0% found this document useful (0 votes)

17 views70 pages

0 Principles of Big Data

The document provides an overview of Big Data, highlighting its significance as a key resource in the modern world and the exponential growth of data generation. It discusses the evolution of Big Data systems from 1.0 (Hadoop) to 2.0, emphasizing the shift towards domain-specific, optimized systems and the impact of the Internet of Things (IoT). Additionally, it addresses the challenges of data overload and the need for new computational architectures to process vast amounts of data effectively.

Uploaded by

Karim Osama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views70 pages

0 Principles of Big Data

Uploaded by

Karim Osama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

CIT650

Introduction to Big Data

Principles of Big Data

1
Today’s Agenda

Big Data Phenomena

Big Data 1.0 Systems

Big Data 2.0 Systems

2
Part I

Big Data Phenomena

3
Big Data
Data is key resource in the modern world.
According to IBM, we are currently creating 2.5 quintillion bytes of
data everyday.
IDC predicts that the world wide volume of data will reach 40 zettabytes
by 2020.
The radical expansion and integration of computation, networking, dig-
ital devices and data storage has provided a robust platform for the
explosion in big data.

4
Big Data

5
On the Verge of A Disruptive Century: Breakthroughs

Gene Ubiquitous
Sequencing and Computing
Biotechnology

Smaller, Faster,
Cheaper Sensors

Faster
Communication

6
Big Data Applications are Everywhere

7
Big Data: What Happens in the Internet in a Minute?

8
Data Generation and Consumption Model is Changing

9
Data Generation and Consumption Model is Changing

Old Model: Few companies (producers) are generating data, all

others are consuming data

New Model: All of us are generating data, and all of us are

consuming data

10
Big Data

11
Big Data

Data generation and consumption is becoming a main part of people’s

daily life especially with the pervasive availability and usage of Internet
technology and applications.

12
Your Smart Phone is now Very smart

13
Internet of Things (IoT)
A network devices, connect directly with each other to capture, share
and monitor vital data automatically through a SSL that connects a
central command and control server in the cloud
Enabling communication between devices, people & processes to ex-
change useful information & knowledge that create value for humans
A global Network Infrastructure linking Physical & Virtual Objects
Infrastructure: Internet and Network developments
Specific object identification, sensor, and connection capability

14
Big Data: Internet of Things

15
Prediction of IoT Usage1

1
https://www.ericsson.com/
16
Why IoT opportunity is growing now?

Affordable hardware: Costs of actuators & sensors have been cut in

half over last 10 years
Smaller, more powerful hardware: Form factors of hardware have
shrunk to millimeter or even nanometer levels
Ubiquitous & cheap mobility: Cost for mobile devices, bandwidth
and data processing has declined over last 10 years
Availability of supporting tools: Big data tools & cloud based in-
frastructure have become widely available

17
Smart X Phenomena

18
What it all produce?

Data ... Data ... Data

19
Big Data: Activity Data
Simple activities like listening to music or reading a book are now
generating data.
Digital music players and eBooks collect data on our activities.
Your smart phone collects data on how you use it and your web
browser collects information on what you are searching for.
Your credit card company collects data on where you shop and your
shop collects data on what you buy.
It is hard to imagine any activity that does not generate data.

20
Big Data

The cost of sequencing one human genome has fallen from $100
million in 2001 to $1K in 2015

21
New Types of Data

22
New Types of Data

23
The Data Structure Evolution Over the Years

24
What Means Big Data?

25
Big Data (3V)

26
Big Data (5V)

27
Big Data

28
Big Data Definition

McKinsey global report described big data as the next frontier for in-
novation and competition.
The report defined big data as ”Data whose scale, distribution, di-
versity, and/or timeliness require the use of new technical architectures
and analytics to enable insights that unlock the new sources of business
value”

29
Big Data Revolution

30
IBM 5MB Hard Disk ;-)

31
Recent Advances in Computational power

Cheaper, larger, and faster disk storage

You can now put all your large database on disk
Cheaper, larger, and faster memory
You may even be able to accommodate it all in memory
Cheaper, more capable, and faster processors
Parallel computing architectures:
Operate on large datasets in reasonable time
Try exhaustive searches and brute force solutions

32
Big Data

Moore’s Law: The information density on silicon integrated circuits

double every 18 to 24 months
Users expect more sophisticated information

33
Your Pocket Size Terabytes Hard Disk

34
Hardware Advancements Enable Big Data Processing

35
Scale Up VS Scale Out

36
The Data Overload Problem

37
The Data Overload Problem

Data is growing at a phenomenal rate. It has become massive, opera-

tional, and opportunistic
Drowning in data but starving for knowledge
The hidden information and knowledge in these mountains of data are
really the most useful

38
The Data Overload Problem

39
Fourth Paradigm

Jim Gray, a database pioneer, described the big data phenomena as

the Fourth Paradigm and called for a paradigm shift in the computing
architecture and large scale data processing mechanisms.
The first three paradigms were experimental, theoretical and, more
recently, computational science

40
Fourth Paradigm

41
Fourth Paradigm

Thousand years ago - Experimental Science

Description of natural phenomena

Last few hundreds years - Theoretical Science

Newton’s laws, Maxwell’s equation , ...

Last few decades - Computational Science

Simulations of complex phenomena

Today - Data-Intensive Science

Scientists overwhelmed with datasets from many different sources
Data captured by instruments
Data generated by simulations
Data generated by sensor networks

42
Computing Clusters

Many racks of computers, thousands of machines per cluster.

Limited bisection bandwidth between racks.

43
Data Centers

44
Big Data is a Competitive Advantage

45
Big Data is a Competitive Advantage

”It’s not who has the

best algorithm that
wins, It’s who has
the most data”
Andrew Ng

46
Data is the new Oil/Gold

47
Big Data Processing Systems
Big Data is the New Oil
and
Big Data Processing Systems is the Machinery

48
Part II

Big Data 1.0 System: The Hadoop

Decade

49
A Little History: Two Seminal contributions
”The Google File System”2
Describes a scalable, distributed, fault-tolerant file system tailored for
data-intensive applications, running on inexpensive commodity hardware,
delivers high aggregate performance
”MapReduce: Simplified Data Processing on Large Clusters”3
Describes a simple programming model and an implementation for pro-
cessing large data sets on computing clusters.

2
S. Ghemawat, H. Gobioff, S. Leung. The Google file system. SOSP 2003
3
J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters.
OSDI 2004
50
Hadoop4: A Star is Born

Hadoop is an open-source software

framework that supports data-intensive
distributed applications and clones the
Google’s MapReduce framework.
It is designed to process very large
amount of unstructured and complex
data.
It is designed to run on a large number
of machines that don’t share any memory
or disks.
It is designed to run on a cluster of ma-
chines which can put together in rela-
tively lower cost and easier maintenance.
4
http://hadoop.apache.org/
51
Key Aspects of Hadoop

52
Hadoop’s Success

Big Data 1.0 = Hadoop

53
Hadoop’s Success5

Big Data 1.0 = Hadoop

5
https://www.google.com/trends/
54
The Always Dilemma: Does One Size Fit All?!

55
Big Data 2.0 Processing Systems

Big Data 2.0 ! = Hadoop

Domain-specific, optimized and vertically focused systems

Cloudera Impala
Apache Samza

Apache S4

Trinity

Apache Flink PowerGraph Apache Tajo

Google MapReduce Apache Spark Apache Storm GraphLab Facebook Presto Apache Phoenix

Hadoop Apache Hive Google Pregel Apache Giraph IBM Big SQL GraphX Apache Tez

2004 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2015

56
Big Graphs

Google estimates that the total number of

web pages exceeds 1 trillion; experimental
graphs of the World Wide Web contain more
than 20 billion nodes and 160 billion edges.

Facebook reportedly consists of more than a

billion users (nodes) and more than 140 bil-
lion friendship relationships (edges) in 2012.

The LinkedIn network contains almost 260

million nodes and billions of edges.

Linked data contains about 31 billion triples.

57
Hadoop for Big Graphs?!

Popular graph query/analysis operations

such as: Page rank, Pattern matching,
Shortest path, Clustering (e.g. Max clique,
triangle closure), Community detection,...,
etc. are iterative in nature.
MapReduce programming model does not
directly support iterative data analysis. Pro-
grammers may implement iterative programs
by manually issuing multiple MapReduce
jobs and orchestrating their execution using
a driver program which wastes I/O, network
bandwidth and CPU resources.
It is not intuitive to think of graphs as
key/value pairs or matrices.
58
Pregel/Giraph6

In 2010, Google introduced the Pregel sys-

tem as a scalable platform for implementing
graph algorithms.
Pregel relies on a vertex-centric approach
and is inspired by the Bulk Synchronous Par-
allel (BSP) model.
In 2012, Apache Giraph was launched as an
open source project that clones the concepts
of Pregel and leverages the Hadoop infras-
tructure.
Other Projects: Spark GraphX (Apache),
GoldenOrb (Apche), GraphLab (CMU) and
Signal/Collect (UZH).
6
https://giraph.apache.org/
59
Big Streaming Data

Every day, Twitter generates more than 12

TB of tweets.

New York Stock Exchange captures 1 TB of

trade information.

About 30 billion radio-frequency identifica-

tion (RFID) tags are created every day.

Hundreds of millions of GPS devices sold ev-

ery year.

60
Static Data Computation vs Streaming Data Computation

61
Hadoop for Big Streams?!

From the stream-processing point of view,

the main limitation of the original implemen-
tation of the MapReduce framework is that
it was designed so that the entire output of
each map and reduce task is materialized
into a local file before it can be consumed
by the next stage.
This materialization step enables the im-
plementation of a simple and elegant
checkpoint/restart fault-tolerance mecha-
nism. But it causes significant delay for jobs
with real-time processing requirements.
Some Hadoop-based Trials include: MapRe-
duce Online and Incoop.
62
Twitter Storm7

Open source project developed by Nathan

Marz and acquired by Twitter in 2012.
Storm is a distributed stream-processing
system with the following key design fea-
tures: horizontal scalability, guaranteed re-
liable communication between the process-
ing nodes, fault tolerance and programming-
language agnosticism.
A Storm cluster is superficially similar to a
Hadoop cluster. One key difference is that a
MapReduce job eventually finishes, whereas
a Storm job processes messages forever (or
until the user kills it).
Other Projects: Flink, Apex, Spark Stream-
ing and Kafka Streams.
7
http://storm.apache.org/
63
Massively Parallel Processing (MPP) Optimized SQL query
engines

Big Data Management 64 / 80

NoSQL Databases8

NoSQL database systems represent a new generation of low-cost, high-

performance database software which is increasingly gaining more and
more popularity.
These systems promise to simplify administration, be fault-tolerant and
able to scale on commodity hardware (Scale out).

8
http://nosql-database.org/
65
Data Storage Options

66
Big Data Landscape

67
Big Data Landscape

68
Big Data Market Size9

9
https://www.statista.com/
69
The End

Big Data Intro
No ratings yet
Big Data Intro
32 pages
Big Data Analytics - Overview
No ratings yet
Big Data Analytics - Overview
66 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Dsc652 - Chapter 1 Introduction To Big Data Systems
No ratings yet
Dsc652 - Chapter 1 Introduction To Big Data Systems
27 pages
Future Revolution On Big Data
No ratings yet
Future Revolution On Big Data
24 pages
Data Science
No ratings yet
Data Science
87 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
Processign Using Hadoop
No ratings yet
Processign Using Hadoop
44 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Unit Iv PDF
No ratings yet
Unit Iv PDF
26 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
Lecture8 - Big Data (Hadoop)
No ratings yet
Lecture8 - Big Data (Hadoop)
29 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
BigData AmberSahai1
No ratings yet
BigData AmberSahai1
32 pages
BIGDATAUNIT1 AKTUpdf
No ratings yet
BIGDATAUNIT1 AKTUpdf
33 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
Big Data All Unit by Study4sub
No ratings yet
Big Data All Unit by Study4sub
161 pages
Big Data: Submitted By-Rajashree Rashmita Reg - No-1825209016 Mca 4 Sem
0% (1)
Big Data: Submitted By-Rajashree Rashmita Reg - No-1825209016 Mca 4 Sem
27 pages
Anand J. Kulkarn
No ratings yet
Anand J. Kulkarn
4 pages
Detailednotes - Unit1 - Big Data
No ratings yet
Detailednotes - Unit1 - Big Data
22 pages
DBIS Lecture 4 - Slides (AI and Big Data)
No ratings yet
DBIS Lecture 4 - Slides (AI and Big Data)
84 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
What Is Big Data
No ratings yet
What Is Big Data
18 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Bda 1
No ratings yet
Bda 1
26 pages
Big Data Introduction
No ratings yet
Big Data Introduction
41 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Big Data Summery
No ratings yet
Big Data Summery
9 pages
Big Data
No ratings yet
Big Data
31 pages
Unit - 1
No ratings yet
Unit - 1
46 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
Big Data-One
No ratings yet
Big Data-One
9 pages
BD U-1 (Anupam Sir)
No ratings yet
BD U-1 (Anupam Sir)
20 pages
Introduction to Big Data Concepts
No ratings yet
Introduction to Big Data Concepts
24 pages
Big Data Analytics: - by Ayushi Gupta
No ratings yet
Big Data Analytics: - by Ayushi Gupta
94 pages
Big Data Network
No ratings yet
Big Data Network
33 pages
Info System Big-Data-by-Dex
No ratings yet
Info System Big-Data-by-Dex
37 pages
Week6 Iot Big Data
No ratings yet
Week6 Iot Big Data
21 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
The Growing Enormous of Big Data Storage
No ratings yet
The Growing Enormous of Big Data Storage
6 pages
Big Data Presentation Slide
100% (1)
Big Data Presentation Slide
30 pages
Current Big Data Issues and Their Solutions Via Deep Learning: An Overview
No ratings yet
Current Big Data Issues and Their Solutions Via Deep Learning: An Overview
12 pages
Module 1
No ratings yet
Module 1
54 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
Introduction To Big Data Computing
No ratings yet
Introduction To Big Data Computing
25 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Big Data: Challenges and Solutions
No ratings yet
Big Data: Challenges and Solutions
10 pages
BDS DS307 Unit-1
No ratings yet
BDS DS307 Unit-1
46 pages
Geie 112 - S19 - LCN - 5
No ratings yet
Geie 112 - S19 - LCN - 5
24 pages
Introduction To Big Data Platform
No ratings yet
Introduction To Big Data Platform
20 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Project Schedule
No ratings yet
Project Schedule
6 pages
Curriculum Vitae of Rifat
No ratings yet
Curriculum Vitae of Rifat
2 pages
Office 2007 Bible 1st Edition John Walkenbach PDF Download
100% (1)
Office 2007 Bible 1st Edition John Walkenbach PDF Download
59 pages
LESSON 1 - Overview of System Analysis & Design
No ratings yet
LESSON 1 - Overview of System Analysis & Design
7 pages
Cucumber BDD
No ratings yet
Cucumber BDD
16 pages
EDEX-UI Terminal For Windows, Mac and Linux
No ratings yet
EDEX-UI Terminal For Windows, Mac and Linux
13 pages
Modem Configuration Guide
No ratings yet
Modem Configuration Guide
12 pages
Blockchain and Distributed Ledger Technology (DLT)
No ratings yet
Blockchain and Distributed Ledger Technology (DLT)
10 pages
Student Record-Keeping Management System
No ratings yet
Student Record-Keeping Management System
17 pages
PCLinetester Tool
No ratings yet
PCLinetester Tool
12 pages
Update Log
No ratings yet
Update Log
1 page
Advanced Web Attacks and Exploitation: Figure 20: Burp Suite Repeater Previous Request and Response
No ratings yet
Advanced Web Attacks and Exploitation: Figure 20: Burp Suite Repeater Previous Request and Response
4 pages
Fortinet FortiGate HA (High Availability)
No ratings yet
Fortinet FortiGate HA (High Availability)
5 pages
Microprocessor Programming Solution
No ratings yet
Microprocessor Programming Solution
39 pages
France Telecom
No ratings yet
France Telecom
17 pages
File and Text Encryption and Decryption Using Web Services (Synopsis)
100% (2)
File and Text Encryption and Decryption Using Web Services (Synopsis)
12 pages
Ade Ci - CD
No ratings yet
Ade Ci - CD
15 pages
Amisys Certified IT Recruiter
No ratings yet
Amisys Certified IT Recruiter
10 pages
Computer Organization and Architecture: Presented by Er. Amandeep Kaur CSE Department CEC, Jhanjeri
No ratings yet
Computer Organization and Architecture: Presented by Er. Amandeep Kaur CSE Department CEC, Jhanjeri
10 pages
English Abbreviations (ABREVIACIONES EN INGLES
No ratings yet
English Abbreviations (ABREVIACIONES EN INGLES
60 pages
Sd2bbc User Guide v1.2
No ratings yet
Sd2bbc User Guide v1.2
2 pages
Overview:: "FOOD TRUCK" Mobile App FOOD TRUCK" Mobile App
No ratings yet
Overview:: "FOOD TRUCK" Mobile App FOOD TRUCK" Mobile App
9 pages
Computer Architecture Exam Guide
No ratings yet
Computer Architecture Exam Guide
3 pages
Intrusion Detection in 802.11 Networks: Empirical Evaluation of Threats and A Public Dataset
No ratings yet
Intrusion Detection in 802.11 Networks: Empirical Evaluation of Threats and A Public Dataset
26 pages
Data Link Control
No ratings yet
Data Link Control
85 pages
General Knowledge Questions and Answers in Tamil PDF
67% (3)
General Knowledge Questions and Answers in Tamil PDF
4 pages
The Potential of Telemedicine System: An Approach Towards A Mobile Doctor
No ratings yet
The Potential of Telemedicine System: An Approach Towards A Mobile Doctor
6 pages
AFPX-COM5 Ethernet Communication Guide
No ratings yet
AFPX-COM5 Ethernet Communication Guide
34 pages
Pmcprimo MC
No ratings yet
Pmcprimo MC
69 pages
4rf LR Aprisa Utility v1.1 Radioenlace
No ratings yet
4rf LR Aprisa Utility v1.1 Radioenlace
2 pages

0 Principles of Big Data

Uploaded by

0 Principles of Big Data

Uploaded by

CIT650

Introduction to Big Data

Principles of Big Data

Big Data Phenomena

Big Data 1.0 Systems

Big Data 2.0 Systems

Big Data Phenomena

Old Model: Few companies (producers) are generating data, all

New Model: All of us are generating data, and all of us are

Data generation and consumption is becoming a main part of people’s

Affordable hardware: Costs of actuators & sensors have been cut in

Data ... Data ... Data

Cheaper, larger, and faster disk storage

Moore’s Law: The information density on silicon integrated circuits

Data is growing at a phenomenal rate. It has become massive, opera-

Jim Gray, a database pioneer, described the big data phenomena as

Thousand years ago - Experimental Science

Last few hundreds years - Theoretical Science

Last few decades - Computational Science

Today - Data-Intensive Science

Many racks of computers, thousands of machines per cluster.

”It’s not who has the

Big Data 1.0 System: The Hadoop

Hadoop is an open-source software

Big Data 1.0 = Hadoop

Big Data 1.0 = Hadoop

Big Data 2.0 ! = Hadoop

Apache Flink PowerGraph Apache Tajo

Google estimates that the total number of

Facebook reportedly consists of more than a

The LinkedIn network contains almost 260

Linked data contains about 31 billion triples.

Popular graph query/analysis operations

In 2010, Google introduced the Pregel sys-

Every day, Twitter generates more than 12

New York Stock Exchange captures 1 TB of

About 30 billion radio-frequency identifica-

Hundreds of millions of GPS devices sold ev-

From the stream-processing point of view,

Open source project developed by Nathan

Big Data Management 64 / 80

NoSQL database systems represent a new generation of low-cost, high-

You might also like