CIT650
Introduction to Big Data
Principles of Big Data
1
Today’s Agenda
Big Data Phenomena
Big Data 1.0 Systems
Big Data 2.0 Systems
2
Part I
Big Data Phenomena
3
Big Data
Data is key resource in the modern world.
According to IBM, we are currently creating 2.5 quintillion bytes of
data everyday.
IDC predicts that the world wide volume of data will reach 40 zettabytes
by 2020.
The radical expansion and integration of computation, networking, dig-
ital devices and data storage has provided a robust platform for the
explosion in big data.
4
Big Data
5
On the Verge of A Disruptive Century: Breakthroughs
Gene Ubiquitous
Sequencing and Computing
Biotechnology
Smaller, Faster,
Cheaper Sensors
Faster
Communication
6
Big Data Applications are Everywhere
7
Big Data: What Happens in the Internet in a Minute?
8
Data Generation and Consumption Model is Changing
9
Data Generation and Consumption Model is Changing
Old Model: Few companies (producers) are generating data, all
others are consuming data
New Model: All of us are generating data, and all of us are
consuming data
10
Big Data
11
Big Data
Data generation and consumption is becoming a main part of people’s
daily life especially with the pervasive availability and usage of Internet
technology and applications.
12
Your Smart Phone is now Very smart
13
Internet of Things (IoT)
A network devices, connect directly with each other to capture, share
and monitor vital data automatically through a SSL that connects a
central command and control server in the cloud
Enabling communication between devices, people & processes to ex-
change useful information & knowledge that create value for humans
A global Network Infrastructure linking Physical & Virtual Objects
Infrastructure: Internet and Network developments
Specific object identification, sensor, and connection capability
14
Big Data: Internet of Things
15
Prediction of IoT Usage1
1
https://www.ericsson.com/
16
Why IoT opportunity is growing now?
Affordable hardware: Costs of actuators & sensors have been cut in
half over last 10 years
Smaller, more powerful hardware: Form factors of hardware have
shrunk to millimeter or even nanometer levels
Ubiquitous & cheap mobility: Cost for mobile devices, bandwidth
and data processing has declined over last 10 years
Availability of supporting tools: Big data tools & cloud based in-
frastructure have become widely available
17
Smart X Phenomena
18
What it all produce?
Data ... Data ... Data
19
Big Data: Activity Data
Simple activities like listening to music or reading a book are now
generating data.
Digital music players and eBooks collect data on our activities.
Your smart phone collects data on how you use it and your web
browser collects information on what you are searching for.
Your credit card company collects data on where you shop and your
shop collects data on what you buy.
It is hard to imagine any activity that does not generate data.
20
Big Data
The cost of sequencing one human genome has fallen from $100
million in 2001 to $1K in 2015
21
New Types of Data
22
New Types of Data
23
The Data Structure Evolution Over the Years
24
What Means Big Data?
25
Big Data (3V)
26
Big Data (5V)
27
Big Data
28
Big Data Definition
McKinsey global report described big data as the next frontier for in-
novation and competition.
The report defined big data as ”Data whose scale, distribution, di-
versity, and/or timeliness require the use of new technical architectures
and analytics to enable insights that unlock the new sources of business
value”
29
Big Data Revolution
30
IBM 5MB Hard Disk ;-)
31
Recent Advances in Computational power
Cheaper, larger, and faster disk storage
You can now put all your large database on disk
Cheaper, larger, and faster memory
You may even be able to accommodate it all in memory
Cheaper, more capable, and faster processors
Parallel computing architectures:
Operate on large datasets in reasonable time
Try exhaustive searches and brute force solutions
32
Big Data
Moore’s Law: The information density on silicon integrated circuits
double every 18 to 24 months
Users expect more sophisticated information
33
Your Pocket Size Terabytes Hard Disk
34
Hardware Advancements Enable Big Data Processing
35
Scale Up VS Scale Out
36
The Data Overload Problem
37
The Data Overload Problem
Data is growing at a phenomenal rate. It has become massive, opera-
tional, and opportunistic
Drowning in data but starving for knowledge
The hidden information and knowledge in these mountains of data are
really the most useful
38
The Data Overload Problem
39
Fourth Paradigm
Jim Gray, a database pioneer, described the big data phenomena as
the Fourth Paradigm and called for a paradigm shift in the computing
architecture and large scale data processing mechanisms.
The first three paradigms were experimental, theoretical and, more
recently, computational science
40
Fourth Paradigm
41
Fourth Paradigm
Thousand years ago - Experimental Science
Description of natural phenomena
Last few hundreds years - Theoretical Science
Newton’s laws, Maxwell’s equation , ...
Last few decades - Computational Science
Simulations of complex phenomena
Today - Data-Intensive Science
Scientists overwhelmed with datasets from many different sources
Data captured by instruments
Data generated by simulations
Data generated by sensor networks
42
Computing Clusters
Many racks of computers, thousands of machines per cluster.
Limited bisection bandwidth between racks.
43
Data Centers
44
Big Data is a Competitive Advantage
45
Big Data is a Competitive Advantage
”It’s not who has the
best algorithm that
wins, It’s who has
the most data”
Andrew Ng
46
Data is the new Oil/Gold
47
Big Data Processing Systems
Big Data is the New Oil
and
Big Data Processing Systems is the Machinery
48
Part II
Big Data 1.0 System: The Hadoop
Decade
49
A Little History: Two Seminal contributions
”The Google File System”2
Describes a scalable, distributed, fault-tolerant file system tailored for
data-intensive applications, running on inexpensive commodity hardware,
delivers high aggregate performance
”MapReduce: Simplified Data Processing on Large Clusters”3
Describes a simple programming model and an implementation for pro-
cessing large data sets on computing clusters.
2
S. Ghemawat, H. Gobioff, S. Leung. The Google file system. SOSP 2003
3
J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters.
OSDI 2004
50
Hadoop4: A Star is Born
Hadoop is an open-source software
framework that supports data-intensive
distributed applications and clones the
Google’s MapReduce framework.
It is designed to process very large
amount of unstructured and complex
data.
It is designed to run on a large number
of machines that don’t share any memory
or disks.
It is designed to run on a cluster of ma-
chines which can put together in rela-
tively lower cost and easier maintenance.
4
http://hadoop.apache.org/
51
Key Aspects of Hadoop
52
Hadoop’s Success
Big Data 1.0 = Hadoop
53
Hadoop’s Success5
Big Data 1.0 = Hadoop
5
https://www.google.com/trends/
54
The Always Dilemma: Does One Size Fit All?!
55
Big Data 2.0 Processing Systems
Big Data 2.0 ! = Hadoop
Domain-specific, optimized and vertically focused systems
Cloudera Impala
Apache Samza
Apache S4
Trinity
Apache Flink PowerGraph Apache Tajo
Google MapReduce Apache Spark Apache Storm GraphLab Facebook Presto Apache Phoenix
Hadoop Apache Hive Google Pregel Apache Giraph IBM Big SQL GraphX Apache Tez
2004 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2015
56
Big Graphs
Google estimates that the total number of
web pages exceeds 1 trillion; experimental
graphs of the World Wide Web contain more
than 20 billion nodes and 160 billion edges.
Facebook reportedly consists of more than a
billion users (nodes) and more than 140 bil-
lion friendship relationships (edges) in 2012.
The LinkedIn network contains almost 260
million nodes and billions of edges.
Linked data contains about 31 billion triples.
57
Hadoop for Big Graphs?!
Popular graph query/analysis operations
such as: Page rank, Pattern matching,
Shortest path, Clustering (e.g. Max clique,
triangle closure), Community detection,...,
etc. are iterative in nature.
MapReduce programming model does not
directly support iterative data analysis. Pro-
grammers may implement iterative programs
by manually issuing multiple MapReduce
jobs and orchestrating their execution using
a driver program which wastes I/O, network
bandwidth and CPU resources.
It is not intuitive to think of graphs as
key/value pairs or matrices.
58
Pregel/Giraph6
In 2010, Google introduced the Pregel sys-
tem as a scalable platform for implementing
graph algorithms.
Pregel relies on a vertex-centric approach
and is inspired by the Bulk Synchronous Par-
allel (BSP) model.
In 2012, Apache Giraph was launched as an
open source project that clones the concepts
of Pregel and leverages the Hadoop infras-
tructure.
Other Projects: Spark GraphX (Apache),
GoldenOrb (Apche), GraphLab (CMU) and
Signal/Collect (UZH).
6
https://giraph.apache.org/
59
Big Streaming Data
Every day, Twitter generates more than 12
TB of tweets.
New York Stock Exchange captures 1 TB of
trade information.
About 30 billion radio-frequency identifica-
tion (RFID) tags are created every day.
Hundreds of millions of GPS devices sold ev-
ery year.
60
Static Data Computation vs Streaming Data Computation
61
Hadoop for Big Streams?!
From the stream-processing point of view,
the main limitation of the original implemen-
tation of the MapReduce framework is that
it was designed so that the entire output of
each map and reduce task is materialized
into a local file before it can be consumed
by the next stage.
This materialization step enables the im-
plementation of a simple and elegant
checkpoint/restart fault-tolerance mecha-
nism. But it causes significant delay for jobs
with real-time processing requirements.
Some Hadoop-based Trials include: MapRe-
duce Online and Incoop.
62
Twitter Storm7
Open source project developed by Nathan
Marz and acquired by Twitter in 2012.
Storm is a distributed stream-processing
system with the following key design fea-
tures: horizontal scalability, guaranteed re-
liable communication between the process-
ing nodes, fault tolerance and programming-
language agnosticism.
A Storm cluster is superficially similar to a
Hadoop cluster. One key difference is that a
MapReduce job eventually finishes, whereas
a Storm job processes messages forever (or
until the user kills it).
Other Projects: Flink, Apex, Spark Stream-
ing and Kafka Streams.
7
http://storm.apache.org/
63
Massively Parallel Processing (MPP) Optimized SQL query
engines
Big Data Management 64 / 80
NoSQL Databases8
NoSQL database systems represent a new generation of low-cost, high-
performance database software which is increasingly gaining more and
more popularity.
These systems promise to simplify administration, be fault-tolerant and
able to scale on commodity hardware (Scale out).
8
http://nosql-database.org/
65
Data Storage Options
66
Big Data Landscape
67
Big Data Landscape
68
Big Data Market Size9
9
https://www.statista.com/
69
The End
70