Basic Concepts in Big Data
What’s Big Data?
No single definition; here is from Wikipedia:
•Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
•The challenges include capture, curation, storage, search,
sharing, transfer, analysis, and visualization.
•The trend to larger data sets is due to the additional
information derivable from analysis of a single large set of
related data, as compared to separate smaller sets with the
same total amount of data, allowing correlations to be found to
"spot business trends, determine quality of research, prevent
diseases, link legal citations, combat crime, and determine
real-time roadway traffic conditions.”
2
• "Big Data are high-volume, high-velocity, and/or
high-variety information assets that require new
forms of processing to enable enhanced decision
making, insight discovery and process
optimization” (Gartner 2012)
• Complicated (intelligent) analysis of data may
make a small data “appear” to be “big”
• Bottom line: Any data that exceeds our current
capability of processing can be regarded as “big”
Why is “big data” a “big deal”?
• Government
– Many leading country administration announced “big data” initiative
– Many different big data programs launched
• Private Sector
– Walmart handles more than 1 million customer transactions
every hour, which is imported into databases estimated to
contain more than 2.5 petabytes of data
– Facebook handles 40 billion photos from its user base.
– Falcon Credit Card Fraud Detection System protects 2.1 billion
active accounts world-wide
• Science
– Large Synoptic Survey Telescope will generate 140 Terabyte
of data every 5 days.
– Biomedical computation like decoding human Genome &
personalized medicine
– Social science revolution
– -…
How much data?
• Google processes 20 PB a day (2008)
• Wayback Machine (digital archive of www) has 3 PB +
100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a
year
640K ought to be
enough for anybody.
Type of Data
• Relational Data (Tables/Transaction/Legacy
Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
– Social Network, Semantic Web (RDF), …
• Streaming Data
– You can only scan the data once
What to do with these data?
• Aggregation and Statistics
– Data warehouse and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF Resource Description Framework (RDF) is a
family of World Wide Web Consortium (W3C) specifications originally
designed as a metadata data model)
• Knowledge discovery
– Data Mining
– Statistical Modeling
Lifecycle of Data: 4 “A”s
Int
r e d Aggregation Da egr
ate
t ta ate
c
S a d
Dat
Acquisition Analysis
g e
Log le d
dat w
a no
Application K
Computational View of Big Data
Data Visualization
Data Access Data Analysis
Data Understanding Data Integration
Formatting, Cleaning
Storage Data
Big Data: 3V’s
10
Volume (Scale)
• Data Volume
– 44x increase from 2009 2020
– From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in
collected/generated data
11
4.6
30 billion RFID billion
tags today
12+ TBs (1.3B in 2005)
camera
of tweet data phones
every day world wide
100s of
millions
data every day
of GPS
? TBs of
enabled
devices
sold
annually
25+ TBs of
log data 2+
every day billion
people on
the Web
76 million smart by end
meters in 2009… 2011
200M by 2014
CERN’s Large Hydron Collider (LHC) generates 15 PB a year
Maximilien Brice, © CERN
The Earthscope
•The Earthscope is the world's
largest science project. Designed to
track North America's geological
evolution, this observatory records
data over 3.8 million square miles,
amassing 67 terabytes of data. It
analyzes seismic slips in the San
Andreas fault, sure, but also the
plume of magma underneath
Yellowstone and much, much more.
(http://www.msnbc.msn.com/id/44
363598/ns/technology_and_scienc
e-future_of_technology/#.TmetOd
Q--uI)
Variety (Complexity)
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
– Social Network, Semantic Web (RDF), …
• Streaming Data
– You can only scan the data once
• A single application can be generating/collecting
many types of data
• Big Public Data (online, weather, finance, etc)
To extract knowledge all these types of data
need to linked together
15
A Single View to the Customer
Social Banking
Media Finance
Our
Gaming
Custom Known
History
er
Entertain Purchase
Velocity (Speed)
• Data is being generated fast and need to be
processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples
– E-Promotions: Based on your current location, your purchase history,
what you like send promotions right now for store next to you
– Healthcare monitoring: sensors monitoring your activities and body any
abnormal measurements require immediate reaction
17
Real-time/Fast Data
Mobile devices
(tracking all objects all the
time)
Social media and Scientific instruments
networks (collecting all sorts of
(all of us are generating data)
data) Sensor technology and
networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
18
Real-Time Analytics/Decision Requirement
Product
Recommendatio Learning why
Influenc
ns e Customers
that are
Behavio
r
Switch to
Relevant competitors
& Compelling and their offers; in
time to Counter
Friend
Improving the Custom Invitations
to join a
Marketing
Effectiveness of
er Game or
a Activity
Promotion while that expands
Preventing business
it
Fraud
is still in Play
as it is
Occurring
& preventing
more
proactively
Some Make it 4V’s
20
• Volume:
– How much data is really relevant to the problem solution? Cost of processing?
– So, can you really afford to store and process all that data?
• Velocity:
– Much data coming in at high speed
– Need for streaming versus block approach to data analysis
– So, how to analyze data in-flight and combine with data at-rest
• Variety:
– A small fraction is structured formats, Relational, XML, etc.
– A fair amount is semi-structured, as web logs, etc.
– The rest of the data is unstructured text, photographs, etc.
– So, no single data model can currently handle the diversity
• Veracity: cover term for …
– Accuracy, Precision, Reliability, Integrity
– So, what is it that you don’t know you don’t know about the data?
• Value:
– How much value is created for each unit of data (whatever it is)?
– So, what is the contribution of subsets of the data to the problem solution?
Harnessing Big Data
• OLTP: Online Transaction Processing (DBMSs)
• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
22
The Model Has Changed…
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming
data
New Model: all of us are generating data, and all of us are consuming
data
23
What’s driving Big Data
- Optimizations and predictive
analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical
sources
- Small to mid-size datasets
24
THE EVOLUTION OF BUSINESS INTELLIGENCE
Interactive Business
Intelligence &
Spee Big Data:
d In-memory RDBMS Scal
e Real Time &
QliqView, Tableau, HANA Single View
BI Reporting
OLAP &
Graph Databases
Dataware house
Business Objects, SAS, Scal
Big Data: Spee
Informatica, Cognos other SQL
e Batch Processing & d
Reporting Tools
Distributed Data Store
Hadoop/Spark; HBase/Cassandra
1990’ 2000’ 2010’
s s s
Big Data Analytics
• Big data is more real-time in nature
than traditional DW applications
• Traditional DW architectures (e.g.
Exadata, Teradata) are not
well-suited for big data apps
• Shared nothing, massively parallel
processing, scale out architectures
are well-suited for big data apps
26
Big Data Technology
28
Cloud Computing
• IT resources provided as a service
– Compute, storage, databases, queues
• Clouds leverage economies of scale of commodity
hardware
– Cheap storage, high bandwidth networks & multicore
processors
– Geographically distributed data centers
• Offerings from Microsoft, Amazon, Google, …
wikipedia:Cloud Computing
Benefits
• Cost & management
– Economies of scale, “out-sourced” resource
management
• Reduced Time to deployment
– Ease of assembly, works “out of the box”
• Scaling
– On demand provisioning, co-locate data and compute
• Reliability
– Massive, redundant, shared resources
• Sustainability
– Hardware not owned
Types of Cloud Computing
• Public Cloud: Computing infrastructure is hosted at the
vendor’s premises.
• Private Cloud: Computing architecture is dedicated to the
customer and is not shared with other organisations.
• Hybrid Cloud: Organisations host some critical, secure
applications in private clouds. The not so critical applications
are hosted in the public cloud
– Cloud bursting: the organisation uses its own infrastructure for normal
usage, but cloud is used for peak loads.
• Community Cloud
Classification of Cloud Computing based on
Service Provided
• Infrastructure as a service (IaaS)
– Offering hardware related services using the principles of cloud
computing. These could include storage services (database or disk
storage) or virtual servers.
– Amazon EC2Amazon EC2, Amazon S3Amazon EC2, Amazon S3,
Rackspace Cloud ServersAmazon EC2, Amazon S3, Rackspace Cloud
Servers and Flexiscale.
• Platform as a Service (PaaS)
– Offering a development platform on the cloud.
– Google’s Application EngineGoogle’s Application Engine,
Microsofts AzureGoogle’s Application Engine, Microsofts Azure,
Salesforce.com’s force.com .
• Software as a service (SaaS)
– Including a complete software offering on the cloud. Users can
access a software application hosted by the cloud vendor on
pay-per-use basis. This is a well-established sector.
– Salesforce.coms’ offering in the online Customer Relationship
Infrastructure as a Service (IaaS)
More Refined Categorization
• Storage-as-a-service
• Database-as-a-service
• Information-as-a-service
• Process-as-a-service
• Application-as-a-service
• Platform-as-a-service
• Integration-as-a-service
• Security-as-a-service
• Management/
Governance-as-a-service
• Testing-as-a-service
• Infrastructure-as-a-service
InfoWorld Cloud Computing Deep
Dive
Key Ingredients in Cloud Computing
• Service-Oriented Architecture (SOA)
• Utility Computing (on demand)
• Virtualization (P2P Network)
• SAAS (Software As A Service)
• PAAS (Platform AS A Service)
• IAAS (Infrastructure AS A Servie)
• Web Services in Cloud
Enabling Technology: Virtualization
App App App
App App App OS OS OS
Operating System Hypervisor
Hardware Hardware
Traditional Stack Virtualized Stack
Everything as a Service
• Utility computing = Infrastructure as a Service
(IaaS)
– Why buy machines when you can rent cycles?
– Examples: Amazon’s EC2, Rackspace
• Platform as a Service (PaaS)
– Give me nice API and take care of the maintenance,
upgrades, …
– Example: Google App Engine
• Software as a Service (SaaS)
– Just run it for me!
– Example: Gmail, Salesforce
Cloud examples
• Amazon Elastic Compute Cloud
• Google App Engine
• Microsoft Azure
• GoGrid
• AppNexus
The Obligatory Timeline Slide
(Mike Culver @ AWS)
COBOL, Amazon.com
Edsel ARPANET Internet
Web Web as a Web Services,
Darkness
Awareness Platform Resources Eliminated
9
9 6 82 96 997 01 004 06
95 19 19 9
1 1 2 0 2 2 0
1
Dot-Com Bubble Web 2.0 Web Scale
Computing
AWS
• Elastic Compute Cloud – EC2 (IaaS)
• Simple Storage Service – S3 (IaaS)
• Elastic Block Storage – EBS (IaaS)
• SimpleDB (SDB) (PaaS)
• Simple Queue Service – SQS (PaaS)
• CloudFront (S3 based Content Delivery Network –
PaaS)
• Consistent AWS Web Services API
Big Data & Related Topics
Human-Computer Interaction
Data Visualization
Databases Information Retrieval Machine Learning
Data Access Data Analysis
Data Mining
Computer Vision Speech Recognition
Data Understanding Data Integration
Natural Language Processing Data Warehousing
Formatting, Cleaning
Signal Processing
Many Applications!
Storage Data
Information Theory
Some Data Analysis Techniques
Visualization
Classification Predictive Modeling
Time Series Clustering