CSIS22H
Advanced Database Systems
Lecture 8
Big Data
“I have been surprised and delighted over the years about how
many people are interested in working with data. There’s
definitely a new geek in town. And in 2015, this geek is a data
geek.”
Christian Chabot, founder and CEO - Tableau
“We have for the first time an economy based on a key
resource [information] that is not only renewable, but
self-generating. Running out of it is not a problem, but
drowning in it is.”
John Naisbitt, American author and public speaker “Big data = Crude oil …
But you need to refine the crude oil.
Enter Data Science”
“It’s a great time to be a data geek.” Carlos Somohano, Data Scientist - London
Roger Barga, Microsoft Research
“There is a big data revolution”
Prof. Gary King, Director for the IQSS - Harvard Univ.
Lecture Contents:
• Why Big Data?
• Definition – 3 & 4 Vs
• Tools for Big Data
• IBM’s Big Data Platform
• What is Hadoop
• Hadoop vs. Other Systems
• Some Hadoop Related Names to Know
Why Big Data?
• 2.5 quintillion (1018) bytes of data are generated every day!
• Social media sites
• Sensors
• Digital photos
• Business transactions
Website Social Media
• Location-based data
Billing
ERP Network Switches
Source: IBM http://www-01.ibm.com/software/data/bigdata/ CRM RFID
Why Big Data ?
• Big data itself isn’t new – its been here for a while and growing
exponentially. What is new is the technology to process and analyze it.
• Increase of storage capacities
• Increase of processing power Available technology can cost
effectively manage and analyze all
• Availability of data available data in its native form
unstructured, structured, streaming
It is all about deriving new insight for the business
Why Big Data ?
• Big data is about deriving new insight from previously untouched data &
integrating that insight into your business operation.
• Its about applying new tools to do more analytics on more data for more
people.
Glen Mules – Big Data University Glen Mules – Big Data University
Big Data - Definition
“Big Data is any data that is expensive to manage and hard
to extract value from.”
Michael Franklin
Thomas M. Siebel Professor of Computer Science
Director of the Algorithms, Machines and People Lab
University of Berkeley
Key idea: “Big” is relative! “Difficult Data” is perhaps more apt!
Bill Howe, UW
Big Data Scenario: Netflix
Big Data Scenario: Amazon
Big Data Characteristics: 3 V’s
• Volume Terabyte = 101 2
Exabyte = 101 8
Zettabyte = 1021
The size of the data Brontobyte = 1027
• Velocity
The speed at which new 1021
data is generated
• Variety
The diversity of sources,
formats, quality, structures
They could also be 4 V’s
© 2014 IBM Corporation
OR 6 V’s
© 2014 IBM Corporation
10 Vs
Traditional Data Warehouse Solution
Problem with Traditional DWH Solution
Tools for Big Data
• NoSQL Systems
MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable,
Voldemort, Riak, ZooKeeper , neo4j
• MapReduce
Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR,
Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
• Storage S3 ((Simple Storage Service),
Hadoop Distributed File System
Big Data is Not JUST Hadoop → Big Data is a platform
Understand and navigate
Federated Discovery and Navigation
federated big data sources
Manage & store huge volume of Hadoop File System
any data MapReduce
Structure and control data Data Warehousing
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Integrate and govern all data Integration, Data Quality, Security, Lifecycle
sources Management, MDM
Source: IBM http://www-01.ibm.com/software/data/bigdata/
IBM’s Big Data Platform
The key aspects of the platform are:
•Integration
•Analytics
•Visualization
•Development
•Workload Optimization
•Security and Governance
Source: IBM http://www-01.ibm.com/software/data/bigdata/
What is Hadoop
• Hadoop is a distributed file system and data processing engine that is designed to
handle extremely high volumes of data in any structure across large clusters of
computers.
• Hadoop has two components:
1. The Hadoop distributed file system (HDFS), which supports data in structured relational
form, in unstructured form, and in any form in between
2. The MapReduce programing paradigm for managing applications on multiple distributed
servers
• The focus is on supporting redundancy, distributed architectures, and parallel
processing
Scalability in Hadoop
What is Hadoop
Hadoop vs RDBMS
Bigger Picture: Hadoop vs. Other Systems
Distributed Databases Hadoop
Computing - Notion of transactions - Notion of jobs
Model - Transaction is the unit of work - Job is the unit of work
- ACID properties, Concurrency - No concurrency control
control
Data Model - Structured data with known - Any data will fit in any format
schema - (un)(semi)structured
- Read/Write mode - ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare - Failures are common over
- Recovery mechanisms thousands of machines
- Simple yet efficient fault
tolerance
Key - Efficiency, optimizations, fine- - Scalability, flexibility, fault
Characteristics tuning tolerance
Some Hadoop Related Names to Know
• Apache Avro: designed for communication between Hadoop nodes through data
serialization
• Cassandra and Hbase: a non-relational database designed for use with Hadoop
• Hive: a query language similar to SQL (HiveQL) but compatible with Hadoop
• Mahout: an AI tool designed for machine learning; that is, to assist with filtering
data for analysis and exploration
• Pig Latin: A data-flow language and execution framework for parallel computation
• ZooKeeper: Keeps all the parts coordinated and working together