0% found this document useful (0 votes)

27 views29 pages

Lecture8 - Big Data (Hadoop)

Big Data using Hadoop

Uploaded by

amirosama2121

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views29 pages

Lecture8 - Big Data (Hadoop)

Big Data using Hadoop

Uploaded by

amirosama2121

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

CSIS22H

Advanced Database Systems

Lecture 8
Big Data
“I have been surprised and delighted over the years about how
many people are interested in working with data. There’s
definitely a new geek in town. And in 2015, this geek is a data
geek.”
Christian Chabot, founder and CEO - Tableau

“We have for the first time an economy based on a key

resource [information] that is not only renewable, but
self-generating. Running out of it is not a problem, but
drowning in it is.”
John Naisbitt, American author and public speaker “Big data = Crude oil …
But you need to refine the crude oil.
Enter Data Science”
“It’s a great time to be a data geek.” Carlos Somohano, Data Scientist - London
Roger Barga, Microsoft Research

“There is a big data revolution”

Prof. Gary King, Director for the IQSS - Harvard Univ.
Lecture Contents:
• Why Big Data?
• Definition – 3 & 4 Vs
• Tools for Big Data
• IBM’s Big Data Platform
• What is Hadoop
• Hadoop vs. Other Systems
• Some Hadoop Related Names to Know
Why Big Data?
• 2.5 quintillion (1018) bytes of data are generated every day!
• Social media sites
• Sensors
• Digital photos
• Business transactions
Website Social Media
• Location-based data
Billing
ERP Network Switches
Source: IBM http://www-01.ibm.com/software/data/bigdata/ CRM RFID
Why Big Data ?
• Big data itself isn’t new – its been here for a while and growing
exponentially. What is new is the technology to process and analyze it.
• Increase of storage capacities
• Increase of processing power Available technology can cost
effectively manage and analyze all
• Availability of data available data in its native form
unstructured, structured, streaming

It is all about deriving new insight for the business

Why Big Data ?
• Big data is about deriving new insight from previously untouched data &
integrating that insight into your business operation.

• Its about applying new tools to do more analytics on more data for more
people.
Glen Mules – Big Data University Glen Mules – Big Data University
Big Data - Definition

“Big Data is any data that is expensive to manage and hard

to extract value from.”
Michael Franklin
Thomas M. Siebel Professor of Computer Science
Director of the Algorithms, Machines and People Lab
University of Berkeley

Key idea: “Big” is relative! “Difficult Data” is perhaps more apt!

Bill Howe, UW
Big Data Scenario: Netflix
Big Data Scenario: Amazon
Big Data Characteristics: 3 V’s
• Volume Terabyte = 101 2
Exabyte = 101 8
Zettabyte = 1021
The size of the data Brontobyte = 1027

• Velocity
The speed at which new 1021

data is generated

• Variety
The diversity of sources,
formats, quality, structures
They could also be 4 V’s

© 2014 IBM Corporation

OR 6 V’s

© 2014 IBM Corporation

10 Vs
Traditional Data Warehouse Solution
Problem with Traditional DWH Solution
Tools for Big Data
• NoSQL Systems
MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable,
Voldemort, Riak, ZooKeeper , neo4j
• MapReduce
Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR,
Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
• Storage S3 ((Simple Storage Service),
Hadoop Distributed File System
Big Data is Not JUST Hadoop → Big Data is a platform
Understand and navigate
Federated Discovery and Navigation
federated big data sources

Manage & store huge volume of Hadoop File System

any data MapReduce

Structure and control data Data Warehousing

Manage streaming data Stream Computing

Analyze unstructured data Text Analytics Engine

Integrate and govern all data Integration, Data Quality, Security, Lifecycle
sources Management, MDM

Source: IBM http://www-01.ibm.com/software/data/bigdata/

IBM’s Big Data Platform
The key aspects of the platform are:

•Integration

•Analytics

•Visualization

•Development

•Workload Optimization

•Security and Governance

Source: IBM http://www-01.ibm.com/software/data/bigdata/

What is Hadoop
• Hadoop is a distributed file system and data processing engine that is designed to
handle extremely high volumes of data in any structure across large clusters of
computers.
• Hadoop has two components:
1. The Hadoop distributed file system (HDFS), which supports data in structured relational
form, in unstructured form, and in any form in between
2. The MapReduce programing paradigm for managing applications on multiple distributed
servers
• The focus is on supporting redundancy, distributed architectures, and parallel
processing
Scalability in Hadoop
What is Hadoop
Hadoop vs RDBMS
Bigger Picture: Hadoop vs. Other Systems
Distributed Databases Hadoop
Computing - Notion of transactions - Notion of jobs
Model - Transaction is the unit of work - Job is the unit of work
- ACID properties, Concurrency - No concurrency control
control
Data Model - Structured data with known - Any data will fit in any format
schema - (un)(semi)structured
- Read/Write mode - ReadOnly mode

Cost Model - Expensive servers - Cheap commodity machines

Fault Tolerance - Failures are rare - Failures are common over

- Recovery mechanisms thousands of machines
- Simple yet efficient fault
tolerance
Key - Efficiency, optimizations, fine- - Scalability, flexibility, fault
Characteristics tuning tolerance
Some Hadoop Related Names to Know
• Apache Avro: designed for communication between Hadoop nodes through data
serialization

• Cassandra and Hbase: a non-relational database designed for use with Hadoop

• Hive: a query language similar to SQL (HiveQL) but compatible with Hadoop
• Mahout: an AI tool designed for machine learning; that is, to assist with filtering
data for analysis and exploration
• Pig Latin: A data-flow language and execution framework for parallel computation

• ZooKeeper: Keeps all the parts coordinated and working together

Big Data Intro
No ratings yet
Big Data Intro
32 pages
Lec1 Special
No ratings yet
Lec1 Special
21 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Unit 1
No ratings yet
Unit 1
19 pages
What Is Bigdata
No ratings yet
What Is Bigdata
5 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
0 Principles of Big Data
No ratings yet
0 Principles of Big Data
70 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
Bda U1
No ratings yet
Bda U1
80 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Bigdatappt
No ratings yet
Bigdatappt
31 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Lect 2 Big Data Lesson01
No ratings yet
Lect 2 Big Data Lesson01
26 pages
Taming Big Data
No ratings yet
Taming Big Data
268 pages
Info System Big-Data-by-Dex
No ratings yet
Info System Big-Data-by-Dex
37 pages
Bda Unit 1 - Mam
No ratings yet
Bda Unit 1 - Mam
198 pages
Module 1
No ratings yet
Module 1
54 pages
Bda 1
No ratings yet
Bda 1
26 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data Analytics - Overview
No ratings yet
Big Data Analytics - Overview
66 pages
Big Data: Introduction To Terms, Concepts and Tools
No ratings yet
Big Data: Introduction To Terms, Concepts and Tools
23 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
What Is Big Data
No ratings yet
What Is Big Data
18 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Hadoop & Big Data Overview
No ratings yet
Hadoop & Big Data Overview
23 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
Data Science
No ratings yet
Data Science
87 pages
Unit - 1
No ratings yet
Unit - 1
46 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
134 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Big Data: Characteristics and Impact
No ratings yet
Big Data: Characteristics and Impact
31 pages
Big Data
No ratings yet
Big Data
79 pages
Big Data - 1
No ratings yet
Big Data - 1
46 pages
BigData Unit1
No ratings yet
BigData Unit1
74 pages
BDA Unit-1
No ratings yet
BDA Unit-1
33 pages
Unit Iv PDF
No ratings yet
Unit Iv PDF
26 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
I Jcs It 20150605100
No ratings yet
I Jcs It 20150605100
4 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
BIG Data - Unit - 1
No ratings yet
BIG Data - Unit - 1
24 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Big Data
No ratings yet
Big Data
25 pages
Processign Using Hadoop
No ratings yet
Processign Using Hadoop
44 pages
BDA - Unit-1
No ratings yet
BDA - Unit-1
24 pages
CS 441 Handouts
No ratings yet
CS 441 Handouts
300 pages
DBMS Record MCA
No ratings yet
DBMS Record MCA
38 pages
Data Flow & Entity Relationship Diagrams
100% (1)
Data Flow & Entity Relationship Diagrams
7 pages
Bi - Unit 1
No ratings yet
Bi - Unit 1
382 pages
Imnttools
No ratings yet
Imnttools
39 pages
SnapMirror ActiveSync
No ratings yet
SnapMirror ActiveSync
2 pages
PL SQL Triggers
No ratings yet
PL SQL Triggers
5 pages
SAP F.07 Tutorial: Balance Carry Forward
100% (4)
SAP F.07 Tutorial: Balance Carry Forward
9 pages
Cyber Threat Intelligence Guide
No ratings yet
Cyber Threat Intelligence Guide
47 pages
Angular Js Notes For Professionals
No ratings yet
Angular Js Notes For Professionals
200 pages
Software Testing Strategy - XXXXXX
100% (1)
Software Testing Strategy - XXXXXX
13 pages
Cehv13 Brochure Craw
No ratings yet
Cehv13 Brochure Craw
5 pages
Umang Report
No ratings yet
Umang Report
7 pages
Software Testing & Inspection Guide
No ratings yet
Software Testing & Inspection Guide
14 pages
Cloud Storage: A Business Guide
No ratings yet
Cloud Storage: A Business Guide
13 pages
DataVisualizationAndInterpretation Regular HO
No ratings yet
DataVisualizationAndInterpretation Regular HO
7 pages
SIEM Policy MGMT - WAF
No ratings yet
SIEM Policy MGMT - WAF
56 pages
In SQL Server Used Cast or Convert Function To Format DateTime Value or Column Into A Specific Date Format
0% (1)
In SQL Server Used Cast or Convert Function To Format DateTime Value or Column Into A Specific Date Format
2 pages
Oracle 11g Admin Course: 20-Hour Guide
No ratings yet
Oracle 11g Admin Course: 20-Hour Guide
4 pages
Adroit Soft India Private Limited
No ratings yet
Adroit Soft India Private Limited
3 pages
Clustering IASP With PowerHA
No ratings yet
Clustering IASP With PowerHA
330 pages
IEEE Access LaTeX Template 1
No ratings yet
IEEE Access LaTeX Template 1
6 pages
XML Configuration File Generator Windows User Guide
No ratings yet
XML Configuration File Generator Windows User Guide
19 pages
18CS45 Model Question Paper-1 With Effect From 2019-20 (CBCS Scheme) Usn: Fourth Semester B.E. Degree Examination Object Oriented Concepts
No ratings yet
18CS45 Model Question Paper-1 With Effect From 2019-20 (CBCS Scheme) Usn: Fourth Semester B.E. Degree Examination Object Oriented Concepts
2 pages
Data Validation in Snowflake
No ratings yet
Data Validation in Snowflake
2 pages
10 Ways To Screw Up With Scrum and XP
No ratings yet
10 Ways To Screw Up With Scrum and XP
22 pages
Alibaba SysOps 1
No ratings yet
Alibaba SysOps 1
14 pages
Information Technology 25
No ratings yet
Information Technology 25
15 pages
Distributed Systems: Chapter 01: Introduction
No ratings yet
Distributed Systems: Chapter 01: Introduction
78 pages
Semantic Web (CS1145) : Department Elective (Final Year) Department of Computer Science & Engineering
No ratings yet
Semantic Web (CS1145) : Department Elective (Final Year) Department of Computer Science & Engineering
36 pages
From A Monolithic PLM Landscape To A Federated Domain and Data Mesh
No ratings yet
From A Monolithic PLM Landscape To A Federated Domain and Data Mesh
10 pages

Lecture8 - Big Data (Hadoop)

Uploaded by

Lecture8 - Big Data (Hadoop)

Uploaded by

CSIS22H

Advanced Database Systems

“We have for the first time an economy based on a key

“There is a big data revolution”

It is all about deriving new insight for the business

“Big Data is any data that is expensive to manage and hard

Key idea: “Big” is relative! “Difficult Data” is perhaps more apt!

© 2014 IBM Corporation

© 2014 IBM Corporation

Manage & store huge volume of Hadoop File System

Structure and control data Data Warehousing

Manage streaming data Stream Computing

Analyze unstructured data Text Analytics Engine

Source: IBM http://www-01.ibm.com/software/data/bigdata/

•Security and Governance

Source: IBM http://www-01.ibm.com/software/data/bigdata/

Cost Model - Expensive servers - Cheap commodity machines

Fault Tolerance - Failures are rare - Failures are common over

• ZooKeeper: Keeps all the parts coordinated and working together

You might also like