What is Hadoop?
An open source framework Commodity Hardware
that allows distributed
processing of large data-sets Economic / affordable
across the cluster of machines
Commodity Hardware Typically low
performance hardware
• Open source framework written in Java
• Inspired by Google's Map-Reduce programming model as well as
its file system (GFS)
Hadoop History
Doug Cutting added Hadoop defeated
DFS & MapReduce Super computer
in
converted 4TB of
Doug Cutting started Doug Cutting
image archives over
working on joined Cloudera
100 EC2 instances
2002 2003 2004 2005 2006 2007 2008 2009
published GFS & Hadoop became
MapReduce papers Development of top-level project
started as Lucene sub-project
launched Hive,
SQL Support for Hadoop
What is Hadoop?
• Open source software framework designed for storage
and processing of large scale dataset on large clusters
of commodity hardware
• Large datasets Terabytes or petabytes of data
• Large clusters hundreds or thousands of nodes
• . Uses for Hadoop
• Data-intensive text processing
• Graph mining
• Machine learning and data mining
• Large scale social network analysis
What is Hadoop (Cont’d)
• Hadoop framework consists on two main layers
• Hadoop Distributed file system (HDFS)
• Execution engine (MapReduce)
6
Hadoop Master/Slave Architecture
• Hadoop is designed as a master-slave architecture
Master node (single node)
Many slave nodes
7
Design Principles of Hadoop
• Need to process big data
• Need to parallelize computation across thousands of nodes
• Commodity hardware
• Large number of low-end cheap machines working in parallel to
solve a computing problem
8
Properties of HDFS
• Large: A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s
data
• Replication: Each data block is replicated many times
(default is 3)
• Failure: Failure is the norm rather than exception
• Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core architectural goal
of HDFS
9
Hadoop: How it Works
10
Hadoop Architecture
• Distributed file system (HDFS)
• Execution engine (MapReduce)
Master node (single node)
Many slave nodes
11
Hadoop Distributed File System
(HDFS)
Centralized namenode
- Maintains metadata info about files
File F 1 2 3 4 5
Blocks (64 MB)
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
12
Hadoop Distributed File System
Namenode
File1
1
2
• NameNode: 3
• Stores metadata (file names, 4
block locations, etc)
• DataNode:
• Stores the actual HDFS data
1 2 1 3
blocks 2 1 4 2
4 3 3 4
Datanodes
Data Retrieval
• When a client wants to retrieve data it communicates
with the NameNode to determine which blocks make
up a file and on which data nodes those blocks are
stored
• Then communicated directly with the data nodes to read
the data
MapReduce
Distributing computation across nodes
MapReduce Overview
• A method for distributing computation across multiple
nodes
• Each node processes the data that is stored at that node
• Consists of two main phases
• Map
• Reduce
The Mapper
• Reads data as key/value pairs
• The key is often discarded
• Outputs zero or more key/value pairs
Shuffle and Sort
• Output from the mapper is sorted by key
• All values with the same key are guaranteed to go to
the same machine
The Reducer
• Called once for each unique key
• Gets a list of all values associated with a key as input
• The reducer outputs zero or more final key/value pairs
• Usually just one output per input key
JobTracker and TaskTracker
• JobTracker
• Determines the execution
plan for the job
• Assigns individual tasks
• TaskTracker
• Keeps track of the
performance of an
individual mapper or
reducer
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
Node 1 Node 2 Node 3
• This file has 5 Blocks run 5 map tasks
• Where to run the task reading block “1”
• Try to run it on Node 1 or Node 3
21
Properties of MapReduce Engine
(Cont’d)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress
In this example, 1 map-reduce
job consists of 4 map tasks and 3
reduce tasks
22
MapReduce Phases
Deciding on what will be the key and what will be the value developer’s
responsibility
23
Map-Reduce Execution Engine
(Example: Color Count)
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Users only provide the “Map” and “Reduce” functions
24
Key-Value Pairs
• Mappers and Reducers are users’ code (provided functions)
• Just need to obey the Key-Value pairs interface
• Mappers:
• Consume <key, value> pairs
• Produce <key, value> pairs
• Reducers:
• Consume <key, <list of values>>
• Produce <key, value>
• Shuffling and Sorting:
• Hidden phase between mappers and reducers
• Groups all similar keys from all mappers, sorts and passes them to a certain reducer
in the form of <key, <list of values>>
25
Example 1: Word Count
• Job: Count the occurrences of each word in a data set
Map Reduce
Tasks Tasks
26
Example 2: Color Count
Job: Count the number of each color in a data set
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Part0001
Part0002
Part0003
That’s the output file, it has
3 parts on probably 3
27 different machines
Example 3: Color Filter
Job: Select only the blue and the green colors
• Each map task will select only
Input blocks Produces (k, v) the blue or green colors
on HDFS ( , 1)
• No need for reduce phase
Write to HDFS
Part0001
Write to HDFS
Part0002
That’s the output file, it has
4 parts on probably 4
Write to HDFS
Part0003 different machines
Write to HDFS
Part0004
28
Other Tools
• Hive
• Hadoop processing with SQL
• Pig
• Hadoop processing with scripting
• HBase
• Database model built on top of Hadoop
29
Who Uses Hadoop?