Module 2 Hadoop
• Introduction to Hadoop
• Core Hadoop components
• Hadoop Ecosystem
• Physical architecture
• Hadoop Limitations
No. of Hrs. 6
Big Data issues
The vast amount of data generated every minute is difficult
for acquisition, storage, searching, sharing, analytics, and
visualization.
So a situation in which data sets have grown to such enormous
sizes that conventional information technologies can no longer
effectively handle either the size of the data set or the scale
and growth of the data set is termed as Big Data.
Desired properties of Big data System
• Robust and Fault tolerant – need to be behaved
correctly in the face of machines going down
randomly, semantics of consistency in Distributed
databases, duplicated data, concurrency etc.
• Low latency reads and updates – few milliseconds
• Scalable – Scalability is the ability to maintain
performance in the face of increasing data
• Extensible – allow functionality to be added with
minimal development cost.
• Debuggable -
Massive data Storage
The problem is simple
while the storage capacities of hard drives
have increased massively over the years,
access speeds—the rate at which data
can be read from drives have not kept up.
Traditional Scenario: 2 orders per hour
Traditional Scenario: Data is generated at a steady
rate and is structured in nature
Traditional Processing
System Traditional RDBMS
Online Orders started : 10 orders per hour
Big Data Scenario: Heterogeneous Data is generated at
an alarming rate by multiple sources.
Traditional Processing Traditional RDBMS
System
Solution ?
• Orders have increased/ So Big Data
• Hire More no. of Chefs…/ More no. of
processing units.
• BT did the same thing…
Multiple Processing units for data
Multiple Cooks cooking Food !
Processing `
Bottleneck is sharing the same Bringing data to processing units, lot
food shelf ! of network overheads/ congestions
and delay. Real time operation not
possible
Solution ??
• Food shelves/Single storage system becomes
the bottleneck
• Distributed & Parallel approach.
• BT did the same thing…
Solution : order is divided into Task
distributed the shelf. Each has its own
access having same ingredients.
Junior Chefs Head Chefs
Data Map
Locality Reduce
Storage :
Distributed File
System
Processing : Parallel &
Distributed Processing
Scalable
HDFS Hadoop MAP REDUCE
In pioneer days they used oxen for heavy pulling, and
when one ox couldn’t budge a log, they didn’t try to
grow a larger ox.
We shouldn’t be trying for bigger computers, but for
more systems of computers.
—Grace Hopper
Apache Hadoop
• It is a Frame work to
process Big Data.
• Allows to store & process
large data sets in parallel
and distributed fashion.
• Data is dumped locally In
processing unit(data
locally).
• All the units are
interconnected, cluster Data is dumped locally
Ex; food for thought !
• I have 1 Tera bytes of data in the drive, and transfer
speed, 100 Mbps, How much time will be required for
transfer of data ?
1 Kilobyte = 1024 bytes
Megabyte = 1, 024kilobytes
Gigabyte = 1, 024 Megabytes
Terrabyte = 1, 024 Gigabytes
• It takes more than two and a half hours to read all the
data off the disk.
• This is the long time to read all the data from the drive
and writing is even more slower.
How to reduce the time ??
• Save the data in many drives (disks) and
access the data in parallel (To read data
from multiple disks at once)
• Imagine if we had 100 drives, each
holding one hundredth of the data.
Working in parallel, we could read the
data in under two minutes.
Think…
Isn’t a wastage of storage capacity, we are
storing only one 100th of data in all drives,
while its capacity is 1TB ??
• No, we can store one hundred datasets, each
of which is one terabyte, and provide shared
access to them.
• Storage as well as access problems are solved.
And hence shorter analysis time !!
What could be the next problem with
multiple hardware/ Machines ?
• Hardware Failure: the chance that the one m/c will
fail, you will not be able to access the m/c and data
loss! Data analysis process gets hindered.
• A common way of avoiding data loss is through
replication: redundant copies of the data are kept
by the system so that in the event of failure, there
is another copy available.
• Solution is Hadoop Distributed Filesystem (HDFS) !
What could be the next problem ??
• The second problem is that most analysis tasks
need to be able to combine the data in some
way; data read from one disk may need to be
combined with the data from any of the other
99 disks.
• Various distributed systems allow data to be
combined from multiple sources, but doing
this correctly is notoriously challenging.
• Solution is Map Reduce !
• This, in a nutshell, is what Hadoop provides: a
reliable shared storage and analysis system.
• The storage is provided by HDFS and analysis
by Map Reduce but these capabilities are its
kernel.
To do two things
at once
is to do neither !!
A Brief History of Hadoop
• Hadoop was developed, based on the paper
written by Google on the Map Reduce system
and it applies concepts of functional
programming.
• Hadoop was developed by Doug
Cutting and Michael J. Cafarella.
• Hadoop is written in the Java programming
Language and ranks among the highest-level
Apache projects.
A Brief History of Hadoop
• Hadoop is an open-source software
framework, developed by a nonprofit
organization Apache Software Foundation
(ASF), used for storing and processing Big
Data in a distributed manner on large clusters
of commodity hardware.
• Hadoop is licensed under the Apache v2
license.
Hadoop Architecture
EX Ram : X, Z
Mohan : Y, W
Ram : X Rakesh: W, Y
Shyam: Z, X
Rakesh: W
Project
Manager
Mohan : Y Shyam: Z
Project manager: team of 4 employees, assigns the
project & keeps the progress track.
Hadoop : Master/Slave Architecture
Ram : X
Project Master Nodes
Manager Rakesh: W
Master
Ram : X, Z
Mohan : Y, W
Rakesh: W, Y
Shyam: Z, X
Slaves
Slaves
Nodes
Mohan : Y Shyam: Z
Project manager: team of 4 employees, assigns the
project & keeps the progress track.
Hadoop has a Master-Slave Architecture for data storage and
distributed data processing using HDFS & Map Reduce methods.
HDFS : Storage
Hadoop Distributed File System
• It is the core component or you can
say, the backbone of Hadoop
Ecosystem.
• HDFS is the one, which makes it
possible to store different types of
large data sets (i.e. structured,
unstructured and semi structured
data).
• HDFS creates a level of abstraction
over the resources, from where we can
see the whole HDFS as a single unit.
• It helps us in storing our data across
various slave nodes and maintaining
the log file about the stored data
(metadata).
HDFS Core Components
Data is Distributed across all the
commodity hardware machines.
Master Node(Name Node)
▪ It is the master daemon that
maintains and manages the
Data Nodes (slave nodes).
▪ It records the metadata of all
the files stored in the cluster,
e.g. the size of the files,
number of blocks, location
of all blocks stored, in which
nodes they are stored,
permissions, hierarchy etc.
Master Node (Name Node)
▪ It records each and every change that
takes place to the file system metadata.
▪ This Information is stored in main
memory as well as disk for persistence
storage.
• Fs image: Fs image stands for File
System image. It contains the complete
namespace of the Hadoop file system
since the NameNode creation.
• Edit log: It contains all the recent
changes performed to the file system
namespace to the most recent Fs
image.
▪ For example, if a file is deleted in HDFS,
the Name Node will immediately record
this in the EditLog.
Master Node (Name Node)
• If the Data Node fails, Editlogs- It keeps track of each and
every changes to HDFS.
the Name Node chooses
Fsimage- It stores the snapshot of
new Data Nodes for new the file system.
replicas.
• It regularly receives a
Heartbeat and a block
report from all the Data
Nodes in the cluster to
ensure that the Data
Nodes are live.
MMM Change replica
Slave Nodes(Data Nodes)
• These are slave daemons which runs on each slave
machine.
• Each slave machine is Commodity hardware. It is a
non expensive system which is not of high quality or
high-availability.
• Hadoop can be installed on any commodity hardware.
• We don't need super computers or high
end hardware to work on Hadoop.
• Commodity hardware includes RAM because
there will be some services which will be running on
RAM. Hadoop is Economical
• The actual data is stored on Data Nodes.
Slave Nodes(Data Nodes)
• They are responsible for serving read and write
requests from the clients.
• They are also responsible for creating blocks,
deleting blocks and replicating the same based on
the decisions taken by the Name Node.
• It also sends the heart beats to the name node
periodically to report overall health of HDFS by
default , this frequency is set to be 3 sec.
S W R CB DB RB HB
Secondary Name Node
It is not a backup Name node.
• Name node stores metadata in main memory as well
as in disks for persistent storage.
• This information is stored in two files:
✔ Editlogs- It keeps track of each and every changes to
HDFS.
✔ Fsimage- It stores the snapshot of the HDFS at a
certain point of time.
✔ Any changes done to HDFS gets noted to edit logs, the
file size increases and size of fs image remains same.
This will not cause any problem until we restart the
server.
• Only at start up, NameNode merges FsImage and
EditLog files, the edits log file might get very large
over time on a busy cluster.
• Whenever a NameNode is restarted, the latest
status of FsImage is built by applying edits records
on last saved copy of Fs Image.
• That means, if the EditLog is very large,NameNode
restart process result in some considerable delay
in the availability of file system.
• So, it is important keep the edits log as small as
possible which is one of the main functions of
Secondary NameNode.
• Secondary NameNode in hadoop is a specially
dedicated node in HDFS cluster whose main function is
to take checkpoints of the file system metadata present
on Name Node. It just checkpoints Namenode’s file
system namespace.
• The Secondary Name Node is a helper to the primary
Name Node but not replacement for primary Name
Node.
• As the Name Node is the single point of failure in HDFS,
if Name Node fails entire HDFS file system is lost.
• So in order to overcome this, Hadoop implemented
Secondary NameNode whose main function is to store a
copy of Fs Image file and edits log file.
File System Meta Data ??
So, at any point of time, applying edit logs
records to Fs Image (recently saved copy)
will give the current status of Fs Image, i.e.
file system metadata.
Whenever a Name Node is restarted, the latest
status of Fs Image is built by applying edits
records on last saved copy of Fs Image.
Fs Image is a snapshot of the HDFS file system metadata at a certain point of
time. Edit Log is a transaction log which contains records for every change that
occurs to file system metadata (Most recent change).
Checkpoint Node in Hadoop first downloads Fsimage and
edits from the Active Namenode. Then it merges them
(Fsimage and edits) locally, and at last, it uploads the new
image back to the active NameNode.
It stores the latest checkpoint in a directory that has the
same structure as the Namenode’s directory. This
permits the check pointed image to be always available
for reading by the NameNode if necessary.
It usually runs on a different machine than the primary
NameNode since its memory requirements are same as
the primary NameNode.
Secondary NameNode in hadoop maintains FsImage &
edits files in current directory which is similar to the
structure of NameNode’s current directory.
In summary :
Functions of Secondary Namenode
1. Stores a copy of Fs Image file and edits log.
2. Periodically applies edits log records to Fs Image file and
refreshes the edits log. And sends this updated FsImage file
to NameNode so that NameNode doesn’t need to re-apply
the EditLog records during its start up process. Thus
Secondary NameNode makes NameNode start up process
fast.
3. If NameNode is failed, File System metadata can be
recovered from the last saved FsImage on the Secondary
NameNode but Secondary NameNode can’t take the primary
NameNode’s functionality.
Functions of Secondary Namenode
Check pointing of File system metadata is performed. A
checkpoint is nothing but the updation of the latest FsImage
file by applying the latest Edits Log files to it .
If the time gap of a checkpoint is large the there will be too many
Edit Logs files generated and it will be very cumbersome and
time consuming to apply them all at once on the latest
FsImage file. And this may lead to acute start time for the
primary Name Node after a reboot .
Default time of Check pointing is 1 hr.
• 1. What is HDFS?
• Storage layer
• Batch processing engine
• Resource management layer
• All of the above
• Storage layer
• 2. Which among the following is the correct
statement ?
• Datanode manage file system namespace
• Namenode stores metadata
• NameNode stores actual data
• All of the above
• Namenode stores metadata
3. The namenode knows that the data node is
active using a mechanism known as
• Active pulse
• Data pulse
• Heartbeats
• h-signal
Heartbeats
4. When a file is deleted from HDFS system,
its records can be seen
• Active pulse
• FS Image
• Heart beats
• Edit Logs
Edit Logs
5. Single point of failure in HDFS is
• Name Node
• Secondary Name Node
• Edit Logs
• Data node
Answer in one word
1. The Heartbeat period is ??
2. The default value of Checkpoint procedure
is ??
3. Metadata is stored on which Node ??
4. Name the backupnode in Hadoop System.
Answer in one word
5. Who stores a copy of Fs Image file and
edits log along with name node ?
6. If NameNode is failed, File System
metadata can be recovered from the last
saved FsImage on the Secondary
NameNode, True or False ?
How the data storage takes place in HDFS ?
Each file is stored in HDFS as blocks, Entire file is broken into chunks of
data, called as block. In Apache 1, the default size of block is 64 MB.
• Internally, HDFS split the file into block-sized chunks
called a block. The size of the block is 128 Mb by
default for Apache2. One can configure the block size
as per the requirement.
• For example, if there is a file of size 612 Mb, then
HDFS will create four blocks of size 128 Mb and one
block of size 100 Mb.
• The file of a smaller size does not occupy the full block
size space in the disk. Each file stored in HDFS doesn’t
need to be an exact multiple of the configured block
size.
• The user doesn’t have any control over the location of
the blocks.
The default size of the HDFS data block is 128 MB.
The reasons for the large size of blocks are:
• To minimize the cost of seek: For the large size blocks,
time taken to transfer the data from disk can be longer
as compared to the time taken to start the block. This
results in the transfer of multiple blocks at the disk
transfer rate.
• If blocks are small, there will be too many blocks in
Hadoop HDFS and thus too much metadata to store.
• Managing such a huge number of blocks and metadata
will create overhead and lead to traffic in a network.
Replication Management
• For a distributed system, the data must be redundant
to multiple places so that if one machine fails, the data
is accessible from other machines.
• In Hadoop, HDFS stores replicas of a block on multiple
Data Nodes based on the replication factor.
• The replication factor is the number of copies to be
created for blocks of a file in HDFS architecture.
• If the replication factor is 3, then three copies of a
block get stored on different Data Nodes. So if one Data
Node containing the data block fails, then the block is
accessible from the other Data Node containing a
replica of the block.
EX
If we are storing a file of 128 Mb and the
replication factor is 3, then (3*128=384)
384 Mb of disk space is occupied for a
file as three copies of a block get stored.
This replication mechanism makes HDFS
fault-tolerant.
Rack Awareness
The Rack is the collection of around 40-50 Data Nodes connected using the
same network switch.
If the network goes down, the whole rack will be unavailable. A large Hadoop
cluster is deployed in multiple racks.
Data Centre
Communication between the Data Nodes on the same rack is more efficient
as compared to the communication between Data Nodes residing on
different racks.
( Node Rack Data Centre Cluster )
Rack Awareness
• To reduce the network traffic during file
read/write, NameNode chooses the
closest DataNode for serving the client
read/write request.
• NameNode maintains rack ids of each
DataNode to achieve this rack information.
• This concept of choosing the closest Data
Node based on the rack information is
known as Rack Awareness.
• The reasons for the Rack Awareness in Hadoop are:
✔To reduce the network traffic while file
read/write, which improves the cluster
performance.
✔To achieve fault tolerance, even when the rack
fails.
✔Achieve high availability of data so that data is
available even in unfavorable conditions.
✔To reduce the latency, that is, to make the file
read/write operations done with lower delay.
Rack Awareness Algorithm
To ensure that all the replicas of a block are not stored
on the same rack or a single rack, NameNode follows
a rack awareness algorithm to store replicas and
provide latency and fault tolerance.
✔ Not more than one replica be placed on one node.
✔ Not more than two replicas are placed on the same
rack.
✔ Also, the number of racks used for block replication
should always be smaller than the number of replicas.
• Suppose if the replication factor is 3, then
according to the rack awareness algorithm:
✔The first replica will get stored on the local rack.
✔The second replica will get stored on the other
Data Node in the same rack.
✔The third replica will get stored on a different
rack, near to first rack, so as to have higher
bandwidth and low latency.
300MB = A(128)+B(128)+C (44)… 3 blocks
C
Advantage of Rack Awareness Algorithm
1. Preventing data loss against rack
failure
2. Minimize the cost of write and
maximize the read speed
3. Maximize network bandwidth and
low latency
Replication ensures high Reliability and Availability of the System
Summary
• HDFS is the distributed file system in Hadoop for
storing huge volumes and variety of data.
• HDFS follows the master-slave architecture
where the Name Node is the master node, and
Data Nodes are the slave nodes.
• The files in HDFS are broken into data blocks.
• The Name Node stores the metadata about the
blocks, and Data Nodes stores the data blocks
and the blocks are stored in different slave
nodes as per rack awareness algorithm.
1.Which of the following is the correct statement?
• DataNode is the slave/worker node and holds the
user data in the form of data blocks
• Each incoming file is broken into 32 MB by default e
user data in the form of Data Blocks
• NameNode stores user data in the form of Data
Blocks
• None of these
2. What is default replication factor?
• 1
• 2
• 3
• 5
3
3. Which scenario demands highest bandwidth
for data transfer between nodes…
• Different nodes on the same rack
• Nodes on different racks in the same data
center.
• Nodes in different data centers
• Data on the same node.
Nodes in different data centers !
4. What is HDFS Block in Hadoop?
• It is the logical representation of data
• It is the physical representation of data
• Both the above
• None of the above
It is the physical representation of data
5. The need for data replication can arise in
various scenarios like
• Replication Factor is changed
• Data Node goes down
• Data blocks get corrupted
• None of the above
6. A file in HDFS that is smaller than a single
block size
Cannot be stored in HDFS
Occupies the full block size
• Occupies only the size it needs and not the
full block
• Can span over multiple blocks
Occupies only the size it needs and not the full
block
7. Which among the following are the duties of the
NameNodes
• Manage file system namespace
• It is responsible for storing actual data
• Perform read-write operation as per request for the
clients
• None of the above
• Manage file system namespace
8. In a Hadoop cluster, what is true for a HDFS block
that is no longer available due to disk corruption or
machine failure?
• It is lost forever
• It can be replicated form its alternative locations to
other live machines.
• The namenode allows new client request to keep
trying to read it.
It can be replicated form its alternative locations to
other live machines.
Give answer in one word…
• The default value of block size in apache 1 is
• 64 Mb
• The default value of block size in apache 2 is
• 128Mb
How to Access HDFS for Reading & Writing ?
- Local machines can access the cluster through the Gateway
using the REST API with the help of OOzie.
- The purpose of an edge node is to provide an access point to
the cluster and prevent users from a direct connection to
critical components such as Name node or Data node.
- On this node, installation/ configuration of Hadoop client is
employed.
LM1
LM2
LMn
Gateway node/
Corporate Edge node/
Organization Client node
Hadoop Cluster
HDFS Write Operation
• To write data in HDFS, the client first interacts with
the NameNode to get permission to write data and to
get IPs of DataNodes where the client writes the data.
• The client then directly interacts with the DataNodes for
writing data.
• The DataNode then creates a replica of the data block
to other DataNodes in the pipeline based on the
replication factor.
• DFSOutputStream in HDFS maintains two queues (data
queue and ack queue) during the write operation.
A
A B C
B 1 11 2
C 2 19 21
10 20 29
A
6. Data Node B
Pipeline
Ack
Steps in Data Write operation in HDFS
1. Create in Distributed File System
2. Create a new file
3. Data write
4. Data Queue
5. Streamer
6. Data node pipe line
7. Data (packet) Write
8. Ack from data nodes to first data node
9. Ack to FS data output stream
10. Ack queue clear
11. Close from Client to FS data output stream
12. Complete
Interaction of a Client node with Name node
• The client calls the create() method on Distributed File
System to create a file. (step 1)
• Distributed File System interacts with Name Node
through the RPC call to create a new file in the file
system namespace with no blocks associated with it.
(step 2)
• Name Node first checks for the LM privileges to write a
file.
• If the LM has sufficient privilege and there is no file
existing with the same name, Name Node then creates
a record of a new file.
Interaction of a Client node with Name node
• Name Node then provides the address of all
Data Nodes, where the client can write its data.
• Name Node also provides a security token to
the client, which they need to present to the
Data Nodes before writing the block.
• If the file already exists in the HDFS, then file
creation fails, and the client receives an IO
Exception.
• Once a new record in Name Node is created, an object
of type FS Data Output Stream is returned to the
client.
• After receiving the list of the Data Nodes and file write
permission, A client uses it to write data into the
HDFS. Data write method is invoked. (step3)
• FS Data Output Stream contains DFS Output Stream
object which looks after communication with Data
Nodes and Name Node.
• While the client continues writing data, DFS Output
Stream continues creating packets with this data.
• These packets are enqueued into a queue which is
called as Data Queue.(step 4)
• There is one more component called Data
Streamer which consumes this Data Queue.
• Data Streamer also asks Name Node for allocation of
new blocks thereby picking desirable Data Nodes to be
used for replication. (step 5)
• The Data Streamer pours packets into the first Data
Node in the pipeline. (step 6) and starts writing data
directly to the first Data Node in the list.
• Every Data Node in a pipeline stores packet received by
it and forwards the same to the second Data Node in a
pipeline. (the Data Node starts making replicas of a
block to other Data Nodes depending on the
replication factor) (step 7)
• If the replication factor is 3, then there will be a
minimum of 3 copies of blocks created in
different DataNodes, and after creating required
replicas, it sends an acknowledgment to the
client.
• Another queue, 'Ack Queue' is maintained by DFS
Output Stream to store packets which are
waiting for acknowledgment from Data Nodes.
• Once acknowledgment for a packet in the queue is
received from all Data Nodes in the pipeline, it is
removed from the 'Ack Queue'.
• In the event of any DataNode failure, packets from
this queue are used to reinitiate the operation.
• After a client is done with the writing data, it calls a
close() method Call to close(), results into flushing
remaining data packets to the pipeline followed by
waiting for acknowledgment. (Step 11)
• Once a final acknowledgment is received, Name
Node is contacted to tell it that the file write
operation is complete. (step 12)
• Note all blocks are written simultaneously and
replication of all blocks is in sequential manner.
Data Write Operation
HDFS Data Read operation
1. A client initiates read request by
calling 'open()' method of File System object; it
is an object of type Distributed File System.
(step 1)
2. This object connects to name node using RPC
and gets metadata information such as the
locations of the blocks of the file. Please note
that these addresses are of first few blocks of a
file. (step2)
3. In response to this metadata request, addresses
of the DataNodes having a copy of that block is
returned back.(step 3)
4. Once addresses of DataNodes are received,
an object of type FS Data Input Stream is
returned to the client.
5. FS Data Input Stream contains DFS Input
Stream which takes care of interactions
with DataNode and NameNode.
6. Client invokes 'read()' method which
causes DFS Input Stream to establish a
connection with the first Data Node with
the first block of a file. (step 4)
7. Data is read in the form of streams wherein
client invokes 'read()' method repeatedly.
This process of read() operation continues till
it reaches the end of block.
8. Once the end of a block is reached, DFS Input
Stream closes the connection and moves on
to locate the next Data Node for the next
block
9. Once a client has done with the reading, it
calls a close() method.
A B
2 3
5 9
7 11
• The Client interacts with HDFS NameNode
•As the NameNode stores the block’s metadata for the file
“File.txt’, the client will reach out to NameNode asking
locations of DataNodes containing data blocks.
•The NameNode first checks for required privileges, and if
the client has sufficient privileges, the NameNode sends
the locations of DataNodes containing blocks (A and B).
•NameNode also gives a security token to the client, which
they need to show to the DataNodes for authentication.
Let the NameNode provide the following list of IPs for
block A and B – for block A, location of DataNodes D2, D5,
D7, and for block B, location of DataNodes D3, D9, D11.
The client interacts with HDFS DataNode
•After receiving the addresses of the DataNodes, the
client directly interacts with the DataNodes.
•The client will send a request to the closest DataNodes
(D2 for block A and D3 for block B) through the FS
DataInputstream object. The DFS Input stream manages
the interaction between client and DataNode.
•The client will show the security tokens provided by
NameNode to the DataNodes and start reading data
from the DataNode. The data will flow directly from the
DataNode to the client.
•After reading all the required file blocks, the client calls
close() method on the FSDataInputStream object.
Fill in the blanks…
• If a particular file exists before writing, error is
informed to client node via….
• Name node creates a file in ----- on the
request of ------- ?
• The HDFS client calls which method on DFS to
write a file ?
• The HDFS client calls which method on DFS to
read a file ?
Give answer in one word…
• Users can access cluster via which node ??
• The client interacts for data writing with which
node ?
• In Hadoop, permission to a client for writing is
provided by which node?
• Client node will start writing on data node
only after presenting------
• Number of queues in HDFS output stream.
Map Reduce: Processing
Ex: Big Data Analytics class
A B
C D
Task : Count how many times Map Reduce word has
occurred in the book of Big data Analytics (1000pages) ?
Ex: Big Data Analytics class
A B
A: 102
B: 100
C: 103
D: 102
Majority is with
102, so let us
consider correct
answer :102
C D 4 hrs
Task : Count how many times Map Reduce word has
occurred in the book of Big data Analytics (1000pages) ?
Ex: Big Data Analytics class
MAP Reduce
A : 1-250 : 16
A B
B : 251-500: 17
C : 501-750: 10
D : 751-100: 59
16+17+10+59:
C D correct
answer :102
1min
1 hr
A : 1-250
B: 251-500
C: 501-750
D: 751-100 Task : Count how many times Map Reduce word has
occurred in the book of Big data Analytics (1000pages) ?
Map Reduce
• MAPREDUCE is a software framework and
programming model used for processing huge
amounts of data.
• MapReduce program work in two phases, namely,
Map and Reduce.
• Map tasks deal with splitting and mapping of data
while Reduce tasks shuffle and reduce the data.
• Hadoop is capable of running Map Reduce programs
written in various languages: Java, Ruby, Python, and
C++.
• MapReduce programs are parallel in nature, thus are
very useful for performing large-scale data analysis
using multiple machines in the cluster.
The whole process goes through four phases of execution
✔Splitting,
✔Mapping,
✔Shuffling, and
✔Reducing.
Consider you have following input data for
Map Reduce Word Count Program
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
Map stage : To process the input data.
• Input Splits: An input (File stored in HDFS) to a
is divided into fixed-size pieces called input
splits. This chunk of the input is given to
mapper function line by line.
• Mapping : Map takes a chunk of data and
converts it into another set of data, where
individual elements are broken down into
tuples (key/value pairs).
Ex: Job of mapping phase is to count a number of
occurrences of each word from input splits and
prepare a list in the form of <word, frequency>
Reduce : To process the data coming from
Mapper
Shuffling: This phase consumes the output of
Mapping phase. Its task is to consolidate the
relevant records from Mapping phase output.
In our example, the same words are clubbed
together along with their respective frequency.
Reducing: In this phase, output values from the
Shuffling phase are aggregated. This phase
combines values from Shuffling phase and
returns a single output value. In short, this
phase summarizes the complete dataset.
Execution of Map and Reduce tasks, both is
controlled by two types of entities.
• Job tracker: Resides on
Name node, acts like
a master (responsible for
scheduling job runs,
managing computational
resources and complete
execution of submitted
job)
• Multiple Task Trackers:
Reside in data nodes,
acts like slaves, each of Job history Server: It is a deamon that
saves historical information about
them performing the job. completed tasks / applications
1. A job is divided into multiple tasks which are
then run onto multiple data nodes in a cluster.
2. It is the responsibility of job tracker to
coordinate the activity by scheduling tasks to
run on different data nodes.
3. Execution of individual task is then to look
after by task tracker, which resides on every
data node executing part of the job.
4. Task tracker's responsibility is to send the
progress report to the job tracker.
5. In addition,task tracker periodically sends 'heart
beat' signal to the Job tracker so as to notify
him of the current state of the system.
6. Thus job tracker keeps track of the overall
progress of each job.
7. In the event of task failure, the job tracker can
reschedule it on a different task tracker
depending upon where the data is present.
Map Reduce Limitations
1. Scalability issues: Single Job Tracker (a cluster
of 5000 nodes and 40,000 tasks running
concurrently).
2. Inefficient utilization of computational
resources (food for thought !)
3. Hadoop framework limited to map reduce
processing paradigm (No flexibility)
• To overcome all these issues, YARN was
introduced in Hadoop version 2.0 in the year
2012 by Yahoo and Hortonworks.
• The basic idea behind YARN is to relieve
MapReduce by taking over the responsibility
of Resource Management and Job Scheduling.
• YARN started to give Hadoop the ability to run
non-MapReduce jobs within the Hadoop
framework.
Quick Revision
State True or False
1. An input (File stored in HDFS) to a is
divided into fixed-size pieces called input
splits.
2. This chunk of the input is given to
Reducer function line by line.
Apache Yarn
• “Yet Another Resource Negotiator” is the
resource management layer of Hadoop.
• The Yarn was introduced in Hadoop 2.x.
• Yarn allows different data processing engines like
graph processing, interactive processing, stream
processing as well as batch processing to run and
process data stored in HDFS.
• Apart from resource management, Yarn also
does job Scheduling.
• Apache yarn is also a data operating system for
Hadoop 2.x.
• This architecture of Hadoop 2.x provides a
general purpose data processing platform which
is not just limited to the MapReduce.
• It enables Hadoop to process other purpose-
built data processing system other than
MapReduce.
• It allows running several different frameworks
on the same hardware where Hadoop is
deployed.
Hadoop Yarn Architecture
Resource Manager: Runs on a master daemon and
manages the resource allocation in the cluster.
Node Manager: They run on the slave daemons and
are responsible for the execution of a task on every
single Data Node.
Application Master: Manages the user job lifecycle and
resource needs of individual applications. It works
along with the Node Manager and monitors the
execution of tasks(one per application).
Container: Package of resources including RAM, CPU,
Network, HDD etc on a single node.
Resource Manager
• It is the master daemon of
Yarn.
• RM manages the global
assignments of resources
(CPU and memory) among
all the applications.
• It arbitrates system
resources between
competing applications.
• Resource Manager has two
Main components:
✔ Scheduler
✔ Application manager
Resource Manager Components
a) Scheduler b) Application Manager
• The scheduler is responsible • It manages running
for allocating the resources Application Masters in the
to the running application. cluster, i.e.,
• The scheduler is pure • it is responsible for starting
scheduler it means that it application masters and for
performs no monitoring , no monitoring and restarting
tracking for the application them on different nodes in
and even doesn’t guarantees case of failures.
about restarting failed tasks
either due to application
failure or hardware failures.
Node Manager
• It is the slave daemon of Yarn.
• NM is responsible for containers monitoring,
their resource usage and reporting the same to
the Resource Manager.
• Manage the user process on that machine.
• Yarn Node Manager also tracks the health of the
node on which it is running.
• A shuffle is a typical auxiliary service by the NMs
for MapReduce applications on YARN.
Container
• It is a collection of physical resources such as RAM,
CPU cores, and disks on a single node.
• YARN containers are managed by a container launch
context which is container life-cycle(CLC). This record
contains a map of environment variables,
dependencies stored in a remotely accessible storage,
security tokens, payload for Node Manager services
and the command necessary to create the process.
• It grants rights to an application to use a specific
amount of resources (memory, CPU etc.) on a specific
host.
Application Manager
• One application master runs per application.
• It negotiates resources from the resource
manager and works with the node manager.
It Manages the application life cycle.
• The AM acquires containers from the RM’s
Scheduler before contacting the
corresponding NMs to start the application’s
individual tasks.
•Apache Hadoop is the most powerful tool of Big Data.
•Hadoop ecosystem revolves around three main components HDFS, MapReduce, and
YARN.
•Apart from these Hadoop Components, there are some other Hadoop ecosystem
components also, that play an important role to boost Hadoop functionalities.
Hadoop EcoSystem:
It is neither a programming language nor a single
service.
Hadoop Ecosystem is a platform or a suite or a
framework which helps in solving the big data
problems.
There are four major elements of Hadoop i.e. HDFS, Map
Reduce, YARN, and Hadoop Common.
Hadoop Common: Necessary Java Archieve files and scripts
needed to start Hadoop. Hadoop requires JRE1.6 or
higher version. The standard start up and shut down
scripts need Secure Shell (SSH) to be setup between
the nodes in the cluster.
• It includes Apache
projects and various
open source and
commercial tools and
solutions for ingesting,
storing, analyzing and
maintaining.
• Various components
together form a
Hadoop ecosystem.
HDFS -> Hadoop Distributed File System
YARN -> Yet Another Resource Negotiator
MapReduce -> Data processing using programming
PIG, HIVE-> Data Processing Services using Query (SQL-
like)
HBase -> NoSQL Database
Mahout, Spark MLlib -> Machine Learning
Spark -> In-memory Data Processing
Apache Drill -> SQL on Hadoop
Zookeeper -> Managing Cluster
Oozie -> Job Scheduling
Flume, Sqoop -> Data Ingesting Services
Ambari -> Provision, Monitor and Maintain cluster
Apache PIG
• The need for Apache Pig
came up when many
programmers weren’t
comfortable with Java and
were facing a lot of
struggle working with
Hadoop, especially, when
Map Reduce tasks had to
be performed.
• Apache Pig came into the
Hadoop world as a boon
for all such programmers.
• Pig Hadoop is basically a high-level
programming language that is helpful for the
analysis of huge datasets, to execute queries
(No need of map reduce programs) on huge
datasets that are stored in Hadoop HDFS.
• Pig Hadoop was developed by Yahoo! and is
generally used with Hadoop to perform a lot of
data administration operations.
• PigLatin is a query based language used in pig
which is very similar to SQL. the pig
runtime, for the execution environment.
• After the introduction of Pig Latin, now, programmers
are able to work on Map Reduce tasks without the use
of complicated codes as in Java.
• To reduce the length of codes, the multi-query
approach is used by Apache Pig, which results in
reduced development time by 16 folds.
10 line of pig latin = approx. 200 lines of Map-Reduce
Java code
• Since Pig Latin is very similar to SQL, it is comparatively
easy to learn Apache Pig if we have little knowledge of
SQL.
• This language provides various operators using which
programmers can develop their own functions for
reading, writing, and processing data.
Features : Apache Pig comes with the following
features −
• Rich set of operators − It provides many
operators to perform operations like join, sort,
filter, etc.
• Ease of programming − Pig Latin is similar to
SQL and it is easy to write a Pig script if you are
good at SQL.
• Optimization opportunities − The tasks in
Apache Pig optimize their execution
automatically, so the programmers need to
focus only on semantics of the language.
Features :
• Extensibility − Using the existing operators,
users can develop their own functions to read,
process, and write data.
• UDF’s − Pig provides the facility to create User-
defined Functions in other programming
languages such as Java and invoke or embed
them in Pig Scripts.
• Handles all kinds of data − Apache Pig analyzes
all kinds of data, both structured as well as
unstructured. It stores the results in HDFS.
• To analyze data using Apache Pig, programmers
need to write scripts using Pig Latin language.
• All these scripts are internally converted to Map
and Reduce tasks.
• Apache Pig has a component known as Pig
Engine that accepts the Pig Latin scripts as input
and converts those scripts into MapReduce jobs.
• It produces a sequential set of MapReduce jobs,
and that’s an abstraction (which works like black
box).
Apache Hive
• Apache Hive is an open source data warehouse
system used for querying and analyzing large
datasets stored in Hadoop files.
• It processes structured data, relational data bases in
Hadoop.
• Hive also support analysis of large datasets stored in
HDFS and also in Amazon S3 file system is supported
by Hive.
• Hive uses the language called HiveQL (HQL), which is
similar to SQL.
• HiveQL automatically translates SQL-like queries into
MapReduce jobs.
It gives you a platform for building data flow for
ETL (Extract, Transform and Load), processing
and analyzing huge data sets.
In PIG, first the load command, loads the
data. Then perform various functions on it like
grouping, filtering, joining, sorting, etc. At last,
either you can dump the data on the screen or
you can store the result back in HDFS.
S.No Apache PiG Apache Hive
.
1 Apache Pig uses a language Hive uses a language
called Pig Latin. It was originally called HiveQL. It was
created at Yahoo. originally created
at Facebook.
2 Pig Latin is a data flow language. HiveQL is a query
processing language.
3 Pig Latin is a procedural HiveQL is a declarative
language and it fits in pipeline language.
paradigm.
4 Apache Pig can handle Hive is mostly for
structured, unstructured, and structured data.
semi-structured data.
In a procedural language, you define the whole process
and provide the steps how to do it. You define how the
process will be served.
In a declarative language, you just set the command or
order, and let it be on the system how to complete
that order.
Apache Pig is generally used by data scientists for
performing tasks involving ad-hoc processing and quick
prototyping. Apache Pig is used −
• To process huge data sources such as web logs.
• To perform data processing for search platforms.
• To process time sensitive data loads.
Think & answer
• A jet engine generates various types of data from different
sensors like pressure sensor, temperature sensor, speed
sensor, etc. which indicates the health of the engine. This is
very useful to understand the problems and status of the
flight.
• Continuous Engine Operations generates 500 GB data per
flight and there are 3000 flights per day approximately.
• So, Engine Analytics applied to such data in near real time
can be used to proactively diagnose problems and reduce
unplanned downtime.
• This requires a distributed environment to store large
amount of data with fast random reads and writes for real
time processing. Here, HBase comes for the rescue.
Apache HBase
• Apache HBase is NoSQL database that runs on
the top of Hadoop.
• It is an open source, distributed, scalable,
written in java, column oriented store that runs
on top of HDFS.
• HBase is important and mainly used when you
need random, real-time, read or write access to
your Big Data.
• It is a database that stores structured data in
tables that could have billions of rows and
millions of columns.
Apache HBase
• HBase also provides real-time access to read or
write data in HDFS. (As it is based on columns
rather than rows, This essentially increases the
speed of execution of operations if they are
need to be performed on similar values across
massive data sets.)
• Hbase does not provide its own query or
scripting language, but is accessible through
java, Thrift and Rest APIs.
• HBase achieves high throughput and low
latency by providing faster Read/Write Access
on huge data sets. Therefore, HBase is the
choice for the applications which require fast
& random access to large amount of data.
• It provides compression, in-memory
operations and Bloom filters (data structure
which tells whether a value is present in a set
or not) to fulfill the requirement of fast and
random read-writes.
S. No. Name Address Product Id Product
name
1 A K Paul Chembur 125 Socks
Mumbai
2 V. Richard Shakti Nagar 860 Key Board
Delhi
Row oriented data base: 1 A K Paul, chembur
Mumbai, product Id, Product name
Column oriented data base: 1,2, A K Paul, V. Richard,
Chembur ,shakti Nagar, 125, 860, Socks, Key board
In a column-oriented databases, all the column values are
stored together.
• HBase is a column-oriented NoSQL database. Although
it looks similar to a relational database which contains
rows and columns, but it is not a relational database.
• Relational databases are row oriented while HBase is
column-oriented.
Row-oriented vs column-oriented Databases:
• Row-oriented databases store table records in a
sequence of rows. Whereas column-oriented
databases store table records in a sequence of columns,
i.e. the entries in a column are stored in contiguous
locations on disks in the form of cell.
• Column Oriented:
• In this database, data is stored in cell grouped in
column rather than rows. Columns are logically
grouped into column families which can be either
created during schema definition or at runtime.
• These types of databases store all the cell
corresponding to a column as continuous disk
entry, thus making the access and search much
faster.
• Column Based Databases: HBase, Accumulo,
Cassandra, Druid, Vertica.
• Column Oriented:
• Use-Case
• It supports the huge storage and allow faster
read write access over it. This makes column
oriented databases suitable for storing
customer behaviors in e-commerce website,
financial systems like Google Finance and
stock market data, Google maps etc.
• When the amount of data is very huge, like in terms of
petabytes or exabytes, column-oriented approach is
used, because the data of a single column is stored
together and can be accessed faster.
• While row-oriented approach comparatively handles
less number of rows and columns efficiently, as row-
oriented database stores data is a structured format.
• When we need to process and analyze a large set of
semi-structured or unstructured data, we use column
oriented approach. Such as applications dealing
with Online Analytical Processing like data mining, data
warehousing, applications including analytics, etc.
• Whereas, Online Transactional Processing such as
banking and finance domains which handle structured
data and require transactional properties (ACID
properties) use row-oriented approach.
Components of HBase:
• HBase Master – It is not part of the actual
data storage. But it performs administration
(interface for creating, updating and deleting
tables.).
• Region Server – It is the worker node. It
handles read, writes, updates and delete
requests from clients. Region server also
process runs on every node in Hadoop cluster.
HCatalog
• It is table and storage management layer on the top
of Apache Hadoop.
• HCatalog is a main component of Hive. Hence, it
enables the user to store their data in any format and
structure.
• It also supports different Hadoop components to
easily read and write data from the cluster.
• Advantages of HCatalog:
• Provide visibility for data cleaning and archiving tools.
• With the table abstraction, HCatalog frees the user
from the overhead of data storage.
• Enables notifications of data availability.
Apache Mahout
• A mahout is one who drives an elephant as
its master.
• It is an open source framework used for
creating scalable machine learning
algorithm Recommendation, Classification,
Clustering
• Once we store data in HDFS, mahout
provides the data science tools to
automatically find meaningful patterns in
those Big Data sets.
Sqoop
• It is mainly used for importing and
exporting data.
• So, it imports data from external sources
into related Hadoop components like
HDFS, HBase or Hive.
• It also exports data from Hadoop to
other external sources.
• Sqoop works with relational databases
such as Teradata, Netezza, Oracle,
MySQL.
Flume
• If you want to ingest event data such as streaming
data, sensor data, or log files, then Flume is used.
• Flume efficiently collects, aggregate and move a large
amount of data from its origin and sending it back to
HDFS.
• It has a very simple and flexible architecture based on
streaming data flows.
• Flume is fault tolerant, also a reliable mechanism.
• It uses a simple extensible data model that allows for
the online analytic application. Hence, using Flume
we can get the data from multiple servers (event type
of data) immediately into Hadoop.
Avro
• It is an open source project that provides data
serialization and data exchange services for Hadoop.
• Using serialization, service programs can serialize data
into files or messages.
• It also stores data definition and data together in one
message or file.
• Hence, this makes it easy for programs to dynamically
understand information stored in Avro file or message.
Avro provides:
Container file, to store persistent data.
Remote procedure call.
Rich data structures.
Compact, fast, binary data format.
Flume
Ambari
• It is an open source management
platform.
• It is a platform for provisioning,
managing, monitoring and securing
Apache Hadoop cluster.
• Hadoop management gets simpler
because Ambari provides consistent,
secure platform for operational control.
Thrift
Apache Thrift is a software framework
that allows scalable cross-language
services development.
Thrift is also used for RPC
communication. Apache Hadoop does a
lot of RPC calls, so there is a possibility
of using Thrift for performance.
Benefits of Ambari
• Simplified installation, configuration, and
management – It can easily and efficiently create and
manage clusters at scale.
• Centralized security setup – Ambari configures cluster
security across the entire platform. It also reduces the
complexity to administer.
• Highly extensible and customizable – Ambari is highly
extensible for bringing custom services under
management.
• Full visibility into cluster health – Ambari ensures that
the cluster is healthy and available with a holistic
approach to monitoring.
Zookeeper
• It is a distributed service with master slave nodes,
stores and maintains configuration information,
naming, and provide distributed synchronization.
• It also provides group services in memory on
Zookeeper servers.
• Zookeeper also manages and coordinates a large
cluster of machines through a shared hierarchical
name space of data registers called znodes.
Benefits of Zookeeper
Fast – zookeeper is fast with workloads
where reads to data are more common
than writes. The ideal read/write ratio
is 10:1.
Ordered – Zookeeper maintains a record
of all transactions, which can also be
used for high-level.
Oozie
• It is a job coordinator and workflow manager/
scheduler system to manage Apache Hadoop jobs.
• It supports and combines multiple Hadoop jobs(Java
map reduce, Streaming map reduce, Pig, Hive, Sqoop,
as well as system specific jobs like java programs and
shell scripts sequentially into one logical unit of work.
• Oozie framework is fully integrated with Apache
Hadoop stack, YARN as an architecture center.
• An Oozie workflow is a collection of actions and
hadoop jobs arranged in a Directed Acyclic
Graphs(DAG), since tasks are arranged in a sequence
and subjected to certain constraints.
Oozie
• Oozie is scalable and also very much flexible.
• One can easily start, stop, suspend and rerun jobs.
Hence, Oozie makes it very easy to rerun failed
workflows.
• It is also possible to skip a specific failed node.
There are two basic types of Oozie jobs:
• Oozie workflow – It is to store and run workflows
composed of Hadoop jobs e.g., MapReduce, Pig, Hive.
• Oozie coordinator – It runs workflow jobs based on
predefined schedules and availability of data.
Stages of Big Data processing
There are four stages of Big Data processing:
Ingest, Processing, Analyze, Access.
1. Ingest
• The first stage of Big Data processing is Ingest.
• The data is ingested or transferred to Hadoop from
various sources such as relational databases,
systems, or local files.
Sqoop transfers data from RDBMS to HDFS.
Flume transfers event data like sensors, servers etc.
2. Processing
The second stage is Processing.
• In this stage, the data is stored and processed.
• The data is stored in the distributed file system,
HDFS, and the NoSQL distributed data, HBase.
• Spark and MapReduce perform the data
processing.
3. Analyze.
• Here, the data is analyzed by processing
frameworks such as Pig, Hive, and Impala.
• Pig: A procedural language platform for developing
a script for map reduce operations, converts the
data using a map and reduce and then analyzes it.
• Hive: A platform for developing SQL type
scripts(Structured data) to do map and reduce
operations.
4. Access
The fourth stage is Access, which is performed
by tools such as Hue and Cloudera Search.
In this stage, the analyzed data can be accessed
by users.
Hue is the web interface, whereas Cloudera
Search provides a text interface for exploring
data.
Hadoop Limitations
• Security Concerns
• Vulnerable By Nature
• Not fit for Small Data
• Potential Stability Issues
• General Limitations
Quiz
1. A non realational (Columnar)data base
Apache Hbase !
Quiz
2. Data Access and query based platform
Apache Hive !
Quiz
3. Scripting Platform for map reduce operations
Apache Pig !
Quiz
4. A Platform where machine learning libraries are
available to find hidden patterns
Apache Mahout !
Quiz
5. A tool which transfers data in both ways between
relational systems and HDFS or other Hadoop data
stores such as Hive or Hbase
Apache Sqoop !
Quiz
6. A workflow manager for various jobs
Apache Oozie !
Quiz
7. A Table storage Management System
Apache HCatalog !
Quiz
8. A distributed service for storing, maintaining
configuration information and providing
synchronization
Apache ZooKeeper !
अपनी उम्मीद की टोकरी
खाली कर दीजिये....
परेशानियाँ नाराज होकर
खुद चली जायेगी.....
लोग बुराई करे …..
और आप दुखी हो जाओ
लोग तारीफ करे …....
और आप खुश हो जाओ
मतलब
आपके सुख दुख का स्विच
लोगो के हाथ मेँ है .....
कोशिश करें ये स्विच आपके हाथ में हो।