Introduction to Hadoop
Ecosystem
Introduction
“Hadoop is a framework that allows for the
distributed processing of large data sets across
clusters of computers using simple programming
models”.
In other words, Hadoop is a ‘software library’ that
allows its users to process large datasets across
distributed clusters of computers, thereby
enabling them to gather, store and analyze
huge sets of data.
Hadoop provides various tools and technologies,
collectively termed as the Hadoop ecosystem, to
enable development and deployment of Big Data
solutions.
Hadoop Ecosystem
Hadoop ecosystem can be defined as a
comprehensive collection of tools and
technologies that can be effectively implemented and
deployed to provide Big Data solutions in a cost-
effective manner.
MapReduce and Hadoop Distributed File System
(HDFS) are two core components of the Hadoop
ecosystem that provide a great starting point to manage
Big Data; however they are not sufficient to deal with
the Big Data challenges.
Along with these two, the Hadoop ecosystem provides
a collection of various elements to support the
complete development and deployment of Big
Data solutions.
All these elements enable users to process large
datasets in real time and provide tools to support
various types of Hadoop projects, schedule jobs
and manage cluster resources.
In short, MapReduce and HDFS provide the necessary
services and basic structure to deal with the core
requirements of Big Data solutions.
Other services and tools of the ecosystem provide the
environment and components required to build and
manage purpose-driven Big Data applications.
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is designed to
reliably store very large files across machines in a large
cluster.
Distribute large data file into blocks
Blocks are managed by different nodes in the cluster
Each block is replicated on multiple nodes
Name node stores metadata information about files and
blocks
Some of the terms related to HDFS:
Huge documents:
HDFS is a file system intended for keeping huge
documents for the future analysis.
Appliance hardware: .
A device that is dedicated to a specific function in
contrast to a general-purpose computer.
Streaming information access:
HDFS is created for batch processing.
The priority is given to the high throughput of data
access rather than the low latency of data access.
A dataset is commonly produced or replicated from the
source, and then various analyzes are performed on the
dataset in the long run.
Low-latency information access:
Applications that permit access to information in
milliseconds do not function well with HDFS.
Loads of small documents:
The Namenode holds file system data information in
memory, the quantity of documents in a file system is
administered in terms of the memory on the server.
As a dependable guideline, each document and registry
takes around 150 bytes.
HDFS Architecture
HDFS architecture has a master-slave architecture.
It comprises a NameNode and a number of DataNodes.
The NameNode is the master that manages the various
DataNodes.
The NameNode manages HDFS cluster metadata,
where DataNode store the data.
Records and directories are presented by clients to the
NameNode.
These records and directories are managed on the
NameNode.
Operations on them, such as their modification or
opening and closing them are performed by the
NameNode.
On the other hand, internally, a file is divided into one
or more blocks, which are stored in a group of
DataNodes.
DataNodes can also execute operations like the
creation, deletion, and replication of blocks, depending
on the instructions from the NameNode.
In the HDFS architecture, data is stored in different
blocks.
Blocks are managed by the different nodes.
Default block size is 64 MB, although numerous HDFS
installations utilizes 128 MB.
Some of the failure management tasks:
Monitoring:
DataNode and NameNode communicate through
continuous singals (“heartbeat”).
If signal is not heard by either of the two, the node is
considered to have failed and would be no longer
available.
The failed node is replaced by the replica.
Rebalancing:
According to this process, the blocks are shifted from
one to another location where ever the free space is
available.
Metadata replication:
Maintain the replica of the corresponding files on the
same HDFS.
NameNodes and DataNodes
An HDFS cluster has two node types working in a slave
master design:
A NameNode (the master) and various DataNodes
(slaves).
The NameNode deals with the file system.
It stores the metadata for all the documents and
indexes in the file system.
The metadata is stored on the local disk as two files:
The file system and the edit log.
The Namenode is aware of the DataNodes on which all
the pieces of a given document are found.
A client accesses the file system on behalf of the user
by communicating with the DataNodes and NameNode.
DataNodes are the workhorses of a file system.
They store and recover blocks when they are asked to
(by clients, or the NameNode), and they report back to
the NameNode occasionally.
Without the NameNode, the file system cannot be
used.
In fact, if the machine using the NameNode crashes, all
files on the file system would be lost.
To overcome this, Hadoop provides two methods.
One way is to take the back up of the documents.
Another way is to run a secondary NameNode.
Secondary NameNode will be updated periodically.
Features of HDFS
The important key features are:
Data Replication
Data Resilience
Highly fault-tolerant
Data Integrity
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Data Replication
Default replication is 3-fold
HDFS primarily maintains one replica of each block
locally.
A second replica of the block is then placed on a
different rack to guard against rack failure.
A third replica is maintained on a different server of a
remote rack.
Finally additional replicas are sent to random locations
in local and remote clusters.
Data Resilience
Resiliency is the ability of a server, network, storage
system, or an entire data center, to recover quickly and
continue operating even when there has been an
equipment failure, power outage or other disruption.
Fault tolerance
Fault tolerance is the property that enables a system to
continue operating properly in the event of the failure of
(one or more faults within) some of its components.
A HDFS instance may consist of thousands of server
machines, each storing part of the file system’s data.
Since we have huge number of components and that each
component has high probability of failure means that there is
always some component that is non-functional.
Detection of faults and quick, automatic recovery from them
is a core architectural goal of HDFS.
17/04/2018
27
Data Integrity
Consider a situation: a block of data fetched from
Datanode arrives corrupted.
This corruption may occur because of faults in a storage
device, network faults, or buggy software.
A HDFS client creates the checksum of every block of
its file and stores it in hidden files in the HDFS
namespace.
When a clients retrieves the contents of file, it verifies
that the corresponding checksums match.
If does not match, the client can retrieve the block from
a replica.
17/04/2018 28
HDFS ensures data integrity throughout the cluster
with the help of the following features:
Maintaining Transaction Logs:
HDFS maintains transaction logs in order to monitor
every operation and carry out effective auditing and
recovery of data in case something goes wrong.
Validation Checksum:
Checksum is an effective error-detection technique
wherein a numerical value is assigned to a transmitted
message on the basis of the number of bits contained in
the message.
HDFS uses checksum validation for verification of the
content of a file.
Validation is carried out as follows:
1. When a file is requested by the client, the contents are
verified using checksum.
2. If the checksums of the received and sent messages match,
the file operations proceed further; otherwise, an error is
reported.
3. The message receiver verifies the checksum of the message
to ensure that it is the same as in the sent message. If a
difference is identified in the two values, the message is
discarded assuming that it has been tempered/altered with
in transition. Checksum files are hidden to avoid tempering
Creating Data Blocks:
HDFS maintains replicated copies of data blocks to
avoid corruption of a file due to failure of a server.
Data blocks are also called as block servers.
High Throughput
Throughput is a measure of how many units of
information a system can process in a given amount of
time. So HDFS provides high throughput.
Suitable for applications with large data sets
HDFS is suitable for applications which they need to
store, collect or analyze large data sets.
Streaming access to file system data
Eventhough HDFS gives priority to Batch processing
streaming access to files is given.
Data Pipelining
A connection between multiple DataNodes that
supports movement of data across servers is termed as
a pipeline.
Client retrieves a list of DataNodes on which to place
replicas of a block
Client writes block to the first DataNode
The first DataNode forwards the data to the next
DataNode in the Pipeline
When all replicas are written, the Client moves on to
write the next block in file
Map Reduce
Working of Map Reduce
Input phase
Here we have a Record Reader that translates each record
in an input file and sends the parsed data to the mapper in
the form of key-value pairs.
The Mapper
Reads data as key/value pairs
◦ The key is often discarded
Outputs zero or more key/value pairs
Map is a user-defined function, which takes a series of key-
value pairs and processes each one of them to generate zero
or more key-value pairs.
Intermediate Keys − They key-value pairs generated
by the mapper are known as intermediate keys.
Combiner − A combiner is a type of local Reducer
that groups similar data from the map phase into
identifiable sets.
It takes the intermediate keys from the mapper as input
and applies a user-defined code to aggregate the values
in a small scope of one mapper.
It is not a part of the main MapReduce algorithm; it is
optional.
Shuffle and Sort
Output from the mapper is sorted by key
All values with the same key are guaranteed to go to
the same machine.
The Reducer task starts with the Shuffle and Sort step.
It downloads the grouped key-value pairs onto the local
machine, where the Reducer is running.
The individual key-value pairs are sorted by key into a
larger data list. The data list groups the equivalent keys
together so that their values can be iterated easily in
the Reducer task.
The Reducer
Called once for each unique key
Gets a list of all values associated with a key as input
The reducer outputs zero or more final key/value pairs
◦ Usually just one output per input key.
Output Phase − In the output phase, we have an
output formatter that translates the final key-value pairs
from the Reducer function and writes them onto a file
using a record writer.
Simple Example
MapReduce: Word Count Example
MapReduce-Example
Let us take a real-world example to comprehend the
power of MapReduce.
Twitter receives around 500 million tweets per day,
which is nearly 3000 tweets per second.
The following illustration shows how Tweeter manages
its tweets with the help of MapReduce.
The MapReduce algorithm performs the following
actions:
Tokenize − Tokenizes the tweets into maps of tokens
and writes them as key-value pairs.
Filter − Filters unwanted words from the maps of
tokens and writes the filtered maps as key-value pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of
similar counter values into small manageable units.
MapReduce Features
Automatic parallelization and distribution
Fault-Tolerance
Used to process large data sets.
Users can write scripts in many languages like java,
python, Ruby and so on.