KEMBAR78
Hadoop Ecosystem & HDFS Guide | PDF | Apache Hadoop | Map Reduce
0% found this document useful (0 votes)
98 views46 pages

Hadoop Ecosystem & HDFS Guide

The document provides an introduction to Hadoop and its ecosystem. It describes Hadoop as a framework that allows distributed processing of large datasets across clusters of computers. The Hadoop ecosystem includes tools like MapReduce and HDFS that provide core functionality for managing big data, as well as additional tools that support building and managing big data applications. HDFS is specifically designed to reliably store very large files across a Hadoop cluster by breaking files into blocks and replicating blocks across multiple nodes for fault tolerance.

Uploaded by

Gokul J L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views46 pages

Hadoop Ecosystem & HDFS Guide

The document provides an introduction to Hadoop and its ecosystem. It describes Hadoop as a framework that allows distributed processing of large datasets across clusters of computers. The Hadoop ecosystem includes tools like MapReduce and HDFS that provide core functionality for managing big data, as well as additional tools that support building and managing big data applications. HDFS is specifically designed to reliably store very large files across a Hadoop cluster by breaking files into blocks and replicating blocks across multiple nodes for fault tolerance.

Uploaded by

Gokul J L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Introduction to Hadoop

Ecosystem
Introduction
 “Hadoop is a framework that allows for the
distributed processing of large data sets across
clusters of computers using simple programming
models”.

 In other words, Hadoop is a ‘software library’ that


allows its users to process large datasets across
distributed clusters of computers, thereby
enabling them to gather, store and analyze
huge sets of data.

 Hadoop provides various tools and technologies,


collectively termed as the Hadoop ecosystem, to
enable development and deployment of Big Data
solutions.
Hadoop Ecosystem
 Hadoop ecosystem can be defined as a
comprehensive collection of tools and
technologies that can be effectively implemented and
deployed to provide Big Data solutions in a cost-
effective manner.

 MapReduce and Hadoop Distributed File System


(HDFS) are two core components of the Hadoop
ecosystem that provide a great starting point to manage
Big Data; however they are not sufficient to deal with
the Big Data challenges.
 Along with these two, the Hadoop ecosystem provides
a collection of various elements to support the
complete development and deployment of Big
Data solutions.

 All these elements enable users to process large


datasets in real time and provide tools to support
various types of Hadoop projects, schedule jobs
and manage cluster resources.
 In short, MapReduce and HDFS provide the necessary
services and basic structure to deal with the core
requirements of Big Data solutions.

 Other services and tools of the ecosystem provide the


environment and components required to build and
manage purpose-driven Big Data applications.
Hadoop Distributed File System (HDFS)

 Hadoop Distributed File System (HDFS) is designed to


reliably store very large files across machines in a large
cluster.

 Distribute large data file into blocks


 Blocks are managed by different nodes in the cluster
 Each block is replicated on multiple nodes
 Name node stores metadata information about files and
blocks
 Some of the terms related to HDFS:
 Huge documents:

 HDFS is a file system intended for keeping huge


documents for the future analysis.

 Appliance hardware: .

 A device that is dedicated to a specific function in


contrast to a general-purpose computer.
 Streaming information access:

 HDFS is created for batch processing.

 The priority is given to the high throughput of data


access rather than the low latency of data access.

 A dataset is commonly produced or replicated from the


source, and then various analyzes are performed on the
dataset in the long run.
 Low-latency information access:

 Applications that permit access to information in


milliseconds do not function well with HDFS.

 Loads of small documents:

 The Namenode holds file system data information in


memory, the quantity of documents in a file system is
administered in terms of the memory on the server.
 As a dependable guideline, each document and registry
takes around 150 bytes.
HDFS Architecture
 HDFS architecture has a master-slave architecture.

 It comprises a NameNode and a number of DataNodes.

 The NameNode is the master that manages the various


DataNodes.

 The NameNode manages HDFS cluster metadata,


where DataNode store the data.
 Records and directories are presented by clients to the
NameNode.

 These records and directories are managed on the


NameNode.

 Operations on them, such as their modification or


opening and closing them are performed by the
NameNode.
 On the other hand, internally, a file is divided into one
or more blocks, which are stored in a group of
DataNodes.

 DataNodes can also execute operations like the


creation, deletion, and replication of blocks, depending
on the instructions from the NameNode.
 In the HDFS architecture, data is stored in different
blocks.

 Blocks are managed by the different nodes.

 Default block size is 64 MB, although numerous HDFS


installations utilizes 128 MB.
 Some of the failure management tasks:

 Monitoring:

 DataNode and NameNode communicate through


continuous singals (“heartbeat”).

 If signal is not heard by either of the two, the node is


considered to have failed and would be no longer
available.

 The failed node is replaced by the replica.


 Rebalancing:

 According to this process, the blocks are shifted from


one to another location where ever the free space is
available.

 Metadata replication:

 Maintain the replica of the corresponding files on the


same HDFS.
NameNodes and DataNodes
 An HDFS cluster has two node types working in a slave
master design:

 A NameNode (the master) and various DataNodes


(slaves).

 The NameNode deals with the file system.

 It stores the metadata for all the documents and


indexes in the file system.
 The metadata is stored on the local disk as two files:

 The file system and the edit log.

 The Namenode is aware of the DataNodes on which all


the pieces of a given document are found.

 A client accesses the file system on behalf of the user


by communicating with the DataNodes and NameNode.
 DataNodes are the workhorses of a file system.

 They store and recover blocks when they are asked to


(by clients, or the NameNode), and they report back to
the NameNode occasionally.

 Without the NameNode, the file system cannot be


used.
 In fact, if the machine using the NameNode crashes, all
files on the file system would be lost.

 To overcome this, Hadoop provides two methods.

 One way is to take the back up of the documents.

 Another way is to run a secondary NameNode.

 Secondary NameNode will be updated periodically.


Features of HDFS
 The important key features are:

 Data Replication
 Data Resilience
 Highly fault-tolerant
 Data Integrity
 High throughput
 Suitable for applications with large data sets
 Streaming access to file system data
Data Replication
 Default replication is 3-fold
 HDFS primarily maintains one replica of each block
locally.

 A second replica of the block is then placed on a


different rack to guard against rack failure.

 A third replica is maintained on a different server of a


remote rack.

 Finally additional replicas are sent to random locations


in local and remote clusters.
Data Resilience

 Resiliency is the ability of a server, network, storage


system, or an entire data center, to recover quickly and
continue operating even when there has been an
equipment failure, power outage or other disruption.
Fault tolerance
 Fault tolerance is the property that enables a system to
continue operating properly in the event of the failure of
(one or more faults within) some of its components.

 A HDFS instance may consist of thousands of server


machines, each storing part of the file system’s data.
 Since we have huge number of components and that each
component has high probability of failure means that there is
always some component that is non-functional.

 Detection of faults and quick, automatic recovery from them


is a core architectural goal of HDFS.

17/04/2018
27
Data Integrity
 Consider a situation: a block of data fetched from
Datanode arrives corrupted.
 This corruption may occur because of faults in a storage
device, network faults, or buggy software.

 A HDFS client creates the checksum of every block of


its file and stores it in hidden files in the HDFS
namespace.

 When a clients retrieves the contents of file, it verifies


that the corresponding checksums match.
 If does not match, the client can retrieve the block from
a replica.
17/04/2018 28
 HDFS ensures data integrity throughout the cluster
with the help of the following features:

 Maintaining Transaction Logs:

 HDFS maintains transaction logs in order to monitor


every operation and carry out effective auditing and
recovery of data in case something goes wrong.
 Validation Checksum:

 Checksum is an effective error-detection technique


wherein a numerical value is assigned to a transmitted
message on the basis of the number of bits contained in
the message.

 HDFS uses checksum validation for verification of the


content of a file.
 Validation is carried out as follows:

1. When a file is requested by the client, the contents are


verified using checksum.

2. If the checksums of the received and sent messages match,


the file operations proceed further; otherwise, an error is
reported.

3. The message receiver verifies the checksum of the message


to ensure that it is the same as in the sent message. If a
difference is identified in the two values, the message is
discarded assuming that it has been tempered/altered with
in transition. Checksum files are hidden to avoid tempering
 Creating Data Blocks:

 HDFS maintains replicated copies of data blocks to


avoid corruption of a file due to failure of a server.

 Data blocks are also called as block servers.


 High Throughput

 Throughput is a measure of how many units of


information a system can process in a given amount of
time. So HDFS provides high throughput.

 Suitable for applications with large data sets

 HDFS is suitable for applications which they need to


store, collect or analyze large data sets.

 Streaming access to file system data

 Eventhough HDFS gives priority to Batch processing


streaming access to files is given.
Data Pipelining
 A connection between multiple DataNodes that
supports movement of data across servers is termed as
a pipeline.

 Client retrieves a list of DataNodes on which to place


replicas of a block
 Client writes block to the first DataNode
 The first DataNode forwards the data to the next
DataNode in the Pipeline
 When all replicas are written, the Client moves on to
write the next block in file
Map Reduce
Working of Map Reduce
 Input phase
 Here we have a Record Reader that translates each record
in an input file and sends the parsed data to the mapper in
the form of key-value pairs.

 The Mapper
 Reads data as key/value pairs
◦ The key is often discarded

 Outputs zero or more key/value pairs

 Map is a user-defined function, which takes a series of key-


value pairs and processes each one of them to generate zero
or more key-value pairs.
 Intermediate Keys − They key-value pairs generated
by the mapper are known as intermediate keys.

 Combiner − A combiner is a type of local Reducer


that groups similar data from the map phase into
identifiable sets.
 It takes the intermediate keys from the mapper as input
and applies a user-defined code to aggregate the values
in a small scope of one mapper.

 It is not a part of the main MapReduce algorithm; it is


optional.
 Shuffle and Sort
 Output from the mapper is sorted by key

 All values with the same key are guaranteed to go to


the same machine.

 The Reducer task starts with the Shuffle and Sort step.
It downloads the grouped key-value pairs onto the local
machine, where the Reducer is running.

 The individual key-value pairs are sorted by key into a


larger data list. The data list groups the equivalent keys
together so that their values can be iterated easily in
the Reducer task.
 The Reducer
 Called once for each unique key

 Gets a list of all values associated with a key as input

 The reducer outputs zero or more final key/value pairs


◦ Usually just one output per input key.

Output Phase − In the output phase, we have an


output formatter that translates the final key-value pairs
from the Reducer function and writes them onto a file
using a record writer.
Simple Example
MapReduce: Word Count Example
MapReduce-Example

 Let us take a real-world example to comprehend the


power of MapReduce.

 Twitter receives around 500 million tweets per day,


which is nearly 3000 tweets per second.

 The following illustration shows how Tweeter manages


its tweets with the help of MapReduce.
 The MapReduce algorithm performs the following
actions:

 Tokenize − Tokenizes the tweets into maps of tokens


and writes them as key-value pairs.

 Filter − Filters unwanted words from the maps of


tokens and writes the filtered maps as key-value pairs.

 Count − Generates a token counter per word.

 Aggregate Counters − Prepares an aggregate of


similar counter values into small manageable units.
MapReduce Features

 Automatic parallelization and distribution

 Fault-Tolerance

 Used to process large data sets.

 Users can write scripts in many languages like java,


python, Ruby and so on.

You might also like