0% found this document useful (0 votes)

98 views46 pages

Hadoop Ecosystem & HDFS Guide

The document provides an introduction to Hadoop and its ecosystem. It describes Hadoop as a framework that allows distributed processing of large datasets across clusters of computers. The Hadoop ecosystem includes tools like MapReduce and HDFS that provide core functionality for managing big data, as well as additional tools that support building and managing big data applications. HDFS is specifically designed to reliably store very large files across a Hadoop cluster by breaking files into blocks and replicating blocks across multiple nodes for fault tolerance.

Uploaded by

Gokul J L

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

98 views46 pages

Hadoop Ecosystem & HDFS Guide

Uploaded by

Gokul J L

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Introduction to Hadoop

Ecosystem
Introduction
 “Hadoop is a framework that allows for the
distributed processing of large data sets across
clusters of computers using simple programming
models”.

 In other words, Hadoop is a ‘software library’ that

allows its users to process large datasets across
distributed clusters of computers, thereby
enabling them to gather, store and analyze
huge sets of data.

 Hadoop provides various tools and technologies,

collectively termed as the Hadoop ecosystem, to
enable development and deployment of Big Data
solutions.
Hadoop Ecosystem
 Hadoop ecosystem can be defined as a
comprehensive collection of tools and
technologies that can be effectively implemented and
deployed to provide Big Data solutions in a cost-
effective manner.

 MapReduce and Hadoop Distributed File System

(HDFS) are two core components of the Hadoop
ecosystem that provide a great starting point to manage
Big Data; however they are not sufficient to deal with
the Big Data challenges.
 Along with these two, the Hadoop ecosystem provides
a collection of various elements to support the
complete development and deployment of Big
Data solutions.

 All these elements enable users to process large

datasets in real time and provide tools to support
various types of Hadoop projects, schedule jobs
and manage cluster resources.
 In short, MapReduce and HDFS provide the necessary
services and basic structure to deal with the core
requirements of Big Data solutions.

 Other services and tools of the ecosystem provide the

environment and components required to build and
manage purpose-driven Big Data applications.
Hadoop Distributed File System (HDFS)

 Hadoop Distributed File System (HDFS) is designed to

reliably store very large files across machines in a large
cluster.

 Distribute large data file into blocks

 Blocks are managed by different nodes in the cluster
 Each block is replicated on multiple nodes
 Name node stores metadata information about files and
blocks
 Some of the terms related to HDFS:
 Huge documents:

 HDFS is a file system intended for keeping huge

documents for the future analysis.

 Appliance hardware: .

 A device that is dedicated to a specific function in

contrast to a general-purpose computer.
 Streaming information access:

 HDFS is created for batch processing.

 The priority is given to the high throughput of data

access rather than the low latency of data access.

 A dataset is commonly produced or replicated from the

source, and then various analyzes are performed on the
dataset in the long run.
 Low-latency information access:

 Applications that permit access to information in

milliseconds do not function well with HDFS.

 Loads of small documents:

 The Namenode holds file system data information in

memory, the quantity of documents in a file system is
administered in terms of the memory on the server.
 As a dependable guideline, each document and registry
takes around 150 bytes.
HDFS Architecture
 HDFS architecture has a master-slave architecture.

 It comprises a NameNode and a number of DataNodes.

 The NameNode is the master that manages the various

DataNodes.

 The NameNode manages HDFS cluster metadata,

where DataNode store the data.
 Records and directories are presented by clients to the
NameNode.

 These records and directories are managed on the

NameNode.

 Operations on them, such as their modification or

opening and closing them are performed by the
NameNode.
 On the other hand, internally, a file is divided into one
or more blocks, which are stored in a group of
DataNodes.

 DataNodes can also execute operations like the

creation, deletion, and replication of blocks, depending
on the instructions from the NameNode.
 In the HDFS architecture, data is stored in different
blocks.

 Blocks are managed by the different nodes.

 Default block size is 64 MB, although numerous HDFS

installations utilizes 128 MB.
 Some of the failure management tasks:

 Monitoring:

 DataNode and NameNode communicate through

continuous singals (“heartbeat”).

 If signal is not heard by either of the two, the node is

considered to have failed and would be no longer
available.

 The failed node is replaced by the replica.

 Rebalancing:

 According to this process, the blocks are shifted from

one to another location where ever the free space is
available.

 Metadata replication:

 Maintain the replica of the corresponding files on the

same HDFS.
NameNodes and DataNodes
 An HDFS cluster has two node types working in a slave
master design:

 A NameNode (the master) and various DataNodes

(slaves).

 The NameNode deals with the file system.

 It stores the metadata for all the documents and

indexes in the file system.
 The metadata is stored on the local disk as two files:

 The file system and the edit log.

 The Namenode is aware of the DataNodes on which all

the pieces of a given document are found.

 A client accesses the file system on behalf of the user

by communicating with the DataNodes and NameNode.
 DataNodes are the workhorses of a file system.

 They store and recover blocks when they are asked to

(by clients, or the NameNode), and they report back to
the NameNode occasionally.

 Without the NameNode, the file system cannot be

used.
 In fact, if the machine using the NameNode crashes, all
files on the file system would be lost.

 To overcome this, Hadoop provides two methods.

 One way is to take the back up of the documents.

 Another way is to run a secondary NameNode.

 Secondary NameNode will be updated periodically.

Features of HDFS
 The important key features are:

 Data Replication
 Data Resilience
 Highly fault-tolerant
 Data Integrity
 High throughput
 Suitable for applications with large data sets
 Streaming access to file system data
Data Replication
 Default replication is 3-fold
 HDFS primarily maintains one replica of each block
locally.

 A second replica of the block is then placed on a

different rack to guard against rack failure.

 A third replica is maintained on a different server of a

remote rack.

 Finally additional replicas are sent to random locations

in local and remote clusters.
Data Resilience

 Resiliency is the ability of a server, network, storage

system, or an entire data center, to recover quickly and
continue operating even when there has been an
equipment failure, power outage or other disruption.
Fault tolerance
 Fault tolerance is the property that enables a system to
continue operating properly in the event of the failure of
(one or more faults within) some of its components.

 A HDFS instance may consist of thousands of server

machines, each storing part of the file system’s data.
 Since we have huge number of components and that each
component has high probability of failure means that there is
always some component that is non-functional.

 Detection of faults and quick, automatic recovery from them

is a core architectural goal of HDFS.

17/04/2018
27
Data Integrity
 Consider a situation: a block of data fetched from
Datanode arrives corrupted.
 This corruption may occur because of faults in a storage
device, network faults, or buggy software.

 A HDFS client creates the checksum of every block of

its file and stores it in hidden files in the HDFS
namespace.

 When a clients retrieves the contents of file, it verifies

that the corresponding checksums match.
 If does not match, the client can retrieve the block from
a replica.
17/04/2018 28
 HDFS ensures data integrity throughout the cluster
with the help of the following features:

 Maintaining Transaction Logs:

 HDFS maintains transaction logs in order to monitor

every operation and carry out effective auditing and
recovery of data in case something goes wrong.
 Validation Checksum:

 Checksum is an effective error-detection technique

wherein a numerical value is assigned to a transmitted
message on the basis of the number of bits contained in
the message.

 HDFS uses checksum validation for verification of the

content of a file.
 Validation is carried out as follows:

1. When a file is requested by the client, the contents are

verified using checksum.

2. If the checksums of the received and sent messages match,

the file operations proceed further; otherwise, an error is
reported.

3. The message receiver verifies the checksum of the message

to ensure that it is the same as in the sent message. If a
difference is identified in the two values, the message is
discarded assuming that it has been tempered/altered with
in transition. Checksum files are hidden to avoid tempering
 Creating Data Blocks:

 HDFS maintains replicated copies of data blocks to

avoid corruption of a file due to failure of a server.

 Data blocks are also called as block servers.

 High Throughput

 Throughput is a measure of how many units of

information a system can process in a given amount of
time. So HDFS provides high throughput.

 Suitable for applications with large data sets

 HDFS is suitable for applications which they need to

store, collect or analyze large data sets.

 Streaming access to file system data

 Eventhough HDFS gives priority to Batch processing

streaming access to files is given.
Data Pipelining
 A connection between multiple DataNodes that
supports movement of data across servers is termed as
a pipeline.

 Client retrieves a list of DataNodes on which to place

replicas of a block
 Client writes block to the first DataNode
 The first DataNode forwards the data to the next
DataNode in the Pipeline
 When all replicas are written, the Client moves on to
write the next block in file
Map Reduce
Working of Map Reduce
 Input phase
 Here we have a Record Reader that translates each record
in an input file and sends the parsed data to the mapper in
the form of key-value pairs.

 The Mapper
 Reads data as key/value pairs
◦ The key is often discarded

 Outputs zero or more key/value pairs

 Map is a user-defined function, which takes a series of key-

value pairs and processes each one of them to generate zero
or more key-value pairs.
 Intermediate Keys − They key-value pairs generated
by the mapper are known as intermediate keys.

 Combiner − A combiner is a type of local Reducer

that groups similar data from the map phase into
identifiable sets.
 It takes the intermediate keys from the mapper as input
and applies a user-defined code to aggregate the values
in a small scope of one mapper.

 It is not a part of the main MapReduce algorithm; it is

optional.
 Shuffle and Sort
 Output from the mapper is sorted by key

 All values with the same key are guaranteed to go to

the same machine.

 The Reducer task starts with the Shuffle and Sort step.
It downloads the grouped key-value pairs onto the local
machine, where the Reducer is running.

 The individual key-value pairs are sorted by key into a

larger data list. The data list groups the equivalent keys
together so that their values can be iterated easily in
the Reducer task.
 The Reducer
 Called once for each unique key

 Gets a list of all values associated with a key as input

 The reducer outputs zero or more final key/value pairs

◦ Usually just one output per input key.

Output Phase − In the output phase, we have an

output formatter that translates the final key-value pairs
from the Reducer function and writes them onto a file
using a record writer.
Simple Example
MapReduce: Word Count Example
MapReduce-Example

 Let us take a real-world example to comprehend the

power of MapReduce.

 Twitter receives around 500 million tweets per day,

which is nearly 3000 tweets per second.

 The following illustration shows how Tweeter manages

its tweets with the help of MapReduce.
 The MapReduce algorithm performs the following
actions:

 Tokenize − Tokenizes the tweets into maps of tokens

and writes them as key-value pairs.

 Filter − Filters unwanted words from the maps of

tokens and writes the filtered maps as key-value pairs.

 Count − Generates a token counter per word.

 Aggregate Counters − Prepares an aggregate of

similar counter values into small manageable units.
MapReduce Features

 Automatic parallelization and distribution

 Fault-Tolerance

 Used to process large data sets.

 Users can write scripts in many languages like java,

python, Ruby and so on.

Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
No ratings yet
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
34 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
BDH Unit 3
No ratings yet
BDH Unit 3
25 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
4
No ratings yet
4
53 pages
Hadoop File System
No ratings yet
Hadoop File System
36 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Notes - 3 Unit Neha
No ratings yet
Notes - 3 Unit Neha
25 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
HDFS Bda
No ratings yet
HDFS Bda
34 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Unit 3 1
No ratings yet
Unit 3 1
20 pages
Big Data Analytics Syllabus
No ratings yet
Big Data Analytics Syllabus
169 pages
HDFS
No ratings yet
HDFS
37 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
HDFS Bda
No ratings yet
HDFS Bda
34 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
Introduction to Hadoop & DFS
No ratings yet
Introduction to Hadoop & DFS
34 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
IMTC634 - Data Science - Chapter 14
No ratings yet
IMTC634 - Data Science - Chapter 14
22 pages
BBVCX
No ratings yet
BBVCX
89 pages
HDFS: Scalable Big Data Storage
No ratings yet
HDFS: Scalable Big Data Storage
1 page
Unit - 2
No ratings yet
Unit - 2
42 pages
Hadoop & Big Data for Tech Students
No ratings yet
Hadoop & Big Data for Tech Students
45 pages
HDFS
No ratings yet
HDFS
16 pages
Paper Hdfs Summary
No ratings yet
Paper Hdfs Summary
5 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
HDFS: Architecture and Benefits
No ratings yet
HDFS: Architecture and Benefits
6 pages
BD Module 1 Final
No ratings yet
BD Module 1 Final
17 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
43 pages
HDFS Essentials for Data Engineers
No ratings yet
HDFS Essentials for Data Engineers
22 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
Introduction To Hadoop and MapReduce Programming
No ratings yet
Introduction To Hadoop and MapReduce Programming
29 pages
HDFS Basics and Components Guide
No ratings yet
HDFS Basics and Components Guide
55 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
HDFS Internals for Developers
No ratings yet
HDFS Internals for Developers
30 pages
HDFS (27 Jan 2025 Hadoop Distributed File System)
No ratings yet
HDFS (27 Jan 2025 Hadoop Distributed File System)
73 pages
34-60-49 FMS UNS-1Ew Rev 2 PDF
100% (5)
34-60-49 FMS UNS-1Ew Rev 2 PDF
270 pages
Lab 1 CedarLogicSimulator Notes
No ratings yet
Lab 1 CedarLogicSimulator Notes
21 pages
Final Project MIS 3rd
No ratings yet
Final Project MIS 3rd
50 pages
Teacher's Digital Tool Guide
No ratings yet
Teacher's Digital Tool Guide
2 pages
OpenPiton System Block Manual
No ratings yet
OpenPiton System Block Manual
26 pages
Airtel User Satisfaction Study
No ratings yet
Airtel User Satisfaction Study
41 pages
CLIPS Expert System Tutorial Guide
No ratings yet
CLIPS Expert System Tutorial Guide
38 pages
Redistart Micro Manual 15-01-00 PDF
No ratings yet
Redistart Micro Manual 15-01-00 PDF
120 pages
Bio - Inspired Networking
No ratings yet
Bio - Inspired Networking
40 pages
MB90
No ratings yet
MB90
20 pages
Certificate: Internal Examiner External Examiner
No ratings yet
Certificate: Internal Examiner External Examiner
13 pages
Resources List For Resources Debug - Ap
No ratings yet
Resources List For Resources Debug - Ap
4 pages
MAD1
No ratings yet
MAD1
28 pages
For Internal Training & Circulation Only
No ratings yet
For Internal Training & Circulation Only
35 pages
Embedded Systems with eCos RTOS
No ratings yet
Embedded Systems with eCos RTOS
35 pages
Dell Guide To Server Basics
No ratings yet
Dell Guide To Server Basics
11 pages
Applications of Matrices To Cryptography
100% (1)
Applications of Matrices To Cryptography
27 pages
Doctor Who Security and Encryption FAQ - Revision 22.3
100% (8)
Doctor Who Security and Encryption FAQ - Revision 22.3
36 pages
01.radhika Word - Home-font+Clipbord (2022!08!17 19-37-20 UTC)
No ratings yet
01.radhika Word - Home-font+Clipbord (2022!08!17 19-37-20 UTC)
7 pages
AWS +devops Fresher Resume Format
No ratings yet
AWS +devops Fresher Resume Format
2 pages
Finding Frequent Items in Data Streams
No ratings yet
Finding Frequent Items in Data Streams
11 pages
Hash Tables: Concepts and Applications
No ratings yet
Hash Tables: Concepts and Applications
15 pages
Xii-Informatics Practices-Qp-Set B-18-11-2021
No ratings yet
Xii-Informatics Practices-Qp-Set B-18-11-2021
14 pages
APNA-380 Instruction Manual (E)
No ratings yet
APNA-380 Instruction Manual (E)
179 pages
Insurance Claim Project Guide
No ratings yet
Insurance Claim Project Guide
3 pages
Spirolab III: Fast, Simple, Durable Diagnostic Spirometer With Oximetry Option
No ratings yet
Spirolab III: Fast, Simple, Durable Diagnostic Spirometer With Oximetry Option
4 pages
Question Bank
No ratings yet
Question Bank
5 pages
Int 882
No ratings yet
Int 882
7 pages
Online Quiz Project
63% (8)
Online Quiz Project
19 pages
RPG Design Patterns
100% (1)
RPG Design Patterns
261 pages