Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is a distributed storage system that breaks files into blocks, managed by a master node called NameNode and slave nodes known as DataNodes. HDFS ensures data reliability through block replication across different nodes and incorporates features like rack awareness for fault tolerance. It offers benefits such as cost-effectiveness, large data set storage, fast recovery from failures, portability, and high data throughput for streaming access.

Uploaded by

Ishmael Garikai Kutambura

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views4 pages

Hadoop Distributed File System

Uploaded by

Ishmael Garikai Kutambura

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS) is the storage component of Hadoop. All data
stored on Hadoop is stored in a distributed manner across a cluster of machines. But it has a
few properties that define its existence. Below is the HDFS architecture: -

Components of HDFS
The HDFS consist of several components explained in detail below.

HDFS Blocks
HDFS breaks down a file into smaller units. Each of these units is stored on different
machines in the cluster. However, to the user working on the HDFS, it seems like storing all
the data onto a single machine. The smaller units broken down from a file are called blocks in
HDFS. The size of each of these blocks is 128MB by default, though it is configurable
according to requirement. For example, if you had a file of size 512MB, it would be divided
into 4 blocks storing 128MB each and the remainder will be stored as such.

NameNode in HDFS
HDFS operates in a master-slave architecture, meaning there is one master node and several
slave nodes in a cluster. Namenode is the master node in a cluster that runs on a separate
node in the cluster which does the following:-

 Manages the filesystem namespace which is the filesystem tree or hierarchy of the
files and directories.
 Stores information including owners of files and file permissions for all the files.
 It is also aware of the locations of all the blocks of a file and their size.

All this information is maintained persistently over the local disk in the form of two files:
Fsimage - stores the information about the files and directories in the filesystem. For files, it
stores the replication level, modification and access times, access permissions, blocks the file
is made up of, and their sizes. For directories, it stores the modification time and permissions,
Edit Log - keeps track of all the write operations that the client performs. This is regularly
updated to the in-memory metadata to serve the read requests.

Whenever a client wants to write information to HDFS or read information from HDFS, it
connects with the Namenode. The Namenode returns the location of the blocks to the client
and the operation is carried out. It is worth noting that the Namenode does not store the
blocks.

Datanodes in HDFS
Datanodes are the slave nodes. They are inexpensive commodity hardware that can be easily
added to the cluster. Datanodes are responsible for storing, retrieving, replicating, deletion,
etc. of blocks when asked by the Namenode. They periodically send heartbeats to the
Namenode so that it is aware of their health. With that, a Datanode also sends a list of blocks
that are stored on it so that the Namenode can maintain the mapping of blocks to Datanodes
in its memory.

Secondary Namenode in HDFS

Secondary Namenode is another node present in the cluster whose main task is to regularly
merge the Edit log with the Fsimage and produce check‐points of the primary’s in-memory
file system metadata. This is also referred to as Checkpointing. Secondary namenode runs on
a separate node on the cluster due to its checkpointing procedure which is very expensive and
requires a lot of memory. The Secondary Namenode is merely there for Checkpointing and
keeping a copy of the latest Fsimage.

Replication Management in HDFS

One of the features of HDFS is the replication of blocks which makes it very reliable. HDFS
is a reliable storage component of Hadoop this is because every block stored in the filesystem
is replicated on different Datanodes in the cluster. This makes HDFS fault-tolerant. The
default replication factor in HDFS is 3 though it is configurable and it means every block will
have two more copies of it, each stored on separate Datanodes in the cluster.
Rack awareness
A Rack is a collection of machines (30-40 in Hadoop) that are stored in the same physical
location. There are multiple racks in a Hadoop cluster, all connected through switches. To
increase reliability, block replicas are stored on different racks and Datanodes to increase
fault tolerance. Hadoop uses the Rack Awareness algorithm to solve the issue of write
bandwidth since replicas are stored on different Datanodes.

Benefits of using HDFS

The following are advantages of using HDFS:

 Cost effectiveness. The DataNodes that store the data rely on inexpensive off-the-
shelf hardware, which cuts storage costs. HDFS is open source, hence, no licensing
fee.

 Large data set storage. It stores a variety of data of any size that is from megabytes
to petabytes and in any format, including structured and unstructured data.

 Fast recovery from hardware failure. designed to detect faults and automatically
recover on its own.

 Portability. It is portable across all hardware platforms, and it is compatible with

several operating systems, including Windows, Linux and Mac OS/X.

 Streaming data access. It is built for high data throughput, which is best for access to

streaming data.
COMPARISON & BENEFITS OF NFS AND HADOOP

Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
BDA Chapter 2
No ratings yet
BDA Chapter 2
36 pages
3 HDFS
No ratings yet
3 HDFS
20 pages
Unit 3 HDFS Notes
No ratings yet
Unit 3 HDFS Notes
71 pages
L-8 HDFS Design and Architecture, Flume and Sqoop
No ratings yet
L-8 HDFS Design and Architecture, Flume and Sqoop
66 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Document 4 HDFS
No ratings yet
Document 4 HDFS
8 pages
Bda - M 2
No ratings yet
Bda - M 2
113 pages
HDFS
No ratings yet
HDFS
14 pages
Unit-2 CH 1 Updated
No ratings yet
Unit-2 CH 1 Updated
22 pages
What Is Hadoop HDF1
No ratings yet
What Is Hadoop HDF1
6 pages
Unit II-bid Data Programming
No ratings yet
Unit II-bid Data Programming
23 pages
File System Basics: Hadoop Distributed
No ratings yet
File System Basics: Hadoop Distributed
22 pages
BCS061 Notes Unit3
No ratings yet
BCS061 Notes Unit3
23 pages
HDFS
No ratings yet
HDFS
15 pages
HDFS: Design, Concepts, and Limitations
No ratings yet
HDFS: Design, Concepts, and Limitations
17 pages
Unit - 3 (HDFS)
No ratings yet
Unit - 3 (HDFS)
23 pages
Unit - 3 (HDFS) - 1
No ratings yet
Unit - 3 (HDFS) - 1
24 pages
HDFS Concepts
No ratings yet
HDFS Concepts
10 pages
Hadoop HDFS Training Guide
No ratings yet
Hadoop HDFS Training Guide
9 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
Huawei
No ratings yet
Huawei
32 pages
Hadoop HDFS Architecture Guide
No ratings yet
Hadoop HDFS Architecture Guide
20 pages
Bda Unit 5
No ratings yet
Bda Unit 5
17 pages
Unit 2
No ratings yet
Unit 2
14 pages
HDFS
No ratings yet
HDFS
16 pages
BDA Mod 3 QB Solns
No ratings yet
BDA Mod 3 QB Solns
19 pages
HDFS Intro
No ratings yet
HDFS Intro
9 pages
5.apache Hadoop
No ratings yet
5.apache Hadoop
33 pages
Module 1 PDF
No ratings yet
Module 1 PDF
42 pages
Hadoop
No ratings yet
Hadoop
23 pages
HDFS (27 Jan 2025 Hadoop Distributed File System)
No ratings yet
HDFS (27 Jan 2025 Hadoop Distributed File System)
73 pages
Big Data Unit-3
No ratings yet
Big Data Unit-3
46 pages
UNIT II Hadoop Framework
No ratings yet
UNIT II Hadoop Framework
25 pages
BDA - Unit-2
No ratings yet
BDA - Unit-2
24 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
44 pages
Unit-4 Hadoop Distributed File System (HDFS) : Syllabus
No ratings yet
Unit-4 Hadoop Distributed File System (HDFS) : Syllabus
17 pages
Bigdata 15cs82 Vtu Module 1 2 Notes
57% (14)
Bigdata 15cs82 Vtu Module 1 2 Notes
49 pages
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
No ratings yet
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
49 pages
Hadoop File System
No ratings yet
Hadoop File System
36 pages
Unit 3 Big Data - 240516 - 090400
No ratings yet
Unit 3 Big Data - 240516 - 090400
20 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
Unit II Big Data Analytics
No ratings yet
Unit II Big Data Analytics
11 pages
Quick Look: HDFS: Assumptions and Goals
No ratings yet
Quick Look: HDFS: Assumptions and Goals
5 pages
Module 4 - Hadoop HDFS
No ratings yet
Module 4 - Hadoop HDFS
102 pages
Hadoop Fundamentals
No ratings yet
Hadoop Fundamentals
45 pages
BDA Module-1 Notes
No ratings yet
BDA Module-1 Notes
14 pages
HDFS
No ratings yet
HDFS
8 pages
The Architecture of Open Source Applications - The Hadoop Distributed File System
No ratings yet
The Architecture of Open Source Applications - The Hadoop Distributed File System
6 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
No ratings yet
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
37 pages
5 - BDP 2024 06
No ratings yet
5 - BDP 2024 06
14 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Module 2
No ratings yet
Module 2
21 pages
Notes
88% (8)
Notes
18 pages
BBVCX
No ratings yet
BBVCX
89 pages
Unit 2
No ratings yet
Unit 2
56 pages
Interactive Dashboard Application Using R
No ratings yet
Interactive Dashboard Application Using R
2 pages
Database Management Exam Nuggets
No ratings yet
Database Management Exam Nuggets
133 pages
Battlet Test of Sphericity
No ratings yet
Battlet Test of Sphericity
1 page
Ethical Hacking Simulations
No ratings yet
Ethical Hacking Simulations
14 pages
Hadoop System and Yarn Including Their Components
No ratings yet
Hadoop System and Yarn Including Their Components
3 pages
Map Reduce Paradigm
No ratings yet
Map Reduce Paradigm
3 pages
Bca Semester I 2019 20
100% (1)
Bca Semester I 2019 20
11 pages
Magnetic Tape
No ratings yet
Magnetic Tape
13 pages
Troubleshooting and Technical Reference Guide-Vol 1-E4
No ratings yet
Troubleshooting and Technical Reference Guide-Vol 1-E4
164 pages
Iq-Lite: User Manual
No ratings yet
Iq-Lite: User Manual
77 pages
Computer System Peripherals Devices
100% (1)
Computer System Peripherals Devices
6 pages
AIS 2024-25 IG-I Worksheet-2 Computer Architecture
No ratings yet
AIS 2024-25 IG-I Worksheet-2 Computer Architecture
2 pages
"Tailor-Fit" Agreement Captured in A Document: Company) Example: Ilan Yung Nasagot Mong Call)
No ratings yet
"Tailor-Fit" Agreement Captured in A Document: Company) Example: Ilan Yung Nasagot Mong Call)
12 pages
NetBackup Architecture Guide
No ratings yet
NetBackup Architecture Guide
19 pages
04133044
No ratings yet
04133044
31 pages
Oracle Exadata X5 Admin Exam Guide
No ratings yet
Oracle Exadata X5 Admin Exam Guide
84 pages
Session 101 Life Cycle Power Consumption HDD vs. SSD: Debasis Baral Storage Labs Samsung Information Systems America
No ratings yet
Session 101 Life Cycle Power Consumption HDD vs. SSD: Debasis Baral Storage Labs Samsung Information Systems America
10 pages
Q.Paper of Class 9 (Computer Applications) - TERM-I 2024-25
No ratings yet
Q.Paper of Class 9 (Computer Applications) - TERM-I 2024-25
5 pages
Minor Computer Science Syllabus
No ratings yet
Minor Computer Science Syllabus
3 pages
MT6582 Android Scatter
No ratings yet
MT6582 Android Scatter
6 pages
ICTAP Chapter 1-3
No ratings yet
ICTAP Chapter 1-3
58 pages
COC1
No ratings yet
COC1
31 pages
Turn On Fault Tolerance Option Is Disabled: Symptoms
No ratings yet
Turn On Fault Tolerance Option Is Disabled: Symptoms
2 pages
Manual Samsung NP270E5G-KD1BR
No ratings yet
Manual Samsung NP270E5G-KD1BR
129 pages
Mo Test 1 2
No ratings yet
Mo Test 1 2
7 pages
Metasys® Network Automation Engine (Nae)
No ratings yet
Metasys® Network Automation Engine (Nae)
2 pages
Computer Generations Overview
No ratings yet
Computer Generations Overview
2 pages
DRAM & ROM Memory Organization Tutorial
No ratings yet
DRAM & ROM Memory Organization Tutorial
12 pages
Nursing Informatics: Computer Systems
100% (3)
Nursing Informatics: Computer Systems
38 pages
Lesson 1-5
No ratings yet
Lesson 1-5
66 pages
Federal Urdu University of Arts, Science and Technology Department of Computer Science Final Examination 2021 (5pm To 11:00pm)
No ratings yet
Federal Urdu University of Arts, Science and Technology Department of Computer Science Final Examination 2021 (5pm To 11:00pm)
2 pages
Graded Quiz Unit 6 Attempt Review PDF
No ratings yet
Graded Quiz Unit 6 Attempt Review PDF
9 pages
Tallh 07 MSS - KL Nyqjrk M%Yak: Dlaikfõoh I - Yd Úohdj-W'Fmd'I' W'FM &
No ratings yet
Tallh 07 MSS - KL Nyqjrk M%Yak: Dlaikfõoh I - Yd Úohdj-W'Fmd'I' W'FM &
5 pages
C/C++ System Design Interview Guide
0% (1)
C/C++ System Design Interview Guide
11 pages
Al 2020 Computer Science 1 Free
No ratings yet
Al 2020 Computer Science 1 Free
6 pages
Memory Management
No ratings yet
Memory Management
16 pages

Hadoop Distributed File System

Uploaded by

Hadoop Distributed File System

Uploaded by

Hadoop Distributed File System (HDFS)

Secondary Namenode in HDFS

Replication Management in HDFS

Benefits of using HDFS

 Portability. It is portable across all hardware platforms, and it is compatible with

You might also like