HDFS
Hadoop Distributed File System (HDFS) is a crucial component of the Hadoop ecosystem, designed to
store and manage large amounts of data in a distributed, fault-tolerant, and scalable manner. It's the
primary storage system for big data analytics, enabling efficient processing of massive datasets.
In HDFS data is distributed over several machines and replicated to ensure their durability to failure
and high availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and
node name.
It provides one of the most reliable filesystems. HDFS (Hadoop Distributed File System) is a unique
design that provides storage for extremely large files with streaming data access pattern, and it runs
on commodity hardware. Let’s elaborate on the terms:
Extremely large files: Here, we are talking about the data in a range of petabytes (1000 TB).
Streaming Data Access Pattern: HDFS is designed on principle of write-once and read-many-
times. Once data is written large portions of dataset can be processed any number times.
Commodity hardware: Hardware that is inexpensive and easily available in the market. This
is one of the features that especially distinguishes HDFS from other file systems.
Where to use HDFS
o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
o Streaming Data Access: The time to read whole data set is more important than latency in
reading the first. HDFS is built on write-once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.
Where not to use HDFS
o Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the
first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if the files
are small in size it takes a lot of memory for name node's memory which is not feasible.
o Multiple Writes:It should not be used when we have to write multiple times.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are 128
MB by default and this is configurable.Files n HDFS are broken into block-sized chunks,which
are stored as independent units.Unlike a file system, if the file is in HDFS is smaller than block
size, then it does not occupy full block?s size, i.e. 5 MB of file stored in HDFS of block size 128
MB takes 5MB of space only.The HDFS block size is large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission, names and
location of each block.The metadata are small, so it is stored in the memory of name
node,allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently,so all this information is handled by a single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The data
node being a commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.
HDFS DataNode and NameNode Image:
HDFS Read Image:
HDFS Write Image:
Since all the metadata is stored in name node, it is very important. If it fails the file system
can not be used as there would be no way of knowing how to reconstruct the files from
blocks present in data node. To overcome this, the concept of secondary name node arises.
Secondary Name Node: It is a separate physical machine which acts as a helper of name
node. It performs periodic check points.It communicates with the name node and take
snapshot of meta data which helps minimize downtime and loss of data.