Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is the storage component of Hadoop. All data
stored on Hadoop is stored in a distributed manner across a cluster of machines. But it has a
few properties that define its existence. Below is the HDFS architecture: -
Components of HDFS
The HDFS consist of several components explained in detail below.
HDFS Blocks
HDFS breaks down a file into smaller units. Each of these units is stored on different
machines in the cluster. However, to the user working on the HDFS, it seems like storing all
the data onto a single machine. The smaller units broken down from a file are called blocks in
HDFS. The size of each of these blocks is 128MB by default, though it is configurable
according to requirement. For example, if you had a file of size 512MB, it would be divided
into 4 blocks storing 128MB each and the remainder will be stored as such.
NameNode in HDFS
HDFS operates in a master-slave architecture, meaning there is one master node and several
slave nodes in a cluster. Namenode is the master node in a cluster that runs on a separate
node in the cluster which does the following:-
Manages the filesystem namespace which is the filesystem tree or hierarchy of the
files and directories.
Stores information including owners of files and file permissions for all the files.
It is also aware of the locations of all the blocks of a file and their size.
All this information is maintained persistently over the local disk in the form of two files:
Fsimage - stores the information about the files and directories in the filesystem. For files, it
stores the replication level, modification and access times, access permissions, blocks the file
is made up of, and their sizes. For directories, it stores the modification time and permissions,
Edit Log - keeps track of all the write operations that the client performs. This is regularly
updated to the in-memory metadata to serve the read requests.
Whenever a client wants to write information to HDFS or read information from HDFS, it
connects with the Namenode. The Namenode returns the location of the blocks to the client
and the operation is carried out. It is worth noting that the Namenode does not store the
blocks.
Datanodes in HDFS
Datanodes are the slave nodes. They are inexpensive commodity hardware that can be easily
added to the cluster. Datanodes are responsible for storing, retrieving, replicating, deletion,
etc. of blocks when asked by the Namenode. They periodically send heartbeats to the
Namenode so that it is aware of their health. With that, a Datanode also sends a list of blocks
that are stored on it so that the Namenode can maintain the mapping of blocks to Datanodes
in its memory.
Secondary Namenode in HDFS
Secondary Namenode is another node present in the cluster whose main task is to regularly
merge the Edit log with the Fsimage and produce check‐points of the primary’s in-memory
file system metadata. This is also referred to as Checkpointing. Secondary namenode runs on
a separate node on the cluster due to its checkpointing procedure which is very expensive and
requires a lot of memory. The Secondary Namenode is merely there for Checkpointing and
keeping a copy of the latest Fsimage.
Replication Management in HDFS
One of the features of HDFS is the replication of blocks which makes it very reliable. HDFS
is a reliable storage component of Hadoop this is because every block stored in the filesystem
is replicated on different Datanodes in the cluster. This makes HDFS fault-tolerant. The
default replication factor in HDFS is 3 though it is configurable and it means every block will
have two more copies of it, each stored on separate Datanodes in the cluster.
Rack awareness
A Rack is a collection of machines (30-40 in Hadoop) that are stored in the same physical
location. There are multiple racks in a Hadoop cluster, all connected through switches. To
increase reliability, block replicas are stored on different racks and Datanodes to increase
fault tolerance. Hadoop uses the Rack Awareness algorithm to solve the issue of write
bandwidth since replicas are stored on different Datanodes.
Benefits of using HDFS
The following are advantages of using HDFS:
Cost effectiveness. The DataNodes that store the data rely on inexpensive off-the-
shelf hardware, which cuts storage costs. HDFS is open source, hence, no licensing
fee.
Large data set storage. It stores a variety of data of any size that is from megabytes
to petabytes and in any format, including structured and unstructured data.
Fast recovery from hardware failure. designed to detect faults and automatically
recover on its own.
Portability. It is portable across all hardware platforms, and it is compatible with
several operating systems, including Windows, Linux and Mac OS/X.
Streaming data access. It is built for high data throughput, which is best for access to
streaming data.
COMPARISON & BENEFITS OF NFS AND HADOOP