KEMBAR78
Big Data 2 - Part | PDF | Apache Hadoop | Map Reduce
0% found this document useful (0 votes)
14 views40 pages

Big Data 2 - Part

Hadoop is an open-source framework designed for storing and processing large datasets in a distributed computing environment, utilizing the MapReduce programming model. It consists of key components such as HDFS for storage and YARN for resource management, along with additional modules like Hive and Pig for enhanced functionality. While Hadoop offers advantages like scalability and fault tolerance, it also has limitations, including complexity and challenges with real-time processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views40 pages

Big Data 2 - Part

Hadoop is an open-source framework designed for storing and processing large datasets in a distributed computing environment, utilizing the MapReduce programming model. It consists of key components such as HDFS for storage and YARN for resource management, along with additional modules like Hive and Pig for enhanced functionality. While Hadoop offers advantages like scalability and fault tolerance, it also has limitations, including complexity and challenges with real-time processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

INTRODUCTION TO HADOOP

INTRODUCTION:

Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment. It is designed to handle
big data and is based on the MapReduce programming model, which allows for the
parallel processing of large datasets.

What is Hadoop?

Hadoop is an open-source software programming framework for storing a large


amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.

Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment. It is designed to handle
big data and is based on the MapReduce programming model, which allows for the
parallel processing of large datasets.

Hadoop has two main components:

 HDFS (Hadoop Distributed File System): This is the storage component of Hadoop,
which allows for the storage of large amounts of data across multiple machines. It is
designed to work with commodity hardware, which makes it cost-effective.

 YARN (Yet Another Resource Negotiator): This is the resource management


component of Hadoop, which manages the allocation of resources (such as CPU and
memory) for processing the data stored in HDFS.

 Hadoop also includes several additional modules that provide additional


functionality, such as Hive (a SQL-like query language), Pig (a high-level platform
for creating MapReduce programs), and HBase (a non-relational, distributed
database).

 Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis,
and data mining. It enables the distributed processing of large data sets across
clusters of computers using a simple programming model.
History of Hadoop

Apache Software Foundation is the developers of Hadoop, and it’s co-founders


are Doug Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on his
son’s toy elephant. In October 2003 the first paper release was Google File System. In
January 2006, MapReduce development started on the Apache Nutch which consisted
of around 6000 lines coding for it and around 5000 lines coding for HDFS. In April
2006 Hadoop 0.1.0 was released.

Hadoop is an open-source software framework for storing and processing big data. It
was created by Apache Software Foundation in 2006, based on a white paper written
by Google in 2003 that described the Google File System (GFS) and the MapReduce
programming model. The Hadoop framework allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. It is used by many organizations, including Yahoo,
Facebook, and IBM, for a variety of purposes such as data warehousing, log
processing, and research. Hadoop has been widely adopted in the industry and has
become a key technology for big data processing.

Features of hadoop:

1. it is fault tolerance. 4. it have huge flexible storage.

2. it is highly available. 5. it is low cost.

3. its programming is easy.

Hadoop has several key features that make it well-suited for big data
processing:

 Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.

 Scalability: Hadoop can scale from a single server to thousands of machines,


making it easy to add more capacity as needed.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can
continue to operate even in the presence of hardware failures.

 Data locality: Hadoop provides data locality feature, where the data is stored on the
same node where it will be processed, this feature helps to reduce the network traffic
and improve the performance

 High Availability: Hadoop provides High Availability feature, which helps to make
sure that the data is always available and is not lost.

 Flexible Data Processing: Hadoop’s MapReduce programming model allows for


the processing of data in a distributed fashion, making it easy to implement a wide
variety of data processing tasks.

 Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure
that the data stored is consistent and correct.

 Data Replication: Hadoop provides data replication feature, which helps to


replicate the data across the cluster for fault tolerance.

 Data Compression: Hadoop provides built-in data compression feature, which helps
to reduce the storage space and improve the performance.

 YARN: A resource management platform that allows multiple data processing


engines like real-time streaming, batch processing, and interactive SQL, to run and
process data stored in HDFS.

Hadoop Distributed File System

It has distributed file system known as HDFS and this HDFS splits files into blocks
and sends them across various nodes in form of large clusters. Also in case of a node
failure, the system operates and data transfer takes place between the nodes which are
facilitated by HDFS.
Advantages of HDFS:

 It is inexpensive.  ability to tolerate faults.


 immutable in nature.  scalable.
 stores data reliably.  block structured.

 can process a large amount of data simultaneously and many more.

Disadvantages of HDFS:

 It’s the biggest disadvantage is that it is not fit for small quantities of data.
 Also, it has issues related to potential stability.
 restrictive and rough in nature.

Hadoop also supports a wide range of software packages such as Apache Flumes,
Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache
Pig, Apache Hive, Apache Phoenix, Cloudera Impala.

Some common frameworks of Hadoop

1. HIVE- It uses HiveQl for data structuring and for writing complicated MapReduce
in HDFS.

2. DRILL- It consists of user-defined functions and is used for data exploration.

3. STORM- It allows real-time processing and streaming of data.

4. SPARK- It contains a Machine Learning Library (MLlib) for providing enhanced


machine learning and is widely used for data processing. It also supports Java,
Python, and Scala.

5. PIG- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.
6. TEZ- It reduces the complexities of Hive and Pig and helps in the running of their
codes faster.

Hadoop Framework is made up of the following modules:

1. Hadoop MapReduce- a MapReduce programming model for handling and


processing large data.

2. Hadoop Distributed File System- distributed files in clusters among nodes.

3. Hadoop YARN- a platform which manages computing resources.

4. Hadoop Common- it contains packages and libraries which are used for other
modules.

Advantages and Disadvantages of Hadoop

Advantages:

 Ability to store a large amount of  High computational power.


data.
 Tasks are independent.
 High flexibility.
 Linear scaling.
 Cost effective.

Hadoop has several advantages that make it a popular choice for big data
processing:

 Scalability: Hadoop can easily scale to handle large amounts of data by adding more
nodes to the cluster.

 Cost-effective: Hadoop is designed to work with commodity hardware, which


makes it a cost-effective option for storing and processing large amounts of data.

 Fault-tolerance: Hadoop’s distributed architecture provides built-in fault-tolerance,


which means that if one node in the cluster goes down, the data can still be
processed by the other nodes.

 Flexibility: Hadoop can process structured, semi-structured, and unstructured data,


which makes it a versatile option for a wide range of big data scenarios.
 Open-source: Hadoop is open-source software, which means that it is free to use
and modify. This also allows developers to access the source code and make
improvements or add new features.

 Large community: Hadoop has a large and active community of developers and
users who contribute to the development of the software, provide support, and share
best practices.

 Integration: Hadoop is designed to work with other big data technologies such as
Spark, Storm, and Flink, which allows for integration with a wide range of data
processing and analysis tools.

Disadvantages:

 Not very effective for small data.  Has stability issues.

 Hard cluster management.  Security concerns.

 Complexity: Hadoop can be complex to set up and maintain, especially for


organizations without a dedicated team of experts.

 Latency: Hadoop is not well-suited for low-latency workloads and may not be the
best choice for real-time data processing.

 Limited Support for Real-time Processing: Hadoop’s batch-oriented nature makes


it less suited for real-time streaming or interactive data processing use cases.

 Limited Support for Structured Data: Hadoop is designed to work with


unstructured and semi-structured data, it is not well-suited for structured data
processing

 Data Security: Hadoop does not provide built-in security features such as data
encryption or user authentication, which can make it difficult to secure sensitive
data.

 Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming model


is not well-suited for ad-hoc queries, making it difficult to perform exploratory data
analysis.

 Limited Support for Graph and Machine Learning: Hadoop’s core component
HDFS and MapReduce are not well-suited for graph and machine learning
workloads, specialized components like Apache Graph and Mahout are available but
have some limitations.

 Cost: Hadoop can be expensive to set up and maintain, especially for organizations
with large amounts of data.

 Data Loss: In the event of a hardware failure, the data stored in a single node may
be lost permanently.

 Data Governance: Data Governance is a critical aspect of data management,


Hadoop does not provide a built-in feature to manage data lineage, data quality, data
cataloging, data lineage, and data audit.

Hadoop – Architecture

As we all know Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data. Hadoop works on MapReduce
Programming Algorithm that was introduced by Google. Today lots of Big Brand
Companies are using Hadoop in their Organization to deal with big data, eg. Facebook,
Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4 components.

 MapReduce  YARN(Yet Another Resource


Negotiator)
 HDFS(Hadoop Distributed File
System)  Common Utilities or Hadoop
Common
Let’s understand the role of each one of this component in detail.

1. MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is based on the YARN
framework. The major feature of MapReduce is to perform the distributed processing in
parallel in a Hadoop cluster which Makes Hadoop working so fast. When you are dealing
with Big Data, serial processing is no more of any use. MapReduce has mainly 2 tasks
which are divided phase-wise:

In first phase, Map is utilized and in next phase Reduce is utilized.


Here, we can see that the Input is provided to the Map() function then it’s output is used as
an input to the Reduce function and after that, we receive our final output. Let’s understand
What this Map() and Reduce() does.

As we can see that an Input is provided to the Map(), now as we are using Big Data. The
Input is a set of Data. The Map() function here breaks this DataBlocks into Tuples that are
nothing but a key-value pair. These key-value pairs are now sent as input to the Reduce().
The Reduce() function then combines this broken Tuples or key-value pair based on its Key
value and form set of Tuples, and perform some operation like sorting, summation type job,
etc. which is then sent to the final Output Node. Finally, the Output is Obtained.

The data processing is always done in Reducer depending upon the business requirement of
that industry. This is How First Map() and then Reduce is utilized one by one.

Let’s understand the Map Task and Reduce Task in detail.

Map Task:

 RecordReader The purpose of recordreader is to break the records. It is responsible


for providing key-value pairs in a Map() function. The key is actually is its
locational information and value is the data associated with it.

 Map: A map is nothing but a user-defined function whose work is to process the
Tuples obtained from record reader. The Map() function either does not generate any
key-value pair or generate multiple pairs of these tuples.

 Combiner: Combiner is used for grouping the data in the Map workflow. It is
similar to a Local reducer. The intermediate key-value that are generated in the Map
is combined with the help of this combiner. Using a combiner is not necessary as it is
optional.

 Partitionar: Partitional is responsible for fetching key-value pairs generated in the


Mapper Phases. The partitioner generates the shards corresponding to each reducer.
Hashcode of each key is also fetched by this partition. Then partitioner performs
it’s(Hashcode) modulus with the number of reducers(key.hashcode()%(number of
reducers)).

Reduce Task:
 Shuffle and Sort: The Task of Reducer starts with this step, the process in which the
Mapper generates the intermediate key-value and transfers them to the Reducer task
is known as Shuffling. Using the Shuffling process the system can sort the data using
its key value.

Once some of the Mapping tasks are done Shuffling begins that is why it is a faster process
and does not wait for the completion of the task performed by Mapper.

 Reduce: The main function or task of the Reduce is to gather the Tuple generated
from Map and then perform some sorting and aggregation sort of process on those
key-value depending on its key element.

 OutputFormat: Once all the operations are performed, the key-value pairs are
written into the file with the help of record writer, each record in a new line, and the
key and value in a space-separated manner.

2. HDFS

HDFS (Hadoop Distributed File System) is utilized for storage permission. It is mainly
designed for working on commodity Hardware devices (inexpensive devices), working on a
distributed file system design. HDFS is designed in such a way that it believes more in
storing the data in a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the
other devices present in that Hadoop cluster. Data storage Nodes in HDFS.

 NameNode (Master)  DataNode (Slave)

NameNode: NameNode works as a Master in a Hadoop cluster that guides the


Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data about the
data. Meta Data can be the transaction logs that keep track of the user’s activity in a Hadoop
cluster.

Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the closest
DataNode for Faster Communication. Namenode instructs the DataNodes with the operation
like delete, create, Replicate, etc.

DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data
in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that.
The more number of DataNode, the Hadoop cluster will be able to store more data. So it is
advised that the DataNode should have High storing capacity to store a large number of file
blocks.

High Level Architecture of Hadoop


File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block
of data is divided into multiple blocks of size 128MB which is default and you can also
change it manually.

Let’s understand this concept of breaking down of file in blocks with an example. Suppose
you have uploaded a file of 400MB to your HDFS then what happens is this file got divided
into blocks of 128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created
each of 128MB except the last one. Hadoop doesn’t know or it doesn’t care about what data
is stored in these blocks so it considers the final file blocks as a partial record as it does not
have any idea regarding it. In the Linux file system, the size of a file block is about 4KB
which is very much less than the default size of file blocks in the Hadoop file system. As we
all know Hadoop is mainly configured for storing the large size data which is in petabyte,
this is what makes Hadoop file system different from other file systems as it can be scaled,
nowadays file blocks of 128MB to 256MB are considered in Hadoop.

Replication In HDFS Replication ensures the availability of the data. Replication is making
a copy of something and the number of times you make a copy of that particular thing can
be expressed as it’s Replication Factor. As we have seen in File blocks that the HDFS stores
the data in the form of various blocks at the same time Hadoop is also configured to make a
copy of those file blocks.

By default, the Replication Factor for Hadoop is set to 3 which can be configured means
you can change it manually as per your requirement like in above example we have made 4
file blocks which means that 3 Replica or copy of each file block is made means total of 4×3
= 12 blocks are made for the backup purpose.

This is because for running Hadoop we are using commodity hardware (inexpensive system
hardware) which can be crashed at any time. We are not using the supercomputer for our
Hadoop setup. That is why we need such a feature in HDFS which can make copies of that
file blocks for backup purposes, this is known as fault tolerance.

Now one thing we also need to notice that after making so many replica’s of our file blocks
we are wasting so much of our storage but for the big brand organization the data is very
much important than the storage so nobody cares for this extra storage. You can configure
the Replication factor in your hdfs-site.xml file.

Rack Awareness The rack is nothing but just the physical collection of nodes in our Hadoop
cluster (maybe 30 to 40). A large Hadoop cluster is consists of so many Racks . with the
help of this Racks information Namenode chooses the closest Datanode to achieve the
maximum performance while performing the read/write information which reduces the
Network Traffic.

HDFS Architecture

3. YARN (Yet Another Resource Negotiator)


YARN is a Framework on which MapReduce works. YARN performs 2 operations that are
Job scheduling and Resource Management. The Purpose of Job schedular is to divide a big
task into small jobs so that each job can be assigned to various slaves in a Hadoop cluster
and Processing can be Maximized. Job Scheduler also keeps track of which job is important,
which job has more priority, dependencies between the jobs and all the other information
like job timing, etc. And the use of Resource Manager is to manage all the resources that are
made available for running a Hadoop cluster.

Features of YARN

 Multi-Tenancy  Cluster-Utilization

 Scalability  Compatibility

4. Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library and java files or we
can say the java scripts that we need for all the other components present in a Hadoop
cluster. these utilities are used by HDFS, YARN, and MapReduce for running the cluster.
Hadoop Common verify that Hardware failure in a Hadoop cluster is common so it needs to
be solved automatically in software by Hadoop Framework.

Hadoop Ecosystem

Last Updated : 07 Aug, 2024

Overview: Apache Hadoop is an open source framework intended to make interaction


with big data easier, However, for those who are not acquainted with this technology, one
question arises that what is big data ? Big data is a term given to the data sets which can’t be
processed in an efficient manner with the help of traditional methodology such as RDBMS.
Hadoop has made its place in the industries and companies that need to work on large data
sets which are sensitive and needs efficient handling. Hadoop is a framework that enables
processing of large data sets which reside in the form of clusters. Being a framework,
Hadoop is made up of several modules that are supported by a large ecosystem of
technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to
solve the big data problems. It includes Apache projects and various commercial tools and
solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and
Hadoop Common Utilities. Most of the tools or solutions are used to supplement or
support these major elements. All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System

 YARN: Yet Another Resource Negotiator

 MapReduce: Programming based Data Processing

 Spark: In-Memory data processing

 PIG, HIVE: Query based processing of data services

 HBase: NoSQL Database

 Mahout, Spark MLLib: Machine Learning algorithm libraries

 Solar, Lucene: Searching and Indexing

 Zookeeper: Managing cluster

 Oozie: Job Scheduling


Note: Apart from the above-mentioned components, there are many other components too
that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is responsible


for storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.

 HDFS consists of two core components i.e.

1. Name node

2. Data Node

 Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These
data nodes are commodity hardware in the distributed environment. Undoubtedly,
making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.

YARN:

 Yet Another Resource Negotiator, as the name implies, YARN is the one who helps
to manage the resources across the clusters. In short, it performs scheduling and
resource allocation for the Hadoop System.

 Consists of three major components i.e.

1. Resource Manager 3. Application Manager

2. Nodes Manager

 Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.

MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce makes it


possible to carry over the processing’s logic and helps to write applications which
transform big data sets into a manageable one.

 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:

1. Map() performs sorting and filtering of data and thereby organizing them in
the form of group. Map generates a key-value pair based result which is later
on processed by the Reduce() method.

2. Reduce(), as the name suggests does the summarization by aggregating the


mapped data. In simple, Reduce() takes the output generated by Map() as
input and combines those tuples into smaller set of tuples.

PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.

 It is a platform for structuring the data flow, processing and analyzing huge data sets.
 Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.

 Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.

 Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.

HIVE:

 With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive Query
Language).

 It is highly scalable as it allows real-time processing and batch processing both.


Also, all the SQL datatypes are supported by Hive thus, making the query processing
easier.

 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.

 JDBC, along with ODBC drivers work on establishing the data storage permissions
and connection whereas HIVE Command line helps in the processing of queries.

Mahout:

 Mahout, allows Machine Learnability to a system or application. Machine Learning,


as the name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.

 It provides various libraries or functionalities such as collaborative filtering,


clustering, and classification which are nothing but concepts of Machine learning. It
allows invoking algorithms as per our need with the help of its own libraries.

Apache Spark:

 It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization,
etc.
 It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.

 Spark is best suited for real-time data whereas Hadoop is best suited for structured
data or batch processing, hence both are used in most of the companies
interchangeably.

Apache HBase:

 It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus
able to work on Big Data sets effectively.

 At times where we need to search or retrieve the occurrences of something small in a


huge database, the request must be processed within a short quick span of time. At
such times, HBase comes handy as it gives us a tolerant way of storing limited data

Other Components: Apart from all of these, there are some other components too that
carry out a huge task in order to make Hadoop capable of processing large datasets. They
are as follows:

 Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by Solr.

 Zookeeper: There was a huge issue of management of coordination and


synchronization among the resources or the components of Hadoop which resulted
in inconsistency, often. Zookeeper overcame all the problems by performing
synchronization, inter-component based communication, grouping, and maintenance.

 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie
workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be
executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those
that are triggered when some data or external stimulus is given to it.

The building blocks of Hadoop

Hadoop employs a master/slave architecture for both distributed storage and distributed
computation. The distributed storage system is called the Hadoop Distributed File System
(HDFS).
On a fully configured cluster, "running Hadoop" means running a set of daemons, or
resident programs, on the different servers in you network. These daemons have specific
roles; some exists only on one server, some exist across multiple servers. The daemons
include:

1. NameNode 4. JobTracker

2. DataNode 5. TaskTracker

3. Secondary NameNode

NameNode: The NameNode is the master of HDFS that directs the slave DataNode
daemons to perform the low-level I/O tasks. It is the bookkeeper of HDFS; it keeps track of
how your files are broken down into file blocks, which nodes store those blocks and the
overall health of the distributed file system.

The server hosting the NameNode typically doesn't store any user data or perform any
computations for a MapReduce program to lower the workload on the machine, hence
memory &I/O Intensive.

There is unfortunately a negative aspect to the importance of the NameNode - it's a single
point of failure of your Hadoop cluster. For any of the other daemons, if their host fail for
software or hardware reasons, the Hadoop cluster will likely continue to function smoothly
or you can quickly restart it. Not so for the NameNode.

DataNode: Each slave machine in your cluster will host a DataNode daemon to perform the
grunt work of the distributed filesystem - reading and writing HDFS blocks to actual files on
the local file system.

When you want to read or write a HDFS file, the file is broken into blocks and the
NameNode will tell your client which DataNode each block resides in. Your client
communicates directly with the DataNode daemons to process the local files corresponding
to the blocks.

Furthermore, a DataNode may communicate with other DataNodes to replicate its data
blocks for redundancy.This ensures that if any one DataNode crashes or becomes
inaccessible over the network, you’ll still able to read the files.
DataNodes are constantly reporting to the NameNode. Upon initialization, each of the
DataNodes informs the NameNode of the blocks it's currently storing. After this mapping is
complete, the DataNodes continually poll the NameNode to provide information regarding
local changes as well as receive instructions to create, move, or delete from the local disk.

Secondary NameNode (SNN): The SNN is an assistant daemon for monitoring the state of
the cluster HDFS. Like the NameNode, each cluster has one SNN, and it typically resides on
its own machine as well. No other DataNode or TaskTracker daemons run on the same
server. The SNN differs from the NameNode in that this process doesn't receive or record
any real-time changes to HDFS. Instead, it communicates with the NameNode to take
snapshots of the HDFS metadata at intervals defined by the cluster configuration.

As mentioned earlier, the NameNode is a single point of failure for a Hadoop cluster, and
the SNN snapshots help minimize the downtime and loss of data.

There is another topic which can be covered under SNN, i.e., fsimage(filesystem image) file
and edits file:

The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log
called the EditLog to persistently record every change that occurs to file system metadata.
For example, creating a new file in HDFS causes the NameNode to insert a record into the
EditLog indicating this. Similarly, changing the replication factor of a file causes a new
record to be inserted into the EditLog. The NameNode uses a file in its local host OS file
system to store the EditLog. The entire file system namespace, including the mapping of
blocks to files and file system properties, is stored in a file called the FsImage. The FsImage
is stored as a file in the NameNode’s local file system too.
JobTracker: Once you submit your code to your cluster, the JobTracker determines the
execution plan by determining which files to process, assigns nodes to different tasks, and
monitors all tasks as they're running. should a task fail, the JobTracker will automatically
relaunch the task, possibly on a different node, up to a predefined limit of retries.
There is only one JobTracker daemon per Hadoop cluster. It's typically run on a server as a
master node of the cluster.

TaskTracker: As with the storage daemons, the computing daemons also follow a
master/slave architecture: the JobTracker is the master overseeing the overall execution of a
MapReduce job and the TaskTracker manage the execution of individual tasks on each slave
node.

Each TaskTracker is responsible for executing the individual tasks that the JobTracker
assigns. Although there is a single TaskTracker per slave node, each TaskTracker can spawn
multiple JVMs t ohandle many map or reduce tasks in parallel.

One responsibility of the TaskTracker is to constantly communicate with the JobTracker. If


the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of
time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks
to other nodes in the cluster.
Difference Between RDBMS and Hadoop

RDBMS and Hadoop both are used for data handling, storing, and processing data but they
are different in terms of design, implementation, and use cases. In RDBMS, store primarily
structured data and processing by SQL while in Hadoop, store or handle structured and
unstructured data and processing using Map-Reduce or Spark. In this article, we will go into
detail about RDBMS and Hadoop and also look into the difference between RDBMS and
Hadoop.

What is the RDMS?

RDBMS is an information management system, which is based on a data model. In RDBMS


tables are used for information storage. Each row of the table represents a record and the
column represents an attribute of data. The organization of data and their manipulation
processes are different in RDBMS from other databases. RDBMS ensures ACID (atomicity,
consistency, integrity, durability) properties required for designing a database. The purpose
of RDBMS is to store, manage, and retrieve data as quickly and reliably as possible.

Advantages

 Support High data integrity.  Create a replicate of data so that


the data can be used in case of
 Provide multiple levels of security.
disasters.

 Normalization is present.
Disadvantages

 It is less scalable as compared to  Complex design.


Hadoop.
 Structure of database is fixed.
 Huge costs are required.

What is the Hadoop?

It is an open-source software framework used for storing data and running applications on a
group of commodity hardware. It has a large storage capacity and high processing power. It
can manage multiple concurrent processes at the same time. It is used in predictive analysis,
data mining, and machine learning. It can handle both structured and unstructured forms of
data. It is more flexible in storing, processing, and managing data than traditional RDBMS.
Unlike traditional systems, Hadoop enables multiple analytical processes on the same data
at the same time. It supports scalability very flexibly.

Advantages

 Highly scalable.  Store any kind of data (structured,


semi-structured and un-
 Cost effective because it is an
structured ).
open source software.
 High throughput.

Disadvantages

 Not effective for small files.  High up processing.

 Security feature is not available.  Supports only batch processing


(Map reduce).

Differences Between RDBMS and Hadoop

RDBMS Hadoop

Traditional row-column based databases, An open-source software used for storing data
basically used for data storage, manipulation and running applications or processes
and retrieval. concurrently.
RDBMS Hadoop

In this structured data is mostly processed. In this both structured and unstructured data is
processed.

It is best suited for OLTP environment. It is best suited for BIG data.

It is less scalable than Hadoop. It is highly scalable.

Data normalization is required in RDBMS. Data normalization is not required in Hadoop.

It stores transformed and aggregated data. It stores huge volume of data.

It has no latency in response. It has some latency in response.

The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.
High data integrity available. Low data integrity available than RDBMS.
Cost is applicable for licensed software. Free of cost, as it is an open source software.

Data is process using SQL. Data is process using Map-Reduce.

Follow ACID properties. Does not follow ACID properties.

Google File System

Google Inc. developed the Google File System (GFS), a scalable distributed file system
(DFS), to meet the company’s growing data processing needs. GFS offers fault tolerance,
dependability, scalability, availability, and performance to big networks and connected
nodes. GFS is made up of a number of storage systems constructed from inexpensive
commodity hardware parts. The search engine, which creates enormous volumes of data that
must be kept, is only one example of how it is customized to meet Google’s various data use
and storage requirements.

The Google File System reduced hardware flaws while gains of commercially available
servers.

GoogleFS is another name for GFS. It manages two types of data namely File metadata and
File Data.
The GFS node cluster consists of a single master and several chunk servers that various
client systems regularly access. On local discs, chunk servers keep data in the form of Linux
files. Large (64 MB) pieces of the stored data are split up and replicated at least three times
around the network. Reduced network overhead results from the greater chunk size.

Without hindering applications, GFS is made to meet Google’s huge cluster requirements.
Hierarchical directories with path names are used to store files. The master is in charge of
managing metadata, including namespace, access control, and mapping data. The master
communicates with each chunk server by timed heartbeat messages and keeps track of its
status updates.

More than 1,000 nodes with 300 TB of disc storage capacity make up the largest GFS
clusters. This is available for constant access by hundreds of clients.

To understand how distributed file systems like GFS are utilized in cloud and DevOps
environments in detail, the DevOps Engineering – Planning to Production course offers
practical insights and real-world examples.

Components of GFS

A group of computers makes up GFS. A cluster is just a group of connected computers.


There could be hundreds or even thousands of computers in each cluster. There are three
basic entities included in any GFS cluster as follows:
 GFS Clients: They can be computer programs or applications which may be used to
request files. Requests may be made to access and modify already-existing files or
add new files to the system.

 GFS Master Server: It serves as the cluster’s coordinator. It preserves a record of


the cluster’s actions in an operation log. Additionally, it keeps track of the data that
describes chunks, or metadata. The chunks’ place in the overall file and which files
they belong to are indicated by the metadata to the master server.

 GFS Chunk Servers: They are the GFS’s workhorses. They keep 64 MB-sized file
chunks. The master server does not receive any chunks from the chunk servers.
Instead, they directly deliver the client the desired chunks. The GFS makes
numerous copies of each chunk and stores them on various chunk servers in order to
assure stability; the default is three copies. Every replica is referred to as one.

Features of GFS

 Namespace management and  High availability.


locking.
 Critical data replication.
 Fault tolerance.
 Automatic and efficient data
 Reduced client and master recovery.
interaction because of large chunk
 High aggregate throughput.
server size.

Advantages of GFS

1. High accessibility Data is still accessible even if a few nodes fail. (replication)
Component failures are more common than not, as the saying goes.

2. Excessive throughput. many nodes operating concurrently.

3. Dependable storing. Data that has been corrupted can be found and duplicated.

Disadvantages of GFS

1. Not the best fit for small files. 4. Suitable for procedures or data that
are written once and only read
2. Master may act as a bottleneck.
(appended) later.
3. unable to type at random.
Google File System(GFS) vs. Hadoop Distributed File System (HDFS)

In distributed file systems, Google File System (GFS) and Hadoop Distributed File System
(HDFS) stand out as crucial technologies. Both are designed to handle large-scale data, but
they cater to different needs and environments. In this article, we will understand the
differences between them.

Google File System(GFS) vs. Hadoop Distributed File System (HDFS)

What is Google File System (GFS)?

Google File System (GFS) is a distributed file system designed by Google to handle large-
scale data storage across multiple machines while providing high reliability and
performance.

 It was developed to meet the needs of Google's massive data processing and storage
requirements, particularly for its search engine and other large-scale applications.

 GFS is optimized for storing and processing very large files (in the range of
gigabytes or terabytes) and supports high-throughput data operations rather than
low-latency access.

Key Features of Google File System(GFS)

Below are the key features of Google File System(GFS):

 Scalability: GFS can scale to thousands of storage nodes and manage petabytes of
data.
 Fault Tolerance: Data is replicated across multiple machines, ensuring reliability
even in case of hardware failures.

 High Throughput: It’s optimized for large data sets and supports concurrent read
and write operations.

 Chunk-based Storage: Files are divided into fixed-size chunks (usually 64 MB) and
distributed across many machines.

 Master and Chunkserver Architecture: GFS employs a master server that


manages metadata and multiple chunkservers that store the actual data.

What is Hadoop Distributed File System (HDFS)?

Hadoop Distributed File System (HDFS) is a open source distributed file system inspired
by GFS and is designed to store large amounts of data across a cluster of machines,
ensuring fault tolerance and scalability. It is a core component of the Apache Hadoop
ecosystem and is designed to handle large-scale data processing jobs such as those found in
big data environments.

Key Features of Hadoop Distributed File System (HDFS)

Below are the key features of Hadoop Distributed File System:

 Distributed Architecture: HDFS stores files across a distributed cluster of


machines.

 Fault Tolerance: Data is replicated across multiple nodes, ensuring that the system
can recover from failures.

 Master-Slave Architecture: HDFS consists of a single master node (NameNode)


that manages metadata and multiple slave nodes (DataNodes) that store actual data.

 Large Block Size: HDFS breaks files into large blocks (default 128 MB or 64 MB)
to optimize read/write operations for large datasets.

 Write Once, Read Many: HDFS is optimized for workloads that involve writing
files once and reading them multiple times

Google File System(GFS) vs. Hadoop Distributed File System (HDFS)


Below are the key differences between Google File System and Hadoop Distributed File
System:

Hadoop Distributed File System


Aspect Google File System (GFS) (HDFS)

Developed by Google for their Developed by Apache for open-


Origin internal applications. source big data frameworks.

Master-slave architecture with a single


Master-slave architecture with a
master (GFS master) and
NameNode and DataNodes.
Architecture chunkservers.

Block/Chunk Default block size of 128 MB


Default chunk size of 64 MB.
Size (configurable).

Replication Default replication is 3 copies


Default replication is 3 copies.
Factor (configurable)

File Access Optimized for write-once, read-many Also optimized for write-once, read-
Pattern access patterns. many workloads.

Achieves fault tolerance via data Achieves fault tolerance via data
Fault replication across multiple replication across multiple
Tolerance chunkservers. DataNodes.

Uses checksums to ensure data Uses checksums to ensure data


Data Integrity integrity. integrity.

Data Locality Focus on computation close to data Provides data locality by moving
Hadoop Distributed File System
Aspect Google File System (GFS) (HDFS)

computation to where the data is


for efficiency.
stored.

Designed to run on commodity Also designed to run on commodity


Cost Efficiency hardware. hardware.

Use Cases of Google File System (GFS)

Below are the use cases of google file system(gfs):

 Web Indexing and Search Engine Operations:

o GFS was originally developed to support Google’s search engine.

o It handles massive amounts of web data (such as crawled web pages) that
need to be processed, indexed, and stored efficiently.

o The system enables fast access to large datasets, making it ideal for web
crawling and indexing tasks.

 Large-Scale Data Processing:

o GFS is used in large-scale distributed data processing jobs where files can be
extremely large (gigabytes or terabytes).

o It supports high-throughput data access, making it suitable for data


processing jobs like MapReduce.

o Google used GFS for data-intensive tasks like search indexing, log analysis,
and content processing.

 Machine Learning and AI Workloads:

o GFS is also employed in machine learning tasks at Google.

o Since machine learning often involves processing large datasets for training
models, GFS’s ability to handle large files and provide high-throughput data
access makes it useful for machine learning pipelines.

 Distributed Video and Image Storage:

o GFS is used to store and process large multimedia files, such as videos and
images, for Google services like YouTube and Google Images.

o Its fault tolerance and ability to scale out to handle massive amounts of
media make it ideal for these types of workloads.

 Log File Storage and Processing:

o Large-scale applications generate enormous log files, which can be stored in


GFS for future analysis.

o Google uses GFS to store and analyze logs for various services (e.g., Google
Ads, Gmail) to identify trends, detect anomalies, and improve service quality.

Use Cases of Hadoop Distributed File System (HDFS)

Below are the use cases of Hadoop Distributed File System(HDFS):

 Big Data Analytics:

o HDFS is widely used for big data analytics in environments that require the
storage and processing of massive datasets.

o Organizations use HDFS for tasks such as customer behavior analysis,


predictive modeling, and large-scale business intelligence analysis using
tools like Apache Hadoop and Apache Spark.

 Data Warehousing:

o HDFS serves as the backbone for data lakes and distributed data warehouses.

o Enterprises use it to store structured, semi-structured, and unstructured data,


enabling them to run complex queries, generate reports, and derive insights
using data warehouse tools like Hive and Impala.

 Batch Processing via MapReduce:

o HDFS is the foundational storage layer for running batch processing jobs
using the MapReduce framework.
o Applications like log analysis, recommendation engines, and ETL (extract-
transform-load) workflows commonly run on HDFS with MapReduce.

 Machine Learning and Data Mining:

o HDFS is also popular in machine learning environments for storing large


datasets that need to be processed by distributed algorithms.

o Frameworks like Apache Mahout and MLlib (Spark’s machine learning


library) work seamlessly with HDFS for training and testing machine
learning models.

 Social Media Data Processing:

o HDFS is commonly used in social media analytics to process large-scale


user-generated content such as tweets, posts, and multimedia.

o Social media companies use HDFS to store, analyze, and extract trends or
insights from vast amounts of data.

Conclusion

In conclusion, GFS is used only by Google for its own tasks, while HDFS is open for
everyone and widely used by many companies. GFS handles Google’s big data, and HDFS
helps other businesses store and process large amounts of data through tools like Hadoop.

HDFS Commands

HDFS is the primary or major component of the Hadoop ecosystem which is responsible for
storing large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files. To use the HDFS commands, first you
need to start the Hadoop services using the following command:

sbin/start-all.sh

To check the Hadoop services are up and running use the following command:

jps
Commands:

1. ls: This command is used to list all the files. Use lsr for recursive approach. It is
useful when we want a hierarchy of a folder.

Syntax:

bin/hdfs dfs -ls <path>

Example:

bin/hdfs dfs -ls /

It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed File System)
commands.

2. mkdir: To create a directory. In Hadoop dfs there is no home directory by default.


So let’s first create it.

Syntax:

bin/hdfs dfs -mkdir <folder name>

creating home directory:

hdfs/bin -mkdir /user

hdfs/bin -mkdir /user/username -> write the username of your computer

Example:

bin/hdfs dfs -mkdir /geeks => '/' means absolute path

bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be

created relative to the home directory.


3. touchz: It creates an empty file.

Syntax:

bin/hdfs dfs -touchz <file_path>

Example:

bin/hdfs dfs -touchz /geeks/myfile.txt

4. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store.
This is the most important command. Local filesystem means the files present on the
OS.

Syntax:

bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>

Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to
folder geeks present on hdfs.

bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks

(OR)

bin/hdfs dfs -put ../Desktop/AI.txt /geeks


5. cat: To print file contents.

Syntax:

bin/hdfs dfs -cat <path>

Example:

// print the content of AI.txt present

// inside geeks folder.

bin/hdfs dfs -cat /geeks/AI.txt ->

6. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.

Syntax:

bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>

Example:

bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero

(OR)

bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero

myfile.txt from geeks folder will be copied to folder hero present on Desktop.

Note: Observe that we don’t write bin/hdfs while checking the things present on local
filesystem.

7. moveFromLocal: This command will move file from local to hdfs.


Syntax:

bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>

Example:

bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks

8. cp: This command is used to copy files within hdfs. Lets copy
folder geeks to geeks_copied.

Syntax:

bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>

Example:

bin/hdfs dfs -cp /geeks /geeks_copied

9. mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.

Syntax:

bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>

Example:

bin/hdfs dfs -mv /geeks/myfile.txt /geeks_copied


10. rmr: This command deletes a file from HDFS recursively. It is very useful
command when you want to delete a non-empty directory.

Syntax:

bin/hdfs dfs -rmr <filename/directoryName>

Example:

bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the

directory then the directory itself.

11. du: It will give the size of each file in directory.

Syntax:

bin/hdfs dfs -du <dirName>

Example:

bin/hdfs dfs -du /geeks

12. dus:: This command will give the total size of directory/file.
Syntax:

bin/hdfs dfs -dus <dirName>

Example:

bin/hdfs dfs -dus /geeks

13. stat: It will give the last modified time of directory or path. In short it will give stats
of the directory or file.

Syntax:

bin/hdfs dfs -stat <hdfs file>

Example:

bin/hdfs dfs -stat /geeks

14. setrep: This command is used to change the replication factor of a file/directory in
HDFS. By default it is 3 for anything which is stored in HDFS (as set in hdfs core-
site.xml).

Example 1: To change the replication factor to 6 for geeks.txt stored in HDFS.

bin/hdfs dfs -setrep -R -w 6 geeks.txt

Example 2: To change the replication factor to 4 for a directory geeksInput stored in HDFS.

bin/hdfs dfs -setrep -R 4 /geeks

Note: The -w means wait till the replication is completed. And -R means recursively, we use
it for directories as they may also contain many files and folders inside them.
Note: There are more commands in HDFS but we discussed the commands which are
commonly used when working with Hadoop. You can check out the list of dfs commands
using the following command:

bin/hdfs dfs

You might also like