Big Data 2 - Part
Big Data 2 - Part
INTRODUCTION:
Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment. It is designed to handle
big data and is based on the MapReduce programming model, which allows for the
parallel processing of large datasets.
What is Hadoop?
Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment. It is designed to handle
big data and is based on the MapReduce programming model, which allows for the
parallel processing of large datasets.
HDFS (Hadoop Distributed File System): This is the storage component of Hadoop,
which allows for the storage of large amounts of data across multiple machines. It is
designed to work with commodity hardware, which makes it cost-effective.
Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis,
and data mining. It enables the distributed processing of large data sets across
clusters of computers using a simple programming model.
History of Hadoop
Hadoop is an open-source software framework for storing and processing big data. It
was created by Apache Software Foundation in 2006, based on a white paper written
by Google in 2003 that described the Google File System (GFS) and the MapReduce
programming model. The Hadoop framework allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. It is used by many organizations, including Yahoo,
Facebook, and IBM, for a variety of purposes such as data warehousing, log
processing, and research. Hadoop has been widely adopted in the industry and has
become a key technology for big data processing.
Features of hadoop:
Hadoop has several key features that make it well-suited for big data
processing:
Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.
Data locality: Hadoop provides data locality feature, where the data is stored on the
same node where it will be processed, this feature helps to reduce the network traffic
and improve the performance
High Availability: Hadoop provides High Availability feature, which helps to make
sure that the data is always available and is not lost.
Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure
that the data stored is consistent and correct.
Data Compression: Hadoop provides built-in data compression feature, which helps
to reduce the storage space and improve the performance.
It has distributed file system known as HDFS and this HDFS splits files into blocks
and sends them across various nodes in form of large clusters. Also in case of a node
failure, the system operates and data transfer takes place between the nodes which are
facilitated by HDFS.
Advantages of HDFS:
Disadvantages of HDFS:
It’s the biggest disadvantage is that it is not fit for small quantities of data.
Also, it has issues related to potential stability.
restrictive and rough in nature.
Hadoop also supports a wide range of software packages such as Apache Flumes,
Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache
Pig, Apache Hive, Apache Phoenix, Cloudera Impala.
1. HIVE- It uses HiveQl for data structuring and for writing complicated MapReduce
in HDFS.
5. PIG- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.
6. TEZ- It reduces the complexities of Hive and Pig and helps in the running of their
codes faster.
4. Hadoop Common- it contains packages and libraries which are used for other
modules.
Advantages:
Hadoop has several advantages that make it a popular choice for big data
processing:
Scalability: Hadoop can easily scale to handle large amounts of data by adding more
nodes to the cluster.
Large community: Hadoop has a large and active community of developers and
users who contribute to the development of the software, provide support, and share
best practices.
Integration: Hadoop is designed to work with other big data technologies such as
Spark, Storm, and Flink, which allows for integration with a wide range of data
processing and analysis tools.
Disadvantages:
Latency: Hadoop is not well-suited for low-latency workloads and may not be the
best choice for real-time data processing.
Data Security: Hadoop does not provide built-in security features such as data
encryption or user authentication, which can make it difficult to secure sensitive
data.
Limited Support for Graph and Machine Learning: Hadoop’s core component
HDFS and MapReduce are not well-suited for graph and machine learning
workloads, specialized components like Apache Graph and Mahout are available but
have some limitations.
Cost: Hadoop can be expensive to set up and maintain, especially for organizations
with large amounts of data.
Data Loss: In the event of a hardware failure, the data stored in a single node may
be lost permanently.
Hadoop – Architecture
As we all know Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data. Hadoop works on MapReduce
Programming Algorithm that was introduced by Google. Today lots of Big Brand
Companies are using Hadoop in their Organization to deal with big data, eg. Facebook,
Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4 components.
1. MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is based on the YARN
framework. The major feature of MapReduce is to perform the distributed processing in
parallel in a Hadoop cluster which Makes Hadoop working so fast. When you are dealing
with Big Data, serial processing is no more of any use. MapReduce has mainly 2 tasks
which are divided phase-wise:
As we can see that an Input is provided to the Map(), now as we are using Big Data. The
Input is a set of Data. The Map() function here breaks this DataBlocks into Tuples that are
nothing but a key-value pair. These key-value pairs are now sent as input to the Reduce().
The Reduce() function then combines this broken Tuples or key-value pair based on its Key
value and form set of Tuples, and perform some operation like sorting, summation type job,
etc. which is then sent to the final Output Node. Finally, the Output is Obtained.
The data processing is always done in Reducer depending upon the business requirement of
that industry. This is How First Map() and then Reduce is utilized one by one.
Map Task:
Map: A map is nothing but a user-defined function whose work is to process the
Tuples obtained from record reader. The Map() function either does not generate any
key-value pair or generate multiple pairs of these tuples.
Combiner: Combiner is used for grouping the data in the Map workflow. It is
similar to a Local reducer. The intermediate key-value that are generated in the Map
is combined with the help of this combiner. Using a combiner is not necessary as it is
optional.
Reduce Task:
Shuffle and Sort: The Task of Reducer starts with this step, the process in which the
Mapper generates the intermediate key-value and transfers them to the Reducer task
is known as Shuffling. Using the Shuffling process the system can sort the data using
its key value.
Once some of the Mapping tasks are done Shuffling begins that is why it is a faster process
and does not wait for the completion of the task performed by Mapper.
Reduce: The main function or task of the Reduce is to gather the Tuple generated
from Map and then perform some sorting and aggregation sort of process on those
key-value depending on its key element.
OutputFormat: Once all the operations are performed, the key-value pairs are
written into the file with the help of record writer, each record in a new line, and the
key and value in a space-separated manner.
2. HDFS
HDFS (Hadoop Distributed File System) is utilized for storage permission. It is mainly
designed for working on commodity Hardware devices (inexpensive devices), working on a
distributed file system design. HDFS is designed in such a way that it believes more in
storing the data in a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the
other devices present in that Hadoop cluster. Data storage Nodes in HDFS.
Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the closest
DataNode for Faster Communication. Namenode instructs the DataNodes with the operation
like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data
in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that.
The more number of DataNode, the Hadoop cluster will be able to store more data. So it is
advised that the DataNode should have High storing capacity to store a large number of file
blocks.
Let’s understand this concept of breaking down of file in blocks with an example. Suppose
you have uploaded a file of 400MB to your HDFS then what happens is this file got divided
into blocks of 128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created
each of 128MB except the last one. Hadoop doesn’t know or it doesn’t care about what data
is stored in these blocks so it considers the final file blocks as a partial record as it does not
have any idea regarding it. In the Linux file system, the size of a file block is about 4KB
which is very much less than the default size of file blocks in the Hadoop file system. As we
all know Hadoop is mainly configured for storing the large size data which is in petabyte,
this is what makes Hadoop file system different from other file systems as it can be scaled,
nowadays file blocks of 128MB to 256MB are considered in Hadoop.
Replication In HDFS Replication ensures the availability of the data. Replication is making
a copy of something and the number of times you make a copy of that particular thing can
be expressed as it’s Replication Factor. As we have seen in File blocks that the HDFS stores
the data in the form of various blocks at the same time Hadoop is also configured to make a
copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured means
you can change it manually as per your requirement like in above example we have made 4
file blocks which means that 3 Replica or copy of each file block is made means total of 4×3
= 12 blocks are made for the backup purpose.
This is because for running Hadoop we are using commodity hardware (inexpensive system
hardware) which can be crashed at any time. We are not using the supercomputer for our
Hadoop setup. That is why we need such a feature in HDFS which can make copies of that
file blocks for backup purposes, this is known as fault tolerance.
Now one thing we also need to notice that after making so many replica’s of our file blocks
we are wasting so much of our storage but for the big brand organization the data is very
much important than the storage so nobody cares for this extra storage. You can configure
the Replication factor in your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in our Hadoop
cluster (maybe 30 to 40). A large Hadoop cluster is consists of so many Racks . with the
help of this Racks information Namenode chooses the closest Datanode to achieve the
maximum performance while performing the read/write information which reduces the
Network Traffic.
HDFS Architecture
Features of YARN
Multi-Tenancy Cluster-Utilization
Scalability Compatibility
Hadoop common or Common utilities are nothing but our java library and java files or we
can say the java scripts that we need for all the other components present in a Hadoop
cluster. these utilities are used by HDFS, YARN, and MapReduce for running the cluster.
Hadoop Common verify that Hardware failure in a Hadoop cluster is common so it needs to
be solved automatically in software by Hadoop Framework.
Hadoop Ecosystem
1. Name node
2. Data Node
Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These
data nodes are commodity hardware in the distributed environment. Undoubtedly,
making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps
to manage the resources across the clusters. In short, it performs scheduling and
resource allocation for the Hadoop System.
2. Nodes Manager
Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.
MapReduce:
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in
the form of group. Map generates a key-value pair based result which is later
on processed by the Reduce() method.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive Query
Language).
Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions
and connection whereas HIVE Command line helps in the processing of queries.
Mahout:
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization,
etc.
It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured
data or batch processing, hence both are used in most of the companies
interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus
able to work on Big Data sets effectively.
Other Components: Apart from all of these, there are some other components too that
carry out a huge task in order to make Hadoop capable of processing large datasets. They
are as follows:
Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by Solr.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie
workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be
executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those
that are triggered when some data or external stimulus is given to it.
Hadoop employs a master/slave architecture for both distributed storage and distributed
computation. The distributed storage system is called the Hadoop Distributed File System
(HDFS).
On a fully configured cluster, "running Hadoop" means running a set of daemons, or
resident programs, on the different servers in you network. These daemons have specific
roles; some exists only on one server, some exist across multiple servers. The daemons
include:
1. NameNode 4. JobTracker
2. DataNode 5. TaskTracker
3. Secondary NameNode
NameNode: The NameNode is the master of HDFS that directs the slave DataNode
daemons to perform the low-level I/O tasks. It is the bookkeeper of HDFS; it keeps track of
how your files are broken down into file blocks, which nodes store those blocks and the
overall health of the distributed file system.
The server hosting the NameNode typically doesn't store any user data or perform any
computations for a MapReduce program to lower the workload on the machine, hence
memory &I/O Intensive.
There is unfortunately a negative aspect to the importance of the NameNode - it's a single
point of failure of your Hadoop cluster. For any of the other daemons, if their host fail for
software or hardware reasons, the Hadoop cluster will likely continue to function smoothly
or you can quickly restart it. Not so for the NameNode.
DataNode: Each slave machine in your cluster will host a DataNode daemon to perform the
grunt work of the distributed filesystem - reading and writing HDFS blocks to actual files on
the local file system.
When you want to read or write a HDFS file, the file is broken into blocks and the
NameNode will tell your client which DataNode each block resides in. Your client
communicates directly with the DataNode daemons to process the local files corresponding
to the blocks.
Furthermore, a DataNode may communicate with other DataNodes to replicate its data
blocks for redundancy.This ensures that if any one DataNode crashes or becomes
inaccessible over the network, you’ll still able to read the files.
DataNodes are constantly reporting to the NameNode. Upon initialization, each of the
DataNodes informs the NameNode of the blocks it's currently storing. After this mapping is
complete, the DataNodes continually poll the NameNode to provide information regarding
local changes as well as receive instructions to create, move, or delete from the local disk.
Secondary NameNode (SNN): The SNN is an assistant daemon for monitoring the state of
the cluster HDFS. Like the NameNode, each cluster has one SNN, and it typically resides on
its own machine as well. No other DataNode or TaskTracker daemons run on the same
server. The SNN differs from the NameNode in that this process doesn't receive or record
any real-time changes to HDFS. Instead, it communicates with the NameNode to take
snapshots of the HDFS metadata at intervals defined by the cluster configuration.
As mentioned earlier, the NameNode is a single point of failure for a Hadoop cluster, and
the SNN snapshots help minimize the downtime and loss of data.
There is another topic which can be covered under SNN, i.e., fsimage(filesystem image) file
and edits file:
The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log
called the EditLog to persistently record every change that occurs to file system metadata.
For example, creating a new file in HDFS causes the NameNode to insert a record into the
EditLog indicating this. Similarly, changing the replication factor of a file causes a new
record to be inserted into the EditLog. The NameNode uses a file in its local host OS file
system to store the EditLog. The entire file system namespace, including the mapping of
blocks to files and file system properties, is stored in a file called the FsImage. The FsImage
is stored as a file in the NameNode’s local file system too.
JobTracker: Once you submit your code to your cluster, the JobTracker determines the
execution plan by determining which files to process, assigns nodes to different tasks, and
monitors all tasks as they're running. should a task fail, the JobTracker will automatically
relaunch the task, possibly on a different node, up to a predefined limit of retries.
There is only one JobTracker daemon per Hadoop cluster. It's typically run on a server as a
master node of the cluster.
TaskTracker: As with the storage daemons, the computing daemons also follow a
master/slave architecture: the JobTracker is the master overseeing the overall execution of a
MapReduce job and the TaskTracker manage the execution of individual tasks on each slave
node.
Each TaskTracker is responsible for executing the individual tasks that the JobTracker
assigns. Although there is a single TaskTracker per slave node, each TaskTracker can spawn
multiple JVMs t ohandle many map or reduce tasks in parallel.
RDBMS and Hadoop both are used for data handling, storing, and processing data but they
are different in terms of design, implementation, and use cases. In RDBMS, store primarily
structured data and processing by SQL while in Hadoop, store or handle structured and
unstructured data and processing using Map-Reduce or Spark. In this article, we will go into
detail about RDBMS and Hadoop and also look into the difference between RDBMS and
Hadoop.
Advantages
Normalization is present.
Disadvantages
It is an open-source software framework used for storing data and running applications on a
group of commodity hardware. It has a large storage capacity and high processing power. It
can manage multiple concurrent processes at the same time. It is used in predictive analysis,
data mining, and machine learning. It can handle both structured and unstructured forms of
data. It is more flexible in storing, processing, and managing data than traditional RDBMS.
Unlike traditional systems, Hadoop enables multiple analytical processes on the same data
at the same time. It supports scalability very flexibly.
Advantages
Disadvantages
RDBMS Hadoop
Traditional row-column based databases, An open-source software used for storing data
basically used for data storage, manipulation and running applications or processes
and retrieval. concurrently.
RDBMS Hadoop
In this structured data is mostly processed. In this both structured and unstructured data is
processed.
It is best suited for OLTP environment. It is best suited for BIG data.
The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.
High data integrity available. Low data integrity available than RDBMS.
Cost is applicable for licensed software. Free of cost, as it is an open source software.
Google Inc. developed the Google File System (GFS), a scalable distributed file system
(DFS), to meet the company’s growing data processing needs. GFS offers fault tolerance,
dependability, scalability, availability, and performance to big networks and connected
nodes. GFS is made up of a number of storage systems constructed from inexpensive
commodity hardware parts. The search engine, which creates enormous volumes of data that
must be kept, is only one example of how it is customized to meet Google’s various data use
and storage requirements.
The Google File System reduced hardware flaws while gains of commercially available
servers.
GoogleFS is another name for GFS. It manages two types of data namely File metadata and
File Data.
The GFS node cluster consists of a single master and several chunk servers that various
client systems regularly access. On local discs, chunk servers keep data in the form of Linux
files. Large (64 MB) pieces of the stored data are split up and replicated at least three times
around the network. Reduced network overhead results from the greater chunk size.
Without hindering applications, GFS is made to meet Google’s huge cluster requirements.
Hierarchical directories with path names are used to store files. The master is in charge of
managing metadata, including namespace, access control, and mapping data. The master
communicates with each chunk server by timed heartbeat messages and keeps track of its
status updates.
More than 1,000 nodes with 300 TB of disc storage capacity make up the largest GFS
clusters. This is available for constant access by hundreds of clients.
To understand how distributed file systems like GFS are utilized in cloud and DevOps
environments in detail, the DevOps Engineering – Planning to Production course offers
practical insights and real-world examples.
Components of GFS
GFS Chunk Servers: They are the GFS’s workhorses. They keep 64 MB-sized file
chunks. The master server does not receive any chunks from the chunk servers.
Instead, they directly deliver the client the desired chunks. The GFS makes
numerous copies of each chunk and stores them on various chunk servers in order to
assure stability; the default is three copies. Every replica is referred to as one.
Features of GFS
Advantages of GFS
1. High accessibility Data is still accessible even if a few nodes fail. (replication)
Component failures are more common than not, as the saying goes.
3. Dependable storing. Data that has been corrupted can be found and duplicated.
Disadvantages of GFS
1. Not the best fit for small files. 4. Suitable for procedures or data that
are written once and only read
2. Master may act as a bottleneck.
(appended) later.
3. unable to type at random.
Google File System(GFS) vs. Hadoop Distributed File System (HDFS)
In distributed file systems, Google File System (GFS) and Hadoop Distributed File System
(HDFS) stand out as crucial technologies. Both are designed to handle large-scale data, but
they cater to different needs and environments. In this article, we will understand the
differences between them.
Google File System (GFS) is a distributed file system designed by Google to handle large-
scale data storage across multiple machines while providing high reliability and
performance.
It was developed to meet the needs of Google's massive data processing and storage
requirements, particularly for its search engine and other large-scale applications.
GFS is optimized for storing and processing very large files (in the range of
gigabytes or terabytes) and supports high-throughput data operations rather than
low-latency access.
Scalability: GFS can scale to thousands of storage nodes and manage petabytes of
data.
Fault Tolerance: Data is replicated across multiple machines, ensuring reliability
even in case of hardware failures.
High Throughput: It’s optimized for large data sets and supports concurrent read
and write operations.
Chunk-based Storage: Files are divided into fixed-size chunks (usually 64 MB) and
distributed across many machines.
Hadoop Distributed File System (HDFS) is a open source distributed file system inspired
by GFS and is designed to store large amounts of data across a cluster of machines,
ensuring fault tolerance and scalability. It is a core component of the Apache Hadoop
ecosystem and is designed to handle large-scale data processing jobs such as those found in
big data environments.
Fault Tolerance: Data is replicated across multiple nodes, ensuring that the system
can recover from failures.
Large Block Size: HDFS breaks files into large blocks (default 128 MB or 64 MB)
to optimize read/write operations for large datasets.
Write Once, Read Many: HDFS is optimized for workloads that involve writing
files once and reading them multiple times
File Access Optimized for write-once, read-many Also optimized for write-once, read-
Pattern access patterns. many workloads.
Achieves fault tolerance via data Achieves fault tolerance via data
Fault replication across multiple replication across multiple
Tolerance chunkservers. DataNodes.
Data Locality Focus on computation close to data Provides data locality by moving
Hadoop Distributed File System
Aspect Google File System (GFS) (HDFS)
o It handles massive amounts of web data (such as crawled web pages) that
need to be processed, indexed, and stored efficiently.
o The system enables fast access to large datasets, making it ideal for web
crawling and indexing tasks.
o GFS is used in large-scale distributed data processing jobs where files can be
extremely large (gigabytes or terabytes).
o Google used GFS for data-intensive tasks like search indexing, log analysis,
and content processing.
o Since machine learning often involves processing large datasets for training
models, GFS’s ability to handle large files and provide high-throughput data
access makes it useful for machine learning pipelines.
o GFS is used to store and process large multimedia files, such as videos and
images, for Google services like YouTube and Google Images.
o Its fault tolerance and ability to scale out to handle massive amounts of
media make it ideal for these types of workloads.
o Google uses GFS to store and analyze logs for various services (e.g., Google
Ads, Gmail) to identify trends, detect anomalies, and improve service quality.
o HDFS is widely used for big data analytics in environments that require the
storage and processing of massive datasets.
Data Warehousing:
o HDFS serves as the backbone for data lakes and distributed data warehouses.
o HDFS is the foundational storage layer for running batch processing jobs
using the MapReduce framework.
o Applications like log analysis, recommendation engines, and ETL (extract-
transform-load) workflows commonly run on HDFS with MapReduce.
o Social media companies use HDFS to store, analyze, and extract trends or
insights from vast amounts of data.
Conclusion
In conclusion, GFS is used only by Google for its own tasks, while HDFS is open for
everyone and widely used by many companies. GFS handles Google’s big data, and HDFS
helps other businesses store and process large amounts of data through tools like Hadoop.
HDFS Commands
HDFS is the primary or major component of the Hadoop ecosystem which is responsible for
storing large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files. To use the HDFS commands, first you
need to start the Hadoop services using the following command:
sbin/start-all.sh
To check the Hadoop services are up and running use the following command:
jps
Commands:
1. ls: This command is used to list all the files. Use lsr for recursive approach. It is
useful when we want a hierarchy of a folder.
Syntax:
Example:
It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed File System)
commands.
Syntax:
Example:
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be
Syntax:
Example:
4. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store.
This is the most important command. Local filesystem means the files present on the
OS.
Syntax:
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to
folder geeks present on hdfs.
(OR)
Syntax:
Example:
6. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
Example:
(OR)
myfile.txt from geeks folder will be copied to folder hero present on Desktop.
Note: Observe that we don’t write bin/hdfs while checking the things present on local
filesystem.
Example:
8. cp: This command is used to copy files within hdfs. Lets copy
folder geeks to geeks_copied.
Syntax:
Example:
9. mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.
Syntax:
Example:
Syntax:
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the
Syntax:
Example:
12. dus:: This command will give the total size of directory/file.
Syntax:
Example:
13. stat: It will give the last modified time of directory or path. In short it will give stats
of the directory or file.
Syntax:
Example:
14. setrep: This command is used to change the replication factor of a file/directory in
HDFS. By default it is 3 for anything which is stored in HDFS (as set in hdfs core-
site.xml).
Example 2: To change the replication factor to 4 for a directory geeksInput stored in HDFS.
Note: The -w means wait till the replication is completed. And -R means recursively, we use
it for directories as they may also contain many files and folders inside them.
Note: There are more commands in HDFS but we discussed the commands which are
commonly used when working with Hadoop. You can check out the list of dfs commands
using the following command:
bin/hdfs dfs