KEMBAR78
Introduc) On To Bigdata | PDF | Apache Hadoop | Computer Cluster
0% found this document useful (0 votes)
135 views103 pages

Introduc) On To Bigdata

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It consists of HDFS for storage and MapReduce for processing. HDFS stores data across nodes and provides high aggregate bandwidth and fault tolerance. MapReduce uses a parallel programming model to efficiently process large amounts of data in a distributed environment. Hadoop enables running applications on thousands of nodes handling thousands of terabytes of data cost effectively.

Uploaded by

Siva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views103 pages

Introduc) On To Bigdata

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It consists of HDFS for storage and MapReduce for processing. HDFS stores data across nodes and provides high aggregate bandwidth and fault tolerance. MapReduce uses a parallel programming model to efficiently process large amounts of data in a distributed environment. Hadoop enables running applications on thousands of nodes handling thousands of terabytes of data cost effectively.

Uploaded by

Siva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Introduc)on

to BigData
HADOOP is an open source, Java-based programming framework that supports the processing
and storage of extremely large data sets in a distributed computing environment. It is part of the
Apache project sponsored by the Apache Software Foundation.

•  The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File
System (HDFS), and a processing part which is a MapReduce programming . Hadoop splits
files into large blocks and distributes them across nodes in a cluster. It then transfers
packaged code into nodes to process the data in parallel. This approach takes advantage of
data locality,[3] where nodes manipulate the data they have access to. This allows the dataset
to be processed faster and more efficiently than it would be in a more conventional
supercomputer architecture that relies on a parallel file system where computation and data
are distributed via high-speed networking .
•  The base Apache Hadoop framework is composed of the following modules:
•  Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
•  Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
•  Hadoop YARN – a resource-management platform responsible for managing computing
resources in clusters and using them for scheduling of users' applications; and
•  Hadoop Map Reduce – an implementation of the MapReduce programming model for large
scale data processing.

Hadoop makes it possible to run applications on systems with thousands of commodity hardware
nodes, and to handle thousands of terabytes of data. Its distributed file system facilitates rapid
data transfer rates among nodes and allows the system to continue operating in case of a node
failure. This approach lowers the risk of catastrophic system failure and unexpected data loss,
even if a significant number of nodes become inoperative. Consequently, Hadoop quickly
emerged as a foundation for big data processing tasks, such as scientific analytics, business and
sales planning, and processing enormous volumes of sensor data, including from internet of
things sensors.
Hadoop Features

Flexibility: Hadoop is very Flexible in terms of ability to deal with all kinds of data. Where data can
be of any kind and Hadoop can store and process them all, whether it is structured, semi-
structured or unstructured data.

Reliability: When machines are working in tandem, if one of the machine fails, another machine
will take over the responsibility and work in a reliable and fault tolerant fashion. Hadoop
infrastructure has in-built fault tolerance features and hence, Hadoop is highly reliable. We will
understand this feature in more detail in upcoming blogs on HDFS.

Economical: Hadoop uses commodity hardware (for example your PC, laptop) For example, in a
small Hadoop cluster, all your data nodes can have normal configurations like 8-16 GB RAM with
5-10 TB hard disk and Xeon processors, but if I would have used hardware-based RAID
with Oracle for the same purpose, I would end up spending 5x times more at least. So the cost of
ownership of a Hadoop based project is pretty minimized. It is easier to maintain the Hadoop
environment and is economical as well. Also, Hadoop is an open source software and hence there
is no licensing cost.

Scalability: At last, but not the least, we talk about the scalability factor. Hadoop has in-built
capability of integrating seamlessly with cloud based services. So if you are installing Hadoop on a
cloud, you don’t need to worry about the scalability factor because you can go ahead and procure
more hardware and expand your setup within minutes whenever required.
These 4 characteristics make Hadoop a front runner in terms of Big Data solution. Now that we
know what is Hadoop, we can explore the core components of Hadoop.
But before talking about Hadoop core components, I will explain what led to the creation of these
components.

There were two major challenges with Big Data:


Big Data Storage: To store Big Data, in a flexible infrastructure that scales up in a cost effective
manner, was critical.

Big Data Processing: Even if a part of Big Data is Stored, processing it would take years.
To solve the storage issue and processing issue, two core components were created in Hadoop –
HDFS and YARN. HDFS solved the storage issue as it stores the data in a distributed fashion and
is easily scalable .
Hadoop Core Components
When you are setting up a Hadoop cluster, you can choose a lot of services that can be part
of Hadoop platform, but there are two services which are always mandatory as a part of
Hadoop set up. One is HDFS (storage) and the other is YARN (processing). HDFS stands for
Hadoop Distributed File System, which is used for storing Big Data. It is highly scalable.
Once you have brought the data onto HDFS, Map Reduce can run jobs to process this
data. YARN does all the resource management and scheduling of these jobs.
Hadoop Master Slave Architecture
The main components of HDFS are Name Node and Data Node.
Name Node
It is the master daemon that maintains and manages the Data Nodes (slave nodes)
It records the metadata of all the files stored in the cluster, e.g. location of blocks stored,
the size of the files, permissions, hierarchy, etc.
It records each and every change that takes place to the file system metadata
For example, if a file is deleted in HDFS, the Name Node will immediately record this in the
Edit Log
It regularly receives a Heartbeat and a block report from all the Data Nodes in the cluster
toensure that the Data Nodes are live
It keeps a record of all the blocks in HDFS and in which nodes these blocks are stored
It has high availability and federation features which I will discuss in HDFS architecture in
detail
Data Node
These are slave daemons which runs on each slave machine
The actual data is stored on Data Nodes
They are responsible for serving read and write requests from the clients
They are also responsible for creating blocks, deleting blocks and replicating the same
based on the decisions taken by the Name Node
They send heartbeats to the Name Node periodically to report the overall health of HDFS,
by default, this frequency is set to 3 seconds
The components of YARN are Resource Manager and Node Manager.
Resource Manager 
It is a cluster level (one for each cluster) component and runs on the master machine
It manages resources and schedule applications running on top of YARN
It has two components: Scheduler & Application Manager
The Scheduler is responsible for allocating resources to the various running applications
The Application Manager is responsible for accepting job submissions and negotiating the
first container for executing the application
It keeps a track of the heartbeats from the Node Manager
Node Manager
It is a node level component (one on each node) and runs on each slave machine
It is responsible for managing containers and monitoring resource utilization in each
container
It also keeps track of node health and log management
It continuously communicates with Resource Manager to remain up-to-date
Big Data
Enterprise

Data system

Simple data Big data Traditional Distributed system


scale up scale out
single machine Horizontal scaling
vertical scaling Cluster of machine

Computational Data Intensive ( I/O Bound )


Intensive ( CPU Bound )

Condor/HPC SETI@ HAMA Others Hadoop


(High performance HDML NO
computer) also called SQL
volunteer
computing Spark storm

F link

HPT
Data life cycle

Capturing C Store Processing Report

Systems
[ Software + Hardware ]
1) Data sources 2) Data formats

Machine User generated Structure Semi Unstructured


generated RDBMS structure Flat files
Images
Logs Transactions XML Audio /Videos
CSV Web
Sensors Social JSON
AVRO
RFIDS Media TSR

Scanners Mails /Documents


80% 20%

3) Data properties
a. Volume
b. Variety ( Combinations Ex : structure and semi structure )
c. Velocity Velocity – data generator speed
d. Complexity
BigData = confluence of ( transaction + operations + observations )

Ex : logs Ex : CCTV
if any operation is done
it makes entry in logs.

Scalability

Scale up / Scale out /


Vertical scaling Horizontal scaling
( traditional ) ( distributed

Disadvantages of scale up / traditional :


•  Cost expensive
•  Time consuming
•  Single point of failure
•  I/O bottleneck
MR , Spark , Flink are batch processing computational Para-diagram.
Hadoop : Distributed computing frame work ( Map Reduce )
Distributed storage ( HDFS )
Programming model à MR
NO SQL : Non relational databases ( Not only SQL )

Spark : Another batch processing engine ( alternative to Map Reduce )


Ex : iterative ; programming model à RDDB
Flink : Optimized version of spark ( alternative to spark and MR )
It is much faster in computation for some algorithms ( Ex : ETL )
Programming model à pact
Storm / su / Samza : Real time distributed computing frame work , it is not much scalable.
It is used for short lived data.
HPCC : Distributed frame work from lexus / Nexus scaled few petabytes.
Sphinak : Full text search engine come analytic engine.

No SQL Databases :
1.  Cap theorem
2.  Poly glot persistence
3.  Limitations of SQL

CAP à Consistency Availability Partition tolerance


In distributed system only 2 of them will be 100% , third one will not be 100%.
SQL à Scale up
No SQL à Scale out
C No SQL is vertically divided with AP and CP.

Cassandra is tunable consistent. It is highly available but the


consistency can be tuned during configuration. It has its own file
A P system called Cassandra file system.

Cassandra highly availability.


H base highly consistent.
Polyglot persistence : For single application multiple databases models are used
ex : Application A

Eb/ Neo44
Mongo Solr
SQL dB

Limitations of SQL

1st generation no SQL database is Key value stories.

No SQL: It stands for not only SQL or non relational databases. There is “No SQL” database , it
is consolation or group , it brings all the non relational databases under single umbrella.
No SQL databases are designed to over come the limitations of SQL databases.
No SQL has 150 databases currently , all these databases are categorized into 12+
categories based on data model , out of 12+ categories , the following are very popular
in the current market.
•  Key value stores
•  Document stores
•  Columnar / column based
•  Graph databases

Key value stores :


Map ( key value pair association )

Has table ( data structure same as key value pair ) on distribute system.
Some are disk based and some are memory based.
(
) forward indexing (Ex : PDF ’S )
•  P1 ( words )
•  P2 ( words )
•  P3 ( words )
•  Inverted indexing ( Ex : all search engines , Google, Yahoo )
•  W1 ( pages )
•  W2 ( pages )
•  W3 ( pages )

•  If application requires key value store if huge keys then use persistent key value stores.
•  Persistent key value store : inverted indexing application
•  Non persistent key value store : cache application.

Key value store :


2005 / 2006 à it was performing basic operations like put / get / delete.

JSON object : JSON objects are document oriented. There are 3 data types:
1)  Primitive
2)  Array
3)  JSON object
{ id : 101, name : “abc” , age : 25 , contact : [‘c1’,’c2’,’c3’]
Ex : address : { street : “abc” , block : “cba” , area : “xyz” }

Address is a key , but address is another key value store in the above example ,this address is
another JSON object which is the example of JSON object within JSON object.
SQL Document oriented

Database Database
tables collection
records documents
fields key value

Document oriented :
It is the extended version of key value store.
Ex : Mongo DB , Couch DB ,Couch Base Server , tera store , elastic search.
JSON object MSON =new JSON object ( cursor. next () )
String name = myson.get (“name”);
JSON object addr = myJSON.get(“address”);
String city = addr.get (“city”);
String cont [ ] = myJSON.get (“contact”)
Ex: database = dB
collection name = my app.
{ _id : 10 , name : ”xyz” , age : 25 , type : “DOP”}
{ _id :20 ,source : “hindi”, title : “electrons” , content : “ “ ,
type : “news” }
{ _id : 30 , movies “xyz” , cout : [p1,p2…] , budget : “ ,
type : “enl” };
dB.myapp.find ( )
dB.myapp.find ({ type : “ent”})

Mongo dB is based on document data model. Here the document is Json. Document and Bson
( Binary Son ) document at storage.
Document = collection of key value pairs as a object.
Document model is wrapper over key value by enclosing particular entity key value pairs.
Document is atomic at storage.

Columnar DB :
In RDBMS the data is row oriented so for ex :
If the order id and Agr price for the below table of 10 GB data.

OLTP OLAP go for


70% 30% à documented oriented
30% 70% à columnal oriented

Columnar dB
C1{ }
C2{ }
C3{ }

1 Naga 29
2 Shiva 30

RDBMS Columnar
data is stored as 01:1,02:naga,03:29;
01: (1,naga,29) ; 02: (2,shiva 30) 01:2,02:shiva,03:30;

Table:
Row key CF1 CF2 CF3
Billion
Of Billion
Row Of
Key column

Column is key value pair with time stamp.


triplet ( key, value, time stamp) .
Ex : Share market.
Key Value Timestamp
Google : V1 T1
Share V2 T2
V3 T3

Vn Tn
Time stamp is because columnar data base is having versioning advantage :data
versioning
(ex stock market).
Speeder analytical processing.

Redis à is built using C.

Redis :

•  It is a persistent key value store.


•  It is one of the key value.
•  It is in memory plus persistent key value store

Installing Redis Server

Redis is available for windows , Linux , Mac.


•  It is written in C
•  It has several clients for all programing languages.
•  The down loaded s/w is source code we need to build this by using make tool.
•  Make ( build tool for C) is build Tool like ant and yarn à build tool for Java.
•  We can also install Redis by using package installer like apt-get for Ubuntu, yum for
Radha.
•  Installing Redis over Ubuntu.
•  Search the package in Ubuntu repository using the following command.
“ Sudo apt -cache search redis “
à give password
à Sudo apt -get install redis. Server.
The above command downloads, install and start the redis server.
To check whether the redis server is running or not.

redis_cli
it will display :
redis 127.0.0.1:6379:
Table Row key column.family
Stock market “Infy 000” bSE : infy 2305804 TSI
2400.07TS2
Stock market ”Infy 001” nyse : nfy :
In case of billions of value in column the data is divided in to partitions and distributed to
all the computer
‘Weather ‘ , ‘Bangalore ‘ ( temp : temperature ) value timestamp
K V T
Column Family name àtemp
Column name à Temperature
Apache phoenix à give SQL skin over HBASE.
Graph data base :
vertex/node – entity Ex: Oracle of BAWN is using graph database.
edge - relation
ex: entity/node name : “abc”
age : 23
place : ”banglr”

edge relation type : “manager “


relation name :

Ex : Neo4j
It is 100% ACID features.

Jedis à { } ( key : 1 key : 0 )


To know the commands explanation type help
To load the data swap file in configure
Redis.io

•  Cap theorem
•  Document oriented
•  Columnar key value store

Humongous à huge
Mongo dB name is derived from this.

Open source à giving it to the world but can be redistributed.


Commercial
open source
freeware à developed and giving to world but cannot redistribute.
HDFS
HADOOP
WWW – Tim Berners lee (1989) à Started exchanging data over Network.
1993
Google à Page Rank algorithm , it uses inverted indexing. It is full text search (Boolean+IR model)

S
C I
P
R
R N
E
A A
D
W
E
E
R
P C
L
E
R X
H
R
o

cessing
GFS: Google file system

Lisp GFS


map reduce
Google MR


Apache : They invented search engine Lucene ( IR life in this life pre-processer , Index and
search was available but crawler was missing, for search engine, crawler is the main
intelligence then they come up with Nutch ( 2003/2004).
àNDFS
àNutch MR
Yahooà1994
mail

In the early years Yahoo was using Google search interface, Then Google started there
own mail service then yahoo started there own search à(dread naught).

In 2006 Yahoo joined with Apache and created.


Hadoop
àHDFS
àHADOOP MR
By 2010 Google also started using Hadoop.


Map Reduce origin is LISP

Hadoop is a distributed file system and distributed computing framework

•  Apache hadoop is a open source distributed compacting frame work.


•  It provides two layer, one is storage layer and the second is processing layer.
1.  Storage layer is achieved by HDFS. ( Hadoop Distributed File System ) the early name of HDFS was NDFS ( Nutch distributed file
system )
2.  Processing layer is achieved by map-reduce (MR) early name was NMR ( Nutch Map Reduce)
•  Hadoop is designed based on GMR ( Google Map Reduce ).
•  MR and GMR is LISP Map and Reduce functions.
•  Hadoop was created by Apache and Yahoo in Feb 2006.
•  The creators of Hadoop are Doug Cutting and Mike Cafarella
•  The entire framework is written in Java.

Lucene
Nutch

Hadoop

Earlier hadoop was a sub-project of Nutch , Lucene is the parent of Nutch.

Ø  Hadoop is one of the top level project from ASF ( Apache Software Foundation )
Ø  Hadoop is a write once read many times ( HDFS )
Ø  Hadoop is a Batch processing engine ( MR )
Ø  Hadoop can crunch the terabyte data in seconds.
Ø  Hadoop is having 3 main advantages , time , scale and cost.

Definition :
Apache Hadoop is an open source , distributed , large scale , reliable frame work written in Java ,
for creating , deploying large scale distributed applications which process humongous ( huge )
amount of data on top of cluster of commodity hardware ( low cost hardware ).
Now Hadoop is de-facto standard for distributed computing.

Components of Hadoop :
It has two components ,
1.  Core
2.  Eco system
Hadoop

Core Eco system

Hadoop-1.X.X
Pig, Hive, HBase,
HDFC (Storage platform) Sqoop, Oozie, HUE,
Map Reduce (platform+app) Mahout, Tez, Ranger,
Sentry, Kmoxgate, Zoo keeper,
Common utilities Lipstick, Flume, Falcon,
Chukwa, Scribe, Kafka,
Hadoop-2.X.X Azkaban, Ambari, Samza,
HDFS 2 (platform) Drill, Impala, Presto, Tajo,
Map Reduce (application) Bookkeeper, Avro, Parquet,
Treveni, Cassandra, Xaseure,
Yarn (platform)
Thrift, MRUnit
Common utilities

NOTE : Many organizations took the apache Hadoop code base and redistributed their
own distributions with some customs and add-ons.
Some of the permanent Hadoop distribution :
1. Cloud era, 2. Horton works, 3. Map R, 4. Pivotal [ EMC2 ] , 5.Bigdata insights [ IBM ]

Comparing Hadoop with other systems :


1.  Comparing Hadoop with database

Databases ( RDBMS ) Hadoop ( core )


a)  Scale up a) Scale out
b)  Structure data b) Confluence of many database ( structured ,
semi-structured , unstructured )
c) Normalized / Joins c) De-normalized
d) Designed of OLTP / Point queries / d) OLAP / Analytical queries / Batch processing
Real time processing
e) Declarative style ( SQL ) e) Functional programming ( MR )
f)  Write on schema f) Read on schema
g)  Hot data g) Cold data

There is no overlap between two, so Hadoop will not replace RDBMS ( traditional data processing
system ) and it is designed to compliant to RDBMS.
RDBMS cannot handle the entire OLTP data, so that will be moved to data ware house so Hadoop
will handle the data.
Comparing Hadoop with dataware house and ETL ( dataware house stack )

Dataware house stack Hadoop (core)


1.  Scale up 1. Scale out
2.  Structural 2. All varieties of data
3.  De-normalized 3. De-normalized
4.  Cold data/ History 4. Cold data/ history
5.  OLAP/ analytical queries batch processing 5. OLAP/analytical queries batch processing
6.  Data is multi dimensional 6.Data is multi dimensional
7.  Transformation + aggregation 7. Transformation + aggregation

We replace the traditional data ware housing stack with Hadoop in 2 ways.
1.  Include Hadoop (as ETL or ware house ) in traditional existing ware housing stack.
2.  Build end-to-end Hadoop based ware housing stack.

fig: Ware house stack


Business Intelligence
Flat file

RDBMS
Ex: Tera data
ETL Ware house Netizza
Green plum
Vertica

Informatica Pentaho
Hadoop use with existing traditional data ware house.
BI Traditional /
Flat file Tera data HDFS

RDBMS ETL

H H Ware
D- D ETL house
F F
S S

It is famous for sorting , joining in this case.

Some transformation like data standardization is difficult in Hadoop.( for ex: I.B.M , IBM , I,B,M all
these data should be standardized to IBM in this case which is difficult using Hadoop algorithms)
So the Hadoop will give input to the ETL traditional using connector which will be transformed
and loaded to warehouse.
BI

T T Tn Ware
HDFS HDFS
house

v
Ware house are
Hive ,Tajo Drill prest

•  Write on schema à first create table schema and then load the data.
•  Read on schema à load the data create , create schema.

Hadoop Vs Grid computing


Grid computing :
Storing of data will be in one cluster and processing in another cluster.
Store data Process data

OOOO OOOO
OOOO OOOO
OOOO OOOO

machines
If we query in Grid computing then the query is done. One cluster where processing is done it will
take data from cluster where data is stored.
But in Hadoop every node will store the data and process the data ( data localization )
Hadoop

OOOOO Both storing and processing on every node on


OOOOO cluster.
OOOOO
OOOOO ( Cluster of machines )

In Grid computing we have separate grid for storing and processing where as in Hadoop every
node has the storage and processing capability.
Hadoop Vs Volunteer Computing
Volunteer computing : Ex : Seti@home

w10 w9 w8
w7

w6
OOOO
Central
w5
OOOO
sever OOOO
w4 OOOO

w3

w1 w2

Work stations
Volunteer computing Hadoop
Here the data is moved where Here the computing is done where data is
Computing is taking place. available.

Analogies :
Ø  Hadoop is an operating system or kernel for data.

Hard
OS Application
ware

Raw
Hadoop Insignts
data
Ø  Hadoop is a data refinery.

M
C P
a
r r
c
u o
h
d d
i
e u
n
o c
e
i t
r
l s
y

FEATURES OF HADOOP :
1.  Scalability
2.  Availability
3.  Reliability
4.  Simplicity
5.  Fault – tolerance
6.  Security
7.  Modularity
8.  Robust
9.  Integrity
10.  Accessibility
Ø  Hadoop is having all the above features except security ( not 100% secure).
Ø  Hadoop is distinguished with other distributed systems in terms of its scalability.
Ø  Hadoop is available at infinite scale.

Hadoop architecture :
It is nothing but HDFS architecture + map-reduce architecture (1.0) , HDFS
architecture + yarn architecture in (2.0).
There is no separate Hadoop architecture.
Hadoop 1.0 architecture Hadoop 2.0 architecture

HDFS Arch HDFS Arch

Map Reduce arch Yarn arch

HDFS architecture : HDFS is a master slave architecture , it has one master and ‘n’ slaves , here
master is name node and slaves are data nodes.
Name node

Data Node Data Node Data Node

NOTE : HDFS architecture is common for both hadoop 1.0 and hadoop 2.0.
In hadoop 2.0 HDFS many have more than one name node ( federation of name node )

Meta data is sliced N1 N2 N3 à Name Nodes


d1 d2 d3 à data node
S1 S2 S3 à Secondary Name node
S1
S2 S3

N1 N2 N3

d1 d2 d3
More than one name node

In Hadoop 1.0 Map Reduce is a platform + application where as in 2.0 it is only


application.
Map Reduce is a master slave architecture , it has one master i.e. : Job tracker and N
slaver i.e.
Task tracker.
Job tracker

Task tracker Task tracker Task tracker

.
Yarn architecture
In Hadoop 2.0 Yarn is introduced as a resources management system, it replaces Map Reduce
platform, YARN stands for Yet Another Resource. Negotiator , it is master slave architecture with
one master i.e. Resource manager and n slaves i.e. Node manager.

Resource manager

Node Node Node


manager manager manager

Hadoop 1.0 architecture :


It has two configurations:
1.  Small node cluster (<50 )
2.  Large node cluster (>50 )
Fig : Small node cluster

Job tracker
Name node

Task tracker Task tracker Task tracker


Data node Data node Data node

Fig : Large node cluster

Name node Job tracker

Task tracker Task tracker Task tracker


Data node Data node Data node
Hadoop 2.0 architecture

1.  Small node cluster Hadoop 1.0 = 1+2


2.  Large node cluster Hadoop 2.0 = 1+3
Fig : 1. Small node cluster

Resource Manager
Name Node

Node manager Node manager Node manager


Data node Data node Data node

Fig : 2. Large node cluster

Name Node Resources manager

Data mode Data mode Data mode


Node manager Node manager Node manager

Building Blocks of Hadoop :


Hadoop has the following architectural building blocks, those are also called as daemons.

1.  Name node


2.  Data node
3.  Secondary name node HDFS(1)
4.  Check point node Hadoop 1.0 daemons
5.  Back up node
6.  Job tracker
7.  Task tracker Map Reduce(2)
8.  History server Hadoop 2.0 daemons
9.  Resource manager
10. Node manager Yarn (3)
11. Job history server
HDFS Daemons :

1.  Name node


Ø  It is a master component of HDFS cluster and one of the master components of Hadoop
cluster.
Ø  It is a point of interaction between HDFS cluster and HDFS clients.
Ø  It is responsible for holding the metadata of HDFS cluster.

Note : HDFS has two types of data, one is actual data i.e. users data , second one is
metadata i.e. data above users data.

Ø  Name node is memory as well as I/O intensive.


Ø  Name node is a single point of failure ( it is overcome by various ways ).
Ø  Is is a heart of Hadoop cluster.

2. Data node
Ø  It is a slave component of HDFS cluster and one of the slave components of Hadoop cluster.
Ø  It is responsible for holding the portion of actual data.
Ø  It is dumb node and it is only used for low level operations like reading and writing.
Ø  All the data nodes interact with HDFS clients and other data nodes in the direction of name
node.
Ø  Data nodes gives heart beats to name node to inform about their status as well as other
information ( default is 3 seconds )i.e. for every 3 seconds it will inform name node. This
status ( whether alive or dead and this can be configured to required value as the number of
seconds will increase the I/O will decrease ).

3. Secondary name node:

In hadoop cluster (irrespective of size ) we have one secondary name node it is used to do
the following things.
1) Check pointing ( By default check pointing happens every 1 hour and its configurable ).
2)  Periodical snapshot or backup

With secondary name node we have the facility backup of name node metadata, but it is not the
mirror image of name nodes metadata.
Metadata in secondary name node = Cluster starting to last check point.
Here is a data loss that is current to last check point. To overcome this metadata loss, secondary
name node is decomposed into two nodes.
Ø  One is the check point node (To perform check pointing)
Ø  The other is Back up mode à Real time mirror image of name node.

Fig : HDFS
For Ex:
Replication factor
.//data /xyz-b1(1) configured to 1
Fig : HDFS
/data /xyz-b2(1)
/data /xyz-b16(1) SNN

input/xyz in/data
Check point
WC Name Node (OR) node
HDFS write client ,

Of 1GB b1[D1]
b2[D2]
b3[D3]
: Backup
b16[D3] node

Data node Data node Data node

D1 B1 D2 B2 D3 B3

Block

File of 1GB is physically converted into 16 blocks if the block size is 64 MB.

For every block i.e. for


File size 1024 = 16 blocks
1GB approx 150 MB meta
Block size 64
data will be created

For every 3 seconds data node will send heart beat to Name node and for every 10th heart beat
data node will send the block report i.e. it will send info regarding the block of data it is having.
Map Reduce Daemon

Job tracker ( JT )
•  It is a master components of Map Reduce cluster and one of the master component of
Hadoop cluster.
•  It is a point of interaction between Map Reduce cluster and Map Reduce client.
•  It is responsible for two things.
1.  Resource management
2.  Job management
•  It is memory intensive as well as I/O intensive.
•  Job tracker decompose MR Job into tasks and these tasks are assigned to task tracker.
•  It is a single point of failure.

Task tracker ( TT )
•  It is a slave components of Map Reduce cluster and one of slave component of Hadoop
cluster.
•  It is worker node to launch the tasks which are assigned by Job tracker , monitor the tasks
and report the tasks status to the Job tracker.
•  All the task tracker give heart beat to JT to inform about their status.
•  All TT are participating in the distributed computing in the direction of Job tracker.

History server
•  History server is used to maintain the Job history managing.
•  By default it is enable as a integral part of JT.
•  For bigger or production cluster it is better to run history server as a separate daemon.
How Map Reduce works :
(1)

MR (2) Job Name


Client tracker node
(3)

Java complied to
generate Jar file.

Pig TT TT TT
engine DN DN DN

Pig, this will


internally converts
Jrm Jrm Jrm
to Map Reduce
and create Jar file.

(1)  Registration of Job


(2)  Registration ID is generated and asks to copy all the required MR files to the staging
directory which is copied to task tracker.
(3)  Every tracker will create Jrm based on the resources available.

Yarn Daemons :
Resource manager
•  It is master component of Yarn and one of the master component of Hadoop cluster.
•  It is point of interaction between Yarn cluster and Yarn client.
•  Example of Yarn clients à Map Reduce , spark , storm , graph , MPI , Tez.
•  Yarn is a resource management system and resource manager is responsible for resource
manager is resource scheduler.
•  Resource manager is a single point of failure.

Node manager
•  It is slave component of Yarn and one of the slave component of Hadoop cluster.
•  It is responsible for launching the containers , monitor the containers , killing the containers.
•  It gives heart beat to resource manager to inform about their status and report the resource
related info of manager.
Container : It is nothing but a pool of resource ( CPU, RAM , Disk , IO etc ).
Application Master :
•  In Yarn , to support multiple paradimes like MR , Tez , Spark etc. We have paradime specific
application masters.
•  A paradime specific application master will decompose the application into one or more
Jobs or application into one or more tasks.
•  It does Job management or application management.

History server :
In Yarn History server is not the integral part of Resource manager it should be run separately.

How Yarn works : (5) Initialize Name


application
(1) Application registration node

(2) Application registration id


Yarn Resource (8)  AM will
client manager (6) Allocates interact with
container name node to
(4) Inform resource are copied get block location
MR Job To HDFS
(7) AM will
(3) Resource copied Register with
to HDFS

C2 C3 C4 C5 C1
Yarn AM is
Node manager Node manager
client launched
Data manager Data manager
Node manager
Data manager
Spark Job
AM will ask RM that required resources , then RM will give containers to AM then AM will launch
the containers based on data location ( which is received from name node ).

Uber mode : If there is only one task then AM will itself launch task in the AM instead of asking the
resource from RM.
Once the tasks are completed the node manager will kill the containers.

Hadoop cluster setup :


For installing Hadoop in all modes we need the following prerequisites software.

1.  Java JDK ( 1.6 / 1.7 )


2.  SSH (Operating system ( nix / windows / mac ) Mandatory
3.  Secured Shell )
4.  Rsync
5.  Vimeditor
Optional
6.  Eclipse IDE
1. Operating system :
•  Hadoop is mainly designed on top of nix operating system , currently it is also available on
other operating system like windows with some constrains or limitations.
•  Majority of the deployments are on Linux , so prefer Linux OS like Ubuntu.

Installing Ubuntu :
To get the Ubuntu OS we have two ways :
1. Dual boot
2. VM ware player

Step 1: Create a new hard disk partition or use existing partition by freeing it.
Step 2: Start my computer.
Download Ubuntu ISO image from the Ubuntu website. ( choose either 32 or 64 bit based
on system architecture after downloading ISO image convert ISO image into bootable image by
using power ISO ( file ware ) install it.
NOTE: The bootable image is created in external storage device like USB , hard disk etc, so the
external device has to be formatted before using.
Step 3: Restart the computer by inserting external device , at the time of restart go the boot
option by choosing proper button and change the boot priority to the external device.
Step 4: It will start the Ubuntu installation , in this installation process we have to choose some
options like install Ubuntu along side windows and we need to give user details then
restart the computer by charging the boot priorities to the internal hard disk.

Installing Java:
NOTE : After installing Ubuntu , to install any application we have package installer apt-get
( for sentos , Red hat – Yum ). We need to update package install before installing any
third party.
*Sudo apt-get update.
To install any software use the following command.
*Sudo apt-get install
To serach any package or software in Ubuntu repository use
*Sudo apt-cache search <package>
To remove the software use
*Sudo apt-get remove / purge <package>

What is inside Hadoop :


The downloaded Hadoop ( binaries ) contains the following files and directories.
bin – executables of Hadoop
lib – Jar
conf – configuration files
C++ - pipes API
src – source code of Hadoop project
contrib – contributed projects on top of Hadoop
logs – logs directory for daemons + Jobs
To install and configure Hadoop we are going to modify some files in conf directory.

Installing Hadoop in local / standalone mode :


To extract Hadoop tar files
* tar – xvf hadoop – 1.2.1.tar.gz

local mode à MR paradime à NO HDFS.

The downloaded Hadoop is all set to run in local mode

Vim. bashrc
Vim / home / $USER/
:99
:$ to go to end
Then give command
export JAVA_HOME =
export HADOOP_HOME = /home /radha / bigdata / hadoop -1.2.1

path where Hadoop is there


If any software is installed then , all the binaries will go to bin directory.
/ USR/Lib à everything installed will be going to this path.

To reinitialize the screen source $HOME/.bashrc (OR)


source ~/.bashrc
To know whether the path is coming eco $JAVA_HOME
To install Hadoop in local mode , Hadoop requires Java environment variable i.e. JAVA_HOME
To place this JAVA_HOME variable we have a file in the conf directory of Hadoop i.e.
hadoop-env.sh

Testing Hadoop is running in local mode or not

hadoop. Jar

Start all.sh
NN / JT àsystem
DN / TTT /SNN àRemote

Installing Hadoop in Pseudo mode


This mode requires password less ssh , it is used to start and stop Hadoop daemons
without password.
Generating password less ssh
ssh-keygen-trsa
This generates two keys
1.  Public key
2.  Private key
go to.ssh directory
cd $HOME /.ssh
LS
two keys wil be displayed
id_rsa id_rsa.pub

Copied public key into file called authorized_keys.

Testing password less ssh

The following configuration parameters override to install hadoop in Pseudo mode.

1s.default.name
mapred.Job.tracker
dfs.replication Basic installation ( not for production ,only for
development and testing )
dfs.name.dir
dfs.data.dir

Ø  fs.default.name à This parameter tells where the name node is running , the default value
of this parameter is local file system i.e. file:///
the over riding value is hdfs://<hostname>:<port>/
preferred port is – 9000
8020
This overriding parameter is placed in core -5.tc.xml.
Ø  Mapred.Job.tracker à This parameter tells whwre the Job tracker is running , the default
value is “ local”.
the overriding value is <hostname>:<port>

9001 (OR) 8021


We place the override parameter in mapred-site.xml.
Ø  dfs.replication à tells the replication factor of HDFS , default value is 3,the overriding
parameter is hdfs-site.xml
maximum replications is 512.
Ø  dfs.name.dir à This parameter tells where the name node metadata ( HDFS metadata) is
stored in the hard disk location.
default value is /tmp/<some dir>
the overriding value is permanent location in the hard disk
Ex: $HOME/bigdata/<HDFSdata/name>
overriding parameter is placed in hdfs-site.xml
Ø  dfs.data.dir à This parameter tells the location of HDFS actual data in the hard disk.
default value is /tmp/<some dir>
the overriding value is some permanent location in the hard disk.
Ex : $HOME /bigdata/<HDFSdata/data>
override parameter is placed in hdfs-site.xml

To install Hadoop in Pseudo mode we need to modify the following files :


1.  hadoop-env.sh
2.  core-site.xml
3.  mapred-site.xml
4.  hdfs-site.xml
5.  slave
6.  master

1. hadoop-env.sh
This file is used to place Hadoop related environment variable like JAVA_HOME,
name node env variables , data node env variables etc.
2.  core-site.xml
This file is used to place Hadoop cluster parameter ( overriding parameter of cluster or
core).
By default the file is empty xml file.
3. mapred-site.xml
This file is used to place the overriding parameter of Map Reduce group.
By default it is empty xml file.
4.  hdfs-site.xml
This file is useed to place the overriding parameter of hdfs group.
By default it is empty xml file.
5. slaves
This file is used to place slave machine names ( data node + task tracker ).
One line per one line.
6. master
This file is used to place machine name of the secondary name node.

Ø  ssh should be done


Ø  go to bigdata directory create hdfs data directory, MR hdfs data.
Ø  change permissions of Hadoop hdfs data to 755*
owner
chmod-R 755 * group
chmod-R 755 * rest

vim hdfs-site.xml
mapred-site.xml
core-site.xml
All modifications are done , now we can have to format the hdfs using following commands.
hadoop namenode – format
Don’t run the format command now onwards ( it is one time run command ).
Now we are ready to start Hadoop in Pseudo mode using the following script small
1.  start-all.sh à all daemons
2.  start-dfs.sh à only HDFS daemons
3.  start-mapred.sh à only mapreduce daemons.

Testing HDFS

hadoop fs –ls /

Load the file using


hadoop fs –put CHANGES.txt /data/change
Open. Webinterface
Jetty.

Testing mapreduce
To switch from Pseudo mode to local mode.

hadoop ‘-- config localconf/ Jar hadoop-example 1.2.1.jar


wordcount CHANGES.txt wordcount

local input local output

Local mode to Pseudo mode.


hadoop jar hadoop_examples 1.2.1.jar wordcount
input Hdfs file output Hdfs file

hdfs input file hdfs output file


Task track address à 50060

To format
stop– all.sh
rm _rf*
JPS
hadoop name node_ format
To stop hadoop we have following scripts

stop-all.sh
stop-dfs.sh Stop name node data node secondary node
stop-mapred.sh Stop Job tracker and task tracker
HADOOP
Cluster Setup
Hadoop Distributed file system :
•  It is a distributed file system , it is designed based on Google GFS , its early name was NDFS
( Nutch Distributed File System )
•  HDFS is a one of the core components of Hadoop and it covers the distributed storage in
Hadoop.
•  It is not an operating system or kernel file system ,. it is a user space file system
•  HDFS is a cheaper or cost effective mechanism to stores hundreds of terabyte or petabytes
with good replication factor compared to spark, NASS and RA
•  HDFS is designed , write once read many times.
•  HDFS is a flat file system and it provides sequential access.
•  Definition of HDFS : It is an open source , distributed file system ( multi terabyte or petabyte)
written in JAVA to store vast amount of data on top of cluster of commodity hardware in
highly reliable and fault tolerance manner HDFS, design principles or goals.
It is designed based on the following assumptions :
1.  Handling hardware failure
2.  Fault tolerance
3.  Streaming data access
4.  Large blocks
5.  Large files
6.  High access/ Read throughput
7.  Write once and read many times
8.  Deploy on commodity hardware
9.  Independent of software and hardware architecture
10. Reliability and scalability
11. Simple coherency model

JBOD à Just Bunch of Hard disk. This is the disk.


Posix interface : It allows streaming of data access, it is a wrapper one HDFS.

1. Handling hardware failure:


•  HDFS is designed to deploy on commodity hardware.
•  The probability of failure is very high in commodity hardware.
•  To handle these hardware failures HDFS is designed more robust fashion with the help of
replication factors.
2. Fault tolerance :
It is highly fault tolerant because of its replication factor.
3. Streaming data access :
HDFS offers streaming data access with the help of Posix interface on top of it.
4.Large blocks :
HDFS is designed to represent the data in terms of blocks in bigger size to trade off seek and
transfer to reduce I/O ( i.e. to reduce number of seeks ).
Ex : In Linux the block size is 4 KB , so if ITB data will be stored as ITB blocks , so seek and
transfer will be more, HDFS is provided with a solution ,if HDFS is 128 MB block size then
the entire 128 MB will be stored in Linux as continuous blocks.
5. Large files :
HDFS is a scalable distributed file system and it is designed to store file of size in terabytes
because it is write once read many times symantics.
•  The large files are optimized for HDFS in terms of less HDFS metadata and less number
of seek operations.
•  HDFS is optimized for handling large files not for small files.
6. High access throughput :
It offers high access throughput because less number of writer and more over good
replication factor.
7. Deploy on commodity hardware :
It is designed to deploy on commodity hardware , still we can use enterprise hardware
( disk used in commodity hardware is SATA ).
8. Simple coherency model :
HDFS is a single window like system from every data node we will be able to view the
entire HDFS i.e. simple coherency model.

HDFS limitations :
While achieving the above design principles HDFS is constrained to the following
limitations :
1.  Small files à because of more metadata , more Java processes , seek increase.
2.  No random access à HDFS is flat file , it will be sequential access by default.
3.  High latency à because of preparing and gathering metadata , the client has to
communicate with the N number of data node.
4.  No multiple writers (i.e. single writer )
5.  Does not allow file arbitrary modification or record level inserts , updates , deletes , but
it will append.

HDFS concepts :
HDFS metadata is managed by name node and this metadata contains the following things.
File to block mapping
Block to machines mapping
Other block related parameter ( like replications)

Every Block entry requires an approximate value of 150 bytes.

Ex : metadata
file - block ( properties ) à machines
/xyz – b1(P) à ( Slave1, slave2, slave3 )
/xyz – b2(P) à( slave3, slave4, slave5 )
:
:
/abc – b20(P) à (slave5, slave7, slave9 )

These entire metadata bits in name node primary memory i.e. RAM.
Persistance of File system Name space / fs image / metadata.
Though Name node keeps metadata in memory it also writes this metadata into two files which
are stored in dfs.name.dir ( i.e. ${ dfs.name.dir } / name/ current ) and those two files are edits and
fs image , this writing process happens periodically.

Edits:
Edits is a small window temporary file , it is used to rewrite all clients interaction with HDFS , by
default all live transaction will go to edits file ( metadata creation , updations , deletions ) the
window size is 1 hour or 64 MB by default.

FS image:
•  It is the complete metadata image.
•  It is read only file with respect to clients.
•  It is used for metadata lookup ( for lookup both FS image and edits are required )
•  There is a small difference in between memory version of metadata and disk version of
metadata. In memory version we have machines info, whereas in disk version no machines
info be cause block to machine info is more jolat.
NOTE : “ Every 3 seconds the data node will be sending the heart beats and every
10th heart beat will be a block report which will be sent to name node.”

File à Block ( properties) à machines


xyz à B1 ( 64 MB , 3 ) à D1D3, D5 Memory
xyz à B10 (128 MB ,5) à D1D2, D5, D2, D10

Fire à block properties Disk

•  If cluster is stopped and restarted the name node retrievers disk version of metadata into
memory.
•  At this moment there is no block to machine info in metadata once cluster is started all the
data node will give heart beat to name node , every 10th heart beat is a block report, data
nodes sends what are the blocks they are containing as a block report and name node
consolidates this individual block reports and finally map the block to machines info in the
memory version of metadata.
•  If metadata is 5 GB and RAM is 4 GB then the name node will not start itself.

Ø  In Hadoop 1.0 there is single name node.


Ø  In Hadoop 2.0 we can use distributed name node.

Check pointing :
•  It is a process done by secondary name node or check point node.
•  It is used to merge the edits file into FS image and reset the edits.
•  After completion of this process secondary name node push it back to name node.
•  This check pointing process happen for every 1 hour ( configurable )
FS image = FS image + edits
edits = reset

Name node
Edits ….live functions
FS image

download

secondary name node


Edits
FS image

Ex : If check pointing is started at 1.00 PM the secondary name node completes check pointing by
merging the FS image and edit , then reset the edits. This complete process will be
completed by secondary name and loaded the FS image and edits at 1.02 PM then this 2mins
edit data will be written in the temporary file edit.new in name node then once the edit file is
available then name node renames edit.new to edit then name node write into this edit file.
Check point related parameter are available in core-site.xml fs. check point period in
core-site.xml à here we can change the interval at which checkpoint should be done.

Cluster balancing :
•  HDFS offers balancing algorithm to balance the HDFS data nodes storage , balances
identifier over utilizing nodes under utilizing nodes in hadoop cluster and it will move the
data from over utilizing nodes to under utilizing.
•  Balancer is applicable only for fully distributed mode.
•  Balancer requires more bandwidth.

Protocols :
HDFS has several protocols which are used in the IPC ( Inter Process Communication ) of
HDFS components ( name node , data node , clients ), these protocols are built based on
T stack.
IPC = RPC + Serialization
(Remote Procedure Call )
Hadoop has its own RPC engine and serialization frame work is a Google protocol buffer.

The following are the protocols which are used in HDFS IPC.

•  Client protocol.
•  Client data node protocol.
•  Data transfer protocol.
•  Data node protocol.
•  Inter data node protocol.
(1)
Client Name node

(2) (3)

Data node Data node Data node


(5)

1.  Client protocol :


Clients use client protocol to establish communication with name node , name node
response.
2. Client data node protocol :
Clients use client data node protocol to establish communication with data nodes , data
nodes response.
3. Data transfer protocol :
This protocol is used in between clients and data nodes for transferring the data
( writing and reading ).
4. Data mode protocol :
All data nodes are data node protocol to establish communication with name node , name
node response. This is also called as heart beat mechanism.
5. Inter data node protocol :
This protocol is used to communicate one data node with other data node.

NOTE : Name node will never initiate any protocol. In HDFS cluster , there are several IPC calls in
all this calls Name node is always participant and it will never initialize any communication
with other components in HDFS.

Heart beats:
HDFS has heart beat mechanism , it is used to know the healthiness or status of data nodes,
this process is initiated with name node and report their status. This process happens
periodically by default every 3 seconds ( configurable ).
The protocol used for this process is data node protocol.

Block scanner report:


This process is used to inform name node by all data node, what are blocks they are
containing , every 10th heart beat ( configurable ) is a block report.
Block scanner report:
This process is used by data mode to validate the blocks and inform the name node.
It was the check sum process to do block scanner report.

Re -replication and replication :

Replication :
In HDFS data replicates at the time of data writing or loading process.

Replication :
It is a process , it happens in the following scenarios:
1.  Under replication
2.  Over replication
3.  Mis replicated

(1) Replication :
Ex : File xyz à 200 MB
Block size 64 MB , 200 MB / 64 MB = 4 Blocks i.e. b1 b2 b3 b4 ,
with replication factor = 3.
Metadata

/xyz à b1 (3) à D1, D3, D4


NN /xyz à b2 (3) à D1, D2, D4
/xyz à b3 (3) à D2, D3, D4
/xyz à b4 (3) à D1, D2, D3

D1 D2 D3 D4

b1 b4 b2 b4 b1 b4 b1 b3
b2 b3 b3 b2

fig (1) : In this all blocks are property replicated.


(2) Under replication :
If data node is dead.
/xyz à b1 (3) à D3,D4
/xyz à b2 (3) à D2,D4 Blocks b1 b2 b4 are
NN /xyz à b3 (3) à D2, D3, D4 under replication ,
/xyz à b4 (3) à D2, D3 where as b3 is replicated.

D1 D2 D3 D4
b2 b4 b1 b4 b1 b3
Dead
b3 b3 b2
In this scenario name node replicates the data by instructing data nodes.
Metadata

NN /xyz à b1 (3) à D3, D4, D2


/xyz à b2 (3) à D2, D4, D3
/xyz à b3 (3) à D2, D3, D4
/xyz à b4 (3) à D2, D3, D4

D2 D3 D4

b2 b4 b1 b4 b1 b3
b1 b3 b2 b3 b4 b2

“ Now all the blocks are properly replicated.”


(3) Over replication :
If data node D1 is back. Metadata
/xyz à b1 (3) à D3, D4, D2,D1
NN /xyz à b2 (3) à D2, D4, D3,D1
/xyz à b3 (3) à D2, D3, D4
/xyz à b4 (3) à D2, D3, D4,D1

D1 D2 D3 D4

b1 b4 b2 b4 b1 b4 b1 b3
b2 b1 b3 b2 b3 b4 b2

fig : over replicated

Block B1 B2 B3 are over replicated and B2 is properly replicated Name node asks data node
which are containing over replicated blocks to delete extra replica (3).

/xyz à b1 (3) à D3, D4, D1


NN /xyz à b2 (3) à D2, D4,D1
/xyz à b3 (3) à D2, D3, D4
/xyz à b4 (3) à D2, D3,D1

D1 D2 D3 D4

b1 b4 b2 b4 b1 b4 b1 b3
b2 b3 b3 b2

“ Now all the blocks are properly replicated.”


(4) Missing replica :
If data node D1 and D2 are dead.
/xyz à b1 (3) à D3, D4 all the 4 blocks are
/xyz à b2 (3) à D4 under replicated.
NN /xyz à b3 (3) à D3, D4
/xyz à b4 (3) à D3

D1 D2 D3 D4

b1 b4 b2 b4 b1 b4 b1 b3
b2 b3 b3 b2

no of miss = | ( replication – data node ) | * blocks.

/xyz à b1 (3) à D3, D4


/xyz à b2 (3) à D4, D3
NN /xyz à b3 (3) à D3, D4
/xyz à b4 (3) à D3,D4

D1 D2 D3 D4
b1 b4 b1 b3
b2 b3 b4 b2

All though name node replicate the data but still 4 replicas are missing ( 1 replica for each block )
this is only because of no data node are available to replicate the data.

Missing replicas scenario access when the number of data nodes are less than replication factor.

No of missing replicas = |(replication – data node available )| * blocks under replication.

Missing blocks :
This scenario will occur all replicas are missing of a particular block or blocks that is
missing block.
Ex : Assume b1 is having 3 replicas r1 r2 r3 are on d1 d2 d3 respectively , if d1 d2 d3 are
simultaneously dead in this scenario name node have any info about block b1, i.e. bi is
missing.
Name node will never recover the missing block scenario.

Data integrity :
HDFS achieves data integrity with the help of checksum. This checksum concept is applicable at
block level.
Ø  At the time of loading for every block HDFS client generates a checksum id.of.type.MD5 ,
stores this checksum in a file blockid_random no.met alond with the actual block file.
Ø  this checksum is an overhead of 512 bytes for block.
Ø  At the time of data retrievel every HDFS client generates checksum ids.for.blocks (on the fly
checksum ) generation this on the fly generated checksum of blocks is compared with the
checksums which are created at the time of data loading.
Ø  If the checksums are matched , those matched blocks are called validated blocks, otherwise
blocks are corrupted.
Block scanner report uses same process.

Disk failures :
HDFS handles disk failures for both name node and data node as follows :
Name node : To handle the disk failures of name node we will use the following ways ( to protect
the metadata ).
1.  Having RAID as storage device.
2.  Writing metadata to multiple hard disk.
To write metadata to multiple hard disk we use the following parameter dfs.name.dir and value is
machine1 location , machine2 location , machine3 location ( comma separated )

Data node :
To handle disk failure of data node by following ways :
1.  Having good replication factor.
2.  Having multiple hard disk for every data node ( 2 or more ).

NN NN

D1 D2 D3 D4 D1 D3 D4
D2

HD HD HD HD
HD1 HD2 HD1 HD2 HD1 HD2 HD1 HD2
Fig (1) Fig (2)

In fig (2) every data node is having 2 Hard disks , data node writes its 100% data to both HD1 and
HD2 ( HD2 is a mirror of HD1 ) and vice versa. This is not only handling disk failure it also speed
up or double parallel I/O.
To provide this kind of set we have to use the parameter dfs.data.dir and the value is ( HD1
location , HD2 location ). HD1 location on all data node has to be same ( absolute pate ) HD2
location.
The disadvantage of this setup is, it will increases storage cost.
Advantages are :
1.  Handling disk failures.
2.  Parallel I-O increases availability.

Data replica placement :


The replica uses the following criteria i.e.
1.  Topology distance
2.  Disk utilization

For small size HDFS cluster having one Rack the above criteria is applicable.
For bigger HDFS cluster having multiple racks , the data replica placement criteria is as follows :

We have block b1 and replica are r1 r2 r3 the replica placement criteria for block b1 in multiple
rack Hadoop cluster is as follows :
First replica will go to any one of the rack say R1 ( Rack1 ) inside R1 replica will go to any one of
the node say D5 ( it is based on topology distance and disk utilization ).
First replica R1D5 , second replica will go to any one of the rack not R1 say R4 , inside R4 second
replica will go to any one of the data node say D10 i.e. R4D10 , third replica will go to the second
replica’s rack i.e. R4 inside R4 it will go to any one of the data node not second replica’s data node
i.e. D10 say D15 third replica is on R4D15.
This concept is applicable for any block.
NOTE : HDFS default replication provides policy for multiple rack environment for data
replication.
1/3 of block replica go to 1 rack and 2/3 go to another rack.

Short circuit reads :


In HDFS data reads happens via data nodes in general if we bypass data node while reading the
data , that read is called short circuit read. These reads will improve the reading process time ( it
will be faster ).
Client should be co-located with data nodes.
C NN
regular

Short circuit read


DN DN DN

FID

Synthetic load generator :


It is a process to test the name nodes capabilities by applying virtual load instead of physical
data.
Data organization :
HDFS organizer data in terms of block , all these blocks are of fixed size i.e. 64 MB default
( Hadoop 1.0 ) and 128 MB ( Hadoop 2.0 ), this 64MB blocks data is represented at OS file system
as a continuous 4 kb blocks.

In HDFS ( n – 1 ) are exactly block size where nth one is less than or equal to block size.

HDFS Xyz/1GB/rep =3
Name node
client
Buffer
Say Linux system B1 Checksum window

B2 HDFS
writer
Flush

B3
b1 b1 b1 b1 b1 b1
meta meta meta
Data r1 r2 r3
queries D1 D2 D3 D4
HDFS client / HDFS accessibility :
HDFS has several clients some of them are :
1.  FS shell
2.  Admin shell
3.  Web interface ( read only)
4.  Web HDFS
5.  Java HDFS client API

HDFS Commands :
HDFS has several commands , these commands re used to perform file operations , archivals,
backup, administrative operations.
HDFS commands are categorized into two groups :
1.  User commands
2.  Admin commands

User commands :
HDFS has the following user commands to perform file operations backups and archival etc.
1. fs
2. fsck
3. archive
4. dist cp / dist cp2

Hadoop usage : Hadoop <command> [generic args] [command args]


Hadoop is a executable or shell script and it is available in the bin dire of $HADOOP_HOME.
With Hadoop script we have several commands , these commands are from HDFS , Map Reduce,
general.

HDFS path : HDFS file system start with root (1).


HDFS has default path for every user i.e. / user / $USER.
To see the enr variable then give the command enr.

Absolute path of HDFS - hdfs://<host name>:<port>|

hdfs root is ‘/ ’
Relation path of the hdfs : /user/$USER.
NOTE: HDFS is not a kernel file system and there are no executable option for the file and there
are executable option for directory ( to access its children ).
In HDFS commands like CD, PWD , many other Linux commands wont work.
In HDFS directory is an empty file and the size occupied by the directory 0 byte.
fs :
fs is a generic file system user client , used to perform file system operations.
NOTE : dfs command is an varient of fs and dfs works only for distributed file system.
fs has several command options to perform individual file operation directory like file listing,
directory creations , file writing , file reading etc.

ls, lsr à these two are used to list the files under a specified path in HDFS , lsr lists recurrively
under specified path.

Usage of ls :

Ex : hadoop fs –ls/
hadoop fs – ls <path>

lsr :
hadoop lsr - < path >
Ex : hadoop fs – ls <path>
NOTE : We can use the HDFS commands and Linux commands as a combination with the help of
Unix / Linux pipes.

Ex: hadoop fs – lsr / 1 grep < directoryname >


Mk dire : To create directories under specific path in HDFS

Usage :
hadoop fs –mk / <path> / <dirname>
Ex: hadoop fs –mkdir /stocks à under root.
hadoop fs –mkdir / nyse à under default path.
(OR)
hadoop fs –mkdir / user / $USER / nyse.
There are 3 commands for loading data.
1.  Copy From local
2.  put
3.  Move From local
All these 3 are used to load the data from local system ( Linux to HDFS ).
Both copy From local and Put is same , this two are like copy paste.( it will load the file to
destination and keep copy in the source ).
move From Local : like cut and copy ( cut from local and paste in HDFS ).
hadoop fs –put <source> <dert>

some path in Linux some path in HDF


hadoop fs –put README.txt Notice.txt /stocks /

move From Local :


Usage : Hadoop fs –moveFromLocal <source> <dert>
Ex: Hadoop fs –moveFromLocal word count /

du , dus :
These are used to get the space consumed by files and directory under specified path.
du give individual usages.
dus give summary / consolidated usages.
Usage : hadoop fs –du <path>
hadoop fs –du/
hadoop fs –dus/

Count :
It is used to get the number of directories , number of files , size, quota information
( space quota + files quota )
Usage : hadoop fs –count /
hadoop fs –count –q /

CP : It is used to copy the file within the HDFS.


Usage : Hadoop fs –CP <source> <dert>

HDFS HDFS
hadoop fs –cp / logs / user /$USER/
mv :
hadoop fs –mv <src> <dsf>

HDFS HDFS
Ex : hadoop fs –mv /logs /my logs

rm , rmr : used to remove files are directories.


rm – only files.
rmr – directory and its files.

Usage : hadoop fs –rm [ skip trash ] <path>

if this can be used only if trash is enabled.


if [ skip trash ] is used then deleted files will be permanently deleted ( like shift delete ).
if [ skip trash ] is not used then deleted files will go to trash.
Expunge : It is used to clear the HDFS trash folder ( folder name is.Trash )
Usage : hadoop fs –expunge.

get , Copy To Local :


These are used to copy data from HDFS to local system ( Linux ).
Usage : hadoop fs –get [_ignore crc ] [-crc] <src> <local dir>]

check sum will not cyclic HDFS local


be done reducing
check. default checksum is enabled

Ex: fx –get /stocks / README.txt


fx –get /stocks / README.txt /home /$USER /read
To find difference between two files : diff file1 file2.
Cat : It is used to print / display content.
Usage : hadoop fs –cat <path>
Ex: hadoop fs –cat /stocks / README.txt
NOTE : Don’t use cat on large file , use tail.command.

Tail command : It is used to get the last 1 KB of the data of a file under specified path.
Usage : hadoop fs –tail <path>
Ex : hadoop fs –tail /stocks / README.txt

touch z : Is is used to create an empty file in HDFS.


Usage : hadoop fs –touch z <path>/file
File must be zero length file.
Ex : hadoop fs –touch z Readme.

Setrep : It is used to alter the replication factor of files and directories which are existed in the
HDFS.
NOTE : In HDFS replication is application only for file and not for directory.
Usage : hadoop fs –setrep [-R} [-W] <rep> <path> /<file>
Ex : hadoop fs –setrep –R 3/
In this case the replications for all the files under root is set to 3.

hadoop fsck : chmod , chown.chgrp , these are used to alter file permissions like access related ,
ownerships , group etc

HDFS file permission : HDFS has file permission like Linux , every file has owner , group, other and
file type.
1 column – file permission
2 column – replication
3 rd column – owner

chown à to change the owner.


hadoop fs –chown –R user :[group:] <path>

if new and old user are in same group then new request
chmod à to change the file permissions of owner group others.
usage: hadoop fs –chmod -R <mode> <path>

In HDFS for file x=0


directory x=1
rwx, rwx, rx
6 6 4

chgrp à It is used to change the group.


usage : hadoop fs –chgrp –R group <path>

-help à to get the help of any fs command option


hadoop fs –help <command>

fsck :
•  This command is used to check the healthiness of file system under specified path.
•  This command provides the following info like number of directories , files, size, blocks,
validated blocks, corrupted blocks, under replicated , over replicated , missing blocks,
default replication.

Usage :
hadoop fsck <path> [-move | delete | -open for write ]
[ -file [ -block [.location trackers ]]]

hadoop fsck /-file –blocks –locations.


This command will give the locations of each block.
hadoop fsck /-file.blocks à gives the information of blocks.

[ open to write ] à info of the files that are under going writing process , by default HDFS will
display only the files that are already written.
fsck /-delete à this command will only delete the corrupted blocks ( i.e. file )
fsck /-move
dist cp / dist cp2 à dist cp stands for distributed copying.
NOTE : All operations with fs are sequencial.

dist cp is a Map Reduce job ( map only ) for migrating the data from one system to another
system in parallel.
It is used for 2 scenarios :
1.  Data loading from source system to HDFS.
2.  Data backup from C.

Data loading : In this scenario our source system is any file system like Linux , windows etc and
moving the data from source to HDFS in parallel.
The degree of the parallelism is dependent on source system capability.
For this data movement both source and duration location must have prefix , that is protocol.

Data backup : It is very common in hadoop cluster , here we are taking the backup of either entire
source HDFS or some directories from source HDFS to destination HDFS.
The degree of parallelism is dependent on source and duration cluster.
Usage : hadoop dirt
hadoop distcp [option] <srcurl> +<desturl>

source url
Options :
-P à useful when data backup is taken from one HDFS to another HDFS.
-i à It will ignore the pipe line break.
-m à It tells how many blocks should be copied parallely.
Ex: -m 5 – 5 blocks will be copied at a time.

Hadoop distcp 2 is the data version of distcp.

Archive :
•  It is used to convert n number of HDFS files in to a single HDFS file i.e. archival file.
•  In general archival provides compression , in HDFS archival does not provide any
compression.
•  The objective of archival is reducing he metadata on the name node by reducing the
number of blocks.
Archival is a Map Reduce pgm.

Ex : 100 of files , then ( n-1 ) will be exact block size byte then nth one will be the extra bytes for
100 files 100 extra byte block will be there for all these 100 metadata entry will be there.
Archive file is extension.har.
Usage : hadoop archive –archivename Name –p <parent path> <src>* <dest>
hadoop –fs mkdir /archive.
hadoop archive –archiveName nysc.har –p /nysc /archive.
•  The entire archival data is stored in one file that is part-0 and replication factor is
dfs.replication.
•  In the archival file we will see two more files those are –index , -master index , replication
factor is 5 for these two files.This two files are having the meta info about archival.
NOTE : There is no unarchival process for archived data HDFS
•  To retrieve individual files from archived HDFS data HDFS introduced a wrapper files system
HAR ( Hadoop archival file system ) and the protocol is “har”.
hadoop fs –ls har:///archive/nysc.har/

HDFS admin command :


These are used to perform administrative operations.
1.  namenode –format
It is used to format the HDFS , we run this command at the time of cluster setup before starting
the cluster.
NOTE : Don’t run this command ( we use this command once in the cluster cycle at the time of
cluster setup )
2. namenode, datanode, secondary namenode
These are used to start the daemons individually instead of restarting the cluster.
NOTE : Restarting hadoop production cluster is expensive operations ( in terms of time and
availability ) in the following scenario we need to start the daemons individually.
1.  Node failure ( NN, SNN, dN )
2.  Adding new nodes (dN)
commands.to.start the daemons
no hup hadoop datanode &
no hup hadoop secondary namenode &
OIV ( offline fs image viewer )
It is used to view the HDFS metadata in a hands readable format.
Usage : hadoop oiv [options] –I Input file –o output
Ex : hadoop oiv –i /home /hd user /hadoop /name /current /fs image –o $PWD /fs image

Hadoop HDFS also provides another commands offline edit viewer to read binary edits files in the
forms of human readable format.
It is used to generate renewal and validate delegation token which are sent by clients to name
node. We need to modify some of the configuration parameter.
hadoop fetch dt <opts> <token file>
For every client name node generates delegation token and stores it in the client machine , for
this the command is
hadoop fetch dt - -webservice <url> <token file>
For renewal : to renew token file
hadoop fetch dt - - webservice - -renew <url> <token>
For cancel
hadoop fetch dt - -webservice - -cancel <url> <token file>

dfs admin :
It is a admin shell client to perform very frequent administrative operations.
Uasge : hadoop dfs admin [generic options] [command options]
dfs admin has several command options which are used to perform individual tasks.

Command options :
Report : It is used to get the cluster summary and individual data node summary.

This summary includes total capacity.present capacity ,DFS remain DFS used etc.
Usage : hadoop dfsadmin –report.

Safe mode : Name node user a special mode called safe mode for its repair and maintenance.
Safe mode makes hdfs.read only files systems. Safe mode does not allow
re-replications , writes, deletes, updates on hdfs.
It allows 2 operations i.e look up and reading.
Name node enters in to safe mode in 2 scenarios.
1.  At the time of cluster starting ( automatic )
2.  Admin enters in to safe mode at any point of time ( manual )

1. Automatic case : For every start of hadoop cluster, name node automatically enters into safe
mode and it will stay either 2/3rd of block replica validation time or 30
seconds , it leaves automatic safe mode.
2. Manual case : Admin people enter into the safe mode for name node or HDFS maintenance ,
stays some time , leave the safe mode manually. For manual safe mod ewe have
command option safe mode.
Usage : hadoop dfsadmin [ -safemode enter | leave | get | wait | ]
hadoop dfsadmin –safemode enter
leave
get
wait
Save name space :
It is used to merge the edits in to fs image and reset the edits within the name node.
1.  Enter in to the safe mode.
hadoop dfsadmin –safe mode enter
2. Give save name space command.
hadoop dfsadmin –save name space
3. Leave safe mode
hadoop dfsadmin –save mode leave

Refresh nodes : It is used to make the HDFS cluster up to date after doing some changes over
HDFS.
Refresh service ACL : Service ACL’s are used to authorize your cluster in proper manner by default
there are disabled.
For production cluster people enable service ACL ( access control list ) to administer the cluster
properly. Thus enabling service ACL’s is nothing but doing some modified in xml files
( ex : core-site.xml )
After doing modifications to reflect these modifications we will use refresh service ACL instead of
restarting the cluster.
Usage : hadoop dfsadmin –refresh service ACL.

Refresh to Groups Mapping :


To make HDFS up to date after doing user group mapping.
Usage : hadoop dfsadmin refresh.User.To Groups Mapping

Refresh Super User Groups Configuration :


This is used to refresh HDFS whenever there is a change in between super user to proxy user
configuration changes.
hadoop dfsadmin –refresh.Super.User Groups Configuration.

Meta save :
This option is used to get the metadata information like blocks being replicated , blocks waiting
for replication , blocks waiting for deletion etc.
This command option writes the above info in a file in human readable format , the location of the
file is ,
Hadoop logs directory –metasave <filename>

Upgrading :
HDFS allows upgrading from HDFS version to another HDFS version to start this upgrading
process two command options with dfsadmin.
Those two are :
1.  Finalize upgrade
2.  Upgrade process

•  HDFS upgrade maintain two states that the current and previous HDFS version.
•  If the upgradation process is successful then run the finalize upgrade to delete the previous
version otherwise rollback.
•  To get the upgrade process we use the following command.
hadoop dfsadmin [-upgrade.Progress status | details | for
•  To finalize upgrade process use the following command.
hadoop dfsadmin –finalize upgrade.

Quotas :
HDFS allows quotas on file system to manage the data like Linux , these are two types of quotas.
1.  File quota
2.  Space quota

File quota :
This concept is used to set a fixed number of files for a directory.
To create file quota on HDFS directories we us ethe following command.
Uasge : hadoop dfsadmin –set Quota <quota> <dir name -dirname>
Ex: hadoop dfsadmin –setQuota 4/input

quota is set as 4 for the director ‘input’.


To clear the file quota use the following command :
Usage : hadoop dfsadmin clrQuota <dir name>….<dir name >
Ex: hadoop dfsadmin –clrQuota / input
To count
hadoop fs –count –q /input

Space quota :
This concept is used to fix the size of the HDFS directories to a fixed amount of size.
To set the space quota on the directory use the following command.
hadoop dfsadmin –setSpaceQuota <quota> <dirname>…<dirname>

size in bytes
If allows another notation i.e. 1M à for 1Mb’s (nM) where n=1,2,3
1T à for 1 Terabyte (nT)
To calculate in HDFS ( expr 3-2 \* |
HDFS fs –du /input à disk usage
Ex : hadoop dfsadmin –setSpaceQuota 65m/input

NOTE : In HDFS we have blocks concept , blocks are fixed size to set space quota on any
directory must be greater than or equal to block size ( 64 MB / 128 MB )

To write any file in to HDFS directory which is already set to space quota must contain remaining
space quota more than or equal to block size.
To clear space quota we use the following command
Usage : hadoop dfsadmin –clrSpaceQuota <dirname>….<dirname>
Ex : hadoop dfsadmin –clrSpaceQuota /input.
In production HDFS cluster people use combination of file quota and space quota on HDFS
directory for managing HDFS.

Set Balancer Bandwidth :


It is used to set the bandwidth for hadoop balance algorithm.
Usage : hadoop dfsadmin –setBalancerBandwidth <bandwidth in bytes per sec>

Help :
It is used to get the help of dfsadmin command options.
Usage : hadoop dfsadmin –help <commands>
We can reserve the space in configuration parameters.
dfs.du.reserve

Java HDFS API :


It allows all kind of operations over HDFS but in general people use Java HDFS API for writing and
reading.
To add the jars in the classpath variable use the following script

for i in*.jar
>do
>export CLASSPATH =$CLASSPATH : $PWD|$i
>done

To compile Java class using following command


javac –d. Write HDFS.Java

To build a jar
jar –cvf write.jar –c. /com/HDFS/example/*.class

hadoop jar write.jar com.hdfs.example.write HDFS.


HDFS Code
Examples
HDFS Commands :
This documenta)on gives complete informa)on about user and admin commands of HDFS.
Hadoop fs is the command to execute file system opera)ons like directory crea)on, renaming files, copying
files etc.
Cmd> hadoop fs –fs [local | <file system URI>]
Specify the file system to use. If not specified, the current configura)on is used, taken from the following, in
increasing precedence: core-default.xml inside the hadoop jar file, core-site.xml in configura)on directory
in hadoop home conf/
'Local' means use the local file system as your DFS.
<File system URI> specifies a par)cular file system to contact.
Example:
hadoop fs -fs local ---> it will use the file system as local file system.
hadoop fs -fs localhost: 9000 ---> It will use the file system as Hadoop distributed file system.
There is no need to specify, by default it chooses based on configura)on files [ core-site, hdfs-site, mapred-
site xml files]
Cmd> hadoop fs -ls <path>:
List the contents that match the specified file paZern. If path is not specified, the contents of /user/
<currentUser> will be listed. Directory entries are of the form dirName full path/<dir> and file entries are of
the form fileName(full path) <r n> size where n is the number of replicas specified for the file and size is the
size of the file, in bytes.
Example:-
hadoop fs -ls / ---> This command lists all the files under the root directory of HDFS

Cmd> hadoop fs -lsr <path>:


Recursively list the contents that match the specified file paZern. Behaves very similarly to hadoop fs -ls,
except that the data is shown for all the entries in the subtree.
Example:-
hadoop fs -ls / ---> This command lists all the files Recursively under the root directory of HDFS

Cmd> hadoop fs -rm [-skipTrash] <src>:
Delete all files that match the specified file paZern. Equivalent to the Unix command "rm <src>" -skipTrash
op)on bypasses trash, if enabled, and immediately deletes <src>
Example: hadoop fs -rm /groups --> Removes the file name groups from the specified directory.

Cmd> hadoop fs -rmr [-skipTrash] <src>:


Remove all directories which match the specified file paZern. Equivalent to the Unix command "rm -rf
<src>" -skipTrash op)on bypasses trash, if enabled, and immediately deletes <src>
Example:
hadoop fs -rmr /data --> Removes the specified directory.
Cmd> hadoop -put <localsrc>... <dst>:
Copy files from the local file system into HDFS.
Example:
hadoop fs -put groups /data --> Copy the data from local system to HDFS.


Cmd> hadoop -copyFromLocal <localsrc>... <dst>:
Iden)cal to the -put command.
Example:
hadoop fs -put groups /data --> Copy the data from local system to HDFS.

Cmd> hadoop fs -moveFromLocal <localsrc>... <dst>:


Same as -put, except that the source is deleted a`er it's copied.
Example:
hadoop fs - moveFromLocal groups /data --> Copy the data from local system to HDFS and removes the
file in local system.

Cmd> hadoop fs -get [-ignoreCrc] [-crc] <src> <localdst>:
Copy files that match the file paZern <src> to the local name. <src> is kept. When copying mul)ple, files,
the des)na)on must be a directory.
Example:
hadoop fs -get /data/groups groups --> This copies the groups file from HDFS to local file system.

Cmd> hadoop fs -getmerge <src> <localdst>:
Get all the files in the directories that match the source file paZern and merge and sort them to only one
file on local fs. <src> is kept.
Example:
hadoop fs -getmerge /data/groups groups --> This copies the all the files from data directory in HDFS to
local file system.

Cmd> hadoop fs -cat <src>:
Fetch all files that match the file paZern <src> and display their content on stdout.
Example:
hadoop fs -cat /data/groups --> It displays the content of groups file in stdout

Cmd> hadoop fs -copyToLocal [-ignoreCrc] [-crc] <src> <localdst>:
Iden)cal to the -get command.
Example:
hadoop fs - copyToLocal /data/groups groups --> This copies the groups file from HDFS to local file system

Cmd> hadoop fs -moveToLocal <src> <localdst>:
Not implemented yet
Cmd> hadoop fs -mkdir <path>:

Create a directory in specified loca)on.


Example:
hadoop fs -mkdir /training --> it creates a directory training in HDFS.
Cmd> hadoop fs -setrep [-R] [-w] <rep> <path/file>:
Set the replica)on level of a file. The -R flag requests a recursive change of replica)on level for an en)re
tree.
Example:
hadoop fs -setrep 2 /training/input --> It changes replicaRon factor of input file to 2 from the old
replicaRon factor.
hadoop fs -setrep -R 2 /training/ --> It changes replicaRon factor of all files to 2 from the old replicaRon
factor.

Cmd> hadoop fs -tail [-f] <file>:


Show the last 1KB of the file. The -f op)on shows apended data as the file grows.
Example:
hadoop fs -tail /training/input --> It displays the last 1KB content of input file

Cmd> hadoop fs -touchz <path>:
It is used to create an empty file in specified directory.
Example: hadoop fs -touchz /training/name --> It creates an empty file name in training directory.

Cmd> hadoop fs -test -[ezd] <path>:
If file { exists, has zero length, is a directory then return 0, else return 1.
Example: hadoop fs -test -e /training/name

Cmd> hadoop fs -text <src>:
Takes a source file and outputs the file in text format. The allowed formats are zip and
TextRecordInputStream.
Example: hadoop fs -text /training/binaryfile à It prints the binary file in text format.

Cmd> hadoop fs -stat [format] <path>:
Print sta)s)cs about the file/directory at <path> in the specified format. Format accepts filesize in blocks
(%b), filename (%n), block size (%o), replica)on (%r), modifica)on date (%y, %Y)
Example:
hadoop fs -stat [%b,%n,%o,%r] /training --> It prints the stats about the file/directory based on the
format.

Cmd> hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...
Changes permissions of a file. This works similar to shell's chmod with a few excep)ons.
-R modifies the files recursively. This is the only op)on currently supported.
MODE Mode is same as mode used for chmod shell command.
Only leZers recognized are 'rwxX'. E.g. a+r,g-w,+rwx,o=r
OCTALMODE Mode specifed in 3 digits. Unlike shell command, this requires all three digits.
E.g. 754 is same as u=rwx,g=rx,o=r
If none of 'augo' is specified, 'a' is assumed and unlike
shell command, no umask is applied.
Example:
hadoop fs -chmod -R 777 /data/daily --> It changes the permissions to 777 means read/write for owner,
group, and rest also.

Cmd> hadoop fs -chown [-R] [OWNER][:[GROUP]] PATH...
Changes owner and group of a file.
This is similar to shell's chown with a few excep)ons.
-R modifies the files recursively. This is the only op)on currently supported.
If only owner or group is specified then only owner or group is modified.
The owner and group names may only cosists of digits, alphabet,
and any of '-_.@/' i.e. [-_.@/a-zA-Z0-9]. The names are case sensi)ve.
WARNING: Avoid using '.' to separate user name and group though
Linux allows it. If user names have dots in them and you are
using local file system, you might see surprising results since
shell command 'chown' is used for local files.
Example:
hadoop fs -chown -R naga:supergroup /data/daily --> This will change the user and group of the file /
data/daily

Cmd> hadoop fs -chgrp [-R] GROUP PATH...


This is equivalent to -chown... :GROUP...
Cmd> hadoop fs -count[-q] <path>:
Count the number of directories, files and bytes under the paths that match the specified file paZern. The
output columns are: DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
Example: hadoop fs -count /data/

Cmd> hadoop fs -help [cmd]:
Displays help for given command or all commands if none is specified.
Example: hadoop fs -help ls --> it gives the usage informaRon about command ls.
Cmd> hadoop balancer [-threshold <threshold>] percentage of disk capacity
Example: hadoop balancer --> It does the cluster balancing

Cmd> hadoop fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locaRons | -racks]]]
<path> start checking from this path
-move move corrupted files to /lost+found
-delete delete corrupted files
-files print out files being checked
-openforwrite print out files opened for write
-blocks print out block report
-loca)ons print out loca)ons for every block
-racks print out network topology for data-node loca)ons
By default fsck ignores files opened for write, use –open for write to report such files. They are usually
tagged CORRUPT or HEALTHY depending on their block alloca)on status
Example: hadoop fsck / --> It provides the healthiness of the hadoop cluster.

Cmd> hadoop distcp [OPTIONS] <srcurl>* <desturl>
OPTIONS:
-p[rbugp] Preserve status
r: replica)on number
b: block size
u: user
g: group
p: permission
-p alone is equivalent to -prbugp
-i Ignore failures
-log <logdir> Write logs to <logdir>
-m <num_maps> Maximum number of simultaneous copies
-overwrite Overwrite des)na)on
-update Overwrite if src size different from dst size
-skipcrccheck Do not use CRC check to determine if src is
different from dest. Relevant only if -update is specified
-f <urilist_uri> Use list at <urilist_uri> as src list
-filelimit <n> Limit the total number of files to be <= n
-sizelimit <n> Limit the total size to be <= n bytes
-delete Delete the files exis)ng in the dst but not in src
-mapredSslConf <f> Filename of SSL configura)on for mapper task
Example:
hadoop distcp -p -overwrite "hdfs://localhost:9000/training/bar" "hdfs://anotherhostname:8020/user/
foo" --> It loads data from one cluster to another cluster in parallel. It over writes if file is there.
hadoop distcp -p -update “file:///filename” "hdfs://localhost:9000/training/filename" --> It loads data
from local system to hadoop cluster in parallel. It updates if file is there.
Cmd> hadoop archive -archiveName NAME -p <parent path> <src>* <dest>
Example: hadoop archive –archiveName myfile.har –p /data /data/files /archive/
hadoop dfsadmin is the command to execute DFS administra)ve commands.
Example: hadoop dfsadm <command name>

Cmd> hadoop dfsadmin -report:
Reports basic filesystem informa)on and sta)s)cs.
Example: hadoop dfsadmin -report

Cmd> hadoop dfsadmin -safemode <enter|leave|get|wait>:
Safe mode maintenance command.
Safe mode is a Namenode state in which it
1. does not accept changes to the name space (read-only)
2. does not replicate or delete blocks.
Safe mode is entered automa)cally at Namenode startup, and
leaves safe mode automa)cally when the configured minimum
percentage of blocks sa)sfies the minimum replica)on
condi)on. Safe mode can also be entered manually, but then
it can only be turned off manually as well.
Example:
hadoop dfsadmin -safemode get --> To get status of safe mode
hadoop dfsadmin -safemode enter --> To enter safe mode
hadoop dfsadmin -safemode leave --> To leave the safemode

Cmd> hadoop dfsadmin -saveNamespace:
Save current namespace into storage directories and reset edits log.Requires superuser permissions and
safe mode.

Cmd> hadoop dfsadmin -refreshNodes:
Updates the set of hosts allowed to connect to namenode.
Re-reads the config file to update values defined by
dfs.hosts and dfs.host.exclude and reads the
en)res (hostnames) in those files.
Each entry not defined in dfs.hosts but in dfs.hosts.exclude is decommissioned. Each entry defined
in dfs.hosts and also in dfs.host.exclude is stopped from decommissioning if it has aleady been marked for
decommission. En)res not present in both the lists are decommissioned.

Cmd> hadoop dfsadmin -finalizeUpgrade:
Finalize upgrade of HDFS. Datanodes delete their previous version working directories, followed by
Namenode doing the same. This completes the upgrade process.

Cmd> hadoop dfsadmin -upgradeProgress <status|details|force>:
Request current distributed upgrade status, a detailed status or force the upgrade to proceed.

Cmd> hadoop dfsadmin -metasave <filename>:
Save Namenode's primary data structures
to <filename> in the directory specified by hadoop.log.dir property.
<filename> will contain one line for each of the following
1. Datanodes heart bea)ng with Namenode
2. Blocks wai)ng to be replicated
3. Blocks currrently being replicated
4. Blocks wai)ng to be deleted

Cmd> hadoop dfsadmin -setQuota <quota> <dirname>...<dirname>:
Set the quota <quota> for each directory <dirName>.
The directory quota is a long integer that puts a hard limit
on the number of names in the directory tree
Best effort for the directory, with faults reported if
1. N is not a posi)ve integer, or
2. user is not an administrator, or
3. the directory does not exist or is a file, or

Cmd> hadoop dfsadmin -clrQuota <dirname>...<dirname>:
Clear the quota for each directory <dirName>.
Best effort for the directory. with fault reported if
1. the directory does not exist or is a file, or
2. user is not an administrator.
It does not fault if the directory has no quota.

Cmd> hadoop dfsadmin -setSpaceQuota <quota> <dirname>...<dirname>:
Set the disk space quota <quota> for each directory <dirName>.
The space quota is a long integer that puts a hard limit on the total size of all the files under the directory
tree. The extra space required for replica)on is also counted. E.g. a 1GB file with replica)on of 3 consumes
3GB of the quota. Quota can also be speciefied with a binary prefix for terabytes,
petabytes etc (e.g. 50t is 50TB, 5m is 5MB, 3p is 3PB). Best effort for the directory, with faults reported if
1. N is not a posi)ve integer, or
2. user is not an administrator, or
3. the directory does not exist or is a file, or

Cmd> hadoop dfsadmin -clrSpaceQuota <dirname>...<dirname>:
Clear the disk space quota for each directory <dirName>.
Best effort for the directory. with fault reported if
1. the directory does not exist or is a file, or
2. user is not an administrator.
It does not fault if the directory has no quota.
Cmd> hadoop dfsadmin -refreshServiceAcl:
Reload the service-level authoriza)on policy file. Namenode will reload the authoriza)on policy file.
Cmd> hadoop dfsadmin -refreshUserToGroupsMappings:
Refresh user-to-groups mappings
Cmd> hadoop dfsadmin -refreshSuperUserGroupsConfiguraRon:
Refresh superuser proxy groups mappings
Cmd> hadoop dfsadmin -setBalancerBandwidth <bandwidth>:
Changes the network bandwidth used by each datanode during
HDFS block balancing.
<bandwidth> is the maximum number of bytes per second that will be used by each datanode. This value
overrides the dfs.balance.bandwidthPerSec parameter.
NOTE: The new value is not persistent on the DataNode
Cmd> hadoop -help [cmd]:
Displays help for the given command or all commands if none is specified.
Example: hadoop -help namenode --> It gives the complete informaRon about the command namenode.

Hdfs commands :
hadoop fs -mkdir /training
hadoop fs -ls /
hadoop fs -lsr /
hadoop fs -put CHANGES.txt /training/data/changes
hadoop fs -copyFromLocal CHANGES.txt /training/input/changes
hadoop fs -cat /training/data/changes
hadoop fs -tail /training/data/changes
hadoop fs -expunge
hadoop fs -count /

JAVA HDFS API :


MapFileWrite.iava

package com.hdfs;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.MapFile;
import org.apache.hadoop.io.Text;
public class MapFileWrite {
/**
* @param args
*/
@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException,
URISyntaxException {
String name = "/home/naga/dept";
@SuppressWarnings("resource")
BufferedReader br = new BufferedReader(new FileReader(name));
String line = br.readLine();
String uri = "/nyse/";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://hadoop:9000"), conf);
Path path = new Path(uri);
Text key = new Text();
Text value = new Text();
MapFile.Writer writer = null;
try {
writer = new MapFile.Writer(conf, fs, uri, key.getClass(), value.getClass());
while (line != null) {
String parts[] = line.split("\\t");
key.set(parts[0]);
value.set(parts[1]);
writer.append(key, value);
line = br.readLine();
}
} finally {
IOUtils.closeStream(writer);
}
}
}

MergeFiles.java
package twok.hadoop.hdfs;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;

public class MergeFiles {

/**
* @param args
* @author Nagamallikarjuna
* @throws IOException
*/
public static void main(String[] args) throws IOException {
String dir = "/home/prasad/training/data/mydata";
File directory = new File(dir);
BufferedWriter writer = new BufferedWriter(new FileWriter("/home/prasad/training/
20news"));
File[] files = directory.listFiles();
for(File file : files)
{
String filepath = file.getPath();
@SuppressWarnings("resource")
BufferedReader br = new BufferedReader(new FileReader(filepath));
String line = br.readLine();
String content = "";
content = file.getName() + "\t";
while(line != null)
{
content = content + " " + line;
line= br.readLine();
}
writer.append(content);
}
writer.close();
}

ReadHDFS.java
package twok.hadoop.hdfs;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;

public class ReadHDFS {

/**
* @param args
* @author Nagamallikarjuna
* @throws URISyntaxException
* @throws IOException
*/
public static void main(String[] args) throws IOException, URISyntaxException {
Configuration conf = new Configuration();
Path fileName = new Path(args[0]);
Pattern pattern = Pattern.compile("^part.*");
Matcher matcher = null;
FileSystem hdfs = FileSystem.get(new URI("hdfs://localhost:9000"), conf);
FileStatus files[] = hdfs.listStatus(fileName);
for(FileStatus file: files)
{
matcher = pattern.matcher(file.getPath().getName());
if(matcher.matches())
{
FSDataInputStream in = hdfs.open(file.getPath());
in.seek(0);
IOUtils.copyBytes(in, System.out, conf, false);
}
}
}
}

SequenceFileRead.java
package twok.hadoop.hdfs;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
public class SequenceFileRead {

/**
* @author Nagamallikarjuna
* @param args
*/

public static void main(String[] args) throws IOException, URISyntaxException {


String uri = "/sequenceFiles/stocks";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://prasad:9000"), conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null;
try
{
reader = new SequenceFile.Reader(fs, path, conf);
System.out.println(reader.getKeyClassName());
System.out.println(reader.getValueClassName());
}
finally {
IOUtils.closeStream(reader);
}
}
}

SequenceFileWrite.java
package twok.hadoop.hdfs;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
public class SequenceFileWrite {

/**
* @param args
* @author Nagamallikarjuna
*/
public static void main(String[] args) throws IOException, URISyntaxException {
String name = "/home/naga/bigdata/hadoop-1.0.3/jobs/daily";
@SuppressWarnings("resource")
BufferedReader br = new BufferedReader(new FileReader(name));
String line = br.readLine();
String uri = "/sequenceFiles/stocks";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://localhost:9000"), conf);
Path path = new Path(uri);
Text key = new Text();
LongWritable value = new LongWritable();
SequenceFile.Writer writer = null;
try
{
writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());
while(line != null)
{
String parts[] = line.split("\\t");
key.set(parts[1]);
value.set(Long.valueOf(parts[7]));
writer.append(key, value);
line = br.readLine();
}

}
finally {
IOUtils.closeStream(writer);
}
}
}

WriteHDFS.java
package twok.hadoop.hdfs;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.GenericOptionsParser;

public class WriteHDFS {

/**
* @param args
* @author Nagamallikarjuna
* @throws URISyntaxException
* @throws IOException
*/
public static void main(String[] args) throws IOException, URISyntaxException {
Configuration conf = new Configuration();
String otherArgs[] = new GenericOptionsParser(args).getRemainingArgs();
String localfile = otherArgs[0];
String hdfsfile = otherArgs[1];
FileSystem local = FileSystem.getLocal(conf);
FileSystem hdfs = FileSystem.get(new URI("hdfs://master:9000"), conf);
FSDataInputStream in = local.open(new Path(localfile));
FSDataOutputStream out = hdfs.create(new Path(hdfsfile));
byte[] buffer = new byte[256];
while(in.read(buffer) > 0)
{
out.write(buffer, 0, 256);
}
out.close();
in.close();
}

HDFS Commands :
This documentation gives complete information about user and admin commands of HDFS.
HDFS User Commands are:
1. fs
2. archive
3. distcp

HDFS Admin Commands are:


1. namenode -format
2. namenode
3. secondarynamenode
4. datanode
5. dfsadmin
6. Fsck

USER Commands:
1.  fs -- FileSystem Client.
2.  Hadoop fs is the command to execute file system operations like directory creation, renaming
files, copying files, removing, displaying files.. etc.
Usage: hadoop fs –fs [local | <file system URI>] [Generic Options] [Command Options]
Specify the file system to use. If not specified, the current configuration is used, taken from the
following,
in increasing precedence: core-default.xml inside the hadoop jar file, core-site.xml in configuration
directory in hadoop home conf/ 'Local' means use the local file system as your DFS. <File system
URI> specifies a particular file system to contact.
Example :
hadoop fs -fs local à it will use the file system as local file system.
hadoop fs -fs hostname:9000 à It will use the file system as Hadoop distributed file system.

There is no need to specify, by default it chooses based on configuration files [ core-site, hdfs-site,
mapred-site xml files]

The following are the available command options with fs:


1. ls:
Cmd> hadoop fs -ls <path>:
List the contents that match the specified file pattern. If path is not specified, the contents of /
user/<currentUser> will be listed. Directory entries are of the form dirName full path/<dir> and
file entries are of the form fileName(full path) <r n> size where n is the number of replicas
specified for the file and size is the size of the file, in bytes.
Example:- hadoop fs -ls / à This command lists all the files under the root directory of HDFS

Cmd> hadoop fs -lsr <path>:


Recursively list the contents that match the specified file pattern. Behaves very similarly to
hadoop fs -ls, except that the data is shown for all the entries in the subtree.
Example:- hadoop fs -lsr / àThis command lists all the files Recursively under the root directory
of HDFS

Cmd> hadoop fs -du <path>:


Show the amount of space, in bytes, used by the files that match the specified file pattern.
Equivalent to the unix command "du -sb <path>/*" in case of a directory, and to "du -b <path>"
in case of a file. The output is in the form name(full path) size (in bytes)
Example: hadoop fs -du / à This gives size in bytes and file name (Individual files).

Cmd> hadoop fs -dus <path>:


Show the amount of space, in bytes, used by the files that match the specified file pattern.
Equivalent to the unix command "du -sb" The output is in the form name(full path) size (in bytes)
Example: hadoop fs -dus / à This gives the filename and size in bytes (all files)

Cmd> hadoop fs -mv <src> <dst>:


Move files that match the specified file pattern <src> to a destination <dst>. When moving
multiple files, the destination must be a directory. Here the both src and dst are HDFS locations
only.
Example: hadoop fs -mv /data/groups /groups à This moves the file groups from /data to / and
deletes the groups in /data.
Cmd> hadoop fs -cp <src> <dst>:
Copy files that match the file pattern <src> to a destination. When copying multiple files, the
destination must be a directory.
Example: hadoop fs -cp /data/groups /kanna/groups à This copies the file groups from /data to /
kanna/.

Cmd> hadoop fs -rm [-skipTrash] <src>:


Delete all files that match the specified file pattern. Equivalent to the Unix command "rm <src>" -
skipTrash option bypasses trash, if enabled, and immediately deletes <src>
Example: hadoop fs -rm /groups --> Removes the file name groups from the specified directory.

Cmd> hadoop fs -rmr [-skipTrash] <src>:


Remove all directories which match the specified file pattern. Equivalent to the Unix command
"rm -rf <src>" -skipTrash option bypasses trash, if enabled, and immediately deletes <src>
Example: hadoop fs -rmr /data --> Removes the specified directory.

Cmd> hadoop -put <localsrc>... <dst>:


Copy files from the local file system into HDFS.
Example: hadoop fs -put groups /data --> Copy the data from local system to HDFS.

Cmd> hadoop -copyFromLocal <localsrc>... <dst>:


Identical to the -put command.
Example: hadoop fs -put groups /data --> Copy the data from local system to HDFS.

Cmd> hadoop fs -moveFromLocal <localsrc>... <dst>:


Same as -put, except that the source is deleted after it's copied.
Example: hadoop fs - moveFromLocal groups /data --> Copy the data from local system to HDFS
and removes the file in local system.

Cmd> hadoop fs -get [-ignoreCrc] [-crc] <src> <localdst>:


Copy files that match the file pattern <src> to the local name. <src> is kept. When copying
multiple, files, the destination must be a directory.
Example: hadoop fs -get /data/groups groups --> This copies the groups file from HDFS to local
file system.

Cmd> hadoop fs -getmerge <src> <localdst>:


Get all the files in the directories that match the source file pattern and merge and sort them to
only one file on local fs. <src> is kept.
Example: hadoop fs -getmerge /data/groups groups --> This copies the all the files from data
directory in HDFS to local file system.
Cmd> hadoop fs -cat <src>:
Fetch all files that match the file pattern <src> and display their content on stdout.
Example: hadoop fs -cat /data/groups --> It displays the content of groups file in stdout

Cmd> hadoop fs -copyToLocal [-ignoreCrc] [-crc] <src> <localdst>:


Identical to the -get command.
Example: hadoop fs - copyToLocal /data/groups groups --> This copies the groups file from HDFS
to local file system

Cmd> hadoop fs -moveToLocal <src> <localdst>:


Not implemented yet

Cmd> hadoop fs -mkdir <path>:


Create a directory in specified location.
Example: hadoop fs -mkdir /training --> it creates a directory training in HDFS.

Cmd> hadoop fs -setrep [-R] [-w] <rep> <path/file>:


Set the replication level of a file. The -R flag requests a recursive change of replication level for an
entire tree.
Example: hadoop fs -setrep 2 /training/input --> It changes replication factor of input file to 2 from
the old replication factor.
hadoop fs -setrep -R 2 /training/ --> It changes replication factor of all files to 2 from the old
replication factor.

Cmd> hadoop fs -tail [-f] <file>:


Show the last 1KB of the file. The -f option shows apended data as the file grows.
Example: hadoop fs -tail /training/input --> It displays the last 1KB content of input file

Cmd> hadoop fs -touchz <path>:


It is used to create an empty file in specified directory.
Example: hadoop fs -touchz /training/name --> It creates an empty file name in training directory.

Cmd> hadoop fs -test -[ezd] <path>:


If file { exists, has zero length, is a directory then return 0, else return 1.
Example: hadoop fs -test -e /training/name

Cmd> hadoop fs -text <src>:


Takes a source file and outputs the file in text format. The allowed formats are zip and
TextRecordInputStream.
Example: hadoop fs -text /training/binaryfile --> It prints the binary file in text format.
Cmd> hadoop fs -stat [format] <path>:
Print statistics about the file/directory at <path> in the specified format. Format accepts filesize in
blocks (%b), filename (%n), block size (%o), replication (%r), modification date (%y, %Y)
Example: hadoop fs -stat [%b,%n,%o,%r] /training --> It prints the stats about the file/directory
based on the format.

Cmd> hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...


Changes permissions of a file. This works similar to shell's chmod with a few exceptions.
-R modifies the files recursively. This is the only option currently supported.
MODE : Mode is same as mode used for chmod shell command.
Only letters recognized are 'rwxX'. E.g. a+r,g-w,+rwx,o=r
OCTALMODE Mode specifed in 3 digits. Unlike shell command, this requires all three digits.
E.g. 754 is same as u=rwx,g=rx,o=r
If none of 'augo' is specified, 'a' is assumed and unlike shell command, no umask is
applied.
Example: hadoop fs -chmod -R 777 /data/daily --> It changes the permissions to 777 means read/
write for owner, group, and rest also.

Cmd> hadoop fs -chown [-R] [OWNER][:[GROUP]] PATH...


Changes owner and group of a file.
This is similar to shell's chown with a few exceptions.
-R modifies the files recursively. This is the only option currently supported.
If only owner or group is specified then only owner or group is modified.
The owner and group names may only cosists of digits, alphabet, and any of '-_.@/' i.e. [-_.@/a-zA-
Z0-9]. The names are case sensitive.
WARNING: Avoid using '.' to separate user name and group though Linux allows it. If user names
have dots in them and you are using local file system, you might see surprising results since shell
command 'chown' is used for local files.
Example: hadoop fs -chown -R naga:supergroup /data/daily --> This will change the user and
group of the file /data/daily

Cmd> hadoop fs -chgrp [-R] GROUP PATH...


This is equivalent to -chown... :GROUP...

Cmd> hadoop fs -count[-q] <path>:


Count the number of directories, files and bytes under the paths that match the specified file
pattern. The output columns are: DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
Example: hadoop fs -count /data/

Cmd> hadoop fs -help [cmd]:


Displays help for given command or all commands if none is specified.
Example: hadoop fs -help ls --> it gives the usage information about command ls.
Cmd> hadoop balancer [-threshold <threshold>] percentage of disk capacity
Example: hadoop balancer --> It does the cluster balancing

Cmd> hadoop fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]
<path> start checking from this path
-move move corrupted files to /lost+found
-delete delete corrupted files
-files print out files being checked
-openforwrite print out files opened for write
-blocks print out block report
-locations print out locations for every block
-racks print out network topology for data-node locations
By default fsck ignores files opened for write, use –open for write to report such files. They are
usually tagged CORRUPT or HEALTHY depending on their block allocation status
Example: hadoop fsck / --> It provides the healthiness of the hadoop cluster.

Cmd> hadoop distcp [OPTIONS] <srcurl>* <desturl>


OPTIONS:
-p[rbugp] Preserve status
r: replication number
b: block size
u: user
g: group
p: permission
-p alone is equivalent to –prbugp
-i Ignore failures
-log <logdir> Write logs to <logdir>
-m <num_maps> Maximum number of simultaneous copies
-overwrite Overwrite destination
-update Overwrite if src size different from dst size
-skipcrccheck Do not use CRC check to determine if src is
different from dest. Relevant only if -update is specified
-f <urilist_uri> Use list at <urilist_uri> as src list
-filelimit <n> Limit the total number of files to be <= n
-sizelimit <n> Limit the total size to be <= n bytes
-delete Delete the files existing in the dst but not in src
-mapredSslConf <f> Filename of SSL configuration for mapper task
Example:
hadoop distcp -p -overwrite "hdfs://localhost:9000/training/bar" "hdfs://anotherhostname:8020/
user/foo" --> It loads data from one cluster to another cluster in parallel. It over writes if file is
there.
hadoop distcp -p -update “file:///filename” "hdfs://localhost:9000/training/filename" --> It loads
data from local system to hadoop cluster in parallel. It updates if file is there.
Cmd> hadoop archive -archiveName NAME -p <parent path> <src>* <dest>
Example: hadoop archive –archiveName myfile.har –p /data /data/files /archive/

hadoop dfsadmin is the command to execute DFS administrative commands.


Example: hadoop dfsadm <command name>

Cmd> hadoop dfsadmin -report:


Reports basic filesystem information and statistics.
Example: hadoop dfsadmin -report

Cmd> hadoop dfsadmin -safemode <enter|leave|get|wait>:


Safe mode maintenance command.
Safe mode is a Namenode state in which it
1.  does not accept changes to the name space (read-only)
2.  does not replicate or delete blocks.
Safe mode is entered automatically at Namenode startup, and leaves safe mode automatically
when the configured minimum percentage of blocks satisfies the minimum replication condition.
Safe mode can also be entered manually, but then it can only be turned off manually as well.
Example:
hadoop dfsadmin -safemode get à To get status of safe mode
hadoop dfsadmin -safemode enter à To enter safe mode
hadoop dfsadmin -safemode leave à To leave the safemode

Cmd> hadoop dfsadmin -saveNamespace:


Save current namespace into storage directories and reset edits log.Requires superuser
permissions and safe mode.

Cmd> hadoop dfsadmin -refreshNodes:


Updates the set of hosts allowed to connect to namenode.
Re-reads the config file to update values defined by dfs.hosts and dfs.host.exclude and reads the
entires (hostnames) in those files.
Each entry not defined in dfs.hosts but in dfs.hosts.exclude is decommissioned. Each entry
defined in dfs.hosts and also in dfs.host.exclude is stopped from decommissioning if it has aleady
been marked for decommission.
Entires not present in both the lists are decommissioned.

Cmd> hadoop dfsadmin -finalizeUpgrade:


Finalize upgrade of HDFS. Datanodes delete their previous version working directories, followed
by Namenode doing the same. This completes the upgrade process.

Cmd> hadoop dfsadmin -upgradeProgress <status|details|force>:


Request current distributed upgrade status, a detailed status or force the upgrade to proceed.
Cmd> hadoop dfsadmin -metasave <filename>:
Save Namenode's primary data structures to <filename> in the directory specified by
hadoop.log.dir property. <filename> will contain one line for each of the following
1.  Datanodes heart beating with Namenode
2. Blocks waiting to be replicated
3. Blocks currrently being replicated
4. Blocks waiting to be deleted

Cmd> hadoop dfsadmin -setQuota <quota> <dirname>...<dirname>:


Set the quota <quota> for each directory <dirName>.
The directory quota is a long integer that puts a hard limit on the number of names in the
directory tree
Best effort for the directory, with faults reported if
1.  N is not a positive integer, or
2. user is not an administrator, or
3. the directory does not exist or is a file, or

Cmd> hadoop dfsadmin -clrQuota <dirname>...<dirname>:


Clear the quota for each directory <dirName>.
Best effort for the directory. with fault reported if
1.  the directory does not exist or is a file, or
2. user is not an administrator.
It does not fault if the directory has no quota.

Cmd> hadoop dfsadmin -setSpaceQuota <quota> <dirname>...<dirname>:


Set the disk space quota <quota> for each directory <dirName>.
The space quota is a long integer that puts a hard limit on the total size of all the files under the
directory tree.
The extra space required for replication is also counted. E.g. a 1GB file with replication of 3
consumes 3GB of the quota.
Quota can also be speciefied with a binary prefix for terabytes, petabytes etc (e.g. 50t is 50TB, 5m
is 5MB, 3p is 3PB).
Best effort for the directory, with faults reported if
1.  N is not a positive integer, or
2. user is not an administrator, or
3. the directory does not exist or is a file, or

Cmd> hadoop dfsadmin -clrSpaceQuota <dirname>...<dirname>:


Clear the disk space quota for each directory <dirName>.
Best effort for the directory. with fault reported if
1.  the directory does not exist or is a file, or
2. user is not an administrator.
It does not fault if the directory has no quota.
Cmd> hadoop dfsadmin -refreshServiceAcl:
Reload the service-level authorization policy file. Namenode will reload the authorization policy
file.

Cmd> hadoop dfsadmin -refreshUserToGroupsMappings:


Refresh user-to-groups mappings

Cmd> hadoop dfsadmin -refreshSuperUserGroupsConfiguration:


Refresh superuser proxy groups mappings

Cmd> hadoop dfsadmin -setBalancerBandwidth <bandwidth>:


Changes the network bandwidth used by each datanode during HDFS block balancing.
<bandwidth> is the maximum number of bytes per second that will be used by each datanode.
This value overrides the dfs.balance.bandwidthPerSec parameter.

NOTE: The new value is not persistent on the DataNode

Cmd> hadoop -help [cmd]:


Displays help for the given command or all commands if none is specified.
Example: hadoop -help namenode --> It gives the complete information about the command
namenode.
HDFS Commands :

This documentation gives complete information about user and admin commands of HDFS.
Hadoop fs is the command to execute file system operations like directory creation, renaming
files, copying files etc.
Cmd> hadoop fs –fs [local | <file system URI>]
Specify the file system to use. If not specified, the current configuration is used, taken from the
following, in increasing precedence: core-default.xml inside the hadoop jar file, core-site.xml in
configuration directory in hadoop home conf/
'Local' means use the local file system as your DFS.
<File system URI> specifies a particular file system to contact.
Example:-
hadoop fs -fs local ---> it will use the file system as local file system.
hadoop fs -fs localhost: 9000 ---> It will use the file system as Hadoop distributed file system.

There is no need to specify, by default it chooses based on configuration files [ core-site, hdfs-site,
mapred-site xml files]
Cmd> hadoop fs -ls <path>:
List the contents that match the specified file pattern. If path is not specified, the contents of /
user/<currentUser> will be listed. Directory entries are of the form dirName full path/<dir> and
file entries are of the form fileName(full path) <r n> size where n is the number of replicas
specified for the file and size is the size of the file, in bytes.
Example:-
hadoop fs -ls / ---> This command lists all the files under the root directory of HDFS

Cmd> hadoop fs -lsr <path>:


Recursively list the contents that match the specified file pattern. Behaves very similarly to hadoop
fs -ls, except that the data is shown for all the entries in the subtree.
Example:-
hadoop fs -ls / ---> This command lists all the files Recursively under the root directory of HDFS

Cmd> hadoop fs -du <path>:


Show the amount of space, in bytes, used by the files that match the specified file pattern.
Equivalent to the unix command "du -sb <path>/*" in case of a directory, and to "du -b <path>"
in case of a file. The output is in the form name(full path) size (in bytes)
Example:
hadoop fs -du / ---> This gives size in bytes and file name (Individual files).

Cmd> hadoop fs -dus <path>:


Show the amount of space, in bytes, used by the files that match the specified file pattern.
Equivalent to the unix command "du -sb" The output is in the form name(full path) size (in bytes)
Example:
hadoop fs -dus / ---> This gives the filename and size in bytes (all files)
Cmd> hadoop fs -mv <src> <dst>:
Move files that match the specified file pattern <src> to a destination <dst>. When moving
multiple files, the destination must be a directory. Here the both src and dst are HDFS locations
only.
Example:
hadoop fs -mv /data/groups /groups ---> This moves the file groups from /data to / and deletes
the groups in /data.

Cmd> hadoop fs -cp <src> <dst>:


Copy files that match the file pattern <src> to a destination. When copying multiple files, the
destination must be a directory.
Example:
hadoop fs -cp /data/groups /kanna/groups ---> This copies the file groups from /data to /kanna/.

Cmd> hadoop fs -rm [-skipTrash] <src>:


Delete all files that match the specified file pattern. Equivalent to the Unix command "rm <src>" -
skipTrash option bypasses trash, if enabled, and immediately deletes <src>
Example:
hadoop fs -rm /groups --> Removes the file name groups from the specified directory.

Cmd> hadoop fs -rmr [-skipTrash] <src>:


Remove all directories which match the specified file pattern. Equivalent to the Unix command
"rm -rf <src>" -skipTrash option bypasses trash, if enabled, and immediately deletes <src>
Example:
hadoop fs -rmr /data --> Removes the specified directory.

Cmd> hadoop -put <localsrc>... <dst>:


Copy files from the local file system into HDFS.
Example:
hadoop fs -put groups /data --> Copy the data from local system to HDFS.

Cmd> hadoop -copyFromLocal <localsrc>... <dst>:


Identical to the -put command.
Example:
hadoop fs -put groups /data --> Copy the data from local system to HDFS.

Cmd> hadoop fs -moveFromLocal <localsrc>... <dst>:


Same as -put, except that the source is deleted after it's copied.
Example:
hadoop fs - moveFromLocal groups /data --> Copy the data from local system to HDFS and
removes the file in local system.

Cmd> hadoop fs -get [-ignoreCrc] [-crc] <src> <localdst>:


Copy files that match the file pattern <src> to the local name. <src> is kept. When copying
multiple, files, the destination must be a directory.
Example:
hadoop fs -get /data/groups groups --> This copies the groups file from HDFS to local file system.

Cmd> hadoop fs -getmerge <src> <localdst>:


Get all the files in the directories that match the source file pattern and merge and sort them to
only one file on local fs. <src> is kept.
Example:
hadoop fs -getmerge /data/groups groups --> This copies the all the files from data directory in
HDFS to local file system.

Cmd> hadoop fs -cat <src>:


Fetch all files that match the file pattern <src> and display their content on stdout.
Example:
hadoop fs -cat /data/groups --> It displays the content of groups file in stdout

Cmd> hadoop fs -copyToLocal [-ignoreCrc] [-crc] <src> <localdst>:


Identical to the -get command.
Example:
hadoop fs - copyToLocal /data/groups groups --> This copies the groups file from HDFS to local
file system

Cmd> hadoop fs -moveToLocal <src> <localdst>:


Not implemented yet
Cmd> hadoop fs -mkdir <path>:
Create a directory in specified location.
Example:
hadoop fs -mkdir /training --> it creates a directory training in HDFS.

Cmd> hadoop fs -setrep [-R] [-w] <rep> <path/file>:


Set the replication level of a file. The -R flag requests a recursive change of replication level for an
entire tree.
Example:
hadoop fs -setrep 2 /training/input --> It changes replication factor of input file to 2 from the old
replication factor.
hadoop fs -setrep -R 2 /training/ --> It changes replication factor of all files to 2 from the old
replication factor.

Cmd> hadoop fs -tail [-f] <file>:


Show the last 1KB of the file. The -f option shows apended data as the file grows.
Example:
hadoop fs -tail /training/input --> It displays the last 1KB content of input file

Cmd> hadoop fs -touchz <path>:


It is used to create an empty file in specified directory.
Example:
hadoop fs -touchz /training/name --> It creates an empty file name in training directory.

Cmd> hadoop fs -test -[ezd] <path>:


If file { exists, has zero length, is a directory then return 0, else return 1.
Example:
hadoop fs -test -e /training/name

Cmd> hadoop fs -text <src>:


Takes a source file and outputs the file in text format. The allowed formats are zip and
TextRecordInputStream.
Example:
hadoop fs -text /training/binaryfile à It prints the binary file in text format.

Cmd> hadoop fs -stat [format] <path>:


Print statistics about the file/directory at <path> in the specified format. Format accepts filesize in
blocks (%b), filename (%n), block size (%o), replication (%r), modification date (%y, %Y)
Example:
hadoop fs -stat [%b,%n,%o,%r] /training à It prints the stats about the file/directory based on the
format.

Cmd> hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...


Changes permissions of a file. This works similar to shell's chmod with a few exceptions.
-R modifies the files recursively. This is the only option currently supported.
MODE Mode is same as mode used for chmod shell command.
Only letters recognized are 'rwxX'. E.g. a+r,g-w,+rwx,o=r
OCTALMODE Mode specifed in 3 digits. Unlike shell command, this requires all three digits.
E.g. 754 is same as u=rwx,g=rx,o=r
If none of 'augo' is specified, 'a' is assumed and unlike
shell command, no umask is applied.
Example:
hadoop fs -chmod -R 777 /data/daily à It changes the permissions to 777 means read/write for
owner, group, and rest also.

Cmd> hadoop fs -chown [-R] [OWNER][:[GROUP]] PATH...


Changes owner and group of a file.
This is similar to shell's chown with a few exceptions.
-R modifies the files recursively. This is the only option currently supported.
If only owner or group is specified then only owner or group is modified.
The owner and group names may only cosists of digits, alphabet, and any of '-_.@/' i.e. [-_.@/a-zA-
Z0-9]. The names are case sensitive.
WARNING: Avoid using '.' to separate user name and group though
Linux allows it. If user names have dots in them and you are
using local file system, you might see surprising results since
shell command 'chown' is used for local files.
Example:
hadoop fs -chown -R naga:supergroup /data/daily --> This will change the user and group of the
file /data/daily

Cmd> hadoop fs -chgrp [-R] GROUP PATH...


This is equivalent to -chown... :GROUP...
Cmd> hadoop fs -count[-q] <path>:
Count the number of directories, files and bytes under the paths that match the specified file
pattern. The output columns are: DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
Example:
hadoop fs -count /data/

Cmd> hadoop fs -help [cmd]:


Displays help for given command or all commands if none is specified.
Example:
hadoop fs -help ls --> it gives the usage information about command ls.

Cmd> hadoop balancer [-threshold <threshold>] percentage of disk capacity


Example:
hadoop balancer --> It does the cluster balancing

Cmd> hadoop fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]
<path> start checking from this path
-move move corrupted files to /lost+found
-delete delete corrupted files
-files print out files being checked
-openforwrite print out files opened for write
-blocks print out block report
-locations print out locations for every block
-racks print out network topology for data-node locations

By default fsck ignores files opened for write, use –open for write to report such files. They are
usually tagged CORRUPT or HEALTHY depending on their block allocation status
Example:
hadoop fsck / --> It provides the healthiness of the hadoop cluster.

Cmd> hadoop distcp [OPTIONS] <srcurl>* <desturl>


OPTIONS:
-p[rbugp] Preserve status
r: replication number
b: block size
u: user
g: group
p: permission
-p alone is equivalent to -prbugp
-i Ignore failures
-log <logdir> Write logs to <logdir>
-m <num_maps> Maximum number of simultaneous copies
-overwrite Overwrite destination
-update Overwrite if src size different from dst size
-skipcrccheck Do not use CRC check to determine if src is different from dest. Relevant only if -
update is specified
-f <urilist_uri> Use list at <urilist_uri> as src list
-filelimit <n> Limit the total number of files to be <= n
-sizelimit <n> Limit the total size to be <= n bytes
-delete Delete the files existing in the dst but not in src
-mapredSslConf <f> Filename of SSL configuration for mapper task
Example:
hadoop distcp -p -overwrite "hdfs://localhost:9000/training/bar" "hdfs://anotherhostname:8020/
user/foo" --> It loads data from one cluster to another cluster in parallel. It over writes if file is
there.
hadoop distcp -p -update “file:///filename” "hdfs://localhost:9000/training/filename" --> It loads
data from local system to hadoop cluster in parallel. It updates if file is there.

Cmd> hadoop archive -archiveName NAME -p <parent path> <src>* <dest>


Example: hadoop archive –archiveName myfile.har –p /data /data/files /archive/
hadoop dfsadmin is the command to execute DFS administrative commands.
Example:
hadoop dfsadm <command name>

Cmd> hadoop dfsadmin -report:


Reports basic filesystem information and statistics.
Example:
hadoop dfsadmin -report

Cmd> hadoop dfsadmin -safemode <enter|leave|get|wait>:


Safe mode maintenance command.
Safe mode is a Namenode state in which it
1. does not accept changes to the name space (read-only)
2. does not replicate or delete blocks
Safe mode is entered automatically at Namenode startup, and leaves safe mode automatically
when the configured minimum percentage of blocks satisfies the minimum replication condition.
Safe mode can also be entered manually, but then it can only be turned off manually as well.
Example:
hadoop dfsadmin -safemode get --> To get status of safe mode
hadoop dfsadmin -safemode enter --> To enter safe mode
hadoop dfsadmin -safemode leave --> To leave the safemode

Cmd> hadoop dfsadmin -saveNamespace:


Save current namespace into storage directories and reset edits log.Requires superuser
permissions and safe mode.
Cmd> hadoop dfsadmin -refreshNodes:
Updates the set of hosts allowed to connect to namenode. Re-reads the config file to update
values defined by dfs.hosts and dfs.host.exclude and reads the
entires (hostnames) in those files.
Each entry not defined in dfs.hosts but in dfs.hosts.exclude is decommissioned. Each entry
defined in dfs.hosts and also in dfs.host.exclude is stopped from decommissioning if it has aleady
been marked for decommission.
Entires not present in both the lists are decommissioned.

Cmd> hadoop dfsadmin -finalizeUpgrade:


Finalize upgrade of HDFS. Datanodes delete their previous version working directories, followed
by Namenode doing the same. This completes the upgrade process.

Cmd> hadoop dfsadmin -upgradeProgress <status|details|force>:


Request current distributed upgrade status, a detailed status or force the upgrade to proceed.

Cmd> hadoop dfsadmin -metasave <filename>:


Save Namenode's primary data structures to <filename> in the directory specified by
hadoop.log.dir property. <filename> will contain one line for each of the following
1. Datanodes heart beating with Namenode
2. Blocks waiting to be replicated
3. Blocks currrently being replicated
4. Blocks waiting to be deleted

Cmd> hadoop dfsadmin -setQuota <quota> <dirname>...<dirname>:


Set the quota <quota> for each directory <dirName>.
The directory quota is a long integer that puts a hard limit
on the number of names in the directory tree
Best effort for the directory, with faults reported if
1. N is not a positive integer, or
2. user is not an administrator, or
3. the directory does not exist or is a file, or
Cmd> hadoop dfsadmin -clrQuota <dirname>...<dirname>:
Clear the quota for each directory <dirName>.
Best effort for the directory. with fault reported if
1. the directory does not exist or is a file, or
2. user is not an administrator.
It does not fault if the directory has no quota.

Cmd> hadoop dfsadmin -setSpaceQuota <quota> <dirname>...<dirname>:


Set the disk space quota <quota> for each directory <dirName>.
The space quota is a long integer that puts a hard limit on the total size of all the files under the
directory tree.
The extra space required for replication is also counted. E.g. a 1GB file with replication of 3
consumes 3GB of the quota.
Quota can also be speciefied with a binary prefix for terabytes, petabytes etc (e.g. 50t is 50TB, 5m
is 5MB, 3p is 3PB).

Best effort for the directory, with faults reported if


1. N is not a positive integer, or
2. user is not an administrator, or
3. the directory does not exist or is a file, or

Cmd> hadoop dfsadmin -clrSpaceQuota <dirname>...<dirname>:


Clear the disk space quota for each directory <dirName>.
Best effort for the directory. with fault reported if
1. the directory does not exist or is a file, or
2. user is not an administrator.
It does not fault if the directory has no quota.

Cmd> hadoop dfsadmin -refreshServiceAcl:


Reload the service-level authorization policy file. Namenode will reload the authorization policy
file.

Cmd> hadoop dfsadmin -refreshUserToGroupsMappings:


Refresh user-to-groups mappings

Cmd> hadoop dfsadmin -refreshSuperUserGroupsConfiguration:


Refresh superuser proxy groups mappings

Cmd> hadoop dfsadmin -setBalancerBandwidth <bandwidth>:


Changes the network bandwidth used by each datanode during
HDFS block balancing.
<bandwidth> is the maximum number of bytes per second that will be used by each datanode.
This value overrides the dfs.balance.bandwidthPerSec parameter.

NOTE: The new value is not persistent on the DataNode

Cmd> hadoop -help [cmd]:


Displays help for the given command or all commands if none is specified.
Example:
hadoop -help namenode --> It gives the complete information about the command namenode.

Hdfs – commands
hadoop fs -mkdir /training
hadoop fs -ls /
hadoop fs -lsr /
hadoop fs -put CHANGES.txt /training/data/changes
hadoop fs -copyFromLocal CHANGES.txt /training/input/changes
hadoop fs -cat /training/data/changes
hadoop fs -tail /training/data/changes
hadoop fs -expunge
hadoop fs -count /

JAVA HDFS API

MapFileWrite.java
package com.hdfs;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.MapFile;
import org.apache.hadoop.io.Text;

public class MapFileWrite {


/**
* @param args
*/
@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException,
URISyntaxException {
String name = "/home/naga/dept";
@SuppressWarnings("resource")
BufferedReader br = new BufferedReader(new FileReader(name));
String line = br.readLine();
String uri = "/nyse/";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://hadoop:9000"), conf);
Path path = new Path(uri);
Text key = new Text();
Text value = new Text();
MapFile.Writer writer = null;
try {
writer = new MapFile.Writer(conf, fs, uri, key.getClass(), value.getClass());
while (line != null) {
String parts[] = line.split("\\t");
key.set(parts[0]);
value.set(parts[1]);
writer.append(key, value);
line = br.readLine();
}
} finally {
IOUtils.closeStream(writer);
}
}
}

MergeFiles.java
package twok.hadoop.hdfs;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;

public class MergeFiles {

/**
* @param args
* @author Nagamallikarjuna
* @throws IOException
*/
public static void main(String[] args) throws IOException {
String dir = "/home/prasad/training/data/mydata";
File directory = new File(dir);
BufferedWriter writer = new BufferedWriter(new FileWriter("/home/prasad/training/
20news"));
File[] files = directory.listFiles();
for(File file : files)
{
String filepath = file.getPath();
@SuppressWarnings("resource")
BufferedReader br = new BufferedReader(new FileReader(filepath));
String line = br.readLine();
String content = "";
content = file.getName() + "\t";
while(line != null)
{
content = content + " " + line;
line= br.readLine();
}
writer.append(content);
}
writer.close();
}

ReadHDFS.java
package twok.hadoop.hdfs;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
public class ReadHDFS {

/**
* @param args
* @author Nagamallikarjuna
* @throws URISyntaxException
* @throws IOException
*/
public static void main(String[] args) throws IOException, URISyntaxException {
Configuration conf = new Configuration();
Path fileName = new Path(args[0]);
Pattern pattern = Pattern.compile("^part.*");
Matcher matcher = null;
FileSystem hdfs = FileSystem.get(new URI("hdfs://localhost:9000"), conf);
FileStatus files[] = hdfs.listStatus(fileName);
for(FileStatus file: files)
{
matcher = pattern.matcher(file.getPath().getName());
if(matcher.matches())
{
FSDataInputStream in = hdfs.open(file.getPath());
in.seek(0);
IOUtils.copyBytes(in, System.out, conf, false);
}
}
}
}

SequenceFileRead.java
package twok.hadoop.hdfs;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
public class SequenceFileRead {

/**
* @author Nagamallikarjuna
* @param args
*/
public static void main(String[] args) throws IOException, URISyntaxException {
String uri = "/sequenceFiles/stocks";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://prasad:9000"), conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null;
try
{
reader = new SequenceFile.Reader(fs, path, conf);
System.out.println(reader.getKeyClassName());
System.out.println(reader.getValueClassName());
}
finally {
IOUtils.closeStream(reader);
}
}
}

SequenceFileWrite.java
package twok.hadoop.hdfs;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
public class SequenceFileWrite {
/**
* @param args
* @author Nagamallikarjuna
*/
public static void main(String[] args) throws IOException, URISyntaxException {
String name = "/home/naga/bigdata/hadoop-1.0.3/jobs/daily";
@SuppressWarnings("resource")
BufferedReader br = new BufferedReader(new FileReader(name));
String line = br.readLine();
String uri = "/sequenceFiles/stocks";
Configuration conf = new Configuration();
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://localhost:9000"), conf);
Path path = new Path(uri);
Text key = new Text();
LongWritable value = new LongWritable();
SequenceFile.Writer writer = null;
try
{
writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());
while(line != null)
{
String parts[] = line.split("\\t");
key.set(parts[1]);
value.set(Long.valueOf(parts[7]));
writer.append(key, value);
line = br.readLine();
}

}
finally {
IOUtils.closeStream(writer);
}
}
}

WriteHDFS.java
package twok.hadoop.hdfs;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.GenericOptionsParser;

public class WriteHDFS {

/**
* @param args
* @author Nagamallikarjuna
* @throws URISyntaxException
* @throws IOException
*/
public static void main(String[] args) throws IOException, URISyntaxException {
Configuration conf = new Configuration();
String otherArgs[] = new GenericOptionsParser(args).getRemainingArgs();
String localfile = otherArgs[0];
String hdfsfile = otherArgs[1];
FileSystem local = FileSystem.getLocal(conf);
FileSystem hdfs = FileSystem.get(new URI("hdfs://master:9000"), conf);
FSDataInputStream in = local.open(new Path(localfile));
FSDataOutputStream out = hdfs.create(new Path(hdfsfile));
byte[] buffer = new byte[256];
while(in.read(buffer) > 0)
{
out.write(buffer, 0, 256);
}
out.close();
in.close();
}
}

HDFS Commands :
This documentation gives complete information about user and admin commands of HDFS.
HDFS User Commands are:
1. fs
2. archive
3. distcp

HDFS Admin Commands are:


1. namenode -format
2. namenode
3. secondarynamenode
4. datanode
5. dfsadmin
6. Fsck

USER Commands:
1.  fs -- FileSystem Client :
Hadoop fs is the command to execute file system operations like directory creation,
renaming files,copying files, removing, displaying files.. etc.
Usage: hadoop fs –fs [local | <file system URI>] [Generic Options] [Command Options]
Specify the file system to use. If not specified, the current configuration is used, taken from the
following, in increasing precedence: core-default.xml inside the hadoop jar file, core-site.xml in
configuration directory in hadoop home conf/ 'Local' means use the local file system as your DFS.
<File s ystem URI> specifies a particular file system to contact.
Example
hadoop fs -fs local à it will use the file system as local file system.
hadoop fs -fs hostname:9000 à It will use the file system as Hadoop distributed file system.

There is no need to specify, by default it chooses based on configuration files


[ core-site, hdfs-site, mapred-site xml files]

The following are the available command options with fs:


1. ls:
Cmd> hadoop fs -ls <path>:
List the contents that match the specified file pattern. If path is not specified, the contents
of /user/<currentUser> will be listed. Directory entries are of the form dirName full path/
<dir> and file entries are of the form fileName(full path) <r n> size where n is the number of
replicas specified for the file and size is the size of the file, in bytes.
Example:- hadoop fs -ls / à This command lists all the files under the root directory of
HDFS

Cmd> hadoop fs -lsr <path>:


Recursively list the contents that match the specified file pattern. Behaves very similarly to
hadoop fs -ls, except that the data is shown for all the entries in the subtree.
Example:- hadoop fs -lsr / à This command lists all the files Recursively under the root
directory of HDFS

Cmd> hadoop fs -du <path>:


Show the amount of space, in bytes, used by the files that match the specified file pattern.
Equivalent to the unix command "du -sb <path>/*" in case of a directory, and to "du -b
<path>" in case of a file. The output is in the form name(full path) size (in bytes)
Example: hadoop fs -du / à This gives size in bytes and file name (Individual files).

Cmd> hadoop fs -dus <path>:


Show the amount of space, in bytes, used by the files that match the specified file pattern.
Equivalent to the unix command "du -sb" The output is in the form name(full path) size (in
bytes)
Example: hadoop fs -dus / à This gives the filename and size in bytes (all files)

Cmd> hadoop fs -mv <src> <dst>:


Move files that match the specified file pattern <src> to a destination <dst>. When moving
multiple files, the destination must be a directory. Here the both src and dst are HDFS
locations only.
Example: hadoop fs -mv /data/groups /groups ---> This moves the file groups from /data
to / and deletes the groups in /data.

Cmd> hadoop fs -cp <src> <dst>:


Copy files that match the file pattern <src> to a destination. When copying multiple files,
the destination must be a directory.
Example :hadoop fs -cp /data/groups /kanna/groups ---> This copies the file groups from /
data to /kanna/.
Cmd> hadoop fs -rm [-skipTrash] <src>:
Delete all files that match the specified file pattern. Equivalent to the Unix command "rm
<src>" -skipTrash option bypasses trash, if enabled, and immediately deletes <src>
Example: hadoop fs -rm /groups --> Removes the file name groups from the specified
directory.

Cmd> hadoop fs -rmr [-skipTrash] <src>:


Remove all directories which match the specified file pattern. Equivalent to the Unix
command "rm -rf <src>" -skipTrash option bypasses trash, if enabled, and immediately
deletes <src>
Example: hadoop fs -rmr /data --> Removes the specified directory.

Cmd> hadoop -put <localsrc>... <dst>:


Copy files from the local file system into HDFS.
Example: hadoop fs -put groups /data --> Copy the data from local system to HDFS.

Cmd> hadoop -copyFromLocal <localsrc>... <dst>:


Identical to the -put command.
Example: hadoop fs -put groups /data --> Copy the data from local system to HDFS.

Cmd> hadoop fs -moveFromLocal <localsrc>... <dst>:


Same as -put, except that the source is deleted after it's copied.
Example: hadoop fs - moveFromLocal groups /data --> Copy the data from local system to
HDFS and removes the file in local system.

Cmd> hadoop fs -get [-ignoreCrc] [-crc] <src> <localdst>:


Copy files that match the file pattern <src> to the local name. <src> is kept. When
copying multiple, files, the destination must be a directory.
Example: hadoop fs -get /data/groups groups --> This copies the groups file from HDFS to
local file system.

Cmd> hadoop fs -getmerge <src> <localdst>:


Get all the files in the directories that match the source file pattern and merge and sort
them to only one file on local fs. <src> is kept.
Example: hadoop fs -getmerge /data/groups groups --> This copies the all the files from
data directory in HDFS to local file system.

Cmd> hadoop fs -cat <src>:


Fetch all files that match the file pattern <src> and display their content on stdout.
Example: hadoop fs -cat /data/groups --> It displays the content of groups file in stdout

Cmd> hadoop fs -copyToLocal [-ignoreCrc] [-crc] <src> <localdst>:


Identical to the -get command.
Example: hadoop fs - copyToLocal /data/groups groups --> This copies the groups file from
HDFS to local file system
Cmd> hadoop fs -moveToLocal <src> <localdst>:
Not implemented yet

Cmd> hadoop fs -mkdir <path>:


Create a directory in specified location.
Example: hadoop fs -mkdir /training --> it creates a directory training in HDFS.

Cmd> hadoop fs -setrep [-R] [-w] <rep> <path/file>:


Set the replication level of a file. The -R flag requests a recursive change of replication level
for an entire tree.
Example: hadoop fs -setrep 2 /training/input --> It changes replication factor of input file to
2 from the old replication factor.
hadoop fs -setrep -R 2 /training/ --> It changes replication factor of all files to 2
from the old replication factor.

Cmd> hadoop fs -tail [-f] <file>:


Show the last 1KB of the file. The -f option shows apended data as the file grows.
Example: hadoop fs -tail /training/input --> It displays the last 1KB content of input file

Cmd> hadoop fs -touchz <path>:


It is used to create an empty file in specified directory.
Example: hadoop fs -touchz /training/name --> It creates an empty file name in training
directory.

Cmd> hadoop fs -test -[ezd] <path>:


If file { exists, has zero length, is a directory then return 0, else return 1.
Example: hadoop fs -test -e /training/name

Cmd> hadoop fs -text <src>:


Takes a source file and outputs the file in text format. The allowed formats are zip and
TextRecordInputStream.
Example: hadoop fs -text /training/binaryfile --> It prints the binary file in text format.

Cmd> hadoop fs -stat [format] <path>:


Print statistics about the file/directory at <path> in the specified format. Format accepts
filesize in blocks (%b), filename (%n), block size (%o), replication (%r), modification date (%y,
%Y)
Example: hadoop fs -stat [%b,%n,%o,%r] /training --> It prints the stats about the file/
directory based on the format.

Cmd> hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...


Changes permissions of a file. This works similar to shell's chmod with a few exceptions.
-R modifies the files recursively. This is the only option currently supported.
MODE : Mode is same as mode used for chmod shell command.
Only letters recognized are 'rwxX'. E.g. a+r,g-w,+rwx,o=r
OCTALMODE : Mode specifed in 3 digits. Unlike shell command, this requires all three
digits.
E.g. 754 is same as u=rwx,g=rx,o=r
If none of 'augo' is specified, 'a' is assumed and unlike shell command, no umask is
applied.
Example: hadoop fs -chmod -R 777 /data/daily --> It changes the permissions to 777
means read/ write for owner, group, and rest also.

Cmd> hadoop fs -chown [-R] [OWNER][:[GROUP]] PATH...


Changes owner and group of a file.
This is similar to shell's chown with a few exceptions.

-R : modifies the files recursively. This is the only option currently supported.
If only owner or group is specified then only owner or group is modified.
The owner and group names may only cosists of digits, alphabet, and any of '-_.@/' i.e. [-
_.@/a-zA-Z0-9]. The names are case sensitive.
WARNING: Avoid using '.' to separate user name and group though Linux allows it. If user
names have dots in them and you are using local file system, you might see surprising
results since shell command 'chown' is used for local files.
Example: hadoop fs -chown -R naga:supergroup /data/daily --> This will change the user
and group of the file /data/daily

Cmd> hadoop fs -chgrp [-R] GROUP PATH...


This is equivalent to -chown... :GROUP...

Cmd> hadoop fs -count[-q] <path>:


Count the number of directories, files and bytes under the paths that match the specified
file pattern. The output columns are: DIR_COUNT FILE_COUNT CONTENT_SIZE
FILE_NAME
Example: hadoop fs -count /data/

Cmd> hadoop fs -help [cmd]:


Displays help for given command or all commands if none is specified.
Example: hadoop fs -help ls --> it gives the usage information about command ls.

Cmd> hadoop balancer [-threshold <threshold>] percentage of disk capacity


Example: hadoop balancer --> It does the cluster balancing

Cmd> hadoop fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -
racks]]]
<path> start checking from this path
-move move corrupted files to /lost+found
-delete delete corrupted files
-files print out files being checked
-openforwrite print out files opened for write
-blocks print out block report
-locations print out locations for every block
-racks print out network topology for data-node locations
By default fsck ignores files opened for write, use –open for write to report such files. They
are usually tagged CORRUPT or HEALTHY depending on their block allocation status
Example: hadoop fsck / --> It provides the healthiness of the hadoop cluster.

Cmd> hadoop distcp [OPTIONS] <srcurl>* <desturl>


OPTIONS:
-p[rbugp] Preserve status
r: replication number
b: block size
u: user
g: group
p: permission
-p alone is equivalent to -prbugp
-i Ignore failures
-log <logdir> Write logs to <logdir>
-m <num_maps> Maximum number of simultaneous copies
-overwrite Overwrite destination
-update Overwrite if src size different from dst size
-skipcrccheck Do not use CRC check to determine if src is
different from dest. Relevant only if -update is specified
-f <urilist_uri> Use list at <urilist_uri> as src list
-filelimit <n> Limit the total number of files to be <= n
-sizelimit <n> Limit the total size to be <= n bytes
-delete Delete the files existing in the dst but not in src
-mapredSslConf <f> Filename of SSL configuration for mapper task
Example:
hadoop distcp -p -overwrite "hdfs://localhost:9000/training/bar" "hdfs://anotherhostname:
8020/user/foo" --> It loads data from one cluster to another cluster in parallel. It over writes
if file is there.
hadoop distcp -p -update “file:///filename” "hdfs://localhost:9000/training/filename" --> It
loads data from local system to hadoop cluster in parallel. It updates if file is there.

Cmd> hadoop archive -archiveName NAME -p <parent path> <src>* <dest>


Example: hadoop archive –archiveName myfile.har –p /data /data/files /archive/

hadoop dfsadmin is the command to execute DFS administrative commands.


Example: hadoop dfsadm <command name>
Cmd> hadoop dfsadmin -report:
Reports basic filesystem information and statistics.
Example: hadoop dfsadmin -report

Cmd> hadoop dfsadmin -safemode <enter|leave|get|wait>:


Safe mode maintenance command.
Safe mode is a Namenode state in which it
1.  does not accept changes to the name space (read-only)
2.  does not replicate or delete blocks.
Safe mode is entered automatically at Namenode startup, and leaves safe mode
automatically when the configured minimum percentage of blocks satisfies the minimum
replication condition. Safe mode can also be entered manually, but then it can only be
turned off manually as well.
Example:
• hadoop dfsadmin -safemode get à To get status of safe mode
• hadoop dfsadmin -safemode enter à To enter safe mode
• hadoop dfsadmin -safemode leave à To leave the safemode

Cmd> hadoop dfsadmin -saveNamespace:


Save current namespace into storage directories and reset edits log.Requires superuser
permissions and safe mode.

Cmd> hadoop dfsadmin -refreshNodes:


Updates the set of hosts allowed to connect to namenode.
Re-reads the config file to update values defined by dfs.hosts and dfs.host.exclude and
reads the
entires (hostnames) in those files. Each entry not defined in dfs.hosts but in
dfs.hosts.exclude is decommissioned. Each entry defined in dfs.hosts and also in
dfs.host.exclude is stopped from
decommissioning if it has aleady been marked for decommission.Entires not present in
both the lists are decommissioned.

Cmd> hadoop dfsadmin -finalizeUpgrade:


Finalize upgrade of HDFS. Datanodes delete their previous version working directories,
followed by Namenode doing the same. This completes the upgrade process.

Cmd> hadoop dfsadmin -upgradeProgress <status|details|force>:


Request current distributed upgrade status, a detailed status or force the upgrade to
proceed.

Cmd> hadoop dfsadmin -metasave <filename>:


Save Namenode's primary data structures to <filename> in the directory specified by
hadoop.log.dir property.
<filename> will contain one line for each of the following
1. Datanodes heart beating with Namenode
2. Blocks waiting to be replicated
3. Blocks currently being replicated
4. Blocks waiting to be deleted
Cmd> hadoop dfsadmin -setQuota <quota> <dirname>...<dirname>:
Set the quota <quota> for each directory <dirName>.
The directory quota is a long integer that puts a hard limit on the number of names in the
directory tree.
Best effort for the directory, with faults reported if
1.  N is not a positive integer, or
2. user is not an administrator, or
3. the directory does not exist or is a file, or

Cmd> hadoop dfsadmin -clrQuota <dirname>...<dirname>:


Clear the quota for each directory <dirName>.
Best effort for the directory. with fault reported if
1.  the directory does not exist or is a file, or
2. user is not an administrator.
It does not fault if the directory has no quota.

Cmd> hadoop dfsadmin -setSpaceQuota <quota> <dirname>...<dirname>:


Set the disk space quota <quota> for each directory <dirName>.
The space quota is a long integer that puts a hard limit on the total size of all the files under
the directory tree.
The extra space required for replication is also counted. E.g. a 1GB file with replication of 3
consumes 3GB of the quota.
Quota can also be speciefied with a binary prefix for terabytes, petabytes etc (e.g. 50t is
50TB, 5m is 5MB, 3p is 3PB).
Best effort for the directory, with faults reported if
1.  N is not a positive integer, or
2. user is not an administrator, or
3. the directory does not exist or is a file, or

Cmd> hadoop dfsadmin -clrSpaceQuota <dirname>...<dirname>:


Clear the disk space quota for each directory <dirName>.
Best effort for the directory. with fault reported if
1.  the directory does not exist or is a file, or
2. user is not an administrator.
It does not fault if the directory has no quota.

Cmd> hadoop dfsadmin -refreshServiceAcl:


Reload the service-level authorization policy file. Namenode will reload the authorization
policy file.

Cmd> hadoop dfsadmin -refreshUserToGroupsMappings:


Refresh user-to-groups mappings
Cmd> hadoop dfsadmin -refreshSuperUserGroupsConfiguration:
Refresh superuser proxy groups mappings

Cmd> hadoop dfsadmin -setBalancerBandwidth <bandwidth>:


Changes the network bandwidth used by each datanode during HDFS block balancing.
<bandwidth> is the maximum number of bytes per second that will be used by each
datanode. This value overrides the dfs.balance.bandwidthPerSec parameter.
NOTE: The new value is not persistent on the DataNode

Cmd> hadoop -help [cmd]:


Displays help for the given command or all commands if none is specified.
Example: hadoop -help namenode --> It gives the complete information about the
command namenode.

You might also like