Introduc) On To Bigdata
Introduc) On To Bigdata
to BigData
HADOOP is an open source, Java-based programming framework that supports the processing
and storage of extremely large data sets in a distributed computing environment. It is part of the
Apache project sponsored by the Apache Software Foundation.
• The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File
System (HDFS), and a processing part which is a MapReduce programming . Hadoop splits
files into large blocks and distributes them across nodes in a cluster. It then transfers
packaged code into nodes to process the data in parallel. This approach takes advantage of
data locality,[3] where nodes manipulate the data they have access to. This allows the dataset
to be processed faster and more efficiently than it would be in a more conventional
supercomputer architecture that relies on a parallel file system where computation and data
are distributed via high-speed networking .
• The base Apache Hadoop framework is composed of the following modules:
• Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
• Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
• Hadoop YARN – a resource-management platform responsible for managing computing
resources in clusters and using them for scheduling of users' applications; and
• Hadoop Map Reduce – an implementation of the MapReduce programming model for large
scale data processing.
Hadoop makes it possible to run applications on systems with thousands of commodity hardware
nodes, and to handle thousands of terabytes of data. Its distributed file system facilitates rapid
data transfer rates among nodes and allows the system to continue operating in case of a node
failure. This approach lowers the risk of catastrophic system failure and unexpected data loss,
even if a significant number of nodes become inoperative. Consequently, Hadoop quickly
emerged as a foundation for big data processing tasks, such as scientific analytics, business and
sales planning, and processing enormous volumes of sensor data, including from internet of
things sensors.
Hadoop Features
Flexibility: Hadoop is very Flexible in terms of ability to deal with all kinds of data. Where data can
be of any kind and Hadoop can store and process them all, whether it is structured, semi-
structured or unstructured data.
Reliability: When machines are working in tandem, if one of the machine fails, another machine
will take over the responsibility and work in a reliable and fault tolerant fashion. Hadoop
infrastructure has in-built fault tolerance features and hence, Hadoop is highly reliable. We will
understand this feature in more detail in upcoming blogs on HDFS.
Economical: Hadoop uses commodity hardware (for example your PC, laptop) For example, in a
small Hadoop cluster, all your data nodes can have normal configurations like 8-16 GB RAM with
5-10 TB hard disk and Xeon processors, but if I would have used hardware-based RAID
with Oracle for the same purpose, I would end up spending 5x times more at least. So the cost of
ownership of a Hadoop based project is pretty minimized. It is easier to maintain the Hadoop
environment and is economical as well. Also, Hadoop is an open source software and hence there
is no licensing cost.
Scalability: At last, but not the least, we talk about the scalability factor. Hadoop has in-built
capability of integrating seamlessly with cloud based services. So if you are installing Hadoop on a
cloud, you don’t need to worry about the scalability factor because you can go ahead and procure
more hardware and expand your setup within minutes whenever required.
These 4 characteristics make Hadoop a front runner in terms of Big Data solution. Now that we
know what is Hadoop, we can explore the core components of Hadoop.
But before talking about Hadoop core components, I will explain what led to the creation of these
components.
Big Data Processing: Even if a part of Big Data is Stored, processing it would take years.
To solve the storage issue and processing issue, two core components were created in Hadoop –
HDFS and YARN. HDFS solved the storage issue as it stores the data in a distributed fashion and
is easily scalable .
Hadoop Core Components
When you are setting up a Hadoop cluster, you can choose a lot of services that can be part
of Hadoop platform, but there are two services which are always mandatory as a part of
Hadoop set up. One is HDFS (storage) and the other is YARN (processing). HDFS stands for
Hadoop Distributed File System, which is used for storing Big Data. It is highly scalable.
Once you have brought the data onto HDFS, Map Reduce can run jobs to process this
data. YARN does all the resource management and scheduling of these jobs.
Hadoop Master Slave Architecture
The main components of HDFS are Name Node and Data Node.
Name Node
It is the master daemon that maintains and manages the Data Nodes (slave nodes)
It records the metadata of all the files stored in the cluster, e.g. location of blocks stored,
the size of the files, permissions, hierarchy, etc.
It records each and every change that takes place to the file system metadata
For example, if a file is deleted in HDFS, the Name Node will immediately record this in the
Edit Log
It regularly receives a Heartbeat and a block report from all the Data Nodes in the cluster
toensure that the Data Nodes are live
It keeps a record of all the blocks in HDFS and in which nodes these blocks are stored
It has high availability and federation features which I will discuss in HDFS architecture in
detail
Data Node
These are slave daemons which runs on each slave machine
The actual data is stored on Data Nodes
They are responsible for serving read and write requests from the clients
They are also responsible for creating blocks, deleting blocks and replicating the same
based on the decisions taken by the Name Node
They send heartbeats to the Name Node periodically to report the overall health of HDFS,
by default, this frequency is set to 3 seconds
The components of YARN are Resource Manager and Node Manager.
Resource Manager
It is a cluster level (one for each cluster) component and runs on the master machine
It manages resources and schedule applications running on top of YARN
It has two components: Scheduler & Application Manager
The Scheduler is responsible for allocating resources to the various running applications
The Application Manager is responsible for accepting job submissions and negotiating the
first container for executing the application
It keeps a track of the heartbeats from the Node Manager
Node Manager
It is a node level component (one on each node) and runs on each slave machine
It is responsible for managing containers and monitoring resource utilization in each
container
It also keeps track of node health and log management
It continuously communicates with Resource Manager to remain up-to-date
Big Data
Enterprise
Data system
F link
HPT
Data life cycle
Systems
[ Software + Hardware ]
1) Data sources 2) Data formats
3) Data properties
a. Volume
b. Variety ( Combinations Ex : structure and semi structure )
c. Velocity Velocity – data generator speed
d. Complexity
BigData = confluence of ( transaction + operations + observations )
Ex : logs Ex : CCTV
if any operation is done
it makes entry in logs.
Scalability
No SQL Databases :
1. Cap theorem
2. Poly glot persistence
3. Limitations of SQL
Eb/ Neo44
Mongo Solr
SQL dB
Limitations of SQL
No SQL: It stands for not only SQL or non relational databases. There is “No SQL” database , it
is consolation or group , it brings all the non relational databases under single umbrella.
No SQL databases are designed to over come the limitations of SQL databases.
No SQL has 150 databases currently , all these databases are categorized into 12+
categories based on data model , out of 12+ categories , the following are very popular
in the current market.
• Key value stores
• Document stores
• Columnar / column based
• Graph databases
Has table ( data structure same as key value pair ) on distribute system.
Some are disk based and some are memory based.
(
) forward indexing (Ex : PDF ’S )
• P1 ( words )
• P2 ( words )
• P3 ( words )
• Inverted indexing ( Ex : all search engines , Google, Yahoo )
• W1 ( pages )
• W2 ( pages )
• W3 ( pages )
• If application requires key value store if huge keys then use persistent key value stores.
• Persistent key value store : inverted indexing application
• Non persistent key value store : cache application.
JSON object : JSON objects are document oriented. There are 3 data types:
1) Primitive
2) Array
3) JSON object
{ id : 101, name : “abc” , age : 25 , contact : [‘c1’,’c2’,’c3’]
Ex : address : { street : “abc” , block : “cba” , area : “xyz” }
Address is a key , but address is another key value store in the above example ,this address is
another JSON object which is the example of JSON object within JSON object.
SQL Document oriented
Database Database
tables collection
records documents
fields key value
Document oriented :
It is the extended version of key value store.
Ex : Mongo DB , Couch DB ,Couch Base Server , tera store , elastic search.
JSON object MSON =new JSON object ( cursor. next () )
String name = myson.get (“name”);
JSON object addr = myJSON.get(“address”);
String city = addr.get (“city”);
String cont [ ] = myJSON.get (“contact”)
Ex: database = dB
collection name = my app.
{ _id : 10 , name : ”xyz” , age : 25 , type : “DOP”}
{ _id :20 ,source : “hindi”, title : “electrons” , content : “ “ ,
type : “news” }
{ _id : 30 , movies “xyz” , cout : [p1,p2…] , budget : “ ,
type : “enl” };
dB.myapp.find ( )
dB.myapp.find ({ type : “ent”})
Mongo dB is based on document data model. Here the document is Json. Document and Bson
( Binary Son ) document at storage.
Document = collection of key value pairs as a object.
Document model is wrapper over key value by enclosing particular entity key value pairs.
Document is atomic at storage.
Columnar DB :
In RDBMS the data is row oriented so for ex :
If the order id and Agr price for the below table of 10 GB data.
Columnar dB
C1{ }
C2{ }
C3{ }
1 Naga 29
2 Shiva 30
RDBMS Columnar
data is stored as 01:1,02:naga,03:29;
01: (1,naga,29) ; 02: (2,shiva 30) 01:2,02:shiva,03:30;
Table:
Row key CF1 CF2 CF3
Billion
Of Billion
Row Of
Key column
Vn Tn
Time stamp is because columnar data base is having versioning advantage :data
versioning
(ex stock market).
Speeder analytical processing.
Redis :
redis_cli
it will display :
redis 127.0.0.1:6379:
Table Row key column.family
Stock market “Infy 000” bSE : infy 2305804 TSI
2400.07TS2
Stock market ”Infy 001” nyse : nfy :
In case of billions of value in column the data is divided in to partitions and distributed to
all the computer
‘Weather ‘ , ‘Bangalore ‘ ( temp : temperature ) value timestamp
K V T
Column Family name àtemp
Column name à Temperature
Apache phoenix à give SQL skin over HBASE.
Graph data base :
vertex/node – entity Ex: Oracle of BAWN is using graph database.
edge - relation
ex: entity/node name : “abc”
age : 23
place : ”banglr”
Ex : Neo4j
It is 100% ACID features.
• Cap theorem
• Document oriented
• Columnar key value store
Humongous à huge
Mongo dB name is derived from this.
S
C I
P
R
R N
E
A A
D
W
E
E
R
P C
L
E
R X
H
R
o
cessing
GFS: Google file system
Lisp GFS
map reduce
Google MR
Apache : They invented search engine Lucene ( IR life in this life pre-processer , Index and
search was available but crawler was missing, for search engine, crawler is the main
intelligence then they come up with Nutch ( 2003/2004).
àNDFS
àNutch MR
Yahooà1994
mail
In the early years Yahoo was using Google search interface, Then Google started there
own mail service then yahoo started there own search à(dread naught).
Map Reduce origin is LISP
Lucene
Nutch
Hadoop
Ø Hadoop is one of the top level project from ASF ( Apache Software Foundation )
Ø Hadoop is a write once read many times ( HDFS )
Ø Hadoop is a Batch processing engine ( MR )
Ø Hadoop can crunch the terabyte data in seconds.
Ø Hadoop is having 3 main advantages , time , scale and cost.
Definition :
Apache Hadoop is an open source , distributed , large scale , reliable frame work written in Java ,
for creating , deploying large scale distributed applications which process humongous ( huge )
amount of data on top of cluster of commodity hardware ( low cost hardware ).
Now Hadoop is de-facto standard for distributed computing.
Components of Hadoop :
It has two components ,
1. Core
2. Eco system
Hadoop
Hadoop-1.X.X
Pig, Hive, HBase,
HDFC (Storage platform) Sqoop, Oozie, HUE,
Map Reduce (platform+app) Mahout, Tez, Ranger,
Sentry, Kmoxgate, Zoo keeper,
Common utilities Lipstick, Flume, Falcon,
Chukwa, Scribe, Kafka,
Hadoop-2.X.X Azkaban, Ambari, Samza,
HDFS 2 (platform) Drill, Impala, Presto, Tajo,
Map Reduce (application) Bookkeeper, Avro, Parquet,
Treveni, Cassandra, Xaseure,
Yarn (platform)
Thrift, MRUnit
Common utilities
NOTE : Many organizations took the apache Hadoop code base and redistributed their
own distributions with some customs and add-ons.
Some of the permanent Hadoop distribution :
1. Cloud era, 2. Horton works, 3. Map R, 4. Pivotal [ EMC2 ] , 5.Bigdata insights [ IBM ]
There is no overlap between two, so Hadoop will not replace RDBMS ( traditional data processing
system ) and it is designed to compliant to RDBMS.
RDBMS cannot handle the entire OLTP data, so that will be moved to data ware house so Hadoop
will handle the data.
Comparing Hadoop with dataware house and ETL ( dataware house stack )
We replace the traditional data ware housing stack with Hadoop in 2 ways.
1. Include Hadoop (as ETL or ware house ) in traditional existing ware housing stack.
2. Build end-to-end Hadoop based ware housing stack.
RDBMS
Ex: Tera data
ETL Ware house Netizza
Green plum
Vertica
Informatica Pentaho
Hadoop use with existing traditional data ware house.
BI Traditional /
Flat file Tera data HDFS
RDBMS ETL
H H Ware
D- D ETL house
F F
S S
Some transformation like data standardization is difficult in Hadoop.( for ex: I.B.M , IBM , I,B,M all
these data should be standardized to IBM in this case which is difficult using Hadoop algorithms)
So the Hadoop will give input to the ETL traditional using connector which will be transformed
and loaded to warehouse.
BI
T T Tn Ware
HDFS HDFS
house
v
Ware house are
Hive ,Tajo Drill prest
• Write on schema à first create table schema and then load the data.
• Read on schema à load the data create , create schema.
OOOO OOOO
OOOO OOOO
OOOO OOOO
machines
If we query in Grid computing then the query is done. One cluster where processing is done it will
take data from cluster where data is stored.
But in Hadoop every node will store the data and process the data ( data localization )
Hadoop
In Grid computing we have separate grid for storing and processing where as in Hadoop every
node has the storage and processing capability.
Hadoop Vs Volunteer Computing
Volunteer computing : Ex : Seti@home
w10 w9 w8
w7
w6
OOOO
Central
w5
OOOO
sever OOOO
w4 OOOO
w3
w1 w2
Work stations
Volunteer computing Hadoop
Here the data is moved where Here the computing is done where data is
Computing is taking place. available.
Analogies :
Ø Hadoop is an operating system or kernel for data.
Hard
OS Application
ware
Raw
Hadoop Insignts
data
Ø Hadoop is a data refinery.
M
C P
a
r r
c
u o
h
d d
i
e u
n
o c
e
i t
r
l s
y
FEATURES OF HADOOP :
1. Scalability
2. Availability
3. Reliability
4. Simplicity
5. Fault – tolerance
6. Security
7. Modularity
8. Robust
9. Integrity
10. Accessibility
Ø Hadoop is having all the above features except security ( not 100% secure).
Ø Hadoop is distinguished with other distributed systems in terms of its scalability.
Ø Hadoop is available at infinite scale.
Hadoop architecture :
It is nothing but HDFS architecture + map-reduce architecture (1.0) , HDFS
architecture + yarn architecture in (2.0).
There is no separate Hadoop architecture.
Hadoop 1.0 architecture Hadoop 2.0 architecture
HDFS architecture : HDFS is a master slave architecture , it has one master and ‘n’ slaves , here
master is name node and slaves are data nodes.
Name node
NOTE : HDFS architecture is common for both hadoop 1.0 and hadoop 2.0.
In hadoop 2.0 HDFS many have more than one name node ( federation of name node )
N1 N2 N3
d1 d2 d3
More than one name node
.
Yarn architecture
In Hadoop 2.0 Yarn is introduced as a resources management system, it replaces Map Reduce
platform, YARN stands for Yet Another Resource. Negotiator , it is master slave architecture with
one master i.e. Resource manager and n slaves i.e. Node manager.
Resource manager
Job tracker
Name node
Resource Manager
Name Node
Note : HDFS has two types of data, one is actual data i.e. users data , second one is
metadata i.e. data above users data.
2. Data node
Ø It is a slave component of HDFS cluster and one of the slave components of Hadoop cluster.
Ø It is responsible for holding the portion of actual data.
Ø It is dumb node and it is only used for low level operations like reading and writing.
Ø All the data nodes interact with HDFS clients and other data nodes in the direction of name
node.
Ø Data nodes gives heart beats to name node to inform about their status as well as other
information ( default is 3 seconds )i.e. for every 3 seconds it will inform name node. This
status ( whether alive or dead and this can be configured to required value as the number of
seconds will increase the I/O will decrease ).
In hadoop cluster (irrespective of size ) we have one secondary name node it is used to do
the following things.
1) Check pointing ( By default check pointing happens every 1 hour and its configurable ).
2) Periodical snapshot or backup
With secondary name node we have the facility backup of name node metadata, but it is not the
mirror image of name nodes metadata.
Metadata in secondary name node = Cluster starting to last check point.
Here is a data loss that is current to last check point. To overcome this metadata loss, secondary
name node is decomposed into two nodes.
Ø One is the check point node (To perform check pointing)
Ø The other is Back up mode à Real time mirror image of name node.
Fig : HDFS
For Ex:
Replication factor
.//data /xyz-b1(1) configured to 1
Fig : HDFS
/data /xyz-b2(1)
/data /xyz-b16(1) SNN
input/xyz in/data
Check point
WC Name Node (OR) node
HDFS write client ,
Of 1GB b1[D1]
b2[D2]
b3[D3]
: Backup
b16[D3] node
D1 B1 D2 B2 D3 B3
Block
File of 1GB is physically converted into 16 blocks if the block size is 64 MB.
For every 3 seconds data node will send heart beat to Name node and for every 10th heart beat
data node will send the block report i.e. it will send info regarding the block of data it is having.
Map Reduce Daemon
Job tracker ( JT )
• It is a master components of Map Reduce cluster and one of the master component of
Hadoop cluster.
• It is a point of interaction between Map Reduce cluster and Map Reduce client.
• It is responsible for two things.
1. Resource management
2. Job management
• It is memory intensive as well as I/O intensive.
• Job tracker decompose MR Job into tasks and these tasks are assigned to task tracker.
• It is a single point of failure.
Task tracker ( TT )
• It is a slave components of Map Reduce cluster and one of slave component of Hadoop
cluster.
• It is worker node to launch the tasks which are assigned by Job tracker , monitor the tasks
and report the tasks status to the Job tracker.
• All the task tracker give heart beat to JT to inform about their status.
• All TT are participating in the distributed computing in the direction of Job tracker.
History server
• History server is used to maintain the Job history managing.
• By default it is enable as a integral part of JT.
• For bigger or production cluster it is better to run history server as a separate daemon.
How Map Reduce works :
(1)
Java complied to
generate Jar file.
Pig TT TT TT
engine DN DN DN
Yarn Daemons :
Resource manager
• It is master component of Yarn and one of the master component of Hadoop cluster.
• It is point of interaction between Yarn cluster and Yarn client.
• Example of Yarn clients à Map Reduce , spark , storm , graph , MPI , Tez.
• Yarn is a resource management system and resource manager is responsible for resource
manager is resource scheduler.
• Resource manager is a single point of failure.
Node manager
• It is slave component of Yarn and one of the slave component of Hadoop cluster.
• It is responsible for launching the containers , monitor the containers , killing the containers.
• It gives heart beat to resource manager to inform about their status and report the resource
related info of manager.
Container : It is nothing but a pool of resource ( CPU, RAM , Disk , IO etc ).
Application Master :
• In Yarn , to support multiple paradimes like MR , Tez , Spark etc. We have paradime specific
application masters.
• A paradime specific application master will decompose the application into one or more
Jobs or application into one or more tasks.
• It does Job management or application management.
History server :
In Yarn History server is not the integral part of Resource manager it should be run separately.
C2 C3 C4 C5 C1
Yarn AM is
Node manager Node manager
client launched
Data manager Data manager
Node manager
Data manager
Spark Job
AM will ask RM that required resources , then RM will give containers to AM then AM will launch
the containers based on data location ( which is received from name node ).
Uber mode : If there is only one task then AM will itself launch task in the AM instead of asking the
resource from RM.
Once the tasks are completed the node manager will kill the containers.
Installing Ubuntu :
To get the Ubuntu OS we have two ways :
1. Dual boot
2. VM ware player
Step 1: Create a new hard disk partition or use existing partition by freeing it.
Step 2: Start my computer.
Download Ubuntu ISO image from the Ubuntu website. ( choose either 32 or 64 bit based
on system architecture after downloading ISO image convert ISO image into bootable image by
using power ISO ( file ware ) install it.
NOTE: The bootable image is created in external storage device like USB , hard disk etc, so the
external device has to be formatted before using.
Step 3: Restart the computer by inserting external device , at the time of restart go the boot
option by choosing proper button and change the boot priority to the external device.
Step 4: It will start the Ubuntu installation , in this installation process we have to choose some
options like install Ubuntu along side windows and we need to give user details then
restart the computer by charging the boot priorities to the internal hard disk.
Installing Java:
NOTE : After installing Ubuntu , to install any application we have package installer apt-get
( for sentos , Red hat – Yum ). We need to update package install before installing any
third party.
*Sudo apt-get update.
To install any software use the following command.
*Sudo apt-get install
To serach any package or software in Ubuntu repository use
*Sudo apt-cache search <package>
To remove the software use
*Sudo apt-get remove / purge <package>
Vim. bashrc
Vim / home / $USER/
:99
:$ to go to end
Then give command
export JAVA_HOME =
export HADOOP_HOME = /home /radha / bigdata / hadoop -1.2.1
hadoop. Jar
Start all.sh
NN / JT àsystem
DN / TTT /SNN àRemote
1s.default.name
mapred.Job.tracker
dfs.replication Basic installation ( not for production ,only for
development and testing )
dfs.name.dir
dfs.data.dir
Ø fs.default.name à This parameter tells where the name node is running , the default value
of this parameter is local file system i.e. file:///
the over riding value is hdfs://<hostname>:<port>/
preferred port is – 9000
8020
This overriding parameter is placed in core -5.tc.xml.
Ø Mapred.Job.tracker à This parameter tells whwre the Job tracker is running , the default
value is “ local”.
the overriding value is <hostname>:<port>
1. hadoop-env.sh
This file is used to place Hadoop related environment variable like JAVA_HOME,
name node env variables , data node env variables etc.
2. core-site.xml
This file is used to place Hadoop cluster parameter ( overriding parameter of cluster or
core).
By default the file is empty xml file.
3. mapred-site.xml
This file is used to place the overriding parameter of Map Reduce group.
By default it is empty xml file.
4. hdfs-site.xml
This file is useed to place the overriding parameter of hdfs group.
By default it is empty xml file.
5. slaves
This file is used to place slave machine names ( data node + task tracker ).
One line per one line.
6. master
This file is used to place machine name of the secondary name node.
vim hdfs-site.xml
mapred-site.xml
core-site.xml
All modifications are done , now we can have to format the hdfs using following commands.
hadoop namenode – format
Don’t run the format command now onwards ( it is one time run command ).
Now we are ready to start Hadoop in Pseudo mode using the following script small
1. start-all.sh à all daemons
2. start-dfs.sh à only HDFS daemons
3. start-mapred.sh à only mapreduce daemons.
Testing HDFS
hadoop fs –ls /
Testing mapreduce
To switch from Pseudo mode to local mode.
To format
stop– all.sh
rm _rf*
JPS
hadoop name node_ format
To stop hadoop we have following scripts
stop-all.sh
stop-dfs.sh Stop name node data node secondary node
stop-mapred.sh Stop Job tracker and task tracker
HADOOP
Cluster Setup
Hadoop Distributed file system :
• It is a distributed file system , it is designed based on Google GFS , its early name was NDFS
( Nutch Distributed File System )
• HDFS is a one of the core components of Hadoop and it covers the distributed storage in
Hadoop.
• It is not an operating system or kernel file system ,. it is a user space file system
• HDFS is a cheaper or cost effective mechanism to stores hundreds of terabyte or petabytes
with good replication factor compared to spark, NASS and RA
• HDFS is designed , write once read many times.
• HDFS is a flat file system and it provides sequential access.
• Definition of HDFS : It is an open source , distributed file system ( multi terabyte or petabyte)
written in JAVA to store vast amount of data on top of cluster of commodity hardware in
highly reliable and fault tolerance manner HDFS, design principles or goals.
It is designed based on the following assumptions :
1. Handling hardware failure
2. Fault tolerance
3. Streaming data access
4. Large blocks
5. Large files
6. High access/ Read throughput
7. Write once and read many times
8. Deploy on commodity hardware
9. Independent of software and hardware architecture
10. Reliability and scalability
11. Simple coherency model
HDFS limitations :
While achieving the above design principles HDFS is constrained to the following
limitations :
1. Small files à because of more metadata , more Java processes , seek increase.
2. No random access à HDFS is flat file , it will be sequential access by default.
3. High latency à because of preparing and gathering metadata , the client has to
communicate with the N number of data node.
4. No multiple writers (i.e. single writer )
5. Does not allow file arbitrary modification or record level inserts , updates , deletes , but
it will append.
HDFS concepts :
HDFS metadata is managed by name node and this metadata contains the following things.
File to block mapping
Block to machines mapping
Other block related parameter ( like replications)
Ex : metadata
file - block ( properties ) à machines
/xyz – b1(P) à ( Slave1, slave2, slave3 )
/xyz – b2(P) à( slave3, slave4, slave5 )
:
:
/abc – b20(P) à (slave5, slave7, slave9 )
These entire metadata bits in name node primary memory i.e. RAM.
Persistance of File system Name space / fs image / metadata.
Though Name node keeps metadata in memory it also writes this metadata into two files which
are stored in dfs.name.dir ( i.e. ${ dfs.name.dir } / name/ current ) and those two files are edits and
fs image , this writing process happens periodically.
Edits:
Edits is a small window temporary file , it is used to rewrite all clients interaction with HDFS , by
default all live transaction will go to edits file ( metadata creation , updations , deletions ) the
window size is 1 hour or 64 MB by default.
FS image:
• It is the complete metadata image.
• It is read only file with respect to clients.
• It is used for metadata lookup ( for lookup both FS image and edits are required )
• There is a small difference in between memory version of metadata and disk version of
metadata. In memory version we have machines info, whereas in disk version no machines
info be cause block to machine info is more jolat.
NOTE : “ Every 3 seconds the data node will be sending the heart beats and every
10th heart beat will be a block report which will be sent to name node.”
• If cluster is stopped and restarted the name node retrievers disk version of metadata into
memory.
• At this moment there is no block to machine info in metadata once cluster is started all the
data node will give heart beat to name node , every 10th heart beat is a block report, data
nodes sends what are the blocks they are containing as a block report and name node
consolidates this individual block reports and finally map the block to machines info in the
memory version of metadata.
• If metadata is 5 GB and RAM is 4 GB then the name node will not start itself.
Check pointing :
• It is a process done by secondary name node or check point node.
• It is used to merge the edits file into FS image and reset the edits.
• After completion of this process secondary name node push it back to name node.
• This check pointing process happen for every 1 hour ( configurable )
FS image = FS image + edits
edits = reset
Name node
Edits ….live functions
FS image
download
Ex : If check pointing is started at 1.00 PM the secondary name node completes check pointing by
merging the FS image and edit , then reset the edits. This complete process will be
completed by secondary name and loaded the FS image and edits at 1.02 PM then this 2mins
edit data will be written in the temporary file edit.new in name node then once the edit file is
available then name node renames edit.new to edit then name node write into this edit file.
Check point related parameter are available in core-site.xml fs. check point period in
core-site.xml à here we can change the interval at which checkpoint should be done.
Cluster balancing :
• HDFS offers balancing algorithm to balance the HDFS data nodes storage , balances
identifier over utilizing nodes under utilizing nodes in hadoop cluster and it will move the
data from over utilizing nodes to under utilizing.
• Balancer is applicable only for fully distributed mode.
• Balancer requires more bandwidth.
Protocols :
HDFS has several protocols which are used in the IPC ( Inter Process Communication ) of
HDFS components ( name node , data node , clients ), these protocols are built based on
T stack.
IPC = RPC + Serialization
(Remote Procedure Call )
Hadoop has its own RPC engine and serialization frame work is a Google protocol buffer.
The following are the protocols which are used in HDFS IPC.
• Client protocol.
• Client data node protocol.
• Data transfer protocol.
• Data node protocol.
• Inter data node protocol.
(1)
Client Name node
(2) (3)
NOTE : Name node will never initiate any protocol. In HDFS cluster , there are several IPC calls in
all this calls Name node is always participant and it will never initialize any communication
with other components in HDFS.
Heart beats:
HDFS has heart beat mechanism , it is used to know the healthiness or status of data nodes,
this process is initiated with name node and report their status. This process happens
periodically by default every 3 seconds ( configurable ).
The protocol used for this process is data node protocol.
Replication :
In HDFS data replicates at the time of data writing or loading process.
Replication :
It is a process , it happens in the following scenarios:
1. Under replication
2. Over replication
3. Mis replicated
(1) Replication :
Ex : File xyz à 200 MB
Block size 64 MB , 200 MB / 64 MB = 4 Blocks i.e. b1 b2 b3 b4 ,
with replication factor = 3.
Metadata
D1 D2 D3 D4
b1 b4 b2 b4 b1 b4 b1 b3
b2 b3 b3 b2
D1 D2 D3 D4
b2 b4 b1 b4 b1 b3
Dead
b3 b3 b2
In this scenario name node replicates the data by instructing data nodes.
Metadata
D2 D3 D4
b2 b4 b1 b4 b1 b3
b1 b3 b2 b3 b4 b2
D1 D2 D3 D4
b1 b4 b2 b4 b1 b4 b1 b3
b2 b1 b3 b2 b3 b4 b2
Block B1 B2 B3 are over replicated and B2 is properly replicated Name node asks data node
which are containing over replicated blocks to delete extra replica (3).
D1 D2 D3 D4
b1 b4 b2 b4 b1 b4 b1 b3
b2 b3 b3 b2
D1 D2 D3 D4
b1 b4 b2 b4 b1 b4 b1 b3
b2 b3 b3 b2
D1 D2 D3 D4
b1 b4 b1 b3
b2 b3 b4 b2
All though name node replicate the data but still 4 replicas are missing ( 1 replica for each block )
this is only because of no data node are available to replicate the data.
Missing replicas scenario access when the number of data nodes are less than replication factor.
Missing blocks :
This scenario will occur all replicas are missing of a particular block or blocks that is
missing block.
Ex : Assume b1 is having 3 replicas r1 r2 r3 are on d1 d2 d3 respectively , if d1 d2 d3 are
simultaneously dead in this scenario name node have any info about block b1, i.e. bi is
missing.
Name node will never recover the missing block scenario.
Data integrity :
HDFS achieves data integrity with the help of checksum. This checksum concept is applicable at
block level.
Ø At the time of loading for every block HDFS client generates a checksum id.of.type.MD5 ,
stores this checksum in a file blockid_random no.met alond with the actual block file.
Ø this checksum is an overhead of 512 bytes for block.
Ø At the time of data retrievel every HDFS client generates checksum ids.for.blocks (on the fly
checksum ) generation this on the fly generated checksum of blocks is compared with the
checksums which are created at the time of data loading.
Ø If the checksums are matched , those matched blocks are called validated blocks, otherwise
blocks are corrupted.
Block scanner report uses same process.
Disk failures :
HDFS handles disk failures for both name node and data node as follows :
Name node : To handle the disk failures of name node we will use the following ways ( to protect
the metadata ).
1. Having RAID as storage device.
2. Writing metadata to multiple hard disk.
To write metadata to multiple hard disk we use the following parameter dfs.name.dir and value is
machine1 location , machine2 location , machine3 location ( comma separated )
Data node :
To handle disk failure of data node by following ways :
1. Having good replication factor.
2. Having multiple hard disk for every data node ( 2 or more ).
NN NN
D1 D2 D3 D4 D1 D3 D4
D2
HD HD HD HD
HD1 HD2 HD1 HD2 HD1 HD2 HD1 HD2
Fig (1) Fig (2)
In fig (2) every data node is having 2 Hard disks , data node writes its 100% data to both HD1 and
HD2 ( HD2 is a mirror of HD1 ) and vice versa. This is not only handling disk failure it also speed
up or double parallel I/O.
To provide this kind of set we have to use the parameter dfs.data.dir and the value is ( HD1
location , HD2 location ). HD1 location on all data node has to be same ( absolute pate ) HD2
location.
The disadvantage of this setup is, it will increases storage cost.
Advantages are :
1. Handling disk failures.
2. Parallel I-O increases availability.
For small size HDFS cluster having one Rack the above criteria is applicable.
For bigger HDFS cluster having multiple racks , the data replica placement criteria is as follows :
We have block b1 and replica are r1 r2 r3 the replica placement criteria for block b1 in multiple
rack Hadoop cluster is as follows :
First replica will go to any one of the rack say R1 ( Rack1 ) inside R1 replica will go to any one of
the node say D5 ( it is based on topology distance and disk utilization ).
First replica R1D5 , second replica will go to any one of the rack not R1 say R4 , inside R4 second
replica will go to any one of the data node say D10 i.e. R4D10 , third replica will go to the second
replica’s rack i.e. R4 inside R4 it will go to any one of the data node not second replica’s data node
i.e. D10 say D15 third replica is on R4D15.
This concept is applicable for any block.
NOTE : HDFS default replication provides policy for multiple rack environment for data
replication.
1/3 of block replica go to 1 rack and 2/3 go to another rack.
FID
In HDFS ( n – 1 ) are exactly block size where nth one is less than or equal to block size.
HDFS Xyz/1GB/rep =3
Name node
client
Buffer
Say Linux system B1 Checksum window
B2 HDFS
writer
Flush
B3
b1 b1 b1 b1 b1 b1
meta meta meta
Data r1 r2 r3
queries D1 D2 D3 D4
HDFS client / HDFS accessibility :
HDFS has several clients some of them are :
1. FS shell
2. Admin shell
3. Web interface ( read only)
4. Web HDFS
5. Java HDFS client API
HDFS Commands :
HDFS has several commands , these commands re used to perform file operations , archivals,
backup, administrative operations.
HDFS commands are categorized into two groups :
1. User commands
2. Admin commands
User commands :
HDFS has the following user commands to perform file operations backups and archival etc.
1. fs
2. fsck
3. archive
4. dist cp / dist cp2
hdfs root is ‘/ ’
Relation path of the hdfs : /user/$USER.
NOTE: HDFS is not a kernel file system and there are no executable option for the file and there
are executable option for directory ( to access its children ).
In HDFS commands like CD, PWD , many other Linux commands wont work.
In HDFS directory is an empty file and the size occupied by the directory 0 byte.
fs :
fs is a generic file system user client , used to perform file system operations.
NOTE : dfs command is an varient of fs and dfs works only for distributed file system.
fs has several command options to perform individual file operation directory like file listing,
directory creations , file writing , file reading etc.
ls, lsr à these two are used to list the files under a specified path in HDFS , lsr lists recurrively
under specified path.
Usage of ls :
Ex : hadoop fs –ls/
hadoop fs – ls <path>
lsr :
hadoop lsr - < path >
Ex : hadoop fs – ls <path>
NOTE : We can use the HDFS commands and Linux commands as a combination with the help of
Unix / Linux pipes.
Usage :
hadoop fs –mk / <path> / <dirname>
Ex: hadoop fs –mkdir /stocks à under root.
hadoop fs –mkdir / nyse à under default path.
(OR)
hadoop fs –mkdir / user / $USER / nyse.
There are 3 commands for loading data.
1. Copy From local
2. put
3. Move From local
All these 3 are used to load the data from local system ( Linux to HDFS ).
Both copy From local and Put is same , this two are like copy paste.( it will load the file to
destination and keep copy in the source ).
move From Local : like cut and copy ( cut from local and paste in HDFS ).
hadoop fs –put <source> <dert>
du , dus :
These are used to get the space consumed by files and directory under specified path.
du give individual usages.
dus give summary / consolidated usages.
Usage : hadoop fs –du <path>
hadoop fs –du/
hadoop fs –dus/
Count :
It is used to get the number of directories , number of files , size, quota information
( space quota + files quota )
Usage : hadoop fs –count /
hadoop fs –count –q /
HDFS HDFS
hadoop fs –cp / logs / user /$USER/
mv :
hadoop fs –mv <src> <dsf>
HDFS HDFS
Ex : hadoop fs –mv /logs /my logs
Tail command : It is used to get the last 1 KB of the data of a file under specified path.
Usage : hadoop fs –tail <path>
Ex : hadoop fs –tail /stocks / README.txt
Setrep : It is used to alter the replication factor of files and directories which are existed in the
HDFS.
NOTE : In HDFS replication is application only for file and not for directory.
Usage : hadoop fs –setrep [-R} [-W] <rep> <path> /<file>
Ex : hadoop fs –setrep –R 3/
In this case the replications for all the files under root is set to 3.
hadoop fsck : chmod , chown.chgrp , these are used to alter file permissions like access related ,
ownerships , group etc
HDFS file permission : HDFS has file permission like Linux , every file has owner , group, other and
file type.
1 column – file permission
2 column – replication
3 rd column – owner
if new and old user are in same group then new request
chmod à to change the file permissions of owner group others.
usage: hadoop fs –chmod -R <mode> <path>
fsck :
• This command is used to check the healthiness of file system under specified path.
• This command provides the following info like number of directories , files, size, blocks,
validated blocks, corrupted blocks, under replicated , over replicated , missing blocks,
default replication.
Usage :
hadoop fsck <path> [-move | delete | -open for write ]
[ -file [ -block [.location trackers ]]]
[ open to write ] à info of the files that are under going writing process , by default HDFS will
display only the files that are already written.
fsck /-delete à this command will only delete the corrupted blocks ( i.e. file )
fsck /-move
dist cp / dist cp2 à dist cp stands for distributed copying.
NOTE : All operations with fs are sequencial.
dist cp is a Map Reduce job ( map only ) for migrating the data from one system to another
system in parallel.
It is used for 2 scenarios :
1. Data loading from source system to HDFS.
2. Data backup from C.
Data loading : In this scenario our source system is any file system like Linux , windows etc and
moving the data from source to HDFS in parallel.
The degree of the parallelism is dependent on source system capability.
For this data movement both source and duration location must have prefix , that is protocol.
Data backup : It is very common in hadoop cluster , here we are taking the backup of either entire
source HDFS or some directories from source HDFS to destination HDFS.
The degree of parallelism is dependent on source and duration cluster.
Usage : hadoop dirt
hadoop distcp [option] <srcurl> +<desturl>
source url
Options :
-P à useful when data backup is taken from one HDFS to another HDFS.
-i à It will ignore the pipe line break.
-m à It tells how many blocks should be copied parallely.
Ex: -m 5 – 5 blocks will be copied at a time.
Archive :
• It is used to convert n number of HDFS files in to a single HDFS file i.e. archival file.
• In general archival provides compression , in HDFS archival does not provide any
compression.
• The objective of archival is reducing he metadata on the name node by reducing the
number of blocks.
Archival is a Map Reduce pgm.
Ex : 100 of files , then ( n-1 ) will be exact block size byte then nth one will be the extra bytes for
100 files 100 extra byte block will be there for all these 100 metadata entry will be there.
Archive file is extension.har.
Usage : hadoop archive –archivename Name –p <parent path> <src>* <dest>
hadoop –fs mkdir /archive.
hadoop archive –archiveName nysc.har –p /nysc /archive.
• The entire archival data is stored in one file that is part-0 and replication factor is
dfs.replication.
• In the archival file we will see two more files those are –index , -master index , replication
factor is 5 for these two files.This two files are having the meta info about archival.
NOTE : There is no unarchival process for archived data HDFS
• To retrieve individual files from archived HDFS data HDFS introduced a wrapper files system
HAR ( Hadoop archival file system ) and the protocol is “har”.
hadoop fs –ls har:///archive/nysc.har/
Hadoop HDFS also provides another commands offline edit viewer to read binary edits files in the
forms of human readable format.
It is used to generate renewal and validate delegation token which are sent by clients to name
node. We need to modify some of the configuration parameter.
hadoop fetch dt <opts> <token file>
For every client name node generates delegation token and stores it in the client machine , for
this the command is
hadoop fetch dt - -webservice <url> <token file>
For renewal : to renew token file
hadoop fetch dt - - webservice - -renew <url> <token>
For cancel
hadoop fetch dt - -webservice - -cancel <url> <token file>
dfs admin :
It is a admin shell client to perform very frequent administrative operations.
Uasge : hadoop dfs admin [generic options] [command options]
dfs admin has several command options which are used to perform individual tasks.
Command options :
Report : It is used to get the cluster summary and individual data node summary.
This summary includes total capacity.present capacity ,DFS remain DFS used etc.
Usage : hadoop dfsadmin –report.
Safe mode : Name node user a special mode called safe mode for its repair and maintenance.
Safe mode makes hdfs.read only files systems. Safe mode does not allow
re-replications , writes, deletes, updates on hdfs.
It allows 2 operations i.e look up and reading.
Name node enters in to safe mode in 2 scenarios.
1. At the time of cluster starting ( automatic )
2. Admin enters in to safe mode at any point of time ( manual )
1. Automatic case : For every start of hadoop cluster, name node automatically enters into safe
mode and it will stay either 2/3rd of block replica validation time or 30
seconds , it leaves automatic safe mode.
2. Manual case : Admin people enter into the safe mode for name node or HDFS maintenance ,
stays some time , leave the safe mode manually. For manual safe mod ewe have
command option safe mode.
Usage : hadoop dfsadmin [ -safemode enter | leave | get | wait | ]
hadoop dfsadmin –safemode enter
leave
get
wait
Save name space :
It is used to merge the edits in to fs image and reset the edits within the name node.
1. Enter in to the safe mode.
hadoop dfsadmin –safe mode enter
2. Give save name space command.
hadoop dfsadmin –save name space
3. Leave safe mode
hadoop dfsadmin –save mode leave
Refresh nodes : It is used to make the HDFS cluster up to date after doing some changes over
HDFS.
Refresh service ACL : Service ACL’s are used to authorize your cluster in proper manner by default
there are disabled.
For production cluster people enable service ACL ( access control list ) to administer the cluster
properly. Thus enabling service ACL’s is nothing but doing some modified in xml files
( ex : core-site.xml )
After doing modifications to reflect these modifications we will use refresh service ACL instead of
restarting the cluster.
Usage : hadoop dfsadmin –refresh service ACL.
Meta save :
This option is used to get the metadata information like blocks being replicated , blocks waiting
for replication , blocks waiting for deletion etc.
This command option writes the above info in a file in human readable format , the location of the
file is ,
Hadoop logs directory –metasave <filename>
Upgrading :
HDFS allows upgrading from HDFS version to another HDFS version to start this upgrading
process two command options with dfsadmin.
Those two are :
1. Finalize upgrade
2. Upgrade process
• HDFS upgrade maintain two states that the current and previous HDFS version.
• If the upgradation process is successful then run the finalize upgrade to delete the previous
version otherwise rollback.
• To get the upgrade process we use the following command.
hadoop dfsadmin [-upgrade.Progress status | details | for
• To finalize upgrade process use the following command.
hadoop dfsadmin –finalize upgrade.
Quotas :
HDFS allows quotas on file system to manage the data like Linux , these are two types of quotas.
1. File quota
2. Space quota
File quota :
This concept is used to set a fixed number of files for a directory.
To create file quota on HDFS directories we us ethe following command.
Uasge : hadoop dfsadmin –set Quota <quota> <dir name -dirname>
Ex: hadoop dfsadmin –setQuota 4/input
Space quota :
This concept is used to fix the size of the HDFS directories to a fixed amount of size.
To set the space quota on the directory use the following command.
hadoop dfsadmin –setSpaceQuota <quota> <dirname>…<dirname>
size in bytes
If allows another notation i.e. 1M à for 1Mb’s (nM) where n=1,2,3
1T à for 1 Terabyte (nT)
To calculate in HDFS ( expr 3-2 \* |
HDFS fs –du /input à disk usage
Ex : hadoop dfsadmin –setSpaceQuota 65m/input
NOTE : In HDFS we have blocks concept , blocks are fixed size to set space quota on any
directory must be greater than or equal to block size ( 64 MB / 128 MB )
To write any file in to HDFS directory which is already set to space quota must contain remaining
space quota more than or equal to block size.
To clear space quota we use the following command
Usage : hadoop dfsadmin –clrSpaceQuota <dirname>….<dirname>
Ex : hadoop dfsadmin –clrSpaceQuota /input.
In production HDFS cluster people use combination of file quota and space quota on HDFS
directory for managing HDFS.
Help :
It is used to get the help of dfsadmin command options.
Usage : hadoop dfsadmin –help <commands>
We can reserve the space in configuration parameters.
dfs.du.reserve
for i in*.jar
>do
>export CLASSPATH =$CLASSPATH : $PWD|$i
>done
To build a jar
jar –cvf write.jar –c. /com/HDFS/example/*.class
Cmd> hadoop fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locaRons | -racks]]]
<path> start checking from this path
-move move corrupted files to /lost+found
-delete delete corrupted files
-files print out files being checked
-openforwrite print out files opened for write
-blocks print out block report
-loca)ons print out loca)ons for every block
-racks print out network topology for data-node loca)ons
By default fsck ignores files opened for write, use –open for write to report such files. They are usually
tagged CORRUPT or HEALTHY depending on their block alloca)on status
Example: hadoop fsck / --> It provides the healthiness of the hadoop cluster.
Cmd> hadoop distcp [OPTIONS] <srcurl>* <desturl>
OPTIONS:
-p[rbugp] Preserve status
r: replica)on number
b: block size
u: user
g: group
p: permission
-p alone is equivalent to -prbugp
-i Ignore failures
-log <logdir> Write logs to <logdir>
-m <num_maps> Maximum number of simultaneous copies
-overwrite Overwrite des)na)on
-update Overwrite if src size different from dst size
-skipcrccheck Do not use CRC check to determine if src is
different from dest. Relevant only if -update is specified
-f <urilist_uri> Use list at <urilist_uri> as src list
-filelimit <n> Limit the total number of files to be <= n
-sizelimit <n> Limit the total size to be <= n bytes
-delete Delete the files exis)ng in the dst but not in src
-mapredSslConf <f> Filename of SSL configura)on for mapper task
Example:
hadoop distcp -p -overwrite "hdfs://localhost:9000/training/bar" "hdfs://anotherhostname:8020/user/
foo" --> It loads data from one cluster to another cluster in parallel. It over writes if file is there.
hadoop distcp -p -update “file:///filename” "hdfs://localhost:9000/training/filename" --> It loads data
from local system to hadoop cluster in parallel. It updates if file is there.
Cmd> hadoop archive -archiveName NAME -p <parent path> <src>* <dest>
Example: hadoop archive –archiveName myfile.har –p /data /data/files /archive/
hadoop dfsadmin is the command to execute DFS administra)ve commands.
Example: hadoop dfsadm <command name>
Cmd> hadoop dfsadmin -report:
Reports basic filesystem informa)on and sta)s)cs.
Example: hadoop dfsadmin -report
Cmd> hadoop dfsadmin -safemode <enter|leave|get|wait>:
Safe mode maintenance command.
Safe mode is a Namenode state in which it
1. does not accept changes to the name space (read-only)
2. does not replicate or delete blocks.
Safe mode is entered automa)cally at Namenode startup, and
leaves safe mode automa)cally when the configured minimum
percentage of blocks sa)sfies the minimum replica)on
condi)on. Safe mode can also be entered manually, but then
it can only be turned off manually as well.
Example:
hadoop dfsadmin -safemode get --> To get status of safe mode
hadoop dfsadmin -safemode enter --> To enter safe mode
hadoop dfsadmin -safemode leave --> To leave the safemode
Cmd> hadoop dfsadmin -saveNamespace:
Save current namespace into storage directories and reset edits log.Requires superuser permissions and
safe mode.
Cmd> hadoop dfsadmin -refreshNodes:
Updates the set of hosts allowed to connect to namenode.
Re-reads the config file to update values defined by
dfs.hosts and dfs.host.exclude and reads the
en)res (hostnames) in those files.
Each entry not defined in dfs.hosts but in dfs.hosts.exclude is decommissioned. Each entry defined
in dfs.hosts and also in dfs.host.exclude is stopped from decommissioning if it has aleady been marked for
decommission. En)res not present in both the lists are decommissioned.
Cmd> hadoop dfsadmin -finalizeUpgrade:
Finalize upgrade of HDFS. Datanodes delete their previous version working directories, followed by
Namenode doing the same. This completes the upgrade process.
Cmd> hadoop dfsadmin -upgradeProgress <status|details|force>:
Request current distributed upgrade status, a detailed status or force the upgrade to proceed.
Cmd> hadoop dfsadmin -metasave <filename>:
Save Namenode's primary data structures
to <filename> in the directory specified by hadoop.log.dir property.
<filename> will contain one line for each of the following
1. Datanodes heart bea)ng with Namenode
2. Blocks wai)ng to be replicated
3. Blocks currrently being replicated
4. Blocks wai)ng to be deleted
Cmd> hadoop dfsadmin -setQuota <quota> <dirname>...<dirname>:
Set the quota <quota> for each directory <dirName>.
The directory quota is a long integer that puts a hard limit
on the number of names in the directory tree
Best effort for the directory, with faults reported if
1. N is not a posi)ve integer, or
2. user is not an administrator, or
3. the directory does not exist or is a file, or
Cmd> hadoop dfsadmin -clrQuota <dirname>...<dirname>:
Clear the quota for each directory <dirName>.
Best effort for the directory. with fault reported if
1. the directory does not exist or is a file, or
2. user is not an administrator.
It does not fault if the directory has no quota.
Cmd> hadoop dfsadmin -setSpaceQuota <quota> <dirname>...<dirname>:
Set the disk space quota <quota> for each directory <dirName>.
The space quota is a long integer that puts a hard limit on the total size of all the files under the directory
tree. The extra space required for replica)on is also counted. E.g. a 1GB file with replica)on of 3 consumes
3GB of the quota. Quota can also be speciefied with a binary prefix for terabytes,
petabytes etc (e.g. 50t is 50TB, 5m is 5MB, 3p is 3PB). Best effort for the directory, with faults reported if
1. N is not a posi)ve integer, or
2. user is not an administrator, or
3. the directory does not exist or is a file, or
Cmd> hadoop dfsadmin -clrSpaceQuota <dirname>...<dirname>:
Clear the disk space quota for each directory <dirName>.
Best effort for the directory. with fault reported if
1. the directory does not exist or is a file, or
2. user is not an administrator.
It does not fault if the directory has no quota.
Cmd> hadoop dfsadmin -refreshServiceAcl:
Reload the service-level authoriza)on policy file. Namenode will reload the authoriza)on policy file.
Cmd> hadoop dfsadmin -refreshUserToGroupsMappings:
Refresh user-to-groups mappings
Cmd> hadoop dfsadmin -refreshSuperUserGroupsConfiguraRon:
Refresh superuser proxy groups mappings
Cmd> hadoop dfsadmin -setBalancerBandwidth <bandwidth>:
Changes the network bandwidth used by each datanode during
HDFS block balancing.
<bandwidth> is the maximum number of bytes per second that will be used by each datanode. This value
overrides the dfs.balance.bandwidthPerSec parameter.
NOTE: The new value is not persistent on the DataNode
Cmd> hadoop -help [cmd]:
Displays help for the given command or all commands if none is specified.
Example: hadoop -help namenode --> It gives the complete informaRon about the command namenode.
Hdfs commands :
hadoop fs -mkdir /training
hadoop fs -ls /
hadoop fs -lsr /
hadoop fs -put CHANGES.txt /training/data/changes
hadoop fs -copyFromLocal CHANGES.txt /training/input/changes
hadoop fs -cat /training/data/changes
hadoop fs -tail /training/data/changes
hadoop fs -expunge
hadoop fs -count /
package com.hdfs;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.MapFile;
import org.apache.hadoop.io.Text;
public class MapFileWrite {
/**
* @param args
*/
@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException,
URISyntaxException {
String name = "/home/naga/dept";
@SuppressWarnings("resource")
BufferedReader br = new BufferedReader(new FileReader(name));
String line = br.readLine();
String uri = "/nyse/";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://hadoop:9000"), conf);
Path path = new Path(uri);
Text key = new Text();
Text value = new Text();
MapFile.Writer writer = null;
try {
writer = new MapFile.Writer(conf, fs, uri, key.getClass(), value.getClass());
while (line != null) {
String parts[] = line.split("\\t");
key.set(parts[0]);
value.set(parts[1]);
writer.append(key, value);
line = br.readLine();
}
} finally {
IOUtils.closeStream(writer);
}
}
}
MergeFiles.java
package twok.hadoop.hdfs;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
/**
* @param args
* @author Nagamallikarjuna
* @throws IOException
*/
public static void main(String[] args) throws IOException {
String dir = "/home/prasad/training/data/mydata";
File directory = new File(dir);
BufferedWriter writer = new BufferedWriter(new FileWriter("/home/prasad/training/
20news"));
File[] files = directory.listFiles();
for(File file : files)
{
String filepath = file.getPath();
@SuppressWarnings("resource")
BufferedReader br = new BufferedReader(new FileReader(filepath));
String line = br.readLine();
String content = "";
content = file.getName() + "\t";
while(line != null)
{
content = content + " " + line;
line= br.readLine();
}
writer.append(content);
}
writer.close();
}
ReadHDFS.java
package twok.hadoop.hdfs;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
/**
* @param args
* @author Nagamallikarjuna
* @throws URISyntaxException
* @throws IOException
*/
public static void main(String[] args) throws IOException, URISyntaxException {
Configuration conf = new Configuration();
Path fileName = new Path(args[0]);
Pattern pattern = Pattern.compile("^part.*");
Matcher matcher = null;
FileSystem hdfs = FileSystem.get(new URI("hdfs://localhost:9000"), conf);
FileStatus files[] = hdfs.listStatus(fileName);
for(FileStatus file: files)
{
matcher = pattern.matcher(file.getPath().getName());
if(matcher.matches())
{
FSDataInputStream in = hdfs.open(file.getPath());
in.seek(0);
IOUtils.copyBytes(in, System.out, conf, false);
}
}
}
}
SequenceFileRead.java
package twok.hadoop.hdfs;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
public class SequenceFileRead {
/**
* @author Nagamallikarjuna
* @param args
*/
SequenceFileWrite.java
package twok.hadoop.hdfs;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
public class SequenceFileWrite {
/**
* @param args
* @author Nagamallikarjuna
*/
public static void main(String[] args) throws IOException, URISyntaxException {
String name = "/home/naga/bigdata/hadoop-1.0.3/jobs/daily";
@SuppressWarnings("resource")
BufferedReader br = new BufferedReader(new FileReader(name));
String line = br.readLine();
String uri = "/sequenceFiles/stocks";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://localhost:9000"), conf);
Path path = new Path(uri);
Text key = new Text();
LongWritable value = new LongWritable();
SequenceFile.Writer writer = null;
try
{
writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());
while(line != null)
{
String parts[] = line.split("\\t");
key.set(parts[1]);
value.set(Long.valueOf(parts[7]));
writer.append(key, value);
line = br.readLine();
}
}
finally {
IOUtils.closeStream(writer);
}
}
}
WriteHDFS.java
package twok.hadoop.hdfs;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.GenericOptionsParser;
/**
* @param args
* @author Nagamallikarjuna
* @throws URISyntaxException
* @throws IOException
*/
public static void main(String[] args) throws IOException, URISyntaxException {
Configuration conf = new Configuration();
String otherArgs[] = new GenericOptionsParser(args).getRemainingArgs();
String localfile = otherArgs[0];
String hdfsfile = otherArgs[1];
FileSystem local = FileSystem.getLocal(conf);
FileSystem hdfs = FileSystem.get(new URI("hdfs://master:9000"), conf);
FSDataInputStream in = local.open(new Path(localfile));
FSDataOutputStream out = hdfs.create(new Path(hdfsfile));
byte[] buffer = new byte[256];
while(in.read(buffer) > 0)
{
out.write(buffer, 0, 256);
}
out.close();
in.close();
}
HDFS Commands :
This documentation gives complete information about user and admin commands of HDFS.
HDFS User Commands are:
1. fs
2. archive
3. distcp
USER Commands:
1. fs -- FileSystem Client.
2. Hadoop fs is the command to execute file system operations like directory creation, renaming
files, copying files, removing, displaying files.. etc.
Usage: hadoop fs –fs [local | <file system URI>] [Generic Options] [Command Options]
Specify the file system to use. If not specified, the current configuration is used, taken from the
following,
in increasing precedence: core-default.xml inside the hadoop jar file, core-site.xml in configuration
directory in hadoop home conf/ 'Local' means use the local file system as your DFS. <File system
URI> specifies a particular file system to contact.
Example :
hadoop fs -fs local à it will use the file system as local file system.
hadoop fs -fs hostname:9000 à It will use the file system as Hadoop distributed file system.
There is no need to specify, by default it chooses based on configuration files [ core-site, hdfs-site,
mapred-site xml files]
Cmd> hadoop fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]
<path> start checking from this path
-move move corrupted files to /lost+found
-delete delete corrupted files
-files print out files being checked
-openforwrite print out files opened for write
-blocks print out block report
-locations print out locations for every block
-racks print out network topology for data-node locations
By default fsck ignores files opened for write, use –open for write to report such files. They are
usually tagged CORRUPT or HEALTHY depending on their block allocation status
Example: hadoop fsck / --> It provides the healthiness of the hadoop cluster.
This documentation gives complete information about user and admin commands of HDFS.
Hadoop fs is the command to execute file system operations like directory creation, renaming
files, copying files etc.
Cmd> hadoop fs –fs [local | <file system URI>]
Specify the file system to use. If not specified, the current configuration is used, taken from the
following, in increasing precedence: core-default.xml inside the hadoop jar file, core-site.xml in
configuration directory in hadoop home conf/
'Local' means use the local file system as your DFS.
<File system URI> specifies a particular file system to contact.
Example:-
hadoop fs -fs local ---> it will use the file system as local file system.
hadoop fs -fs localhost: 9000 ---> It will use the file system as Hadoop distributed file system.
There is no need to specify, by default it chooses based on configuration files [ core-site, hdfs-site,
mapred-site xml files]
Cmd> hadoop fs -ls <path>:
List the contents that match the specified file pattern. If path is not specified, the contents of /
user/<currentUser> will be listed. Directory entries are of the form dirName full path/<dir> and
file entries are of the form fileName(full path) <r n> size where n is the number of replicas
specified for the file and size is the size of the file, in bytes.
Example:-
hadoop fs -ls / ---> This command lists all the files under the root directory of HDFS
Cmd> hadoop fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]
<path> start checking from this path
-move move corrupted files to /lost+found
-delete delete corrupted files
-files print out files being checked
-openforwrite print out files opened for write
-blocks print out block report
-locations print out locations for every block
-racks print out network topology for data-node locations
By default fsck ignores files opened for write, use –open for write to report such files. They are
usually tagged CORRUPT or HEALTHY depending on their block allocation status
Example:
hadoop fsck / --> It provides the healthiness of the hadoop cluster.
Hdfs – commands
hadoop fs -mkdir /training
hadoop fs -ls /
hadoop fs -lsr /
hadoop fs -put CHANGES.txt /training/data/changes
hadoop fs -copyFromLocal CHANGES.txt /training/input/changes
hadoop fs -cat /training/data/changes
hadoop fs -tail /training/data/changes
hadoop fs -expunge
hadoop fs -count /
MapFileWrite.java
package com.hdfs;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.MapFile;
import org.apache.hadoop.io.Text;
MergeFiles.java
package twok.hadoop.hdfs;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
/**
* @param args
* @author Nagamallikarjuna
* @throws IOException
*/
public static void main(String[] args) throws IOException {
String dir = "/home/prasad/training/data/mydata";
File directory = new File(dir);
BufferedWriter writer = new BufferedWriter(new FileWriter("/home/prasad/training/
20news"));
File[] files = directory.listFiles();
for(File file : files)
{
String filepath = file.getPath();
@SuppressWarnings("resource")
BufferedReader br = new BufferedReader(new FileReader(filepath));
String line = br.readLine();
String content = "";
content = file.getName() + "\t";
while(line != null)
{
content = content + " " + line;
line= br.readLine();
}
writer.append(content);
}
writer.close();
}
ReadHDFS.java
package twok.hadoop.hdfs;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
public class ReadHDFS {
/**
* @param args
* @author Nagamallikarjuna
* @throws URISyntaxException
* @throws IOException
*/
public static void main(String[] args) throws IOException, URISyntaxException {
Configuration conf = new Configuration();
Path fileName = new Path(args[0]);
Pattern pattern = Pattern.compile("^part.*");
Matcher matcher = null;
FileSystem hdfs = FileSystem.get(new URI("hdfs://localhost:9000"), conf);
FileStatus files[] = hdfs.listStatus(fileName);
for(FileStatus file: files)
{
matcher = pattern.matcher(file.getPath().getName());
if(matcher.matches())
{
FSDataInputStream in = hdfs.open(file.getPath());
in.seek(0);
IOUtils.copyBytes(in, System.out, conf, false);
}
}
}
}
SequenceFileRead.java
package twok.hadoop.hdfs;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
public class SequenceFileRead {
/**
* @author Nagamallikarjuna
* @param args
*/
public static void main(String[] args) throws IOException, URISyntaxException {
String uri = "/sequenceFiles/stocks";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://prasad:9000"), conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null;
try
{
reader = new SequenceFile.Reader(fs, path, conf);
System.out.println(reader.getKeyClassName());
System.out.println(reader.getValueClassName());
}
finally {
IOUtils.closeStream(reader);
}
}
}
SequenceFileWrite.java
package twok.hadoop.hdfs;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
public class SequenceFileWrite {
/**
* @param args
* @author Nagamallikarjuna
*/
public static void main(String[] args) throws IOException, URISyntaxException {
String name = "/home/naga/bigdata/hadoop-1.0.3/jobs/daily";
@SuppressWarnings("resource")
BufferedReader br = new BufferedReader(new FileReader(name));
String line = br.readLine();
String uri = "/sequenceFiles/stocks";
Configuration conf = new Configuration();
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(new URI("hdfs://localhost:9000"), conf);
Path path = new Path(uri);
Text key = new Text();
LongWritable value = new LongWritable();
SequenceFile.Writer writer = null;
try
{
writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());
while(line != null)
{
String parts[] = line.split("\\t");
key.set(parts[1]);
value.set(Long.valueOf(parts[7]));
writer.append(key, value);
line = br.readLine();
}
}
finally {
IOUtils.closeStream(writer);
}
}
}
WriteHDFS.java
package twok.hadoop.hdfs;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.GenericOptionsParser;
/**
* @param args
* @author Nagamallikarjuna
* @throws URISyntaxException
* @throws IOException
*/
public static void main(String[] args) throws IOException, URISyntaxException {
Configuration conf = new Configuration();
String otherArgs[] = new GenericOptionsParser(args).getRemainingArgs();
String localfile = otherArgs[0];
String hdfsfile = otherArgs[1];
FileSystem local = FileSystem.getLocal(conf);
FileSystem hdfs = FileSystem.get(new URI("hdfs://master:9000"), conf);
FSDataInputStream in = local.open(new Path(localfile));
FSDataOutputStream out = hdfs.create(new Path(hdfsfile));
byte[] buffer = new byte[256];
while(in.read(buffer) > 0)
{
out.write(buffer, 0, 256);
}
out.close();
in.close();
}
}
HDFS Commands :
This documentation gives complete information about user and admin commands of HDFS.
HDFS User Commands are:
1. fs
2. archive
3. distcp
USER Commands:
1. fs -- FileSystem Client :
Hadoop fs is the command to execute file system operations like directory creation,
renaming files,copying files, removing, displaying files.. etc.
Usage: hadoop fs –fs [local | <file system URI>] [Generic Options] [Command Options]
Specify the file system to use. If not specified, the current configuration is used, taken from the
following, in increasing precedence: core-default.xml inside the hadoop jar file, core-site.xml in
configuration directory in hadoop home conf/ 'Local' means use the local file system as your DFS.
<File s ystem URI> specifies a particular file system to contact.
Example
hadoop fs -fs local à it will use the file system as local file system.
hadoop fs -fs hostname:9000 à It will use the file system as Hadoop distributed file system.
-R : modifies the files recursively. This is the only option currently supported.
If only owner or group is specified then only owner or group is modified.
The owner and group names may only cosists of digits, alphabet, and any of '-_.@/' i.e. [-
_.@/a-zA-Z0-9]. The names are case sensitive.
WARNING: Avoid using '.' to separate user name and group though Linux allows it. If user
names have dots in them and you are using local file system, you might see surprising
results since shell command 'chown' is used for local files.
Example: hadoop fs -chown -R naga:supergroup /data/daily --> This will change the user
and group of the file /data/daily
Cmd> hadoop fsck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -
racks]]]
<path> start checking from this path
-move move corrupted files to /lost+found
-delete delete corrupted files
-files print out files being checked
-openforwrite print out files opened for write
-blocks print out block report
-locations print out locations for every block
-racks print out network topology for data-node locations
By default fsck ignores files opened for write, use –open for write to report such files. They
are usually tagged CORRUPT or HEALTHY depending on their block allocation status
Example: hadoop fsck / --> It provides the healthiness of the hadoop cluster.