Hands-On Hadoop
Tutorial
Chris Sosa
Wolfgang Richter
May 23, 2008
General Information
Hadoop uses HDFS, a distributed file
system based on GFS, as its shared
filesystem
HDFS architecture divides files into large
chunks (~64MB) distributed across data
servers
HDFS has a global namespace
General Information (cont’d)
Provided a script for your convenience
– Run source /localtmp/hadoop/setupVars from centurtion064
– Changes all uses of {somePath}/command to just command
Goto http://www.cs.virginia.edu/~cbs6n/hadoop for web
access. These slides and more information are also
available there.
Once you use the DFS (put something in it), relative
paths are from /usr/{your usr id}. E.G. if your id is tb28
… your “home dir” is /usr/tb28
Master Node
Hadoop currently configured with
centurion064 as the master node
Master node
– Keeps track of namespace and metadata
about items
– Keeps track of MapReduce jobs in the system
Slave Nodes
Centurion064 also acts as a slave node
Slave nodes
– Manage blocks of data sent from master node
– In terms of GFS, these are the chunkservers
Currently centurion060 is also another
slave node
Hadoop Paths
Hadoop is locally “installed” on each machine
– Installed location is in /localtmp/hadoop/hadoop-
0.15.3
– Slave nodes store their data in
/localtmp/hadoop/hadoop-dfs (this is automatically
created by the DFS)
– /localtmp/hadoop is owned by group gbg (someone
in this group must administer this or a cs admin)
Files are divided into 64 MB chunks (this is
configurable)
Starting / Stopping Hadoop
For the purposes of this tutorial, we
assume you have run the setupVars from
earlier
start-all.sh – starts all slave nodes and
master node
stop-all.sh – stops all slave nodes and
master node
Using HDFS (1/2)
hadoop dfs
– [-ls <path>]
– [-du <path>]
– [-cp <src> <dst>]
– [-rm <path>]
– [-put <localsrc> <dst>]
– [-copyFromLocal <localsrc> <dst>]
– [-moveFromLocal <localsrc> <dst>]
– [-get [-crc] <src> <localdst>]
– [-cat <src>]
– [-copyToLocal [-crc] <src> <localdst>]
– [-moveToLocal [-crc] <src> <localdst>]
– [-mkdir <path>]
– [-touchz <path>]
– [-test -[ezd] <path>]
– [-stat [format] <path>]
– [-help [cmd]]
Using HDFS (2/2)
Want to reformat?
Easy
– hadoop namenode –format
Basically we see most commands look similar
– hadoop “some command” options
– If you just type hadoop you get all possible
commands (including undocumented ones – hooray)
To Add Another Slave
This adds another data node / job execution site
to the pool
– Hadoop dynamically uses filesystem underneath it
– If more space is available on the HDD, HDFS will try
to use it when it needs to
Modify the slaves file
– In centurion064:/localtmp/hadoop/hadoop-
0.15.3/conf
– Copy code installation dir to
newMachine:/localtmp/hadoop/hadoop-0.15.3 (very
small)
– Restart Hadoop
Configure Hadoop
Can configure in {$installation dir}/conf
– hadoop-default.xml for global
– hadoop-site.xml for site specific (overrides global)
That’s it for Configuration!
Real-time Access