Hadoop single cluster installation

Hadoop Single Cluster
Installation
Minh Tran – Software Architect
05/2013

Prerequisites
• Ubuntu Server 10.04 (Lucid Lynx)
• JDK 6u34 Linux
• Hadoop 1.0.4
• VMWare Player / VMWare Workstation /
VMWare Server
• Ubuntu Server VMWare Image:
http://www.thoughtpolice.co.uk/vmware/#u
buntu10.04 (notroot / thoughtpolice)

Install SSH
• sudo apt-get update
• sudo apt-get install openssh-server

Install JDK
• wget -c -O jdk-6u34-linux-i586.bin
http://download.oracle.com/otn/java/jdk/6u34-b04/jdk-6u34-
linux-
i586.bin?AuthParam=1347897296_c6dd13e0af9e099dc731937f9
5c1cd01
• chmod 777 jdk-6u34-linux-i586.bin
• ./jdk-6u34-linux-i586.bin
• sudo mv jdk1.6.0_34 /usr/local
• sudo ln -s /usr/local/jdk1.6.0_34 /usr/local/jdk

Create group / account for
Hadoop
• sudo addgroup hadoop
• sudo adduser --ingroup hadoop hduser

Install Local Hadoop
• wget http://mirrors.digipower.vn/apache/hadoop/common/hado
op-1.0.4/hadoop-1.0.4.tar.gz
• tar -zxvf hadoop-1.0.4.tar.gz
• sudo mv hadoop-1.0.4 /usr/local
• sudo chown -R hduser:hadoop /usr/local/hadoop-1.0.4
• sudo ln -s /usr/local/hadoop-1.0.4 /usr/local/hadoop

Install Apache Ant
• wget http://mirrors.digipower.vn/apache/ant/binaries/apache-
ant-1.9.0-bin.tar.gz
• tar -zxvf apache-ant-1.9.0-bin.tar.gz
• sudo mv apache-ant-1.9.0 /usr/local
• sudo ln -s /usr/local/apache-ant-1.9.0 /usr/local/apache-ant

Modify environment variables
• su - hduser
• vi .bashrc
• export JAVA_HOME=/usr/local/jdk
• export HADOOP_PREFIX=/usr/local/hadoop
• export
PATH=${JAVA_HOME}/bin:${HADOOP_PREFIX}/bin:${PATH}
• . .bashrc

Try 1st example
hduser@ubuntu:/usr/local/hadoop$ cd $HADOOP_PREFIX
hduser@ubuntu:/usr/local/hadoop$ hadoop jar hadoop-examples-1.0.4.jar pi 2 10
Number of Maps = 2
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Starting Job
13/04/03 15:01:40 INFO mapred.FileInputFormat: Total input paths to process : 2
13/04/03 15:01:41 INFO mapred.JobClient: Running job: job_201304031458_0003
13/04/03 15:01:42 INFO mapred.JobClient: map 0% reduce 0%
13/04/03 15:02:19 INFO mapred.JobClient: Job complete: job_201304031458_0003
13/04/03 15:02:19 INFO mapred.JobClient: Counters: 30
13/04/03 15:02:19 INFO mapred.JobClient: Job Counters
…
13/04/03 15:02:19 INFO mapred.JobClient: Reduce output records=0
13/04/03 15:02:19 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1118670848
13/04/03 15:02:19 INFO mapred.JobClient: Map output records=4
Job Finished in 39.148 seconds
Estimated value of Pi is 3.80000000000000000000

Setup Single Node Cluster
• Disabling ipv6
• Configuring SSH
• Configuration
– hadoop-env.sh
– conf/*-site.xml
• Start / Stop node cluster
• Running MapReduce job

Disabling ipv6
• Open /etc/sysctl.conf, add following lines
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
• Reboot your machine
• Verify ipv6 enabled / disabled
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
(0 – enabled, 1 – disabled)

Configuring SSH
• Create SSH keys in the localhost
su - hduser
ssh-keygen -t rsa -P "“
• Put the key id_rsa.pub to localhost
touch ~/.ssh/authorized_keys && chmod 600
~/.ssh/authorized_keys
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Configuration
• Edit the configuration
in /usr/local/hadoop/conf/hadoop-
env.sh, add following lines:
export JAVA_HOME=/usr/local/jdk

Configuration (cont.)
• Create a folder to store data for node
sudo mkdir -p /hadoop_data/name
sudo mkdir -p /hadoop_data/data
sudo mkdir -p /hadoop_data/temp
sudo chown hduser:hadoop /hadoop_data/name
sudo chown hduser:hadoop /hadoop_data/data
sudo chown hduser:hadoop /hadoop_data/temp

conf/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/hadoop_data/temp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>

conf/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and reduce task.
</description>
</property>
</configuration>

conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.name.dir</name>

<value>/hadoop_data/name</value>
</property>
<property>
<name>dfs.data.dir</name>

<value>/hadoop_data/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. The actual number of replications can
be specified when the file is created. The default is used if replication is not
specified in create time.
</description>
</property>
</configuration>

Format a new system
notroot@ubuntu:/usr/local/hadoop/conf$ su - hduser
Password:
hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format
13/04/03 13:41:24 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu.localdomain/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.4
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290;
compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012
************************************************************/
Re-format filesystem in /hadoop_data/name ? (Y or N) Y
13/04/03 13:41:26 INFO util.GSet: VM type = 32-bit
13/04/03 13:41:26 INFO util.GSet: 2% max memory = 19.33375 MB
13/04/03 13:41:26 INFO util.GSet: capacity = 2^22 = 4194304 entries
….
13/04/03 13:41:28 INFO common.Storage: Storage directory /hadoop_data/name has been successfully formatted.
13/04/03 13:41:28 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu.localdomain/127.0.1.1
************************************************************/
Do not format a running Hadoop file system as you will lose all the
data currently in the cluster (in HDFS)!

Start Single Node Cluster
hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh
starting namenode, logging to /usr/local/hadoop-1.0.4/libexec/../logs/hadoop-hduser-
namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop-1.0.4/libexec/../logs/hadoop-
hduser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop-
1.0.4/libexec/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop-1.0.4/libexec/../logs/hadoop-hduser-
jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop-1.0.4/libexec/../logs/hadoop-
hduser-tasktracker-ubuntu.out

How to verify Hadoop process
• A nifty tool for checking whether the expected Hadoop processes are running is jps
(part of Sun JDK tool)
hduser@ubuntu:~$ jps
1203 NameNode
1833 Jps
1615 JobTracker
1541 SecondaryNameNode
1362 DataNode
1788 TaskTracker
• You can also check with netstat if Hadoop is listening on the configured ports.
notroot@ubuntu:/usr/local/hadoop/conf$ sudo netstat -plten | grep java
tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 7167 2438/java
tcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 7949 2874/java
tcp 0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 1001 7898 2791/java
tcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 8035 2874/java
tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 7202 2438/java
tcp 0 0 0.0.0.0:57143 0.0.0.0:* LISTEN 1001 7585 2791/java
tcp 0 0 0.0.0.0:41943 0.0.0.0:* LISTEN 1001 7222 2608/java
tcp 0 0 0.0.0.0:58936 0.0.0.0:* LISTEN 1001 6969 2438/java
tcp 0 0 127.0.0.1:50234 0.0.0.0:* LISTEN 1001 8158 3050/java
tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 7697 2608/java
tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 1001 7775 2608/java
tcp 0 0 0.0.0.0:40067 0.0.0.0:* LISTEN 1001 7764 2874/java
tcp 0 0 0.0.0.0:50020 0.0.0.0:* LISTEN 1001 7939 2608/java

Stop your single node cluster
hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

Running a MapReduce job
• We will use three ebooks from Project
Gutenberg for this example:
– The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
– The Notebooks of Leonardo Da Vinci
– Ulysses by James Joyce
• Download each ebook as text files in Plain
Text UTF-8 encoding and store the files in
/tmp/gutenberg

(cont.)
• Copy these files into HDFS
hduser@ubuntu:~$ hadoop dfs -copyFromLocal /tmp/gutenberg
/user/hduser/gutenberg
hduser@ubuntu:~$ hadoop dfs -ls /user/hduser/gutenberg
Found 3 items
-rw-r--r-- 1 hduser supergroup 661807 2013-04-03 14:01
/user/hduser/gutenberg/pg20417.txt

(cont.)
hduser@ubuntu:~$ cd /usr/local/hadoop
hduser@ubuntu:/usr/local/hadoop$ hadoop jar hadoop-examples-1.0.4.jar wordcount /user/hduser/gutenberg
/user/hduser/gutenberg-output
13/04/03 14:02:45 INFO input.FileInputFormat: Total input paths to process : 3
13/04/03 14:02:45 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/04/03 14:02:45 WARN snappy.LoadSnappy: Snappy native library not loaded
13/04/03 14:02:45 INFO mapred.JobClient: Running job: job_201304031352_0001
13/04/03 14:03:53 INFO mapred.JobClient: Job complete: job_201304031352_0001
13/04/03 14:03:53 INFO mapred.JobClient: Counters: 29
13/04/03 14:03:53 INFO mapred.JobClient: Job Counters
13/04/03 14:03:53 INFO mapred.JobClient: Launched reduce tasks=1
…
13/04/03 14:03:53 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=59114
13/04/03 14:03:53 INFO mapred.JobClient: SPLIT_RAW_BYTES=361
13/04/03 14:03:53 INFO mapred.JobClient: Reduce input records=102321
13/04/03 14:03:53 INFO mapred.JobClient: Reduce input groups=82334
13/04/03 14:03:53 INFO mapred.JobClient: Combine output records=102321
13/04/03 14:03:53 INFO mapred.JobClient: Physical memory (bytes) snapshot=576069632
13/04/03 14:03:53 INFO mapred.JobClient: Reduce output records=82334
13/04/03 14:03:53 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1490481152
13/04/03 14:03:53 INFO mapred.JobClient: Map output records=629172

Check the result
• hduser@ubuntu:/usr/local/hadoop$ hadoop dfs -ls /user/hduser/gutenberg-output
Found 3 items
-rw-r--r-- 1 hduser supergroup 0 2013-04-03 14:03 /user/hduser/gutenberg-output/_SUCCESS
drwxr-xr-x - hduser supergroup 0 2013-04-03 14:02 /user/hduser/gutenberg-output/_logs
-rw-r--r-- 1 hduser supergroup 880829 2013-04-03 14:03 /user/hduser/gutenberg-output/part-r-00000
• hduser@ubuntu:/usr/local/hadoop$ hadoop dfs -cat /user/hduser/gutenberg-output/part-r-00000 | more
"(Lo)cra" 1
"1490 1
"1498," 1
"35" 1
"40," 1
"A 2
"AS-IS". 1
"A_ 1
"Absoluti 1
"Alack! 1
"Alack!" 1
"Alla 1
"Allegorical 1
"Alpha 1
"Alpha," 1
…

Hadoop Interfaces
• NameNode Web UI:
http://192.168.65.134:50070/
• JobTracker Web UI:
http://192.168.65.134:50030/
• TaskTracker Web UI:
http://192.168.65.134:50060/

Troubleshooting
• VMware Ubuntu image lost eth0 after moving
it http://www.whiteboardcoder.com/2012/03/vmware-ubuntu-image-
lost-eth0-after.html
• Hadoop Troubleshooting:
http://wiki.apache.org/hadoop/TroubleShooting
• Error when formatting the Hadoop
filesystem: http://askubuntu.com/questions/35551/error-when-
formatting-the-hadoop-filesystem

Hadoop single cluster installation

More Related Content

What's hot

Similar to Hadoop single cluster installation

Recently uploaded

Hadoop single cluster installation

Editor's Notes