DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
CLOUD COMPUTING
UCS641
ASSIGNMENT 2
Hadoop and Map-Reduce
Submitted by: Submitted to:
Lohitya Kumar Ambashta Dr. Ashima Singh
101883073
COE-14
PATIALA-147001, PUNJAB
Table of Contents
1. Setting up and Installing Hadoop ....................................................................................... 2
1.1. Stand-alone Mode ....................................................................................................... 3
1.2. Pseudo-Distributed Mode ........................................................................................... 4
1.3. Fully Distributed Mode.............................................................................................. 10
2. Implementation of HDFS Commands ............................................................................... 16
3. Upload and Download File from HDFS ............................................................................. 18
4. Copy file from Source to Destination in Hadoop.............................................................. 18
5. Copy file from/to local file system to HDFS...................................................................... 19
6. Remove a File or Directory from HDFS ............................................................................. 20
7. Display last few lines of a file in HDFS .............................................................................. 20
8. Display aggregate length of a file ..................................................................................... 21
9. Implementation of Word Count Map Reduce .................................................................. 22
10. Matrix multiplication with Hadoop Map Reduce ................ Error! Bookmark not defined.
11. Small binary file to One sequential file................................ Error! Bookmark not defined.
1. Setting up and Installing Hadoop
First, we need to download Hadoop.
Figure 1: Download Website for Hadoop
We will download Hadoop 3.2.1. Click on the link under the “Binary Download” column. This
will take us to the download page of Hadoop.
Figure 2: Link to get Hadoop-3.2.1-src.tar.gz file
Now we have Hadoop zip file. Before setting up, we need java 7 i.e. Java1.7 or above pre-
installed on our computer.
To check for installed java versions, we can run in terminal:
$ java -version
If java is not installed, we can install it using the following commands:
$ sudo apt-get update
$ sudo apt-get install default-jdk
Now check version once again by running:
$ java -version
1.1. Stand-alone Mode
Installing Hadoop in Stand-alone mode step by step:
i. Go to the directory where you have downloaded the Hadoop file and run the
following command:
$ tar -xzvf Hadoop-3.2.1-src.tar.gz
This will extract the files into a new folder in the same directory.
ii. Move the extracted folder to a suitable install location.
$ sudo mv Hadoop-3.2.1 /usr/local/hadoop
iii. We will add the Hadoop-3.2.1/bin/ to the PATH variable.
$ export HADOOP_PATH=~/Hadoop-3.2.1/bin/
$ export PATH=$PATH;$HADOOP_PATH
Now we are ready to run Hadoop in stand-alone mode.
To execute a jar file using Hadoop, first move to the Hadoop install directory and run:
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-
examples-3.2.1.jar
If you are not in the correct directory, you will get an error saying file not found, otherwise
you see the output as below:
Figure 3: Output for Hadoop in stand-alone mode
1.2. Pseudo-Distributed Mode
Configuring Hadoop in pseudo-distributed mode requires more configurations after the
stand-alone installation.
Pre-requirements include Java (already done in stand-alone mode) and SSH installed.
We can check it using:
$ ssh localhost
Figure 4: 'ssh localhost' output when SSH is installed
If not installed, you can install it using the command:
$ sudo apt-get install openssh-server
Master node communicate with slave node very frequently over SSH protocol. in Pseudo-
Distributed mode, only one node exists (your machine) and master slave interaction is
simulated by JVM. Since communication is very frequent, SSH should be password less.
Authentication needs to be done using Public key.
We can achieve this by creating a RSA key value pair. We use the command:
$ ssh-keygen
You'll be prompted to choose the location to store the keys. The default location is good
unless you already have a key. Press Enter to choose the default location.
Enter file in which to save the key (/Users/yourname/.ssh/id
_rsa):
Next, you'll be asked to choose a password. Using a password means a password will be
required to use the private key. It's a good idea to use a password on your private key.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
After you choose a password, your public and private keys will be generated. There will be
two different files. The one named id_rsa is your private key. The one
named id_rsa.pub is your public key.
Your identification has been saved in /Users/yourname/.ssh/
id_rsa.
Your public key has been saved in /Users/yourname/.ssh/
Id_rsa.pub.
You'll also be shown a fingerprint and "visual fingerprint" of your key. You do not need to
save these.
The key fingerprint is:
d7:21:c7:d6:b8:3a:29:29:11:ae:6f:79:bc:67:63:53
yourname@laptop1
The key's randomart image is:
+--[ RSA 2048]----+
| |
| . o |
| . . * . |
| . . = o |
| o S . o |
| . . o oE |
| . .oo +. |
| .o.o.*. |
| ....= o |
+-----------------+
Now to give access using the public key, create a file .ssh/authorized_keys and
paste the content of the rsa.pub key in it.
Figure 5: Content of .ssh folder
Figure 6: Content of private RSA Key
Figure 7: Content of pubic RSA key
Figure 8: Content of authorized RSA keys
Now we have configured our SSH, now we need to configure Hadoop configuration files.
Navigate to Hadoop installation directory; and within it navigate to etc/hadoop/.
Figure 9: Content of etc/hadoop (Configuration Files)
First, we need to tell Hadoop where in our computer JAVA is located. We can find path to
java by using the whereis command and following the symbolic links to its original path.
Figure 10: Steps to get java path
We will now modify the hadoop-env.sh file to export this path. Within the hadoop
installation directory, change your directory to etc/hadoop.
Open hadoop-env.sh using nano in terminal:
$ nano hadoop-env.sh
Now add this environment variable to the file
$ export JAVA_HOME=path/from/above/command
Figure 11: JAVA_HOME path inside hadoop-env.sh
Now save and exit by pressing Ctrl+X and then enter Y to save buffer with existing name.
Second, we will configure the core-site.xml in the same directory.
$ nano core-site.xml
add the following inside the <configuration> </configuration> tag.
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
Third, we need to configure hdfs-site.xml.
$ nano hdfs-site.xml
add the following inside the <configuration> </configuration> tag.
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
It tells any data you store on HDFS is replicated to one another node as a backup.
Fourth, we configure mapred-site.xml. We need to create this file as it is not present
in the directory.
$ nano mapred-site.xml
add the following inside the <configuration> </configuration> tag.
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
In this property we configured the resource negotiator as YARN.
At last, we configure resource negotiator YARN. We do that by configuring yarn-
site.xml.
$ nano yarn-site.xml
add the following inside the <configuration> </configuration> tag.
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Now we can start executing Hadoop commands.
We begin by formatting the NameNode so we have a fresh start.
$ hdfs namenode -format
Let us first see the currently active Hadoop processes.
$ jps
Now we start our NameNode by running the following command:
$ start-dfs.sh
Figure 12: Output for start-dfs.sh
Check running processes of Hadoop.
$ jps
Figure 13:Starting DFS
Then we start our resource negotiator YARN.
$ start-yarn.sh
Figure 14: Output for start-yarn.sh
Let us check the running processes of Hadoop.
$ jps
We can verify if our NameNode is running using our web browser. Visit the url:
localhost:9870. Hadoop, by default, works on port 9870.
1.3. Fully Distributed Mode
We can switch Hadoop from Pseudo-distributed to Fully-distributed by making few changes
in the configuration file.
We need to edit the host file.
$ sudo gedit /etc/hosts
Add all machines IP address and host names.
192.168.2.14 HadoopMaster
192.168.2.15 HadoopSlave1
192.168.2.16 HadoopSlave2
Then we create a group ‘hdgroup’ and add a user ‘hduser’ in the group.
$ sudo addgroup hdgroup
$ sudo adduser –ingroup hdgroup hduser
Then we configure the sudo permission for ‘hduser’
$ sudo visudo
In the file, give permissions to ‘hduser’. Then save and exit the file.
$ hduser ALL=(ALL) ALL
Give permissions to the Hadoop installation folders.
$ sudo chown -R hduser /usr/local/hadoop
$ sudo chmod -R 755 /usr/local/hadoop
All the steps till now should be done on all the machines i.e Master and slave nodes.
Now we need to re-configure our Master node for fully distributed mode.
We will now edit core-site.xml.
<property>
<name>fs.default.name</name>
<value>hdfs://HadoopMaster:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.
</description>
</property>
Next, we will configure hdfs-site.xml.
<property>
<name>dfs.name.dir</name>
<value>/app/hadoop/tmp/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/app/hadoop/tmp/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>false</value>
</property>
Continued command on next page…
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-
check</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>HadoopMaster:50070</value>
<description>Your NameNode hostname for http
access.</description>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>HadoopMaster:50090</value>
<description>Your Secondary NameNode hostname for http
access.</description>
</property>
Next we will edit yarn-site.xml.
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
<description>Long running service which executes on Node
Manager(s) and provides MapReduce Sort and Shuffle
functionality.</description>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
<description>Enable log aggregation so application logs are
moved onto hdfs and are viewable via web ui after the
application completed. The default location on hdfs is
'/log' and can be changed via yarn.nodemanager.remote-app-
log-dir property</description>
</property>
Continuation from previous page…
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>HadoopMaster:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>HadoopMaster:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>HadoopMaster:8032</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>HadoopMaster:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>HadoopMaster:8088</value>
</property>
Next, we edit mapred-site.xml.
<property>
<name>mapred.job.tracker</name>
<value>HadoopMaster:9001</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Now we edit the slave file.
$ sudo gedit slaves
Add IP of slave machines.
192.168.2.15
192.168.2.16
Secure copy the Hadoop folder to other slave hosts.
$ scp -r /usr/local/hadoop/* hduser@192.168.2.15:/usr/local/hadoop
$ scp -r /usr/local/hadoop/* hduser@192.168.2.16:/usr/local/hadoop
$ scp -r $HOME/.bashrc hduser@192.168.2.15:$HOME/.bashrc
$ scp -r $HOME/.bashrc hduser@192.168.2.16:$HOME/.bashrc
Format NameNode and run DFS and YARN.
$ start-dfs.sh
$ start-yarn.sh
Now Hadoop is running in Fully-distributed mode successfully.
2. Implementation of HDFS Commands
All commands can be used in two ways.
First, using hadoop fs
$ hadoop fs [-command] <parameter if any>
Example,
$ hadoop fs -ls /
Second using hdfs dfs
$ hdfs dfs [-command] <parameter if any>
Example,
$ hdfs dfs -ls \
Commands and their implementations:
a) ls: list directory
b) mkdir: make direcotry
c) cat: output content of file
d) touchz: create a file with zero length
e) du: returns size of directories or length of files in the specified directory
f) dus: deprecated command. Same as du -s. Gives summary of files in a directory
g) stats: returns the stat information about the path. Example, date modified.
3. Upload and Download File from HDFS
To upload we use hadoop fs -put <src/path> <dfs/path>
To download a file, we use hadoop fs -get <dfs/path> <src/path>
4. Copy file from Source to Destination in Hadoop
To copy, we use hadoop fs -cp <src/path> <dest/path>
To move, we use hadoop fs -mv <src/path> <dest/path>
5. Copy file from/to local file system to HDFS
To copy from local to hdfs we use hadoop fs -copyFromLocal <src> <dest>
To copy to local from hdfs we use hadoop fs -copyToLocal <src> <dest>
6. Remove a File or Directory from HDFS
To remove a file, we use hadoop fs -rm </path/to/filename> and to remove a
directory we use hadoop fs -rmr </path/to/dir>
7. Display last few lines of a file in HDFS
To display last few kilobytes of a file we use hadoop fs -tail <path/to/file>
Using a pipelined expression to get last 5 lines.
8. Display aggregate length of a file
To display aggregate length we can use hadoop fs -du -s <path/to/file>
9. Implementation of Word Count Map Reduce
Q. Implement Word Count Map Reduce program using the Map Reduce Paradigm.
A. Program for Word Count:
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class wordcount {
public static class MyMap extends MapReduceBase implements Mapper<LongWrit
able, Text, Text, IntWritable> {
private Text myKey = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, In
tWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
myKey.set(tokenizer.nextToken());
output.collect(myKey, new IntWritable(1));
}
}
}
public static class MyReduce extends MapReduceBase implements Reducer<Text
, IntWritable, Text, IntWritable>{
public void reduce(Text key, Iterator<IntWritable> values, OutputColle
ctor<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws IOException {
JobConf conf = new JobConf(wordcount.class);
conf.setJobName("CountWordFreq");
conf.setMapperClass(MyMap.class);
conf.setReducerClass(MyReduce.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Hadoop jar libraries required:
▪ share/hadoop/common/*.jar
▪ share/hadoop/mapreduce/*.jar
These jar files should be added as dependencies while writing the program.
Compiling the java code.
$ javac -classpath ${HADOOP_CLASSPATH} -d wc_class/ wc.java
Packing the program in one jar file.
$ jar -cvf wc.jar -C wc_class/
Copy ‘input’ file to HDFS.
$ hadoop fs -put input /WordCount/input
Run program in Hadoop Map Reduce paradigm.
$ hadoop jar wc.jar wc /WordCount/input /WordCount/output
Snapshot of above command running:
Printing Output File from HDFS:
$ hadoop fs -cat /WordCount/output/part-00000