0% found this document useful (0 votes)

13 views26 pages

Hadoop Cluster

hadoop cluster

Uploaded by

bik.ghimi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views26 pages

Hadoop Cluster

hadoop cluster

Uploaded by

bik.ghimi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Name: Chandra Prakash Chaudhary

Roll No: 079MSDSA005

How to Setup an Apache Hadoop Cluster

on AWS EC2
Introduction

Lets talk about how to setup an Apache Hadoop cluster on AWS.

In a previous article, we discussed setting up a Hadoop processing

pipeline on a single node (laptop). That involved running all the
components of Hadoop on a single machine. In the setup we discuss
here, we setup a multi-node cluster to run processing jobs.

Our setup involves a single NameNode and three DataNodes which

serve as processing slaves.

Starting with setting up the AWS EC2 resources, we take you all the
way through complete configuration of the machines in this
arrangement.

We use Apache Hadoop 2.7.3 for this demonstration.

Pre-Requisites

Sign up for an AWS account if you don’t already have one. You get
some resources free for the first year, including an EC2 Micro
Instance.

AWS EC2 Startup

We will now create 4 instances of Ubuntu Server 16.04 LTS using

Amazon EC2.
Select Instance

Go to your AWS Console, Click on Launch Instance and select

Ubuntu Server 16.04 LTS.

Instance Type

For the instance type, we choose t2.micro since that is sufficient for
the purposes of the demo. If you have a need for a high-memory or
high-cpu instance, you can select one of those.
Instance Details

Here, we request 4 instances of the selected machine type. We also

choose a subnet (us-west-1b) just so we can launch into the same
location if we need more machines.

Storage

For our purpose, the default instance storage of 8GB is sufficient. If

you need more storage, either increase the size or attach a disk by
clicking “Add Volume”. If you add a volume, you will need to attach the
volume to your instance, format it and mount it. Since this is a
beginner tutorial, these steps are not covered here.

Setting Up Instances

Once the instances are up and running, it is time to set them up for
our purpose. This includes the following:
 Setup password-less login between the namenode and the
datanodes.
 Install java.
 Setup Hadoop.

Copy Instance Public DNS Name

We now need to copy the Public DNS Name of each node (1

namenode and 3 datanodes). These names are used in the
configuration steps below. Since the DNS is specific to each setup, we
refer to the names as follows.

For example, in the description below, if you see <nnode>, substitute

with the value of <NameNode Public DNS>. Similarly for <dnode1>
and so on.

Common Setup on All Nodes

Some setup is common to all the nodes: NameNode and DataNodes.

This is covered in this section.
All Nodes: Update the instance

Let us update the OS with latest available software patches.

After the updates, the system might require a restart. Perform a

Reboot from the EC2 Instances page.
All Nodes: Install Java

Let us now install Java. We install the package: openjdk-8-jdk-

headless on all the nodes.

All Nodes: Install Apache Hadoop

Install Apache Hadoop 2.7.3 on all the instances. Obtain the link to
download from the Apache website and run the following commands.
We install Hadoop under a directory server in the home directory.
tar xvzf hadoop-2.7.3.tar.gz

All Nodes: Setup JAVA_HOME

On each of the nodes,

edit ~/server/hadoop-2.7.3/etc/hadoop/hadoop-env.sh.

Replace this line:

export JAVA_HOME=${JAVA_HOME}

With the following line:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
All Nodes: Update core_site.xml

On each node,
edit ~/server/hadoop-2.7.3/etc/hadoop/core-site.xml an
d replace the following lines:

</configuration>

with these (as mentioned above, replace <nnode> with NameNode’s

public DNS):

<name>fs.defaultFS</name>

</configuration>

All Nodes: Create Data Dir

HDFS needs the data directory to be present on each node: 1 name

node and 3 data nodes. Create this directory as shown and change
ownership to user ubuntu.

sudo mkdir -p /usr/local/hadoop/hdfs/data

sudo chown -R ubuntu:ubuntu /usr/local/hadoop/hdfs/data

Create Image of instance that have same configuration

Launch Intances using Image

Create3 number of data node intances

Configuring NameNode

After performing configuration common to all nodes, let us now setup

the NameNode.

Namenode: Password Less SSH

As mentioned before, we need password-less SSH between the name

nodes and the data nodes. Let us create a public-private key pair for
this purpose on the namenode.

namenode> ssh-keygen

Use the default (/home/ubuntu/.ssh/id_rsa) for the key location

and hit enter for an empty passphrase.
Datanodes: Setup Public Key

The public key is saved in /home/ubuntu/.ssh/id_rsa.pub. We

need to copy this file from the namenode to each data node and
append the contents
to /home/ubuntu/.ssh/authorized_keys on each data node.

datanode1> cat id_rsa.pub >> ~/.ssh/authorized_keys

datanode2> cat id_rsa.pub >> ~/.ssh/authorized_keys

datanode3> cat id_rsa.pub >> ~/.ssh/authorized_keys

Namenode: Setup SSH Config

SSH uses a configuration file located at ~/.ssh/config for various

parameters. Set it up as shown below. Again, substitute each node’s
Public DNS for the HostName parameter (for example, replace
<nnode> with EC2 Public DNS for NameNode).

Host nnode

HostName <nnode>

User ubuntu

IdentityFile ~/.ssh/id_rsa

Host dnode1

HostName <dnode1>

User ubuntu

IdentityFile ~/.ssh/id_rsa

Host dnode2

HostName <dnode2>

User ubuntu

IdentityFile ~/.ssh/id_rsa

Host dnode3
HostName <dnode3>

User ubuntu

IdentityFile ~/.ssh/id_rsa

At this point, verify that password-less operation works on each node

as follows (the first time, you will get a warning that the host is
unknown and whether you want to connect to it. Type yes and hit
enter. This step is needed once only):

namenode> ssh nnode

namenode> ssh dnode1

namenode> ssh dnode2

namenode> ssh dnode3

Namenode: Setup HDFS Properties

On the NameNode, edit the following file:~/server/hadoop-

2.7.3/etc/hadoop/hdfs-site.xml

Replace:

</configuration>

With:
<configuration>

<name>dfs.replication</name>

</property>

<name>dfs.namenode.name.dir</name>

<value>file:///usr/local/hadoop/hdfs/data</value>

</property>

</configuration>

Namenode: Setup MapReduce Properties

On the NameNode, copy the file

(~/server/hadoop-2.7.3/etc/hadoop/mapred-
site.xml.template) to
(~/server/hadoop-2.7.3/etc/hadoop/mapred-site.xml).
Replace:

</configuration>

With this (as above replace <nnode> with NameNode’s public DNS):

<name>mapreduce.jobtracker.address</name>

</property>

<name>mapreduce.framework.name</name>

</property>

</configuration>
Namenode: Setup YARN Properties

Next we need to set

up ~/server/hadoop-2.7.3/etc/hadoop/yarn-site.xml on
the NameNode. Replace the following:

</configuration>

With (as before, replace <nnode> with NameNode’s public DNS):

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<name>yarn.resourcemanager.hostname</name>

</property>

</configuration>
Namenode: Setup Master and Slaves

On the NameNode,
create ~/server/hadoop-2.7.3/etc/hadoop/masters with the
following (replace <nnode> with the NameNode’s public DNS):

<nnode>

Also replace all content

in ~/server/hadoop-2.7.3/etc/hadoop/slaves with (replace
each of <dnode1>, etc with the appropriate DateNode’s public DNS):
<dnode1>

Configuring Data Nodes

After covering configuration common to both NameNode and

DataNodes, we have a little bit of configuring specific to DataNodes.
On each data node, edit the
file ~/server/hadoop-2.7.3/etc/hadoop/hdfs-site.xml an
d replace the following:

</configuration>

With:

<name>dfs.replication</name>
<value>3</value>

</property>

<name>dfs.datanode.data.dir</name>

<value>file:///usr/local/hadoop/hdfs/data</value>

</property>

Starting the Hadoop Cluster

After all that configuration, it is now time to test drive the cluster. First,
format the HDFS file system on the NameNode:

namenode> cd ~/server
namenode> ./hadoop-2.7.3/bin/hdfs namenode -format

Finally, startup the Hadoop Cluster. After this step you should have all
the daemons running on the NameNode and the DataNodes.

namenode> ./hadoop-2.7.3/sbin/start-dfs.sh

namenode> ./hadoop-2.7.3/sbin/start-yarn.sh

namenode> ./hadoop-2.7.3/sbin/mr-jobhistory-daemon.sh start historyserver

Check the console carefully for any error messages. If everything
looks OK, check for the daemons using jps.

namenode> jps
You can also check on the datanodes for java processes.

datanode1> jps

How To Set Up A Multi-Node Hadoop Cluster On Amazon EC2
No ratings yet
How To Set Up A Multi-Node Hadoop Cluster On Amazon EC2
25 pages
Department of Computer Engineering Istanbul S. Zaim University, Istanbul, Turkey
No ratings yet
Department of Computer Engineering Istanbul S. Zaim University, Istanbul, Turkey
42 pages
How To Set Up A Multi-Node Hadoop Cluster On Amazon EC2 - WithScreenShots
No ratings yet
How To Set Up A Multi-Node Hadoop Cluster On Amazon EC2 - WithScreenShots
42 pages
Lab 1
No ratings yet
Lab 1
12 pages
Hadoop Setup Guide for Linux Users
No ratings yet
Hadoop Setup Guide for Linux Users
23 pages
How To Install Hadoop On Ubuntu 18
No ratings yet
How To Install Hadoop On Ubuntu 18
15 pages
Hadoop Configuration
No ratings yet
Hadoop Configuration
12 pages
Ex 1
No ratings yet
Ex 1
5 pages
TP2 - 3IM - en
No ratings yet
TP2 - 3IM - en
7 pages
Installing Standalone and Pseudocode Hadoop Cluster: 1. Setting Up Vmware Virtual Machine
No ratings yet
Installing Standalone and Pseudocode Hadoop Cluster: 1. Setting Up Vmware Virtual Machine
14 pages
Hadoop Installation Steps
No ratings yet
Hadoop Installation Steps
4 pages
L Hadoop 1 PDF
No ratings yet
L Hadoop 1 PDF
12 pages
BDA Lab Manual UPDATED
No ratings yet
BDA Lab Manual UPDATED
45 pages
BigData Lab Manual
No ratings yet
BigData Lab Manual
44 pages
Week 1 Lab
No ratings yet
Week 1 Lab
8 pages
Hadoop Multi Node Cluster
No ratings yet
Hadoop Multi Node Cluster
7 pages
Hadoop Cluster Setup Guide
No ratings yet
Hadoop Cluster Setup Guide
5 pages
BDA Lab Manual-1
No ratings yet
BDA Lab Manual-1
60 pages
DataVisuaization Lab
No ratings yet
DataVisuaization Lab
5 pages
Install Hadoop
No ratings yet
Install Hadoop
8 pages
How To Install Hadoop On Ubuntu 18.04 or 20.04
No ratings yet
How To Install Hadoop On Ubuntu 18.04 or 20.04
15 pages
Lab 0-Cluster With Multiple VMs-30-01-2024
No ratings yet
Lab 0-Cluster With Multiple VMs-30-01-2024
6 pages
Online:: Setting Up The Environment
No ratings yet
Online:: Setting Up The Environment
9 pages
Bda Lab
No ratings yet
Bda Lab
37 pages
Bi Lab File
No ratings yet
Bi Lab File
19 pages
Installation of Hadoop in Ubuntu
No ratings yet
Installation of Hadoop in Ubuntu
15 pages
Big Data Record
No ratings yet
Big Data Record
69 pages
Start Hadoop
No ratings yet
Start Hadoop
4 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
80 pages
Bda Lab Manual Print 3.6.24
No ratings yet
Bda Lab Manual Print 3.6.24
45 pages
A Report On Distributed Computing
No ratings yet
A Report On Distributed Computing
25 pages
Hadoop Setup Guide for Ubuntu 16.04/18.04
No ratings yet
Hadoop Setup Guide for Ubuntu 16.04/18.04
20 pages
Hadoop Installation Steps
100% (1)
Hadoop Installation Steps
6 pages
Install Sqoop
No ratings yet
Install Sqoop
7 pages
Hadoop
No ratings yet
Hadoop
27 pages
BDA Unit-4
No ratings yet
BDA Unit-4
38 pages
Hadoop Install
No ratings yet
Hadoop Install
19 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
33 pages
Unix Commands Part 2
No ratings yet
Unix Commands Part 2
37 pages
Install Single Node Hadoop on Ubuntu
No ratings yet
Install Single Node Hadoop on Ubuntu
13 pages
Assignment Tanupriya BDDV
No ratings yet
Assignment Tanupriya BDDV
8 pages
$ Sudo Apt-Get Install Oracle-Java8-Installer
No ratings yet
$ Sudo Apt-Get Install Oracle-Java8-Installer
4 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
32 pages
Hadoop Setup for CSE Students
No ratings yet
Hadoop Setup for CSE Students
17 pages
Hadoop 6
No ratings yet
Hadoop 6
5 pages
Apache Hadoop Installation and Cluster Setup On AWS EC2 (Ubuntu) - Part 2
No ratings yet
Apache Hadoop Installation and Cluster Setup On AWS EC2 (Ubuntu) - Part 2
23 pages
Hadoop Installation Guide
No ratings yet
Hadoop Installation Guide
18 pages
Experiment-2 BDA Lab
No ratings yet
Experiment-2 BDA Lab
13 pages
Hive INstallation
No ratings yet
Hive INstallation
13 pages
Anurag 1-6 Merged
No ratings yet
Anurag 1-6 Merged
60 pages
Hadoop Multinode Cluster Installation
No ratings yet
Hadoop Multinode Cluster Installation
4 pages
Hadoop Setup & File Management Guide
No ratings yet
Hadoop Setup & File Management Guide
16 pages
3 Introduction To Hadoop Administration
No ratings yet
3 Introduction To Hadoop Administration
8 pages
Exp 1 1
No ratings yet
Exp 1 1
24 pages
Bdafile
No ratings yet
Bdafile
9 pages
Bdamanual
No ratings yet
Bdamanual
8 pages
Hadoop Installatio1
No ratings yet
Hadoop Installatio1
22 pages
Single Node Hadoop Cluster
No ratings yet
Single Node Hadoop Cluster
9 pages
Big Data Report
No ratings yet
Big Data Report
8 pages
FDSA Bikram Ghimire 079msdsa003
No ratings yet
FDSA Bikram Ghimire 079msdsa003
6 pages
MLCI Presentation - Compressed
No ratings yet
MLCI Presentation - Compressed
28 pages
MLCI Presentation
No ratings yet
MLCI Presentation
28 pages
DEC Paper Presentation by Bikram Ghimire
No ratings yet
DEC Paper Presentation by Bikram Ghimire
12 pages
Research Proposal Format
No ratings yet
Research Proposal Format
5 pages
Gate Panel Board Cble Schedule
No ratings yet
Gate Panel Board Cble Schedule
16 pages
Survey Camp
No ratings yet
Survey Camp
1 page
Analysis of The Suzuki-Kasami Algorithm With SAL M
No ratings yet
Analysis of The Suzuki-Kasami Algorithm With SAL M
8 pages
Assignment 1 Answer
No ratings yet
Assignment 1 Answer
6 pages
AWS Cloud & DevOps Roles in India
No ratings yet
AWS Cloud & DevOps Roles in India
2 pages
AWS ML Notes - Domain 4 - Monitor Model
No ratings yet
AWS ML Notes - Domain 4 - Monitor Model
24 pages
Aws Essentials
No ratings yet
Aws Essentials
6 pages
Red Hat Enterprise Linux-8-Deploying Red Hat Enterprise Linux 8 On Public Cloud Platforms-En-US
No ratings yet
Red Hat Enterprise Linux-8-Deploying Red Hat Enterprise Linux 8 On Public Cloud Platforms-En-US
101 pages
Final Report
No ratings yet
Final Report
6 pages
Practice Test #1 (AWS Certified Developer Associate - DVA-C01)
100% (2)
Practice Test #1 (AWS Certified Developer Associate - DVA-C01)
69 pages
Cloud Deployment for Online Clinic
No ratings yet
Cloud Deployment for Online Clinic
24 pages
AWS Risk and Compliance Whitepaper 020315
No ratings yet
AWS Risk and Compliance Whitepaper 020315
128 pages
Cloud vs Grid Computing Analysis
No ratings yet
Cloud vs Grid Computing Analysis
10 pages
20121a3226 Internship Report
No ratings yet
20121a3226 Internship Report
64 pages
BCA-DA-Final - Exam - Notes For AWS
No ratings yet
BCA-DA-Final - Exam - Notes For AWS
16 pages
Gopal Bomma-CDW GD
No ratings yet
Gopal Bomma-CDW GD
10 pages
Aws Cloud Test1
No ratings yet
Aws Cloud Test1
10 pages
API Gateway Security
0% (1)
API Gateway Security
1 page
Alm1 and Alm2
No ratings yet
Alm1 and Alm2
48 pages
DevOps Engineer Profile
No ratings yet
DevOps Engineer Profile
2 pages
AWS Certified Security - Specialty SCS-C02 Exam - Free Exam Q&as, Page 1 - ExamTopics - PDF 201-250
100% (1)
AWS Certified Security - Specialty SCS-C02 Exam - Free Exam Q&as, Page 1 - ExamTopics - PDF 201-250
25 pages
Knowledgetree Installation and Upgrade Guide
No ratings yet
Knowledgetree Installation and Upgrade Guide
58 pages
SSBP Lab Tutorial Supplement 2Q 2023 V2
No ratings yet
SSBP Lab Tutorial Supplement 2Q 2023 V2
57 pages
60 Methods For Cloud Attacks
No ratings yet
60 Methods For Cloud Attacks
42 pages
Amazon Web Service Exam - Prep - 2022-Oct-09
No ratings yet
Amazon Web Service Exam - Prep - 2022-Oct-09
18 pages
DevOps Resume 73
No ratings yet
DevOps Resume 73
5 pages
Gopinath Gurram DevOps Profile
No ratings yet
Gopinath Gurram DevOps Profile
5 pages
AWS EC2 & S3 Interview Questions
No ratings yet
AWS EC2 & S3 Interview Questions
49 pages
AWS Data Engineering Questions by Deepa Vasanth Kumar 1721182233
No ratings yet
AWS Data Engineering Questions by Deepa Vasanth Kumar 1721182233
68 pages
Aws Certified Data Engineer Associate 8
No ratings yet
Aws Certified Data Engineer Associate 8
16 pages
AWS Financial Services Industry Solutions Playbook-1
No ratings yet
AWS Financial Services Industry Solutions Playbook-1
18 pages
ArimaGupta DataEngineer 5YoE
No ratings yet
ArimaGupta DataEngineer 5YoE
1 page
Ju 120722120619 0
No ratings yet
Ju 120722120619 0
136 pages

Hadoop Cluster

Uploaded by

Hadoop Cluster

Uploaded by

Name: Chandra Prakash Chaudhary

Roll No: 079MSDSA005

How to Setup an Apache Hadoop Cluster

Lets talk about how to setup an Apache Hadoop cluster on AWS.

In a previous article, we discussed setting up a Hadoop processing

Our setup involves a single NameNode and three DataNodes which

We use Apache Hadoop 2.7.3 for this demonstration.

AWS EC2 Startup

We will now create 4 instances of Ubuntu Server 16.04 LTS using

Go to your AWS Console, Click on Launch Instance and select

Here, we request 4 instances of the selected machine type. We also

For our purpose, the default instance storage of 8GB is sufficient. If

Copy Instance Public DNS Name

We now need to copy the Public DNS Name of each node (1

For example, in the description below, if you see <nnode>, substitute

Common Setup on All Nodes

Some setup is common to all the nodes: NameNode and DataNodes.

Let us update the OS with latest available software patches.

After the updates, the system might require a restart. Perform a

Let us now install Java. We install the package: openjdk-8-jdk-

All Nodes: Install Apache Hadoop

All Nodes: Setup JAVA_HOME

On each of the nodes,

Replace this line:

With the following line:

with these (as mentioned above, replace <nnode> with NameNode’s

All Nodes: Create Data Dir

HDFS needs the data directory to be present on each node: 1 name

sudo mkdir -p /usr/local/hadoop/hdfs/data

sudo chown -R ubuntu:ubuntu /usr/local/hadoop/hdfs/data

Create Image of instance that have same configuration

Create3 number of data node intances

After performing configuration common to all nodes, let us now setup

Namenode: Password Less SSH

As mentioned before, we need password-less SSH between the name

Use the default (/home/ubuntu/.ssh/id_rsa) for the key location

The public key is saved in /home/ubuntu/.ssh/id_rsa.pub. We

datanode1> cat id_rsa.pub >> ~/.ssh/authorized_keys

datanode2> cat id_rsa.pub >> ~/.ssh/authorized_keys

datanode3> cat id_rsa.pub >> ~/.ssh/authorized_keys

SSH uses a configuration file located at ~/.ssh/config for various

At this point, verify that password-less operation works on each node

namenode> ssh nnode

namenode> ssh dnode1

namenode> ssh dnode2

Namenode: Setup HDFS Properties

On the NameNode, edit the following file:~/server/hadoop-

Namenode: Setup MapReduce Properties

On the NameNode, copy the file

Next we need to set

With (as before, replace <nnode> with NameNode’s public DNS):

Also replace all content

Configuring Data Nodes

After covering configuration common to both NameNode and

Starting the Hadoop Cluster

namenode> ./hadoop-2.7.3/sbin/mr-jobhistory-daemon.sh start historyserver

You might also like