KEMBAR78
Hadoop Notes Unit2 | PDF | Apache Hadoop | Big Data
0% found this document useful (0 votes)
94 views24 pages

Hadoop Notes Unit2

This document provides information about big data and Hadoop. It discusses the characteristics of big data including volume, variety, velocity, variability, and veracity. It describes the benefits of big data for marketing, product development, and healthcare. It then explains why big data is important now due to factors like low-cost storage, powerful processors, and distributed computing techniques. Finally, it summarizes how Hadoop provides a solution to big data challenges by allowing scalable, distributed processing of structured and unstructured data using commodity hardware.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views24 pages

Hadoop Notes Unit2

This document provides information about big data and Hadoop. It discusses the characteristics of big data including volume, variety, velocity, variability, and veracity. It describes the benefits of big data for marketing, product development, and healthcare. It then explains why big data is important now due to factors like low-cost storage, powerful processors, and distributed computing techniques. Finally, it summarizes how Hadoop provides a solution to big data challenges by allowing scalable, distributed processing of structured and unstructured data using commodity hardware.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

HADOOP AND BIG DATA(UNIT-II)

UNIT-II
Big Data: Big Data is a collection of large datasets that cannot be processed using
traditional computing techniques. It is not a single technique or a tool, rather it
involves many areas of business and technology.

Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will
be of three types.

1. Structured data: Relational data.


2. Semi Structured data: XML data.
3. Unstructured data: Word, PDF, Text, Media Logs.

Big data can be described by the following characteristics:


 Volume: The quantity of generated and stored data. The size of the data determines the
value and potential insight- and whether it can actually be considered big data or not.
 Variety: The type and nature of the data. This helps people who analyze it to effectively
use the resulting insight.
 Velocity: In this context, the speed at which the data is generated and processed to meet
the demands and challenges that lie in the path of growth and development.
 Variability: Inconsistency of the data set can hamper processes to handle and manage it.
 Veracity: The quality of captured data can vary greatly, affecting accurate analysis.

Benefits of Big Data:


Using the information kept in the social network like Facebook, the marketing agencies
are learning about the response for their campaigns, promotions, and other advertising
mediums.
Using the information in the social media like preferences and product perception of their
consumers, product companies and retail organizations are planning their production.
Using the data regarding the previous medical history of patients, hospitals are providing
better and quick service.

Why Big Data Now?


1. Low cost storage to store data that was discarded earlier.
2. Powerful multi-core processors.
3. Low latency possible by distributed computing: Compute clusters and grids connected
via high-speed networks Partition, Aggregate, isolate resources in any.
4. Virtualization Minimize latency for any size and dynamically change it scale.

PRAKASAM ENGINEERING COLLEGE Page 1


HADOOP AND BIG DATA(UNIT-II)

5. Affordable storage and computing with minimal man power via clouds Possible because
of advances in Networking.
6. Better understanding of task distribution (MapReduce), computing architecture (Hadoop).
7. Advanced analytical techniques (Machine learning).
8. Managed Big Data Platforms: Cloud service providers, such as Amazon Web Services
provide Elastic MapReduce, Simple Storage Service (S3) and HBase – column oriented
database. Google’ BigQuery and Prediction API.
9. Open-source software: OpenStack, PostGresSQL.
10. March 12, 2012: Obama announced $200M for Big Data research. Distributed via NSF,
NIH, DOE, DoD, DARPA, and USGS (Geological Survey)

Big Data Applications:


 Monitor premature infants to alert when interventions is needed.
 Predict machine failures in manufacturing.
 Prevent traffic jams, save fuel, reduce pollution.
 Big Data is used in Banking, Education, Finance, Communications, Media and
Entertainment, Healthcare, Manufacturing and Natural Resources, Government,
Insurance, Transportation, Retail and wholesale trader.

Where does Big Data come from:


Original big data was the web data -- as in the entire Internet! Remember Hadoop was built to
index the web. These days Big data comes from multiple sources.

 Web Data -- still it is big data


 Social media data : Sites like Facebook, Twitter, LinkedIn generate a large amount of
data.
 Click stream data : when users navigate a website, the clicks are logged for further
analysis (like navigation patterns). Click stream data is important in on line advertising
and and E-Commerce.
 Sensor data : sensors embedded in roads to monitor traffic and misc. other applications
generate a large volume of data
Examples of Big Data in the Real world:
 Facebook : has 40 PB of data and captures 100 TB / day
 Yahoo : 60 PB of data
 Twitter : 8 TB / day
 EBay : 40 PB of data, captures 50 TB / day

Challenges of Big Data:


1. Sheer size of Big Data: Big data is... well... big in size! How much data constitute Big
Data is not very clear cut. So lets not get bogged down in that debate. For a small

PRAKASAM ENGINEERING COLLEGE Page 2


HADOOP AND BIG DATA(UNIT-II)

company that is used to dealing with data in gigabytes, 10TB of data would be BIG.
However for companies like Facebook and Yahoo, peta bytes is big. Just the size of big
data, makes it impossible (or at least cost prohibitive) to store in traditional storage
like databases or conventional filers. We are talking about cost to store gigabytes of data.
Using traditional storage filers can cost a lot of money to store Big Data.
2. Big Data is unstructured or semi structured: A lot of Big Data is unstructured. For
example click stream log data might look like time stamp, user_id, page, referrer_page
Lack of structure makes relational databases not well suited to store Big Data.Plus, not
many databases can cope with storing billions of rows of data.
3. No point in just storing big data, if we can't process it: Storing Big Data is part of the
game. We have to process it to mine intelligence out of it. Traditional storage
systems are pretty 'dumb' as in they just store bits -- They don't offer any processing
power. The traditional data processing model has data stored in a 'storage cluster', which
is copied over to a 'compute cluster' for processing, and the results are written back to the
storage cluster.

This model however doesn't quite work for Big Data because copying so much data out to a
compute cluster might be too time consuming or impossible. So what is the answer?
One solution is to process Big Data 'in place' -- as in a storage cluster doubling as a compute
cluster.

PRAKASAM ENGINEERING COLLEGE Page 3


HADOOP AND BIG DATA(UNIT-II)

How Hadoop solves the Big Data problem:


 Hadoop clusters scale horizontally
More storage and compute power can be achieved by adding more nodes to a Hadoop
cluster. This eliminates the need to buy more and more powerful and expensive
hardware.
 Hadoop can handle unstructured / semi-structured data
Hadoop doesn't enforce a 'schema' on the data it stores. It can handle arbitrary text and
binary data. So Hadoop can 'digest' any unstructured data easily.
 Hadoop clusters provides storage and computing
We saw how having separate storage and processing clusters is not the best fit for Big
Data. Hadoop clusters provide storage and distributed computing all in one.

What is Hadoop ?
Hadoop is an open source framework for writing and running distributed applications that
process large amounts of data. Distributed computing is a wide and varied field, but the key
distinctions of Hadoop are that it is
 Accessible—Hadoop runs on large clusters of commodity machines or on cloud
computing services such as Amazon’s Elastic Compute Cloud (EC2 ).
 Robust—Because it is intended to run on commodity hardware, Hadoop is architected
with the assumption of frequent hardware malfunctions. It can gracefully handle most
such failures.
 Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the
cluster.
 Simple—Hadoop allows users to quickly write efficient parallel code.

PRAKASAM ENGINEERING COLLEGE Page 4


HADOOP AND BIG DATA(UNIT-II)

Comparing SQL databases and Hadoop:


Given that Hadoop is a framework for processing data, what makes it better than standard
relational databases, the workhorse of data processing in most of today’s applications? One
reason is that SQL (structured query language) is by design targeted at structured data. Many of
Hadoop’s initial applications deal with unstructured data such as text. From this perspective
Hadoop provides a more general paradigm than SQL. For working only with structured data, the
comparison is more nuanced. In principle, SQL and Hadoop can be complementary, as SQL is a
query language which can be implemented on top of Hadoop as the execution engine. But in
practice, SQL databases tend to refer to a whole set of legacy technologies, with several
dominant vendors, optimized for a historical set of applications. Many of these existing
commercial databases are a mismatch to the requirements that Hadoop targets. With that in mind,
let’s make a more detailed comparison of Hadoop with typical SQL databases on specific
dimensions.
1. SCALE-OUT INSTEAD OF SCALE-UP: Scaling commercial relational databases is
expensive. Their design is more friendly to scaling up. To run a bigger database you need
to buy a bigger machine. In fact, it’s not unusual to see server vendors market their
expensive high-end machines as “database-class servers.” Unfortunately, at some point
there won’t be a big enough machine available for the larger data sets. More importantly,
the high-end machines are not cost effective for many applications. For example, a
machine with four times the power of a standard PC costs a lot more than putting four
such PCs in a cluster. Hadoop is designed to be a scale-out architecture operating on a
cluster of commodity PC machines. Adding more resources means adding more machines
to the Hadoop cluster. Hadoop clusters with ten to hundreds of machines is standard. In
fact, other than for development purposes, there’s no reason to run Hadoop on a single
server.
2. KEY/VALUE PAIRS INSTEAD OF RELATIONAL TABLES: A fundamental tenet
of relational databases is that data resides in tables having relational structure defined by
a schema . Although the relational model has great formal properties, many modern
applications deal with data types that don’t fit well into this model. Text documents,
images, and XML files are popular examples. Also, large data sets are often unstructured
or semistructured. Hadoop uses key/value pairs as its basic data unit, which is flexible
enough to work with the less-structured data types. In Hadoop, data can originate in any
form, but it eventually transforms into (key/value) pairs for the processing functions to
work on.
3. FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF
DECLARATIVE QUERIES (SQL): SQL is fundamentally a high-level declarative
language. You query data by stating the result you want and let the database engine figure
out how to derive it. Under MapReduce you specify the actual steps in processing the
data, which is more analogous to an execution plan for a SQL engine. Under SQL you
have query statements; under MapReduce you have scripts and codes. MapReduce allows

PRAKASAM ENGINEERING COLLEGE Page 5


HADOOP AND BIG DATA(UNIT-II)

you to process data in a more general fashion than SQL queries. For example, you can
build complex statistical models from your data or reformat your image data. SQL is not
well designed for such tasks. On the other hand, when working with data that do fit well
into relational structures, some people may find MapReduce less natural to use.
MapReduce programming more intuitive. But note that many extensions are available to
allow one to take advantage of the scalability of Hadoop while programming in more
familiar paradigms. In fact, some enable you to write queries in a SQL-like language, and
your query is automatically compiled into MapReduce code for execution.
4. OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS:
Hadoop is designed for offline processing and analysis of large-scale data. It doesn’t
work for random reading and writing of a few records, which is the type of load for
online transaction processing. In fact, as of this writing (and in the foreseeable future),
Hadoop is best used as a write-once, read-many-times type of data store. In this aspect
it’s similar to data warehouses in the SQL world. You have seen how Hadoop relates to
distributed systems and SQL databases at a high level. Let’s learn how to program in it.

Google File System: The Google File System, a scalable distributed file system for large
distributed data intensive applications. It provides fault tolerance while running on inexpensive
commodity hardware, and it delivers high aggregate performance to a large number of clients.
GFS provides a familiar file system interface, though it does not implement a standard API such
as POSIX (Portable Operating System Interface for Unix). Files are organized hierarchically in
directories and identified by path-names. We support the usual operations to create, delete, open,
close, read, and write files. Moreover, GFS has snapshot and record append operations. Snapshot
creates a copy of a file or a directory treat low cost. Record append allows multiple clients to
append data to the same file concurrently while guaranteeing the atomicity of each individual
client’s append Architecture A GFS cluster consists of a single master and multiple chunk
servers and is accessed by multiple clients, as shown in Figure .

Each of these is typically a commodity Linux machine running a user-level server


process. It is easy to run both a chunkserver and a client on the same machine, as long as
machine resources permit and the lower reliability caused by running possibly flaky application
code is acceptable. Each of these is typically a commodity Linux machine running a user-level
server process. It is easy to run both a chunkserver and a client on the same machine, as long as
machine resources permit and the lower reliability caused by running possibly flaky application
code is acceptable.

PRAKASAM ENGINEERING COLLEGE Page 6


HADOOP AND BIG DATA(UNIT-II)

Files are divided into fixed-size chunks. Each chunk is identified by an immutable and
globally unique 64 bit chunk handle assigned by the master at the time of chunk creation.
Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by
a chunk handle and byte range. For reliability, each chunk is replicated on multiple chunkservers.
By default, we store three replicas, though users can designate different replication levels for
different regions of the file namespace. The master maintains all file system metadata. This
includes the namespace, access control information, the mapping from files to chunks, and the
current locations of chunks. It also controls system-wide activities such as chunk lease
management, garbage collection of orphaned chunks, and chunk migration between
chunkservers. The master periodically communicates with each chunkserver in HeartBeat
messages to give it instructions and collect its state.

GFS client code linked into each application implements the file system API(Application
Program Interface) and communicates with the master and chunkservers to read or write data on
behalf of the application. Clients interact with the master for metadata operations, but all data-
bearing communication goes directly to the chunkservers. We do not provide the POSIX API
and therefore need not hook into the Linux vnode layer. Neither the client nor the chunkserver
caches file data. Client caches offer little benefit because most applications stream through huge
files or have working sets too large to be cached. Not having them simplifies the client and the
overall system by eliminating cache coherence issues.(Clients do cache metadata, however.)
Chunkservers need not cache file data because chunks are stored as local files and so Linux’s
buffer cache already keeps frequently accessed data in memory.

PRAKASAM ENGINEERING COLLEGE Page 7


HADOOP AND BIG DATA(UNIT-II)

1. Single Master: Having a single master vastly simplifies our design and enables the
master to make sophisticated chunk placement Application and replication decisions
using global knowledge. However, we must minimize its involvement in reads and writes
sothat it does not become a bottleneck. Clients never read and write file data through the
master. Instead, a client asks the master which chunkservers it should contact.
2. Chunk Size : Chunk size is one of the key design parameters. We have chosen 64 MB,
which is much larger than typical file system block sizes. Each chunk replica is stored as
a plain Linux file on a chunkserver and is extended only as needed. Lazy space allocation
avoids wasting space due to internal fragmentation, perhaps the greatest objection against
such a large chunk size. A large chunk size offers several important advantages. First, it
reduces clients’ need to interact with the master because reads and writes on the same
chunk require only one initial request to the master for chunk location information.

The reduction is especially significant for our work loads because applications mostly
read and write large files sequentially. Even for small random reads, the client can
comfortably cache all the chunk location information for a multi-TB working set. Second,
since on a large chunk, a client is more likely to perform many operations on a given
chunk, it can reduce network overhead by keeping a persistent TCP connection to the
chunkserver over an extended period of time. Third, it reduces the size of the metadata
stored on the master. This allows us to keep the metadata in memory, which in turn
brings other advantages . On the other hand, a large chunk size, even with lazy space
allocation, has its disadvantages. A small file consists of a small number of chunks,
perhaps just one. The chunkservers storing those chunks may become hot spots if many
clients are accessing the same file. In practice, hot spots have not been a major issue
because our applications mostly read large multi-chunk files sequentially. However, hot
spots did develop when GFS was first used by a batch-queue system: an executable was
written to GFS as a single-chunk file and then started on hundreds of machines at the
same time.

3. Metadata: The master stores three major types of metadata: the file and chunk
namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas.
All metadata is kept in the master’s memory. The first two types (namespaces and file-to
chunk mapping) are also kept persistent by logging mutations to an operation log stored
on the master’s local disk and replicated on remote machines. Using a log allows us to
update the master state simply, reliably, and without risking inconsistencies in the event
of a master crash. The master does not store chunk location information persistently.
Instead, it asks each chunkserver about its chunks at master startup and whenever a
chunkserver joins the cluster.
i. In-Memory Data Structures : Since metadata is stored in memory, master
operations are fast. Furthermore, it is easy and efficient for the master to
periodically scan through its entire state in the background. This periodic

PRAKASAM ENGINEERING COLLEGE Page 8


HADOOP AND BIG DATA(UNIT-II)

scanning is used to implement chunk garbage collection, re-replication in the


presence of chunkserver failures, and chunk migration to balance load and disk
space usage across chunkservers.
ii. Chunk Locations : The master does not keep a persistent record of which
chunkservers have a replica of a given chunk. It simply polls chunkservers for that
information at startup. The master can keep itself up-to-date thereafter because it
controls all chunkplacement and monitors chunkserver status with regular
HeartBeat messages.
4. Consistency Model: GFS has a relaxed consistency model that supports our highly
distributed applications well but remains relatively simple and efficient to implement. We
now discuss GFS’s guarantees and what they mean to applications. We also highlight
how GFS maintains these guarantees but leave the details to other parts of the paper.
i. Guarantees by GFS
ii. Operation Log

The operation log contains a historical record of critical metadata changes. It is central to
GFS. Not only is it the only persistent record of metadata, but it also serves as a logical
time line that defines the order of concurrent operations.

Advantages and disadvantages of large sized chunks in Google File System:

Chunks size is one of the key design parameters. In GFS it is 64 MB, which is much
larger than typical file system blocks sizes. Each chunk replica is stored as a plain Linux file on a
chunk server and is extended only as needed.

Advantages:

1. It reduces clients’ need to interact with the master because reads and writes on the same
chunk require only one initial request to the master for chunk location information.
2. Since on a large chunk, a client is more likely to perform many operations on a given
chunk, it can reduce network overhead by keeping a persistent TCP connection to the
chunk server over an extended period of time.
3. It reduces the size of the metadata stored on the master. This allows us to keep the
metadata in memory, which in turn brings other advantages.

Disadvantages:
1. Lazy space allocation avoids wasting space due to internal fragmentation.
2. Even with lazy space allocation, a small file consists of a small number of chunks,
perhaps just one. The chunk servers storing those chunks may become hot spots if many
clients are accessing the same file. In practice, hot spots have not been a major issue
because the applications mostly read large multi-chunk files sequentially. To mitigate it,
replication and allowance to read from other clients can be done.

PRAKASAM ENGINEERING COLLEGE Page 9


HADOOP AND BIG DATA(UNIT-II)

History of Hadoop: It is a known fact that Hadoop has been specially created to manage Big
Data. Here we are going to learn about the brief history of Hadoop. Everybody in the world
knows about Google; it is probably the most popular search engine in the online world. To
provide search results for users Google had to store huge amounts of data. In the 1990s, Google
started searching for ways to store and process huge amounts of data. And finally in 2003 they
provided the world with an innovative Big Data storage idea called GFS or Google File System;
it is a technique to store data especially huge amount of data. In the year 2004 they provided the
world with another technique called MapReduce, which is the technique for processing the data
that is present in GFS. And it can be observed that it took Google 13 years to come up with this
innovative idea of storing and processing Big Data and fine tuning the idea.

But these techniques have been presented to the world just as a description through white
papers. So the world and interested people have just been provided with the idea of what GFS is
and how it would store data theoretically and what MapReduce is and how it would process the
data stored in GFS theoretically. So people had the knowledge of the technique, which was just
its description but there was no working model or code provided. Then in the year 2006-07
another major search engine, Yahoo came up with techniques called HDFS and MapReduce
based on the white papers published by Google. So finally, the HDFS and MapReduce are the
two core concepts that make up Hadoop.

Hadoop was actually created by Doug Cutting. People who have some knowledge of
Hadoop know that its logo is a yellow elephant. So there is a doubt in most people’s mind of
why Doug Cutting has chosen such a name and such a logo for his project. There is a reason
behind it; the elephant is symbolic in the sense that it is a good solution for Big Data. Actually
Hadoop was the name that came from the imagination of Doug Cutting’s son; it was the name
that the little boy gave to his favorite soft toy which was a yellow elephant and this is where the
name and the logo for the project have come from. Thus, this is the brief history behind Hadoop
and its name.

 Naming Conventions?
Doug Cutting drew inspiration from his family
– Lucene: Doug’s wife’s middle name
– Nutch: A word for "meal" that his son used as a toddler
– Hadoop: Yellow stuffed elephant named by his son

PRAKASAM ENGINEERING COLLEGE Page 10


HADOOP AND BIG DATA(UNIT-II)

HADOOP:
Hadoop is made up of 2 parts:

1. HDFS – Hadoop Distributed File System


2. MapReduce – The programming model that is used to work on the data present in HDFS.

HDFS (Hadoop Distributed File System): HDFS is a file system that is written in Java and
resides within the user space unlike traditional file systems like FAT, NTFS, ext2, etc that reside
on the kernel space. HDFS was primarily written to store large amounts of data (terra bytes and
petabytes). HDFS was built inline with Google’s paper on GFS.

MapReduce: MapReduce is the programming model that uses Java as the programming
language to retrieve data from files stored in the HDFS. All data in HDFS is stored as files. Even
MapReduce was built inline with another paper by Google. Google, apart from their papers did
not release their implementations of GFS and MapReduce. However, the Open Source
Community built Hadoop and MapReduce based on those papers. The initial adoption of Hadoop
was at Yahoo Inc., where it gained good momentum and went onto be a part of their production
systems. After Yahoo, many organizations like LinkedIn, Facebook, Netflix and many more
have successfully implemented Hadoop within their organizations.

Building Blocks of Hadoop:


Hadoop uses HDFS to store files efficiently in the cluster. When a file is placed in HDFS
it is broken down into blocks, 64 MB block size by default. These blocks are then replicated
across the different nodes (DataNodes) in the cluster. The default replication value is 3, i.e. there
will be 3 copies of the same block in the cluster. We will see later on why we maintain replicas
of the blocks in the cluster. A Hadoop cluster can comprise of a single node (single node cluster)
or thousands of nodes. Once you have installed Hadoop you can try out the following few basic
commands to work with

PRAKASAM ENGINEERING COLLEGE Page 11


HADOOP AND BIG DATA(UNIT-II)

HDFS

The above diagram depicts a 6 Node Hadoop Cluster. In the diagram you see that the
NameNode, Secondary NameNode and theJobTracker are running on a single machine. Usually
in production clusters having more those 20-30 nodes, the daemons run on separate nodes.

PRAKASAM ENGINEERING COLLEGE Page 12


HADOOP AND BIG DATA(UNIT-II)

Hadoop follows a Master-Slave architecture. As mentioned earlier, a file in HDFS is split into
blocks and replicated across Data nodes in a Hadoop cluster. You can see that the three files A, B
and C have been split across with a replication factor of 3 across the different Data nodes. Now
let us go through each node and daemon:

1. NameNode: The NameNode in Hadoop is the node where Hadoop stores all the location
information of the files in HDFS. In other words, it holds the metadata for HDFS.
Whenever a file is placed in the cluster a corresponding entry of it location is maintained
by the NameNode. So, for the files A, B and C we would have something as follows in
the NameNode:
File A – DataNode1, DataNode2, DataNode4
File B – DataNode1, DataNode3, DataNode4
File C – DataNode2, DataNode3, DataNode4
This information is required when retrieving data from the cluster as the data is
spread across multiple machines. The NameNode is a Single Point of Failure for the
Hadoop Cluster.
2. Secondary NameNode: IMPORTANT – The Secondary NameNode is not a failover
node for the NameNode. The secondary name node is responsible for performing
periodic housekeeping functions for the NameNode. It only creates check points of the
file system present in the NameNode.
3. DataNode: The DataNode is responsible for storing the files in HDFS. It manages the
file blocks within the node. It sends information to the NameNode about the files and
blocks stored in that node and responds to the NameNode for all filesystem operations.
4. JobTracker: JobTracker is responsible for taking in requests from a client and assigning
TaskTrackers with tasks to be performed. The JobTracker tries to assign tasks to the
TaskTracker on the DataNode where the data is locally present (Data Locality). If that is
not possible it will at least try to assign tasks to TaskTrackers within the same rack. If for
some reason the node fails the JobTracker assigns the task to another TaskTracker where
the replica of the data exists since the data blocks are replicated across the DataNodes.
This ensures that the job does not fail even if a node fails within the cluster.
5. TaskTracker: TaskTracker is a daemon that accepts tasks (Map, Reduce and Shuffle)
from the JobTracker. The TaskTracker keeps sending a heart beat message to
theJobTracker to notify that it is alive. Along with the heartbeat it also sends the free slots
available within it to process tasks. TaskTracker starts and monitors the Map & Reduce
Tasks and sends progress/status information back to theJobTracker.

PRAKASAM ENGINEERING COLLEGE Page 13


HADOOP AND BIG DATA(UNIT-II)

Fig 1: JobTracker and TaskTracker interaction. After a client calls the JobTracker to
begin a data processing job, the JobTracker partitions the work and assigns different map
and reduce tasks to each TaskTracker in the cluster.

Fig 2: Topology of a typical Hadoop cluster . It’s a master/slave architecture in which the
NameNode and JobTracker are masters and the DataNodes and TaskTrackers are slaves.

PRAKASAM ENGINEERING COLLEGE Page 14


HADOOP AND BIG DATA(UNIT-II)

All the above daemons run within have their own JVMs. A typical (simplified) flow in Hadoop
is a follows:

1. A Client (usaually a MapReduce program) submits a job to the JobTracker.


2. The JobTracker get information from the NameNode on the location of the data within
the DataNodes. The JobTracker places the client program (usually a jar file along with
the configuration file) in the HDFS. Once placed, JobTracker tries to assign tasks to
TaskTrackers on the DataNodes based on data locality.
3. The TaskTracker takes care of starting the Map tasks on the DataNodesby picking up
the client program from the shared location on the HDFS.
4. The progress of the operation is relayed back to the JobTracker by the TaskTracker.
5. On completion of the Map task an intermediate file is created on the local filesystem of
the TaskTracker.
6. Results from Map tasks are then passed on to the Reduce task.
7. The Reduce tasks works on all data received from map tasks and writes the final output
to HDFS.
8. After the task complete the intermediate data generated by the TaskTracker is deleted. A
very important feature of Hadoop to note here is, that, the program goes to where the
data is and not the way around, thus resulting in efficient processing of data.

Introducing and Configuring Hadoop cluster (Local, Pseudo-Distributed and


Fully Distributed Modes):
Setting up SSH(Secure Shell) for a Hadoop cluster: When setting up a Hadoop cluster , you’ll
need to designate one specific node as the master node. As shown in figure 2, this server will
typically host the NameNode and JobTracker daemons. It’ll also serve as the base station
contacting and activating the DataNode and TaskTracker daemons on all of the slave nodes. As
such, we need to define a means for the master node to remotely access every node in your
cluster.
Hadoop uses passphraseless SSH for this purpose. SSH utilizes standard public key
cryptography to create a pair of keys for user verification—one public, one private. The public
key is stored locally on every node in the cluster, and the master node sends the private key when
attempting to access a remote machine. With both pieces of information, the target machine can
validate the login attempt.
1. Define a common account: We’ve been speaking in general terms of one node accessing
another; more precisely this access is from a user account on one node to another user
account on the target machine. For Hadoop, the accounts should have the same username
on all of the nodes (we use hadoop-user in this book), and for security purpose we
recommend it being a user-level account. This account is only for managing your Hadoop
cluster. Once the cluster daemons are up and running, you’ll be able to run your actual
MapReduce jobs from other accounts.

PRAKASAM ENGINEERING COLLEGE Page 15


HADOOP AND BIG DATA(UNIT-II)

2. Verify SSH installation: The first step is to check whether SSH is installed on your
nodes. We can easily do this by use of the "which" UNIX command:
[hadoop-user@master]$ which ssh
/usr/bin/ssh
[hadoop-user@master]$ which sshd
/usr/bin/sshd
[hadoop-user@master]$ which ssh-keygen
/usr/bin/ssh-keygen
If you instead receive an error message such as this,
/usr/bin/which: no ssh in (/usr/bin:/bin:/usr/sbin...

install OpenSSH (www.openssh.com) via a Linux package manager or by downloading


the source directly. (Better yet, have your system administrator do it for you.)
3. Generate SSH key pair: Having verified that SSH is correctly installed on all nodes of
the cluster, we use sshkeygen on the master node to generate an RSA key pair . Be
certain to avoid entering a passphrase, or you’ll have to manually enter that phrase every
time the master node attempts to access another node.
[hadoop-user@master]$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop-user/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop-user/.ssh/id_rsa.
Your public key has been saved in /home/hadoop-user/.ssh/id_rsa.pub.
After creating your key pair, your public key will be of the form
[hadoop-user@master]$ more /home/hadoop-user/.ssh/id_rsa.pub

and we next need to distribute this public key across your cluster.
4. Distribute public key and validate logins: Albeit a bit tedious, you’ll next need to copy
the public key to every slave node as well as the master node:
[hadoop-user@master]$ scp ~/.ssh/id_rsa.pub hadoop-user@target:~/master_key
Manually log in to the target node and set the master key as an authorized key (or append to
the list of authorized keys if you have others defined).
[hadoop-user@target]$ mkdir ~/.ssh

PRAKASAM ENGINEERING COLLEGE Page 16


HADOOP AND BIG DATA(UNIT-II)

[hadoop-user@target]$ chmod 700 ~/.ssh


[hadoop-user@target]$ mv ~/master_key ~/.ssh/authorized_keys
[hadoop-user@target]$ chmod 600 ~/.ssh/authorized_keys
After generating the key, you can verify it’s correctly defined by attempting to log in to
the target node from the master:

[hadoop-user@master]$ ssh target


The authenticity of host 'target (xxx.xxx.xxx.xxx)' can’t be established.
RSA key fingerprint is 72:31:d8:1b:11:36:43:52:56:11:77:a4:ec:82:03:1d.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'target' (RSA) to the list of known hosts.
Last login: Sun Jan 4 15:32:22 2009 from master
After confirming the authenticity of a target node to the master node, you won’t be prompted
upon subsequent login attempts.
[hadoop-user@master]$ ssh target
Last login: Sun Jan 4 15:32:49 2009 from master

Running Hadoop:
We need to configure a few things before running Hadoop. Let’s take a closer look at
the Hadoop configuration directory :

[hadoop-user@master]$ cd $HADOOP_HOME
[hadoop-user@master]$ ls -l conf/
total 100
-rw-rw-r-- 1 hadoop-user hadoop 2065 Dec 1 10:07 capacity-scheduler.xml
-rw-rw-r-- 1 hadoop-user hadoop 535 Dec 1 10:07 configuration.xsl
-rw-rw-r-- 1 hadoop-user hadoop 49456 Dec 1 10:07 hadoop-default.xml
-rwxrwxr-x 1 hadoop-user hadoop 2314 Jan 8 17:01 hadoop-env.sh
-rw-rw-r-- 1 hadoop-user hadoop 2234 Jan 2 15:29 hadoop-site.xml
-rw-rw-r-- 1 hadoop-user hadoop 2815 Dec 1 10:07 log4j.properties
-rw-rw-r-- 1 hadoop-user hadoop 28 Jan 2 15:29 masters
-rw-rw-r-- 1 hadoop-user hadoop 84 Jan 2 15:29 slaves
-rw-rw-r-- 1 hadoop-user hadoop 401 Dec 1 10:07 sslinfo.xml.example

The first thing you need to do is to specify the location of Java on all the nodes including the
master. In hadoop-env.sh define the JAVA_HOME environment variable to point
to the Java installation directory. On our servers, we’ve it defined as
export JAVA_HOME=/usr/share/jdk

The hadoop-env.sh file contains other variables for defining your Hadoop
environment, but JAVA_HOME is the only one requiring initial modification. The default

PRAKASAM ENGINEERING COLLEGE Page 17


HADOOP AND BIG DATA(UNIT-II)

settings on the other variables will probably work fine. As you become more familiar
with Hadoop you can later modify this file to suit your individual needs (logging
directory location, Java class path, and so on).

The majority of Hadoop settings are contained in XML configuration files. Before
version 0.20, these XML files are hadoop-default.xml and hadoop-site.xml. As the
names imply, hadoop-default.xml contains the default Hadoop settings to be used
unless they are explicitly overridden in hadoop-site.xml. In practice you only deal with
hadoop-site.xml. In version 0.20 this file has been separated out into three XML files:
core-site.xml, hdfs-site.xml, and mapred-site.xml . This refactoring better aligns the
configuration settings to the subsystem of Hadoop that they control. In the rest of this
chapter we’ll generally point out which of the three files used to adjust a configuration
setting. If you use an earlier version of Hadoop, keep in mind that all such configuration
settings are modified in hadoop-site.xml.

In the following subsections we’ll provide further details about the different
operational modes of Hadoop and example configuration files for each.

1. Local (standalone) mode:


The standalone mode is the default mode for Hadoop. When you first uncompress the
Hadoop source package, it’s ignorant of your hardware setup. Hadoop chooses to be
conservative and assumes a minimal configuration. All three XML files (or
hadoopsite.xml before version 0.20) are empty under this default mode:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
</configuration>

With empty configuration files, Hadoop will run completely on the local machine.
Because there’s no need to communicate with other nodes, the standalone mode
doesn’t use HDFS, nor will it launch any of the Hadoop daemons. Its primary
use is for developing and debugging the application logic of a MapReduce program
without the additional complexity of interacting with the daemons.

2. Pseudo-distributed mode:
PRAKASAM ENGINEERING COLLEGE Page 18
HADOOP AND BIG DATA(UNIT-II)

The pseudo-distributed mode is running Hadoop in a “cluster of one” with all


daemons running on a single machine. This mode complements the standalone mode
for debugging your code, allowing you to examine memory usage, HDFS input/output
issues, and other daemon interactions. The below XML files provides simple XML files
to configure a single server in this mode.
Listing 2.1 Example of the three configuration files for pseudo-distributed mode
core-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
mapred-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
hdfs-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

In core-site.xml and mapred-site.xml we specify the hostname and port of the


NameNode and the JobTracker, respectively. In hdfs-site.xml we specify the default
replication factor for HDFS, which should only be one because we’re running on only
one node. We must also specify the location of the Secondary NameNode in the masters
file and the slave nodes in the slaves file:

PRAKASAM ENGINEERING COLLEGE Page 19


HADOOP AND BIG DATA(UNIT-II)

[hadoop-user@master]$ cat masters


localhost
[hadoop-user@master]$ cat slaves
localhost
While all the daemons are running on the same machine, they still communicate
with each other using the same SSH protocol as if they were distributed over a cluster.
The below XML files has a more detailed discussion of setting up the SSH channels, but
for single-node operation simply check to see if your machine already allows you to ssh
back to itself.
[hadoop-user@master]$ ssh localhost
If it does, then you’re good. Otherwise setting up takes two lines.
[hadoop-user@master]$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
[hadoop-user@master]$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
You are almost ready to start Hadoop. But first you’ll need to format your HDFS by
using the command
[hadoop-user@master]$ bin/hadoop namenode –format
We can now launch the daemons by use of the start-all.sh script. The Java jps
command will list all daemons to verify the setup was successful.
[hadoop-user@master]$ bin/start-all.sh
[hadoop-user@master]$ jps
26893 Jps
26832 TaskTracker
26620 SecondaryNameNode
26333 NameNode
26484 DataNode
26703 JobTracker
When you’ve finished with Hadoop you can shut down the Hadoop daemons by
the command.
[hadoop-user@master]$ bin/stop-all.sh
Both standalone and pseudo-distributed modes are for development and debugging
purposes. An actual Hadoop cluster runs in the third mode, the fully distributed mode.

3. Fully distributed mode:


After continually emphasizing the benefits of distributed storage and distributed
computation, it’s time for us to set up a full cluster. In the discussion below we’ll use
the following server names:
■ master—The master node of the cluster and host of the NameNode and JobTracker
daemons
■ backup—The server that hosts the Secondary NameNode daemon
■ hadoop1, hadoop2, hadoop3, ...—The slave boxes of the cluster running both
DataNode and TaskTracker daemons
PRAKASAM ENGINEERING COLLEGE Page 20
HADOOP AND BIG DATA(UNIT-II)

Using the preceding naming convention, listing below XML files is a modified
version of the pseudo-distributed configuration files (listing above pseudo-distributed
xml files) that can be used as a skeleton for your cluster’s setup.
Listing 2.2 Example configuration files for fully distributed mode
core-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
mapred-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name> w
<value>master:9001</value>
</property>
</configuration>
hdfs-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name> e
<value>3</value>
</property>
</configuration>

The key differences are:


■ We explicitly stated the hostname for location of the NameNode (1) and JobTracker (2)
daemons.
■ We increased the HDFS replication factor to take advantage of distributed storage (3).
Recall that data is replicated across HDFS to increase availability and reliability.

We also need to update the masters and slaves files to reflect the locations of the other
daemons.

PRAKASAM ENGINEERING COLLEGE Page 21


HADOOP AND BIG DATA(UNIT-II)

[hadoop-user@master]$ cat masters


backup
[hadoop-user@master]$ cat slaves
hadoop1
hadoop2
hadoop3
...
Once you have copied these files across all the nodes in your cluster, be sure to format
HDFS to prepare it for storage:
[hadoop-user@master]$ bin/hadoop namenode-format
Now you can start the Hadoop daemons:
[hadoop-user@master]$ bin/start-all.sh
and verify the nodes are running their assigned jobs.
[hadoop-user@master]$ jps
30879 JobTracker
30717 NameNode
30965 Jps
[hadoop-user@backup]$ jps
2099 Jps
1679 SecondaryNameNode
[hadoop-user@hadoop1]$ jps
7101 TaskTracker
7617 Jps
6988 DataNode

Web-based cluster UI:we can now introduce the web interfaces that Hadoop provides to
monitor the health of your cluster. The browser interface allows you to access information you
desire much faster than digging through logs and directories.

The NameNode hosts a general report on port 50070. It gives you an overview of
the state of your cluster’s HDFS. Figure 2.4 displays this report for a 2-node cluster
example. From this interface, you can browse through the filesystem, check the
status of each DataNode in your cluster, and peruse the Hadoop daemon logs to verify
your cluster is functioning correctly.
Hadoop provides a similar status overview of ongoing MapReduce jobs. Figure 2.5
depicts one hosted at port 50030 of the JobTracker.

Again, a wealth of information is available through this reporting interface. You


can access the status of ongoing MapReduce tasks as well as detailed reports about

PRAKASAM ENGINEERING COLLEGE Page 22


HADOOP AND BIG DATA(UNIT-II)

completed jobs. The latter is of particular importance—these logs describe which


nodes performed which tasks and the time/resources required to complete each task.
Finally, the Hadoop configuration for each job is also available, as shown in figure 2.6.
With all of this information you can streamline your MapReduce programs to better
utilize the resources of your cluster.

PRAKASAM ENGINEERING COLLEGE Page 23


HADOOP AND BIG DATA(UNIT-II)

PRAKASAM ENGINEERING COLLEGE Page 24

You might also like