Hadoop Notes Unit2
Hadoop Notes Unit2
UNIT-II
Big Data: Big Data is a collection of large datasets that cannot be processed using
traditional computing techniques. It is not a single technique or a tool, rather it
involves many areas of business and technology.
Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will
be of three types.
5. Affordable storage and computing with minimal man power via clouds Possible because
of advances in Networking.
6. Better understanding of task distribution (MapReduce), computing architecture (Hadoop).
7. Advanced analytical techniques (Machine learning).
8. Managed Big Data Platforms: Cloud service providers, such as Amazon Web Services
provide Elastic MapReduce, Simple Storage Service (S3) and HBase – column oriented
database. Google’ BigQuery and Prediction API.
9. Open-source software: OpenStack, PostGresSQL.
10. March 12, 2012: Obama announced $200M for Big Data research. Distributed via NSF,
NIH, DOE, DoD, DARPA, and USGS (Geological Survey)
company that is used to dealing with data in gigabytes, 10TB of data would be BIG.
However for companies like Facebook and Yahoo, peta bytes is big. Just the size of big
data, makes it impossible (or at least cost prohibitive) to store in traditional storage
like databases or conventional filers. We are talking about cost to store gigabytes of data.
Using traditional storage filers can cost a lot of money to store Big Data.
2. Big Data is unstructured or semi structured: A lot of Big Data is unstructured. For
example click stream log data might look like time stamp, user_id, page, referrer_page
Lack of structure makes relational databases not well suited to store Big Data.Plus, not
many databases can cope with storing billions of rows of data.
3. No point in just storing big data, if we can't process it: Storing Big Data is part of the
game. We have to process it to mine intelligence out of it. Traditional storage
systems are pretty 'dumb' as in they just store bits -- They don't offer any processing
power. The traditional data processing model has data stored in a 'storage cluster', which
is copied over to a 'compute cluster' for processing, and the results are written back to the
storage cluster.
This model however doesn't quite work for Big Data because copying so much data out to a
compute cluster might be too time consuming or impossible. So what is the answer?
One solution is to process Big Data 'in place' -- as in a storage cluster doubling as a compute
cluster.
What is Hadoop ?
Hadoop is an open source framework for writing and running distributed applications that
process large amounts of data. Distributed computing is a wide and varied field, but the key
distinctions of Hadoop are that it is
Accessible—Hadoop runs on large clusters of commodity machines or on cloud
computing services such as Amazon’s Elastic Compute Cloud (EC2 ).
Robust—Because it is intended to run on commodity hardware, Hadoop is architected
with the assumption of frequent hardware malfunctions. It can gracefully handle most
such failures.
Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the
cluster.
Simple—Hadoop allows users to quickly write efficient parallel code.
you to process data in a more general fashion than SQL queries. For example, you can
build complex statistical models from your data or reformat your image data. SQL is not
well designed for such tasks. On the other hand, when working with data that do fit well
into relational structures, some people may find MapReduce less natural to use.
MapReduce programming more intuitive. But note that many extensions are available to
allow one to take advantage of the scalability of Hadoop while programming in more
familiar paradigms. In fact, some enable you to write queries in a SQL-like language, and
your query is automatically compiled into MapReduce code for execution.
4. OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS:
Hadoop is designed for offline processing and analysis of large-scale data. It doesn’t
work for random reading and writing of a few records, which is the type of load for
online transaction processing. In fact, as of this writing (and in the foreseeable future),
Hadoop is best used as a write-once, read-many-times type of data store. In this aspect
it’s similar to data warehouses in the SQL world. You have seen how Hadoop relates to
distributed systems and SQL databases at a high level. Let’s learn how to program in it.
Google File System: The Google File System, a scalable distributed file system for large
distributed data intensive applications. It provides fault tolerance while running on inexpensive
commodity hardware, and it delivers high aggregate performance to a large number of clients.
GFS provides a familiar file system interface, though it does not implement a standard API such
as POSIX (Portable Operating System Interface for Unix). Files are organized hierarchically in
directories and identified by path-names. We support the usual operations to create, delete, open,
close, read, and write files. Moreover, GFS has snapshot and record append operations. Snapshot
creates a copy of a file or a directory treat low cost. Record append allows multiple clients to
append data to the same file concurrently while guaranteeing the atomicity of each individual
client’s append Architecture A GFS cluster consists of a single master and multiple chunk
servers and is accessed by multiple clients, as shown in Figure .
Files are divided into fixed-size chunks. Each chunk is identified by an immutable and
globally unique 64 bit chunk handle assigned by the master at the time of chunk creation.
Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by
a chunk handle and byte range. For reliability, each chunk is replicated on multiple chunkservers.
By default, we store three replicas, though users can designate different replication levels for
different regions of the file namespace. The master maintains all file system metadata. This
includes the namespace, access control information, the mapping from files to chunks, and the
current locations of chunks. It also controls system-wide activities such as chunk lease
management, garbage collection of orphaned chunks, and chunk migration between
chunkservers. The master periodically communicates with each chunkserver in HeartBeat
messages to give it instructions and collect its state.
GFS client code linked into each application implements the file system API(Application
Program Interface) and communicates with the master and chunkservers to read or write data on
behalf of the application. Clients interact with the master for metadata operations, but all data-
bearing communication goes directly to the chunkservers. We do not provide the POSIX API
and therefore need not hook into the Linux vnode layer. Neither the client nor the chunkserver
caches file data. Client caches offer little benefit because most applications stream through huge
files or have working sets too large to be cached. Not having them simplifies the client and the
overall system by eliminating cache coherence issues.(Clients do cache metadata, however.)
Chunkservers need not cache file data because chunks are stored as local files and so Linux’s
buffer cache already keeps frequently accessed data in memory.
1. Single Master: Having a single master vastly simplifies our design and enables the
master to make sophisticated chunk placement Application and replication decisions
using global knowledge. However, we must minimize its involvement in reads and writes
sothat it does not become a bottleneck. Clients never read and write file data through the
master. Instead, a client asks the master which chunkservers it should contact.
2. Chunk Size : Chunk size is one of the key design parameters. We have chosen 64 MB,
which is much larger than typical file system block sizes. Each chunk replica is stored as
a plain Linux file on a chunkserver and is extended only as needed. Lazy space allocation
avoids wasting space due to internal fragmentation, perhaps the greatest objection against
such a large chunk size. A large chunk size offers several important advantages. First, it
reduces clients’ need to interact with the master because reads and writes on the same
chunk require only one initial request to the master for chunk location information.
The reduction is especially significant for our work loads because applications mostly
read and write large files sequentially. Even for small random reads, the client can
comfortably cache all the chunk location information for a multi-TB working set. Second,
since on a large chunk, a client is more likely to perform many operations on a given
chunk, it can reduce network overhead by keeping a persistent TCP connection to the
chunkserver over an extended period of time. Third, it reduces the size of the metadata
stored on the master. This allows us to keep the metadata in memory, which in turn
brings other advantages . On the other hand, a large chunk size, even with lazy space
allocation, has its disadvantages. A small file consists of a small number of chunks,
perhaps just one. The chunkservers storing those chunks may become hot spots if many
clients are accessing the same file. In practice, hot spots have not been a major issue
because our applications mostly read large multi-chunk files sequentially. However, hot
spots did develop when GFS was first used by a batch-queue system: an executable was
written to GFS as a single-chunk file and then started on hundreds of machines at the
same time.
3. Metadata: The master stores three major types of metadata: the file and chunk
namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas.
All metadata is kept in the master’s memory. The first two types (namespaces and file-to
chunk mapping) are also kept persistent by logging mutations to an operation log stored
on the master’s local disk and replicated on remote machines. Using a log allows us to
update the master state simply, reliably, and without risking inconsistencies in the event
of a master crash. The master does not store chunk location information persistently.
Instead, it asks each chunkserver about its chunks at master startup and whenever a
chunkserver joins the cluster.
i. In-Memory Data Structures : Since metadata is stored in memory, master
operations are fast. Furthermore, it is easy and efficient for the master to
periodically scan through its entire state in the background. This periodic
The operation log contains a historical record of critical metadata changes. It is central to
GFS. Not only is it the only persistent record of metadata, but it also serves as a logical
time line that defines the order of concurrent operations.
Chunks size is one of the key design parameters. In GFS it is 64 MB, which is much
larger than typical file system blocks sizes. Each chunk replica is stored as a plain Linux file on a
chunk server and is extended only as needed.
Advantages:
1. It reduces clients’ need to interact with the master because reads and writes on the same
chunk require only one initial request to the master for chunk location information.
2. Since on a large chunk, a client is more likely to perform many operations on a given
chunk, it can reduce network overhead by keeping a persistent TCP connection to the
chunk server over an extended period of time.
3. It reduces the size of the metadata stored on the master. This allows us to keep the
metadata in memory, which in turn brings other advantages.
Disadvantages:
1. Lazy space allocation avoids wasting space due to internal fragmentation.
2. Even with lazy space allocation, a small file consists of a small number of chunks,
perhaps just one. The chunk servers storing those chunks may become hot spots if many
clients are accessing the same file. In practice, hot spots have not been a major issue
because the applications mostly read large multi-chunk files sequentially. To mitigate it,
replication and allowance to read from other clients can be done.
History of Hadoop: It is a known fact that Hadoop has been specially created to manage Big
Data. Here we are going to learn about the brief history of Hadoop. Everybody in the world
knows about Google; it is probably the most popular search engine in the online world. To
provide search results for users Google had to store huge amounts of data. In the 1990s, Google
started searching for ways to store and process huge amounts of data. And finally in 2003 they
provided the world with an innovative Big Data storage idea called GFS or Google File System;
it is a technique to store data especially huge amount of data. In the year 2004 they provided the
world with another technique called MapReduce, which is the technique for processing the data
that is present in GFS. And it can be observed that it took Google 13 years to come up with this
innovative idea of storing and processing Big Data and fine tuning the idea.
But these techniques have been presented to the world just as a description through white
papers. So the world and interested people have just been provided with the idea of what GFS is
and how it would store data theoretically and what MapReduce is and how it would process the
data stored in GFS theoretically. So people had the knowledge of the technique, which was just
its description but there was no working model or code provided. Then in the year 2006-07
another major search engine, Yahoo came up with techniques called HDFS and MapReduce
based on the white papers published by Google. So finally, the HDFS and MapReduce are the
two core concepts that make up Hadoop.
Hadoop was actually created by Doug Cutting. People who have some knowledge of
Hadoop know that its logo is a yellow elephant. So there is a doubt in most people’s mind of
why Doug Cutting has chosen such a name and such a logo for his project. There is a reason
behind it; the elephant is symbolic in the sense that it is a good solution for Big Data. Actually
Hadoop was the name that came from the imagination of Doug Cutting’s son; it was the name
that the little boy gave to his favorite soft toy which was a yellow elephant and this is where the
name and the logo for the project have come from. Thus, this is the brief history behind Hadoop
and its name.
Naming Conventions?
Doug Cutting drew inspiration from his family
– Lucene: Doug’s wife’s middle name
– Nutch: A word for "meal" that his son used as a toddler
– Hadoop: Yellow stuffed elephant named by his son
HADOOP:
Hadoop is made up of 2 parts:
HDFS (Hadoop Distributed File System): HDFS is a file system that is written in Java and
resides within the user space unlike traditional file systems like FAT, NTFS, ext2, etc that reside
on the kernel space. HDFS was primarily written to store large amounts of data (terra bytes and
petabytes). HDFS was built inline with Google’s paper on GFS.
MapReduce: MapReduce is the programming model that uses Java as the programming
language to retrieve data from files stored in the HDFS. All data in HDFS is stored as files. Even
MapReduce was built inline with another paper by Google. Google, apart from their papers did
not release their implementations of GFS and MapReduce. However, the Open Source
Community built Hadoop and MapReduce based on those papers. The initial adoption of Hadoop
was at Yahoo Inc., where it gained good momentum and went onto be a part of their production
systems. After Yahoo, many organizations like LinkedIn, Facebook, Netflix and many more
have successfully implemented Hadoop within their organizations.
HDFS
The above diagram depicts a 6 Node Hadoop Cluster. In the diagram you see that the
NameNode, Secondary NameNode and theJobTracker are running on a single machine. Usually
in production clusters having more those 20-30 nodes, the daemons run on separate nodes.
Hadoop follows a Master-Slave architecture. As mentioned earlier, a file in HDFS is split into
blocks and replicated across Data nodes in a Hadoop cluster. You can see that the three files A, B
and C have been split across with a replication factor of 3 across the different Data nodes. Now
let us go through each node and daemon:
1. NameNode: The NameNode in Hadoop is the node where Hadoop stores all the location
information of the files in HDFS. In other words, it holds the metadata for HDFS.
Whenever a file is placed in the cluster a corresponding entry of it location is maintained
by the NameNode. So, for the files A, B and C we would have something as follows in
the NameNode:
File A – DataNode1, DataNode2, DataNode4
File B – DataNode1, DataNode3, DataNode4
File C – DataNode2, DataNode3, DataNode4
This information is required when retrieving data from the cluster as the data is
spread across multiple machines. The NameNode is a Single Point of Failure for the
Hadoop Cluster.
2. Secondary NameNode: IMPORTANT – The Secondary NameNode is not a failover
node for the NameNode. The secondary name node is responsible for performing
periodic housekeeping functions for the NameNode. It only creates check points of the
file system present in the NameNode.
3. DataNode: The DataNode is responsible for storing the files in HDFS. It manages the
file blocks within the node. It sends information to the NameNode about the files and
blocks stored in that node and responds to the NameNode for all filesystem operations.
4. JobTracker: JobTracker is responsible for taking in requests from a client and assigning
TaskTrackers with tasks to be performed. The JobTracker tries to assign tasks to the
TaskTracker on the DataNode where the data is locally present (Data Locality). If that is
not possible it will at least try to assign tasks to TaskTrackers within the same rack. If for
some reason the node fails the JobTracker assigns the task to another TaskTracker where
the replica of the data exists since the data blocks are replicated across the DataNodes.
This ensures that the job does not fail even if a node fails within the cluster.
5. TaskTracker: TaskTracker is a daemon that accepts tasks (Map, Reduce and Shuffle)
from the JobTracker. The TaskTracker keeps sending a heart beat message to
theJobTracker to notify that it is alive. Along with the heartbeat it also sends the free slots
available within it to process tasks. TaskTracker starts and monitors the Map & Reduce
Tasks and sends progress/status information back to theJobTracker.
Fig 1: JobTracker and TaskTracker interaction. After a client calls the JobTracker to
begin a data processing job, the JobTracker partitions the work and assigns different map
and reduce tasks to each TaskTracker in the cluster.
Fig 2: Topology of a typical Hadoop cluster . It’s a master/slave architecture in which the
NameNode and JobTracker are masters and the DataNodes and TaskTrackers are slaves.
All the above daemons run within have their own JVMs. A typical (simplified) flow in Hadoop
is a follows:
2. Verify SSH installation: The first step is to check whether SSH is installed on your
nodes. We can easily do this by use of the "which" UNIX command:
[hadoop-user@master]$ which ssh
/usr/bin/ssh
[hadoop-user@master]$ which sshd
/usr/bin/sshd
[hadoop-user@master]$ which ssh-keygen
/usr/bin/ssh-keygen
If you instead receive an error message such as this,
/usr/bin/which: no ssh in (/usr/bin:/bin:/usr/sbin...
and we next need to distribute this public key across your cluster.
4. Distribute public key and validate logins: Albeit a bit tedious, you’ll next need to copy
the public key to every slave node as well as the master node:
[hadoop-user@master]$ scp ~/.ssh/id_rsa.pub hadoop-user@target:~/master_key
Manually log in to the target node and set the master key as an authorized key (or append to
the list of authorized keys if you have others defined).
[hadoop-user@target]$ mkdir ~/.ssh
Running Hadoop:
We need to configure a few things before running Hadoop. Let’s take a closer look at
the Hadoop configuration directory :
[hadoop-user@master]$ cd $HADOOP_HOME
[hadoop-user@master]$ ls -l conf/
total 100
-rw-rw-r-- 1 hadoop-user hadoop 2065 Dec 1 10:07 capacity-scheduler.xml
-rw-rw-r-- 1 hadoop-user hadoop 535 Dec 1 10:07 configuration.xsl
-rw-rw-r-- 1 hadoop-user hadoop 49456 Dec 1 10:07 hadoop-default.xml
-rwxrwxr-x 1 hadoop-user hadoop 2314 Jan 8 17:01 hadoop-env.sh
-rw-rw-r-- 1 hadoop-user hadoop 2234 Jan 2 15:29 hadoop-site.xml
-rw-rw-r-- 1 hadoop-user hadoop 2815 Dec 1 10:07 log4j.properties
-rw-rw-r-- 1 hadoop-user hadoop 28 Jan 2 15:29 masters
-rw-rw-r-- 1 hadoop-user hadoop 84 Jan 2 15:29 slaves
-rw-rw-r-- 1 hadoop-user hadoop 401 Dec 1 10:07 sslinfo.xml.example
The first thing you need to do is to specify the location of Java on all the nodes including the
master. In hadoop-env.sh define the JAVA_HOME environment variable to point
to the Java installation directory. On our servers, we’ve it defined as
export JAVA_HOME=/usr/share/jdk
The hadoop-env.sh file contains other variables for defining your Hadoop
environment, but JAVA_HOME is the only one requiring initial modification. The default
settings on the other variables will probably work fine. As you become more familiar
with Hadoop you can later modify this file to suit your individual needs (logging
directory location, Java class path, and so on).
The majority of Hadoop settings are contained in XML configuration files. Before
version 0.20, these XML files are hadoop-default.xml and hadoop-site.xml. As the
names imply, hadoop-default.xml contains the default Hadoop settings to be used
unless they are explicitly overridden in hadoop-site.xml. In practice you only deal with
hadoop-site.xml. In version 0.20 this file has been separated out into three XML files:
core-site.xml, hdfs-site.xml, and mapred-site.xml . This refactoring better aligns the
configuration settings to the subsystem of Hadoop that they control. In the rest of this
chapter we’ll generally point out which of the three files used to adjust a configuration
setting. If you use an earlier version of Hadoop, keep in mind that all such configuration
settings are modified in hadoop-site.xml.
In the following subsections we’ll provide further details about the different
operational modes of Hadoop and example configuration files for each.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
</configuration>
With empty configuration files, Hadoop will run completely on the local machine.
Because there’s no need to communicate with other nodes, the standalone mode
doesn’t use HDFS, nor will it launch any of the Hadoop daemons. Its primary
use is for developing and debugging the application logic of a MapReduce program
without the additional complexity of interacting with the daemons.
2. Pseudo-distributed mode:
PRAKASAM ENGINEERING COLLEGE Page 18
HADOOP AND BIG DATA(UNIT-II)
Using the preceding naming convention, listing below XML files is a modified
version of the pseudo-distributed configuration files (listing above pseudo-distributed
xml files) that can be used as a skeleton for your cluster’s setup.
Listing 2.2 Example configuration files for fully distributed mode
core-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
mapred-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name> w
<value>master:9001</value>
</property>
</configuration>
hdfs-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name> e
<value>3</value>
</property>
</configuration>
We also need to update the masters and slaves files to reflect the locations of the other
daemons.
Web-based cluster UI:we can now introduce the web interfaces that Hadoop provides to
monitor the health of your cluster. The browser interface allows you to access information you
desire much faster than digging through logs and directories.
The NameNode hosts a general report on port 50070. It gives you an overview of
the state of your cluster’s HDFS. Figure 2.4 displays this report for a 2-node cluster
example. From this interface, you can browse through the filesystem, check the
status of each DataNode in your cluster, and peruse the Hadoop daemon logs to verify
your cluster is functioning correctly.
Hadoop provides a similar status overview of ongoing MapReduce jobs. Figure 2.5
depicts one hosted at port 50030 of the JobTracker.