KEMBAR78
BIG DATA & Hadoop Tutorial | PDF | Apache Hadoop | Map Reduce
0% found this document useful (0 votes)
141 views23 pages

BIG DATA & Hadoop Tutorial

The document discusses big data and Hadoop. It defines big data as extremely large data sets that cannot be processed by traditional data management tools. It provides examples of big data sources like social media and jet engines. Big data has characteristics of volume, variety, velocity and variability. Hadoop is an open source software framework for distributed storage and processing of large data sets across clusters of computers. The core components of Hadoop are HDFS for storage and MapReduce for processing. Related projects include Hive, HBase, Mahout and others.

Uploaded by

saif salah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views23 pages

BIG DATA & Hadoop Tutorial

The document discusses big data and Hadoop. It defines big data as extremely large data sets that cannot be processed by traditional data management tools. It provides examples of big data sources like social media and jet engines. Big data has characteristics of volume, variety, velocity and variability. Hadoop is an open source software framework for distributed storage and processing of large data sets across clusters of computers. The core components of Hadoop are HDFS for storage and MapReduce for processing. Related projects include Hive, HBase, Mahout and others.

Uploaded by

saif salah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

What is BIG DATA?

Introduction, Types,
Characteristics & Example
In order to understand 'Big Data', you first need to know

What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical signals
and recorded on magnetic, optical, or mechanical recording media.

What is Big Data?


Big Data is also data but with a huge size. Big Data is a term used to describe a
collection of data that is huge in volume and yet growing exponentially with time.
In short such data is so large and complex that none of the traditional data
management tools are able to store it or process it efficiently.

In this tutorial, you will learn,

 Examples Of Big Data


 Types Of Big Data
 Characteristics Of Big Data
 Advantages Of Big Data Processing

Examples Of Big Data


Following are some the examples of Big Data-

The New York Stock Exchange generates about one terabyte of new trade data


per day.
Social Media

The statistic shows that 500+terabytes of new data get ingested into the databases
of social media site Facebook, every day. This data is mainly generated in terms of
photo and video uploads, message exchanges, putting comments etc.

A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time.


With many thousand flights per day, generation of data reaches up to
many Petabytes.
Types Of Big Data
BigData' could be found in three forms:

1. Structured
2. Unstructured
3. Semi-structured

Structured

Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data. Over the period of time, talent in computer science
has achieved greater success in developing techniques for working with such kind
of data (where the format is well known in advance) and also deriving value out of
it. However, nowadays, we are foreseeing issues when a size of such data grows to
a huge extent, typical sizes are being in the rage of multiple zettabytes.

Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a


zettabyte.

Looking at these figures one can easily understand why the name Big Data is given
and imagine the challenges involved in its storage and processing.

Do you know? Data stored in a relational database management system is one


example of a 'structured' data.

Examples Of Structured Data

An 'Employee' table in a database is an example of Structured Data

Employee_ID  Employee_Name  Gender  Department  Salary_In_lacs


2365  Rajesh Kulkarni  Male  Finance 650000
3398  Pratibha Joshi  Female  Admin  650000
7465  Shushil Roy  Male  Admin  500000
7500  Shubhojit Das  Male  Finance  500000
7699  Priya Sane  Female  Finance  550000
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in
terms of its processing for deriving value out of it. A typical example of
unstructured data is a heterogeneous data source containing a combination of
simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don't know how to derive value out of
it since this data is in its raw form or unstructured format.

Examples Of Un-structured Data

The output returned by 'Google Search'

 Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-
structured data as a structured in form but it is actually not defined with e.g. a table
definition in relational DBMS. Example of semi-structured data is a data
represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-


<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over the years

 Please note that web application data, which is unstructured, consists of log files,
transaction history files etc. OLTP systems are built to work with structured data
wherein data is stored in relations (tables).

Characteristics Of Big Data


(i) Volume – The name Big Data itself is related to a size which is enormous. Size
of data plays a very crucial role in determining value out of data. Also, whether a
particular data can actually be considered as a Big Data or not, is dependent upon
the volume of data. Hence, 'Volume' is one characteristic which needs to be
considered while dealing with Big Data.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only
sources of data considered by most of the applications. Nowadays, data in the form
of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being
considered in the analysis applications. This variety of unstructured data poses
certain issues for storage, mining and analyzing data.

(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How


fast the data is generated and processed to meet the demands, determines real
potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites, sensors,
Mobile devices, etc. The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data
at times, thus hampering the process of being able to handle and manage the data
effectively.

Benefits of Big Data Processing

Ability to process Big Data brings in multiple benefits, such as-

o Businesses can utilize outside intelligence while taking decisions

Access to social data from search engines and sites like facebook, twitter are
enabling organizations to fine tune their business strategies.

o Improved customer service

Traditional customer feedback systems are getting replaced by new systems


designed with Big Data technologies. In these new systems, Big Data and natural
language processing technologies are being used to read and evaluate consumer
responses.

o Early identification of risk to the product/services, if any


o Better operational efficiency

Big Data technologies can be used for creating a staging area or landing zone for
new data before identifying what data should be moved to the data warehouse. In
addition, such integration of Big Data technologies and data warehouse helps an
organization to offload infrequently accessed data.

Summary

 Big Data is defined as data that is huge in size. Bigdata is a term used to
describe a collection of data that is huge in size and yet growing
exponentially with time.
 Examples of Big Data generation includes stock exchanges, social media
sites, jet engines, etc.
 Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
 Volume, Variety, Velocity, and Variability are few Characteristics of
Bigdata
 Improved customer service, better operational efficiency, Better Decision
Making are few advantages of Bigdata
What is Hadoop? Introduction,
Architecture, Ecosystem, Components
What is Hadoop?

Apache Hadoop is an open source software framework used to develop data


processing applications which are executed in a distributed computing
environment.

 Applications built using HADOOP are run on large data sets distributed across
clusters of commodity computers. Commodity computers are cheap and widely
available. These are mainly useful for achieving greater computational power at
low cost.

Similar to data residing in a local file system of a personal computer system, in


Hadoop, data resides in a distributed file system which is called as a Hadoop
Distributed File system. The processing model is based on 'Data
Locality' concept wherein computational logic is sent to cluster nodes(server)
containing data. This computational logic is nothing, but a compiled version of a
program written in a high-level language such as Java. Such a program, processes
data stored in Hadoop HDFS.

Do you know? Computer cluster consists of a set of multiple processing units


(storage disk + processor) which are connected to each other and acts as a single
system.

In this tutorial, you will learn,

 Hadoop EcoSystem and Components


 Hadoop Architecture
 Features Of 'Hadoop'
 Network Topology In Hadoop

Hadoop EcoSystem and Components


Below diagram shows various components in the Hadoop ecosystem-
Apache Hadoop consists of two sub-projects –

1. Hadoop MapReduce: MapReduce is a computational model and software


framework for writing applications which are run on Hadoop. These
MapReduce programs are capable of processing enormous data in parallel
on large clusters of computation nodes.
2. HDFS (Hadoop Distributed File System): HDFS takes care of the storage
part of Hadoop applications. MapReduce applications consume data from
HDFS. HDFS creates multiple replicas of data blocks and distributes them
on compute nodes in a cluster. This distribution enables reliable and
extremely rapid computations.

Although Hadoop is best known for MapReduce and its distributed file system-
HDFS, the term is also used for a family of related projects that fall under the
umbrella of distributed computing and large-scale data processing. Other Hadoop-
related projects at Apache include are Hive, HBase, Mahout, Sqoop, Flume,
and ZooKeeper.
Hadoop Architecture

High Level Hadoop Architecture

Hadoop has a Master-Slave Architecture for data storage and distributed data
processing using MapReduce and HDFS methods.

NameNode:

NameNode represented every files and directory which is used in the namespace

DataNode:

DataNode helps you to manage the state of an HDFS node and allows you to
interacts with the blocks

MasterNode:

The master node allows you to conduct parallel processing of data using Hadoop
MapReduce.

Slave node:

The slave nodes are the additional machines in the Hadoop cluster which allows
you to store data to conduct complex calculations. Moreover, all the slave node
comes with Task Tracker and a DataNode. This allows you to synchronize the
processes with the NameNode and Job Tracker respectively.

In Hadoop, master or slave system can be set up in the cloud or on-premise

Features Of 'Hadoop'
• Suitable for Big Data Analysis

As Big Data tends to be distributed and unstructured in nature, HADOOP clusters


are best suited for analysis of Big Data. Since it is processing logic (not the actual
data) that flows to the computing nodes, less network bandwidth is consumed. This
concept is called as data locality concept which helps increase the efficiency of
Hadoop based applications.

• Scalability

HADOOP clusters can easily be scaled to any extent by adding additional cluster
nodes and thus allows for the growth of Big Data. Also, scaling does not require
modifications to application logic.

• Fault Tolerance

HADOOP ecosystem has a provision to replicate the input data on to other cluster
nodes. That way, in the event of a cluster node failure, data processing can still
proceed by using data stored on another cluster node.

Network Topology In Hadoop


Topology (Arrangment) of the network, affects the performance of the Hadoop
cluster when the size of the Hadoop cluster grows. In addition to the performance,
one also needs to care about the high availability and handling of failures. In order
to achieve this Hadoop, cluster formation makes use of network topology.
Typically, network bandwidth is an important factor to consider while forming any
network. However, as measuring bandwidth could be difficult, in Hadoop, a
network is represented as a tree and distance between nodes of this tree (number of
hops) is considered as an important factor in the formation of Hadoop cluster.
Here, the distance between two nodes is equal to sum of their distance to their
closest common ancestor.

Hadoop cluster consists of a data center, the rack and the node which actually
executes jobs. Here, data center consists of racks and rack consists of nodes.
Network bandwidth available to processes varies depending upon the location of
the processes. That is, the bandwidth available becomes lesser as we go away
from-

 Processes on the same node


 Different nodes on the same rack
 Nodes on different racks of the same data center
 Nodes in different data centers
HDFS Tutorial: Architecture, Read &
Write Operation using Java API
What is HDFS?

HDFS is a distributed file system for storing very large data files, running on
clusters of commodity hardware. It is fault tolerant, scalable, and extremely simple
to expand. Hadoop comes bundled with HDFS (Hadoop Distributed File
Systems).

When data exceeds the capacity of storage on a single physical machine, it


becomes essential to divide it across a number of separate machines. A file system
that manages storage specific operations across a network of machines is called a
distributed file system. HDFS is one such software.

In this tutorial, we will learn,

 What is HDFS?
 HDFS Architecture
 Read Operation
 Write Operation
 Access HDFS using JAVA API
 Access HDFS Using COMMAND-LINE INTERFACE

HDFS Architecture
HDFS cluster primarily consists of a NameNode that manages the file
system Metadata and a DataNodes that stores the actual data.

 NameNode: NameNode can be considered as a master of the system. It


maintains the file system tree and the metadata for all the files and
directories present in the system. Two files 'Namespace image' and
the 'edit log' are used to store metadata information. Namenode has
knowledge of all the datanodes containing data blocks for a given file,
however, it does not store block locations persistently. This information is
reconstructed every time from datanodes when the system starts.
 DataNode: DataNodes are slaves which reside on each machine in a cluster
and provide the actual storage. It is responsible for serving, read and write
requests for the clients.

Read/write operations in HDFS operate at a block level. Data files in HDFS are
broken into block-sized chunks, which are stored as independent units. Default
block-size is 64 MB.

HDFS operates on a concept of data replication wherein multiple replicas of data


blocks are created and are distributed on nodes throughout a cluster to enable high
availability of data in the event of node failure.

Do you know?  A file in HDFS, which is smaller than a single block, does not
occupy a block's full storage. 

Read Operation In HDFS


Data read request is served by HDFS, NameNode, and DataNode. Let's call the
reader as a 'client'. Below diagram depicts file read operation in Hadoop.

1. A client initiates read request by calling 'open()' method of FileSystem


object; it is an object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information
such as the locations of the blocks of the file. Please note that these
addresses are of first few blocks of a file.
3. In response to this metadata request, addresses of the DataNodes having a
copy of that block is returned back.
4. Once addresses of DataNodes are received, an object of
type FSDataInputStream is returned to the
client. FSDataInputStream contains DFSInputStream which takes care of
interactions with DataNode and NameNode. In step 4 shown in the above
diagram, a client invokes 'read()' method which
causes DFSInputStream to establish a connection with the first DataNode
with the first block of a file.
5. Data is read in the form of streams wherein client invokes 'read()' method
repeatedly. This process of read() operation continues till it reaches the end
of block.
6. Once the end of a block is reached, DFSInputStream closes the connection
and moves on to locate the next DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.

Write Operation In HDFS


In this section, we will understand how data is written into HDFS through files.
1. A client initiates write operation by calling 'create()' method of
DistributedFileSystem object which creates a new file - Step no. 1 in the
above diagram.
2. DistributedFileSystem object connects to the NameNode using RPC call and
initiates new file creation. However, this file creates operation does not
associate any blocks with the file. It is the responsibility of NameNode to
verify that the file (which is being created) does not exist already and a
client has correct permissions to create a new file. If a file already exists or
client does not have sufficient permission to create a new file,
then IOException is thrown to the client. Otherwise, the operation succeeds
and a new record for the file is created by the NameNode.
3. Once a new record in NameNode is created, an object of type
FSDataOutputStream is returned to the client. A client uses it to write data
into the HDFS. Data write method is invoked (step 3 in the diagram).
4. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues
writing data, DFSOutputStream continues creating packets with this data.
These packets are enqueued into a queue which is called as DataQueue.
5. There is one more component called DataStreamer which consumes
this DataQueue. DataStreamer also asks NameNode for allocation of new
blocks thereby picking desirable DataNodes to be used for replication.
6. Now, the process of replication starts by creating a pipeline using
DataNodes. In our case, we have chosen a replication level of 3 and hence
there are 3 DataNodes in the pipeline.
7. The DataStreamer pours packets into the first DataNode in the pipeline.
8. Every DataNode in a pipeline stores packet received by it and forwards the
same to the second DataNode in a pipeline.
9. Another queue, 'Ack Queue' is maintained by DFSOutputStream to store
packets which are waiting for acknowledgment from DataNodes.
10.Once acknowledgment for a packet in the queue is received from all
DataNodes in the pipeline, it is removed from the 'Ack Queue'. In the event
of any DataNode failure, packets from this queue are used to reinitiate the
operation.
11.After a client is done with the writing data, it calls a close() method (Step 9
in the diagram) Call to close(), results into flushing remaining data packets
to the pipeline followed by waiting for acknowledgment.
12.Once a final acknowledgment is received, NameNode is contacted to tell it
that the file write operation is complete.

Access HDFS using JAVA API


In this section, we try to understand Java interface used for accessing Hadoop's file
system.

In order to interact with Hadoop's filesystem programmatically, Hadoop provides


multiple JAVA classes. Package named org.apache.hadoop.fs contains classes
useful in manipulation of a file in Hadoop's filesystem. These operations include,
open, read, write, and close. Actually, file API for Hadoop is generic and can be
extended to interact with other filesystems other than HDFS.

Reading a file from HDFS, programmatically

Object java.net.URL is used for reading contents of a file. To begin with, we


need to make Java recognize Hadoop's hdfs URL scheme. This is done by
calling setURLStreamHandlerFactory method on URL object and an instance of
FsUrlStreamHandlerFactory is passed to it. This method needs to be executed only
once per JVM, hence it is enclosed in a static block.
An example code is-
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new
FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}

This code opens and reads contents of a file. Path of this file on HDFS is passed to
the program as a command line argument.

Access HDFS Using COMMAND-LINE INTERFACE


This is one of the simplest ways to interact with HDFS. Command-line interface
has support for filesystem operations like read the file, create directories, moving
files, deleting data, and listing directories.

We can run '$HADOOP_HOME/bin/hdfs dfs -help' to get detailed help on every


command. Here, 'dfs' is a shell command of HDFS which supports multiple
subcommands.

Some of the widely used commands are listed below along with some details of
each one.

1. Copy a file from the local filesystem to HDFS


$HADOOP_HOME/bin/hdfs dfs -copyFromLocal temp.txt /

This command copies file temp.txt from the local filesystem to HDFS.

2. We can list files present in a directory using -ls


$HADOOP_HOME/bin/hdfs dfs -ls /

We can see a file 'temp.txt' (copied earlier) being listed under ' / ' directory.

3. Command to copy a file to the local filesystem from HDFS


$HADOOP_HOME/bin/hdfs dfs -copyToLocal /temp.txt

We can see temp.txt copied to a local filesystem.

4. Command to create a new directory


$HADOOP_HOME/bin/hdfs dfs -mkdir /mydirectory

Check whether a directory is created or not. Now, you should know how to do it ;-)
What is MapReduce? How it Works -
Hadoop MapReduce Tutorial
What is MapReduce?
MAPREDUCE is a software framework and programming model used for
processing huge amounts of data. MapReduce program work in two phases,
namely, Map and Reduce. Map tasks deal with splitting and mapping of data while
Reduce tasks shuffle and reduce the data.

Hadoop is capable of running MapReduce programs written in various languages:


Java, Ruby, Python, and C++. MapReduce programs are parallel in nature, thus are
very useful for performing large-scale data analysis using multiple machines in the
cluster.

The input to each phase is key-value pairs. In addition, every programmer needs to
specify two functions: map function and reduce function.

In this beginner training, you will learn-

 What is MapReduce in Hadoop?


 How MapReduce Works? Complete Process
 MapReduce Architecture explained in detail
 How MapReduce Organizes Work?

How MapReduce Works? Complete Process


The whole process goes through four phases of execution namely, splitting,
mapping, shuffling, and reducing.

Let's understand this with an example –

Consider you have following input data for your Map Reduce Program
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
MapReduce Architecture

The final output of the MapReduce task is

bad  1
Class  1
good  1
Hadoop  3
is  2
to  1
Welcome  1

The data goes through the following phases

Input Splits:

An input to a MapReduce job is divided into fixed-size pieces called input


splits Input split is a chunk of the input that is consumed by a single map
Mapping

This is the very first phase in the execution of map-reduce program. In this phase
data in each split is passed to a mapping function to produce output values. In our
example, a job of mapping phase is to count a number of occurrences of each word
from input splits (more details about input-split is given below) and prepare a list
in the form of <word, frequency>

Shuffling

This phase consumes the output of Mapping phase. Its task is to consolidate the
relevant records from Mapping phase output. In our example, the same words are
clubed together along with their respective frequency.

Reducing

In this phase, output values from the Shuffling phase are aggregated. This phase
combines values from Shuffling phase and returns a single output value. In short,
this phase summarizes the complete dataset.

In our example, this phase aggregates the values from Shuffling phase i.e.,
calculates total occurrences of each word.

MapReduce Architecture explained in detail


 One map task is created for each split which then executes map function for
each record in the split.
 It is always beneficial to have multiple splits because the time taken to
process a split is small as compared to the time taken for processing of the
whole input. When the splits are smaller, the processing is better to load
balanced since we are processing the splits in parallel.
 However, it is also not desirable to have splits too small in size. When splits
are too small, the overload of managing the splits and map task creation
begins to dominate the total job execution time.
 For most jobs, it is better to make a split size equal to the size of an HDFS
block (which is 64 MB, by default).
 Execution of map tasks results into writing output to a local disk on the
respective node and not to HDFS.
 Reason for choosing local disk over HDFS is, to avoid replication which
takes place in case of HDFS store operation.
 Map output is intermediate output which is processed by reduce tasks to
produce the final output.
 Once the job is complete, the map output can be thrown away. So, storing it
in HDFS with replication becomes overkill.
 In the event of node failure, before the map output is consumed by the
reduce task, Hadoop reruns the map task on another node and re-creates the
map output.
 Reduce task doesn't work on the concept of data locality. An output of every
map task is fed to the reduce task. Map output is transferred to the machine
where reduce task is running.
 On this machine, the output is merged and then passed to the user-defined
reduce function.
 Unlike the map output, reduce output is stored in HDFS (the first replica is
stored on the local node and other replicas are stored on off-rack nodes). So,
writing the reduce output

How MapReduce Organizes Work?


Hadoop divides the job into tasks. There are two types of tasks:

1. Map tasks (Splits & Mapping)


2. Reduce tasks (Shuffling, Reducing)

as mentioned above.

The complete execution process (execution of Map and Reduce tasks, both) is
controlled by two types of entities called a

1. Jobtracker: Acts like a master (responsible for complete execution of


submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the job

For every job submitted for execution in the system, there is one Jobtracker that
resides on Namenode and there are multiple tasktrackers which reside
on Datanode.
 A job is divided into multiple tasks which are then run onto multiple data
nodes in a cluster.
 It is the responsibility of job tracker to coordinate the activity by scheduling
tasks to run on different data nodes.
 Execution of individual task is then to look after by task tracker, which
resides on every data node executing part of the job.
 Task tracker's responsibility is to send the progress report to the job tracker.
 In addition, task tracker periodically sends 'heartbeat' signal to the
Jobtracker so as to notify him of the current state of the system. 
 Thus job tracker keeps track of the overall progress of each job. In the event
of task failure, the job tracker can reschedule it on a different task tracker.

You might also like