KEMBAR78
Big Data Batch Analytics Lecture | PDF | Apache Hadoop | Map Reduce
100% found this document useful (1 vote)
45 views36 pages

Big Data Batch Analytics Lecture

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
45 views36 pages

Big Data Batch Analytics Lecture

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

BIG DATA Master IT Lecture 8 Course code : M25331

Batch Analytics

Dr. Ali Haider Shamsan

1
Lecture Outlines
• Batch Analysis frameworks
• Hadoop and MapReduce
• Pig
• Apache Oozie
• Apache Spark
• Apache Solr

Review

• Data Acquisition

Keywords

big data, Database, batch Analytics

2
Review
NoSQL

• Data Acquisition Considerations


• Publish - Subscribe Messaging Frameworks
• Big Data Collection Systems
• Messaging Queues
• Custom Connectors

3
Batch Analysis frameworks
• Batch analytics is a type of analytics that involves processing large amounts of
data in batches, typically over a period of time.
• Batch analytics typically involve the use of ETL (Extract-Transform-Load)
processes to
• extract data from various sources,
• transform the data into a structure suitable for analytics,
• and load it into a data warehouse or other data repository.
• This is then followed by the use of analytical tools to
• explore the data,
• discover patterns, and
• generate insights.
• Batch analytics can be used to conduct predictive and descriptive analytics, as
well as more complex analytics such as machine learning and deep learning.
4
Batch Analysis frameworks

• Hadoop-MapReduce
• Pig
• Spark
• Solr.

5
Batch Analysis frameworks
Hadoop and
MapReduce
• Apache Hadoop is an open source framework for distributed batch
processing of big data.
• Similarly, MapReduce is a parallel programming model suitable
analysis of big data.
• MapReduce algorithms allow large-scale computations to be
automatically parallelized across a large cluster of servers.

6
Batch Analysis frameworks
MapReduce
Programming Model
• MapReduce is a parallel data processing model for processing and analysis
of massive scale data .
• MapReduce model has two phases: Map and Reduce.
• The input data to the map and reduce phases is in the form of key-value
pairs.
• Run-time systems for MapReduce are typically large clusters built of
commodity hardware.
• The MapReduce run-time systems take care of tasks such partitioning the
data, scheduling of jobs and communication between nodes in the cluster.
• This makes it easier for programmers to analyze massive scale data
without worrying about tasks such as data partitioning and scheduling. 7
Batch Analysis frameworks
MapReduce
Programming Model
• In the Map phase, data is read from a distributed file system, partitioned
among a set of computing nodes in the cluster, and sent to the nodes as a
set of key-value pairs.
• The Map tasks process the input records independently of each other
and produce intermediate results as key-value pairs.
• When all the Map tasks are completed, the Reduce phase begins in
which the intermediate data with the same key is aggregated.
• An optional Combine task can be used to perform data aggregation on
the intermediate data of the same key for the output of the mapper
before transferring the output to the Reduce task.
8
Batch Analysis frameworks

Hadoop YARN

• Hadoop YARN is the next generation architecture of Hadoop (version


2.x).
• In the YARN architecture, the original processing engine of Hadoop
(MapReduce) has been separated from the resource management
component (which is now part of YARN).
• This makes YARN effectively an operating system for Hadoop that
supports different processing engines on a Hadoop cluster such as
• MapReduce for batch processing, Apache Tez for interactive queries, Apache
Storm for stream processing.

9
Batch Analysis frameworks

Hadoop YARN

Figure 8.1
10
Batch Analysis frameworks

Hadoop YARN

The key components of YARN are described as follows:


1. Resource Manager (RM):
• RM manages the global assignment of compute resources to
applications.
• RM consists of two main services: –
• Scheduler: Scheduler is a pluggable service that manages and enforces
the resource scheduling policy in the cluster.
• Applications Manager (AsM):
• AsM manages the running Application Masters in the cluster.
• AsM is responsible for starting application masters and for monitoring
and restarting them on different nodes in case of failures.
11
Batch Analysis frameworks

Hadoop YARN
The key components of YARN are described as follows:
2. Application Master (AM):
• A application Master AM manages the application’s life cycle.
• AM is responsible for negotiating resources from the RM and working with the
NMs to execute and monitor the tasks.
3. Node Manager (NM):
• NM manages the user processes on that machine.
4. Containers:
• Container is a bundle of resources allocated by RM (memory, CPU and network).
• A container is a conceptual entity that grants an application the privilege to use a
certain amount of resources on a given machine to run a task.
• Each node has multiple containers based on the resource allocations made by
the RM. 12
Batch Analysis frameworks

Hadoop YARN

Figure 8.2

13
Batch Analysis frameworks

Hadoop YARN
• Figure 8.2 shows a YARN cluster with a Resource Manager node and
three Node Manager nodes.
• There are as many Application Masters running as there are applications
(jobs).
• Each application’s AM manages the application tasks such as
• starting, monitoring and restarting tasks in case of failures.
• Each application has multiple tasks.
• Each task runs in a separate container.
• Each container in YARN can be used for both map and reduce tasks.
• The resource allocation model of YARN is more flexible with the
introduction of resource containers which improve cluster utilization.
14
Batch Analysis frameworks

Hadoop YARN

Figure 8.3

15
Batch Analysis frameworks

Hadoop YARN
• To better understand the YARN job execution workflow let us analyze the interactions
between the main components on YARN.
• Figure 8.3 shows the interactions between a Client and Resource Manager.
• Job execution begins with the submission of a new application request by the client to
the RM.
• The RM then responds with a unique application ID and information about cluster
resource capabilities that the client will need in requesting resources for running the
application’s AM.
• Using the information received from the RM, the client constructs and submits an
Application Submission Context which contains information such as scheduler queue,
priority and user information.
• The Application Submission Context also contains a Container Launch Context which
contains the application’s jar, job files, security tokens and any resource requirements.
• The client can query the RM for application reports.
• The client can also "force kill" an application by sending a request to the RM.
16
Batch Analysis frameworks

Hadoop YARN

Figure 8.4

17
Batch Analysis frameworks

Hadoop YARN
• Above Figure shows the interactions between Resource Manager and
Application Master.
• Upon receiving an application submission context from a client, the RM finds
an available container meeting the resource requirements for running the AM
for the application.
• On finding a suitable container, the RM contacts the NM for the container to
start the AM process on its node.
• When the AM is launched it registers itself with the RM.
• The registration process consists of handshaking that conveys information such
as the port that the AM will be listening on, the tracking URL for monitoring the
application’s status and progress, etc.
• The registration response from the RM contains information for the AM that is
used in calculating and requesting any resource requests for the application’s
individual tasks (such as minimum and maximum resource capabilities for the
cluster). 18
Batch Analysis frameworks

Hadoop YARN

• The AM relays heartbeat and progress information to the RM.


• The AM sends resource allocation requests to the RM that contains a list
of requested containers, and may also contain a list of released
containers by the AM.
• Upon receiving the allocation request, the scheduler component of the
RM computes a list of containers that satisfy the request and sends back
an allocation response.
• Upon receiving the resource list, the AM contacts the associated NMs
for starting the containers.
• When the job finishes, the AM sends a Finish Application message to the
RM.
19
Batch Analysis frameworks

Hadoop YARN

Figure 8.5
20
Batch Analysis frameworks

Hadoop YARN

• Figure 8.5 shows the interactions between the an Application Master


and the Node Manager.
• Based on the resource list received from the RM, the AM requests the
hosting NM for each container to start the container.
• The AM can request and receive a container status report from the Node
Manager.
• Figure 8.6 shows the MapReduce job execution within a YARN cluster

21
Batch Analysis frameworks

Hadoop YARN

Figure 8.6

22
Hadoop YARN

Hadoop Schedulers

• The scheduler is a pluggable component in Hadoop that allows it to


support different scheduling algorithms.
• The pluggable scheduler framework provides the flexibility to support a
variety of workloads with varying priority and performance constraints.
• The Hadoop scheduling algorithms are described as follows:
FIFO
• FIFO scheduler maintains a work queue in which the jobs are queued.
The scheduler pulls jobs in first-in first-out manner (oldest job first) for
scheduling.
• There is no concept of priority or size of the job in FIFO scheduler.
23
Hadoop YARN

Hadoop Schedulers
Fair Scheduler
• The Fair Scheduler was originally developed by Facebook.
• Facebook uses Hadoop to manage the massive content and log data it
accumulates every day.
• It is our understanding that the need for Fair Scheduler arose when Facebook
wanted to share the data warehousing infrastructure between multiple users.
• The Fair Scheduler allocates resources evenly between multiple jobs and also
provides capacity guarantees.
• Fair Scheduler assigns resources to jobs such that each job gets an equal share
of the available resources on average over time.
• The Fair Scheduler lets short jobs finish in reasonable time while not starving
long jobs.
24
Hadoop YARN

Hadoop Schedulers

Fair Scheduler
• Tasks slots that are free are assigned to to the new jobs, so that each job
gets roughly the same amount of CPU time.
• The Fair Scheduler maintains a set of pools into which jobs are placed.
Each pool has a guaranteed capacity.
• When there is a single job running, all the resources are assigned to that
job. When there are multiple jobs in the pools, each pool gets at least as
many task slots as guaranteed.
• This lets the scheduler guarantee capacity for pools while utilizing
resources efficiently when these pools don’t contain jobs.
25
Hadoop YARN

Hadoop Schedulers

Fair Scheduler
• The Fair Scheduler keeps track of the compute time received by each
job.
• Fair scheduler is useful when a small or large Hadoop cluster is shared
between multiple groups of users.
• Though the fair scheduler ensures fairness by maintaining a set of pools
and providing guaranteed capacity to each pool, it does not provide any
timing guarantees and hence it is ill-equipped for real-time jobs.

26
Hadoop YARN

Hadoop Schedulers

Capacity Scheduler
• Capacity scheduler has similar functionality as the Fair Scheduler but
adopts a different scheduling philosophy.
• In Capacity Scheduler, multiple named queues are defined, each with a
configurable number of map and reduce slots.
• Each queue is also assigned a guaranteed capacity.
• The Capacity Scheduler gives each queue its capacity when it contains
jobs, and shares any unused capacity between the queues.

27
Hadoop YARN

Hadoop Schedulers

Capacity Scheduler
• When a TaskTracker has free slots, the Capacity Scheduler picks a queue
for which the ratio of number of running slots to capacity is the lowest.
• The capacity scheduler is useful when a large Hadoop cluster is shared
between with multiple clients and different types and priorities of jobs.
• Though the capacity scheduler ensures fairness by maintaining a set of
queues and providing guaranteed capacity to each queue, it does not
provide any timing guarantees and, therefore, it may be ill-equipped for
real-time jobs.

28
Batch Analysis frameworks

Pig
• While MapReduce is a powerful for big data analysis, for certain complex
analysis jobs, developers may find it difficult to identify the key-value pairs
involved at each step and then implement the map and reduce functions.
• Moreover, complex analysis jobs may require multiple MapReduce jobs to be
chained.
• Pig is a high-level data processing language which makes it easy for developers
to write data analysis scripts, which are translated into MapReduce programs
by the Pig compiler.
Pig includes:
• (1) a high-level language (called Pig Latin) for expressing data analysis programs and
• (2) a complier which produces sequences of MapReduce programs from the pig scripts.
• Pig can be executed either in local mode or MapReduce mode. 29
Batch Analysis frameworks

Pig
• In local mode, Pig runs inside a single JVM process on a local machine.
• Local mode is useful for development purpose and testing the
scripts with small data files on a single machine.
• MapReduce mode requires a Hadoop cluster.
• In MapReduce mode, Pig can analyze data stored in HDFS.
• Pig compiler translates the pig scripts into MapReduce programs
which are executed in a Hadoop cluster.
• Pig provides an interactive shell called grunt, for developing pig scripts.
• Pig provides various built-in functions such as AVG, MIN, MAX, SUM,
and COUNT.
30
Pig
Pig Operators
• Pig operators are a set of operators used in Apache Pig, a platform for creating data
analysis programs.
These operators include the following:
• 1. Foreach
• This operator generates a new data set by applying a set of transformations to each record
of the input data set.
• 2. Join
• This operator allows two data sets to be joined together to form a single data set.
• 3. Filter
• This operator is used to select a subset of records from the input data set that meet certain
criteria.
• 4. Group
• This operator is used to group related records from the input data set.
• 5. Sort
• This operator is used to sort the records of the input data set in ascending or descending
order. 31
Pig
Pig Operators
• 6. Limit
• This operator is used to limit the number of records returned from the input data
set.
• 7. Distinct
• This operator is used to remove duplicate records from the input data set.
• 8. Union
• This operator is used to combine two data sets into a single data set.
• 9. Load
• This operator is used to load data from an external source such as a file or database
into the Apache Pig environment.
• 10. Store
• This operator is used to write the output data set to an external source such as a file
or database.
32
Pig
Data Types
• The simple data types work the same way as in other programming languages.
Let is look at the complex data types in detail.
• Tuple
• A tuple is an ordered set of fields.
• A collection of fields with different data types.
• Bag
• A bag is an unordered collection of tuples.
• A bag is represented with curly braces.
• Map
• A Map is a set of key-value pairs.
• Map is represented with square brackets.
• a # is used to separate the key and value. 33
Pig
Storing Results
• To save the results on the filesystem the STORE operator is used.
• Pig uses a lazy evaluation strategy and delays the evaluation of
expressions till a STORE or DUMP operator triggers the results to be
stored or displayed.
• The DUMP operator is used to dump the results on the console.
• The EXPLAIN operator is used to view the logical, physical, and
MapReduce execution plans for computing a relation.
• The ILLUSTRATE operator is used to display the step by step
execution of statements to compute a relation with a small sample of
data.

34
Batch Analysis frameworks

Apache Oozie
• Many batch analysis applications require more than one MapReduce job to
be chained to perform data analysis.
• This can be accomplished using Apache Oozie system.
• Oozie is a workflow scheduler system that allows managing Hadoop jobs.
• With Oozie, you can create workflows which are a collection of actions
(such as MapReduce jobs) arranged as Direct Acyclic Graphs (DAG).
• Control dependencies exist between the actions in a workflow.
• Thus, an action is executed only when the preceding action is completed.
• An Oozie workflow specifies a sequence of actions that need to be
executed using an XML-based Process Definition Language called Hadoop
Process Definition Language (hPDL).
• Oozie supports various types of actions such as Hadoop MapReduce, Hadoop
file system, Pig, Java, Email, Shell, Hive, Sqoop, SSH and custom actions.
35
Next lecture

• Realtime Analysis

Assignment

Deadline

Previous Deadline

54

You might also like