Big Data
And
Analytics
Seema Acharya
Subhashini Chellappan
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Chapter 5
Introduction to Hadoop
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Learning Objectives and Learning Outcomes
Learning Objectives Learning Outcomes
Introduction to Hadoop
1. To study the features of a) To comprehend the reasons
Hadoop. behind the popularity of
Hadoop.
2. To learn the basic concepts of
HDFS and MapReduce b) To be able to perform HDFS
Programming. operations.
3. To study HDFS Architecture. c) To comprehend MapReduce
framework.
4. To study MapReduce
Programming Model d) To understand the read and
write in HDFS.
5. To study Hadoop Ecosystem.
e) To be able to understand
Hadoop Ecosystem.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Agenda
► Hadoop - An Introduction
► Why Hadoop?
► Why not RDBMS?
► RDBMS versus Hadoop
► Distributed Computing Challenges
► History of Hadoop
► Hadoop Overview
❖ Key Aspects of Hadoop
❖ Hadoop Components
❖ Hadoop Conceptual Layer
❖ High Level Architecture of Hadoop
► Use case for Hadoop
❖ ClickStream Data
► Hadoop Distributors
► HDFS
❖ HDFS Daemons
❖ Anatomy of File Read
❖ Anatomy of File Write
❖ Replica Placement Strategy
❖ Working with HDFS commands
❖ Special Features of HDFS
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Agenda
► Processing Data with Hadoop
❖ What is MapReduce Programming?
❖ MapReduce Daemons
❖ How does MapReduce Works?
❖ MapReduce Word Count Example
► Managing Resources and Application with Hadoop YARN
❖ Limitations of Hadoop 1.0 Architecture
❖ HDFS Limitation
❖ Hadoop 2:HDFS
❖ Hadoop 2 YARN: Taking Hadoop Beyond Batch
► Interacting with Hadoop Ecosystem
❖ Pig
❖ Hive
❖ Sqoop
❖ HBase
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop – An Introduction
1.Every Day:
► NYSE
► FaceBook
► Google
2.Every minute:
► FB
► Twitter
► Instagram
► Youtube
► Apple
► Email
► Amazon
► Google
3.Every second:
► Banking Applications
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Data: The Treasure Trove
► Provides business advantages like generating
product recommendations, inventing new
products, analyzing the market etc.
► Provides few early key indicators that can
turn the fortune of business.
► Provides room for precise analysis.
► To process ,analyze and make sense of these
different kinds of data,we need a system
that scales and addresses the challenges
shown in fig 5.1
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Why Hadoop?
Ever wondered why Hadoop has been and is one of the most wanted
technologies!!
The key consideration (the rationale behind its huge popularity) is:
Its capability to handle massive amounts of data, different
categories of data – fairly quickly.
The other considerations are :
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Why not RDBMS?
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
RDBMS versus HADOOP
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Distributed Computing Challenges
• Hardware Failure – replication factor
• The default replication factor is 3. That's the minimum number that a file
will replicate across the cluster. The default can be set in hdfs-site.xml but can be
changed dynamically for individual files by using:
hdfs dfs -setrep <replication factor> <filename>
• How to Process This gigantic Store of Data? -
• How to integrate the data available on several machines prior to processing it.
• Mapreduce programming is the solution
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
History of Hadoop
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
History of Hadoop
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop Overview
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Key Aspects of Hadoop
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop Components
Flume is a distributed,
reliable, and available service
for efficiently collecting,
aggregating, and moving
large amounts of streaming
data into the Hadoop
Distributed File System
(HDFS).
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop Components
Hadoop Core Components:
HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.
MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Hadoop Ecosystem Components:
Flume,Oozie,Mahout,Hive,Pig,Sqoop,Hbase
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop Conceptual Layer:
► Data storage layer
► Data Processing layer
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop High Level Architecture
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Use case for Hadoop
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
ClickStream Data Analysis
ClickStream data (mouse clicks) helps you to understand the
purchasing behavior of customers. ClickStream analysis helps online
marketers to optimize their product web pages, promotional content,
etc. to improve their business.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop Distributors
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
HDFS
(HADOOP DISTRIBUTED FILE SYSTEM)
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop Distributed File System
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and
moves computation where data is stored).
5. You can replicate a file for a configured number of times, which is
tolerant in terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. You can realize the power of HDFS when you perform read or write
on large files (gigabytes and larger).
8. Sits on top of native file system
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
HDFS key points
► Block structured file
► Default replication factor : 3
► Default block size : 64MB
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
HDFS Daemons
NameNode:
• Single NameNode per cluster.
• Keeps the metadata details
DataNode:
• Multiple DataNode per cluster
• Read/Write operations
SecondaryNameNode:
• Housekeeping Daemon
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
NameNode
► FsImage – file in which entire file system is stored
► EditLog – records every transaction that occurs
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Name Node and Data Node communication
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Anatomy of File Read
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Anatomy of File Write
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Steps involved in Anatomy of File Write
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Special Features of HDFS
Data Replication: There is absolutely no need for a client application to
track all blocks. It directs the client to the nearest replica to ensure high
performance.
Data Pipeline: A client application writes a block to the first DataNode in
the pipeline. Then this DataNode takes over and forwards the data to the
next node in the pipeline. This process continues for all the data blocks,
and subsequently all the replicas are written to the disk.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Replica Placement Strategy
As per the Hadoop Replica Placement Strategy, first replica is placed on the same node as
the client. Then it places second replica on a node that is present on different rack. It
places the third replica on the same rack as second, but on a different node in the rack.
Once replica locations have been set, a pipeline is built. This strategy provides good
reliability.
Fig: Replica Placement strategy
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Working with HDFS Commands
Objective: To create a directory (say, sample) in HDFS.
Act:
hadoop fs -mkdir /sample
Objective: To copy a file from local file system to HDFS.
Act:
hadoop fs -put /root/sample/test.txt /sample/test.txt
Objective: To copy a file from HDFS to local file system.
Act:
hadoop fs -get /sample/test.txt /root/sample/testsample.txt
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
HDFS Commands..
Objective: To get the list of directories and files at the root of HDFS
Act: hadoop fs –ls /
Objective: To get the list of complete directories and files of HDFS.
Act:
hadoop fs –ls –R /
Objective: To copy a file from local file system to HDFS via copyFromLocal command
Act:
hadoop fs –copyFromLocal /root /sample/test.txt /sample/testsample.txt
Objective: To copy a file from Hadoop file system to local file system via copyToLocal command
Act:
hadoop fs –copyToLocal /sample/test.txt /root/sample/testsample1.txt
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
HDFS Commands..
Objective: To display contents of an HDFS file on console
Act:
hadoop fs –cat /sample/test.txt
Objective: To copy a file from one directory to another on HDFS
Act:
hadoop fs –cp /sample/test.txt /sample1
Objective: To remove a directory from HDFS
Act:
hadoop fs-rm-r /sample1
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Processing Data with Hadoop
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
What is MapReduce Programming?
MapReduce Programming is a software framework. MapReduce Programming helps
you to process massive amounts of data in parallel.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
► In MapReduce programming, the input data set is split into independent chunks.
► Map Tasks process these independent chunks completely in a parallel manner. The
output produced by the map tasks serves as intermediate data and is stored on the local
disk of that server.
► The output of the mappers are automatically shuffled and sorted by the framework.
MapReduce framework sorts the output based on KEYS.
► This sorted output becomes the input to the Reduce Tasks.
► Reduced tasks provides reduced output by combining the output of the various
mappers.
► Job inputs and outputs are stored in a file systems.
► MapReduce framework also takes care of the other tasks such as Scheduling,
Monitoring, Re-Executing failed tasks, etc.
► HDFS and MapReduce frameworks run on the same set of nodes. This configuration
allows effective scheduling of tasks on the nodes where data is present (DATA
LOCALITY). This in turn results in very high throughput.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
► There are two daemons associated with MapReduce programming.
► A single master JOB TRACKER per cluster and one slave TASK TRACKER per
cluster node.
► The JobTracker is responsible for scheduling tasks to the TaskTracker,
monitoring the task and re-executing the task just in case the TaskTracker
fails.
► TaskTracker executes the tasks.
► MapReduce applications use suitable interfaces to construct the job. The
application and the job parameters together called as JOB CONFIGURATION.
► Hadoop JOB CLIENT submits job(jar/executable,etc) to the JobTracker.
Then it is the responsibility of the JobTracker to schedule the tasks to the
slaves and it also monitors the task and provides status information to the
client.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
MapReduce daemons:
Job Tracker and Task Tracker
Fig: Job Tracker
and Task Tracker
interaction
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
How MapReduce Programming Workflow
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
MapReduce programming architecture
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
MapReduce WordCount-Example
►Count the occurrences of similar words across 50
files
►Driver class: - Job configuration details
►Mapper class: - Map function
►Reducer class: - Reduce function
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
MapReduce – Word Count Example
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
SQL vs MapReduce
Parameter SQL MapReduce
Access Interactive and batch Batch
Structure Static Dynamic
Updates Read and write many time Write once, read many
s times
Integrity High Low
Scalability Nonlinear Linear
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
MANAGING RESOURCES AND APPLICATIONS
WITH HADOOP - YARN
(YET ANOTHER RESOURCE NEGOTIATOR)
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Limitations of Hadoop 1.0 Architecture
1. Single NameNode is responsible for managing entire namespace for Hadoop
Cluster.
2. It has a restricted processing model which is suitable for batch-oriented
MapReduce jobs.
3. Hadoop MapReduce is not suitable for interactive analysis.
4. Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and
other memory intensive algorithms.
5. MapReduce is responsible for cluster resource management and data
processing.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
HDFS Limitation:
Name node saves all its file metadata in main memory. so it can quickly become overwhelmed
with load on the system increasing.
Hadoop 2 : HDFS
► Major components:
►Namespace
►Blocks storage device
► Features:
►Horizontal scalability
►High availability
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Fig: Active and Passive Name Node
Interaction
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop 2 YARN: Taking Hadoop beyond Batch
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Hadoop 2 YARN: Taking Hadoop beyond Batch
The fundamental idea behind this architecture is splitting the JobTracker responsibility of
resource management and Job Scheduling/Monitoring into separate daemons. Daemons that
are part of YARN Architecture are described below.
A Global ResourceManager: Its main responsibility is to distribute resources among various
applications in the system. It has two main components:
Scheduler: Decides the allocation of resources to various running applications,it is a
pure scheduler and it does not monitor monitor or track the status of the application.
Application Manager: Accepts the job, negotiating resources for excuting the
application specific application master, Restarting the application master during its
failure.
NodeManager: This is a per-machine slave daemon. NodeManager responsibility is launching
the application containers for application execution. NodeManager monitors the resource
usage such as memory, CPU, disk, network, etc. It then reports the usage of resources to the
global ResourceManager.
Per-application ApplicationMaster: This is an application-specific entity. Its responsibility is
to negotiate required resources for execution from the ResourceManager. It works along with
the NodeManager for executing and monitoring component tasks.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Basic concepts
► Application
► Container
► YARN Architecture
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Interacting with Hadoop
Ecosystem
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Interacting with Hadoop Ecosytem
Pig : Pig is a data flow system for Hadoop. It uses Pig Latin to specify data
flow. Pig is an alternative to MapReduce Programming. It abstracts some
details and allows you to focus on data processing.
Hive: Hive is a Data Warehousing Layer on top of Hadoop. Analysis and queries
can be done using an SQL-like language. Hive can be used to do ad-hoc queries,
summarization, and data analysis. Figure 5.31 depicts Hive in the Hadoop
ecosystem.
Sqoop: Sqoop is a tool which helps to transfer data between Hadoop and
Relational Databases. With the help of Sqoop, you can import data from RDBMS
to HDFS and vice-versa. Figure 5.32 depicts the Sqoop in Hadoop ecosystem.
HBase: HBase is a NoSQL database for Hadoop. HBase is column-oriented
NoSQL database. HBase is used to store billions of rows and millions of
columns. HBase provides random read/write operation. It also supports record
level updates which is not possible using HDFS. HBase sits on top of HDFS.
Figure 5.33 depicts the HBase in Hadoop ecosystem.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
CHAPTER-8
Introduction to MapReduce Programming
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Introduction to MapReduce Programming
► Introduction
► Mapper
❖ RecordReader
❖ Map
❖ Combiner
❖ Partitioner
► Reducer
❖ Shuffle
❖ Sort
❖ Reduce
❖ Output Format
► Combiner
► Partitioner
► Searching
► Sorting
► Compression
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Introduction
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Introduction
In MapReduce Programming, Jobs (Applications) are split into a set
of map tasks and reduce tasks. Then these tasks are executed in a
distributed fashion on Hadoop cluster.
Each task processes small subset of data that has been assigned to
it. This way, Hadoop distributes the load across the cluster.
MapReduce job takes a set of files that is stored in HDFS (Hadoop
Distributed File System) as input.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Mapper
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Mapper
A mapper maps the input key−value pairs into a set of
intermediate key–value pairs. Maps are individual tasks that
have the responsibility of transforming input records into
intermediate key–value pairs.
Mapper Consists of following phases:
• RecordReader
• Map
• Combiner
• Partitioner
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Reducer
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Reducer
The primary chore of the Reducer is to reduce a set of
intermediate values (the ones that share a common
key) to a smaller set of values.
The Reducer has three primary phases: Shuffle and
Sort, Reduce, and Output Format.
⮚ Shuffle and sort
⮚ Reduce
⮚ Output format
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
The chores of Mapper, Combiner, Partitioner, and
Reducer
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
The chores of Mapper, Combiner, Partitioner, and Reducer
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Combiner
It is an optimization technique for MapReduce Job. Generally, the
reducer class is set to be the combiner class. The difference between
combiner class and reducer class is as follows:
• Output generated by combiner is intermediate data and it is passed
to the reducer.
• Output of the reducer is passed to the output file on disk.
• Objective
• Input data
• Act
• Output data
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Partitioner
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Partitioner
The partitioning phase happens after map phase and
before reduce phase. Usually the number of partitions are
equal to the number of reducers. The default partitioner is
hash partitioner.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Searching and Sorting Demo
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Compression
In MapReduce programming, you can compress the MapReduce output file.
Compression provides two benefits as follows:
1. Reduces the space to store files.
2. Speeds up data transfer across the network.
You can specify compression format in the Driver Program as shown below:
conf.setBoolean("mapred.output.compress",true);
conf.setClass("mapred.output.compression.codec",
GzipCodec.class,CompressionCodec.class);
Here, codec is the implementation of a compression and decompression algorithm.
GzipCodec is the compression algorithm for gzip. This compresses the output file.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Answer a few questions…
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Fill in the blanks
1. Partitioner phase belongs ------------------------ to task.
2. Combiner is also known ---------------------------.
3. RecordReader converts byte-oriented view into --------------------------- view.
4. MapReduce sorts the intermediate value based on -------------------------- .
5. In MapReduce Programming, reduce function is applied ---------------- group at a
time.
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Thank You
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Answer a few quick questions…
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Match the columns
Column A Column B
HDFS DataNode
MapReduce Programming NameNode
Master node Processing Data
Slave node Google File System and MapReduce
Hadoop Implementation Storage
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Match the columns
Column A Column B
JobTracker Executes Task
MapReduce Schedules Task
TaskTracker Programming Model
Job Configuration Converts input into Key Value pair
Map Job Parameters
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Thank You
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.