100% found this document useful (1 vote)

59 views49 pages

Bigdata & Hadoop: Shushrutha Reddy K M.Tech in Computational Engineering From Rgukt Senior Bigdata Developer @servicenow

BigData-Hadoop, Stages in Big Data Analytics, Types of Big Data Analytics, Evolution of Hadoop, Hadoop, Name Node, Data Node, Replication Management, HDFS Architecture, MapReduce YARN, Resource Manager, Node Manager, Application Workflow in Hadoop YARN, Spark, Amazon EMR

Uploaded by

Shushrutha Reddy K

We take content rights seriously. If you suspect this is your content, claim it here.

100% found this document useful (1 vote)

59 views49 pages

Bigdata & Hadoop: Shushrutha Reddy K M.Tech in Computational Engineering From Rgukt Senior Bigdata Developer @servicenow

Uploaded by

Shushrutha Reddy K

We take content rights seriously. If you suspect this is your content, claim it here.

You are on page 1/ 49

BigData &

Hadoop
Shushrutha Reddy K
M.Tech in Computational Engineering from RGUKT
Senior BigData Developer @ServiceNow
Bigdata
Hadoop
MapReduce
Agenda YARN
Spark
Amazon EMR
Friday, 21 January 2022 2
How It All Started?

Friday, January 21, 2022 3

What is BigData?

BigData is a term used for a collection of data sets that are large and
complex, which is difficult to store and process using available database
management tools or traditional data processing applications.

The challenge includes capturing, curating, storing, searching, sharing,

transferring, analysing and visualization.

Friday, 21 January 2022 4

Every minute:

Friday, 21 January 2022 5

Characteristics:

Friday, 21 January 2022 6

Types of Big Data
• Three types:
• Structured - stored and processed in a fixed format - SQL
• Semi-Structured - XML files or JSON
• Unstructured - Text Files, images, audios, videos

Friday, 21 January 2022 7

Why Big Data Analytics?

Making Smarter and More Efficient Organisation

Optimize Business Operations by Analysing Customer Behaviour

Cost Reduction

New Generation Products

Friday, 21 January 2022 8

Stages in Big Data Analytics

Friday, 21 January 2022 9

Types of Big Data Analytics

Descriptive Analytics

•data aggregation and data mining to provide insight into the past

Diagnostic Analytics

•determine why something happened in the past

Predictive Analytics

•statistical models and forecasts techniques to understand the future

Prescriptive Analytics

•optimization and simulation algorithms to advice on possible outcomes

Friday, 21 January 2022 10

Big Data Domains

Friday, 21 January 2022 12

Scope of Big Data

Friday, 21 January 2022 13

Friday, 21 January 2022 14
Problems with Traditional Approach

Friday, 21 January 2022 15

Evolution of Hadoop

Friday, 21 January 2022 16

What is Hadoop?

• Hadoop is a framework that allows you to first store Big Data in a

distributed environment, so that, you can process it parallelly.

• HDFS (Hadoop distributed File System)

• storage
• YARN (Yet Another Resource Negotiator)
• resource management

Friday, 21 January 2022 17

Advantages Of HDFS
1. Distributed Storage

2. Distributed & Parallel Computation

3. Horizontal Scalability

Friday, 21 January 2022 18

HDFS

Friday, 21 January 2022 19

Hadoop – NameNode

Friday, 21 January 2022 20

Hadoop - NameNode
• Master daemon that maintains and manages the DataNodes (slave nodes)

• Records the metadata of all the blocks stored in the cluster,

• location of blocks stored, size of the files, permissions, hierarchy, etc.

• Records each and every change that takes place to the file system metadata

• If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog

• Regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to
ensure that the DataNodes are alive

• Keeps a record of all the blocks in the HDFS and DataNode in which they are stored

Friday, 21 January 2022 21

Secondary NameNode:

Friday, 21 January 2022 22

Hadoop - DataNode
• Slave daemon which runs on each slave machine

• The actual data is stored on DataNodes

• Responsible for serving read and write requests from the clients

• Responsible for creating blocks, deleting blocks and replicating the

same based on the decisions taken by the NameNode

• Sends heartbeats to the NameNode periodically to report the overall

health of HDFS, by default, this frequency is set to 3 seconds

Friday, 21 January 2022 23

Blocks

Friday, 21 January 2022 24

Replication Management

Friday, 21 January 2022 25

Friday, 21 January 2022 26
Friday, 21 January 2022 27
HDFS Write Architecture

• File “example.txt” into 2 blocks

• 128 MB (Block A)
• 120 MB (block B)

Friday, 21 January 2022 28

Data copy process
• Three stages:
• Set up of Pipeline
• Data streaming and replication
• Shutdown of Pipeline (Acknowledgement stage)

Friday, 21 January 2022 29

For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}.

Friday, 21 January 2022 30

For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}.

Friday, 21 January 2022 31

For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}.

Friday, 21 January 2022 32

For Block A: 1A -> 2A -> 3A -> 4A
For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B

Friday, 21 January 2022 33

Friday, 21 January 2022 34
MapReduce: Traditional Way

Friday, 21 January 2022 35

What is MapReduce?
• Framework that allows us to perform distributed and parallel processing on large data
sets in a distributed environment

• 2 tasks – Map and Reduce

• block of data is read and processed to produce key-value pairs as intermediate outputs
• output of a Mapper or map job (key-value pairs) is input to the Reducer
• the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a
smaller set of tuples or key-value pairs

Friday, 21 January 2022 36

MapReduce: Word Count

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

Friday, 21 January 2022 37

YARN

Friday, 21 January 2022 38

Friday, 21 January 2022 39
Resource Manager

Cluster-level (one for each cluster) component and runs on the master machine

Manages resources and schedules applications running on top of YARN

Keeps a track of the heartbeats from the Node Manager

Two Components:

Responsible for allocating resources to the various running

Scheduler applications

Responsible for accepting job submissions and negotiating

Application Manager the first container for executing the application

Friday, 21 January 2022 40

Node Manager
• Node-level component (one on each node) and runs on each slave machine

• Responsible for managing containers and monitoring resource utilization in each

container

• Keeps track of node health and log management

• Continuously communicates with Resource Manager to remain up-to-date

Friday, 21 January 2022 41

Application Submission in YARN
1) Submit the job

2) Get Application ID

3) Application Submission Context

4 a) Start Container Launch

b) Launch Application Master

5) Allocate Resources

6 a) Container
b) Launch

7) Execute

Friday, 21 January 2022 42

Application Workflow
in Hadoop YARN
1. Client submits an application
2. Resource Manager allocates a container to
start Application Manager
3. Application Manager registers with Resource
Manager
4. Application Manager asks containers from Resource
Manager
5. Application Manager notifies Node Manager to
launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application
Manager to monitor application’s status
8. Application Manager unregisters with Resource
Manager

Friday, 21 January 2022 43

Hadoop Ecosystem

Friday, 21 January 2022 44

Apache Spark
• Framework for real time data analytics in a distributed computing environment

• executes in-memory computations to increase speed of data processing over Map-Reduce

• 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations

Friday, 21 January 2022 45

Amazon EMR

Provides a managed Hadoop framework using the elastic infrastructure

of Amazon EC2 and Amazon S3.

Distributes computation of the data over multiple Amazon EC2 instances.

Analysis of the data is easy with Amazon Elastic MapReduce

Friday, 21 January 2022 46

Benefits of Amazon EMR

• Elastic - Auto Scaling can use to modify the number of instances automatically

• Economical – Cheap and has support for Amazon EC2 Spot and Reserved Instances

• Secure - Inbuilt capability to turn on the firewall for the protection and controlling cloud
network access to instances

• Flexible - For performing tasks such as root access to any instance, Installation of additional
applications, and customization of the cluster with bootstrap actions

Friday, 21 January 2022 47

Unit 3
No ratings yet
Unit 3
18 pages
Unit 2
No ratings yet
Unit 2
9 pages
Data W - Bigdata8
No ratings yet
Data W - Bigdata8
105 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
Unit-2 1
No ratings yet
Unit-2 1
93 pages
Hadoop Big Data 1
No ratings yet
Hadoop Big Data 1
19 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Introduction To
No ratings yet
Introduction To
7 pages
Hadoop HDFS and MapReduce
No ratings yet
Hadoop HDFS and MapReduce
40 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
CC-Unit 3
No ratings yet
CC-Unit 3
22 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
Unit 5
No ratings yet
Unit 5
32 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Big Data Analytics AAM Unit 5
No ratings yet
Big Data Analytics AAM Unit 5
28 pages
CP 329 - Lecture Twonew - 2025 - 122059
No ratings yet
CP 329 - Lecture Twonew - 2025 - 122059
43 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Unit 3
No ratings yet
Unit 3
90 pages
Unit 2
No ratings yet
Unit 2
22 pages
Bda Unit - 3
No ratings yet
Bda Unit - 3
15 pages
HADOOP
No ratings yet
HADOOP
10 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
CO3 Session 19
No ratings yet
CO3 Session 19
29 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Big Data
No ratings yet
Big Data
67 pages
Hadoop
No ratings yet
Hadoop
61 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
BDA praON Iat1
No ratings yet
BDA praON Iat1
12 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
HADOOP
No ratings yet
HADOOP
19 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Test Scenario Test Cases: Validation
No ratings yet
Test Scenario Test Cases: Validation
4 pages
Agile Methodology
No ratings yet
Agile Methodology
46 pages
Business Requirements Document
100% (1)
Business Requirements Document
13 pages
Sia Module 1 Lesson 1
No ratings yet
Sia Module 1 Lesson 1
36 pages
Capacity Management Process - HS2016
No ratings yet
Capacity Management Process - HS2016
23 pages
L4. Datawarehouse Architecture PDF
No ratings yet
L4. Datawarehouse Architecture PDF
13 pages
2 Understanding Enterprise Architecture m2 Slides
No ratings yet
2 Understanding Enterprise Architecture m2 Slides
27 pages
Structured Solution Analysis For BMC Remedy IT Service Management V 7 PDF
No ratings yet
Structured Solution Analysis For BMC Remedy IT Service Management V 7 PDF
14 pages
Business Intelligence
No ratings yet
Business Intelligence
4 pages
Database Management for IT Students
No ratings yet
Database Management for IT Students
18 pages
C TBW45 70-Detail
No ratings yet
C TBW45 70-Detail
1 page
Oracle Performance Tuning - Oracle Partitioning - Introduction
No ratings yet
Oracle Performance Tuning - Oracle Partitioning - Introduction
57 pages
Chapter 10 Accounting Information System
100% (1)
Chapter 10 Accounting Information System
10 pages
Integrated Planning Solutions With SAP BPC
No ratings yet
Integrated Planning Solutions With SAP BPC
2 pages
AIM Market Part 1 Gartner 2015
No ratings yet
AIM Market Part 1 Gartner 2015
27 pages
Data Warehousing Dissertation Guide
100% (2)
Data Warehousing Dissertation Guide
5 pages
The Cloud Database
No ratings yet
The Cloud Database
19 pages
DWH-BI Engineer - Assessment Questionaire
No ratings yet
DWH-BI Engineer - Assessment Questionaire
4 pages
Introduction To Transaction Control Language
No ratings yet
Introduction To Transaction Control Language
10 pages
Business Objects Universe Creation
No ratings yet
Business Objects Universe Creation
15 pages
Essential Qualities for Manual Testers
No ratings yet
Essential Qualities for Manual Testers
16 pages
Manwant Singh Bala: Project Presentation
No ratings yet
Manwant Singh Bala: Project Presentation
7 pages
CDMP Mock Test 4
100% (1)
CDMP Mock Test 4
20 pages
Prichay 5+ Power BI Resume
100% (1)
Prichay 5+ Power BI Resume
3 pages
Object-Oriented Databases Guide
No ratings yet
Object-Oriented Databases Guide
31 pages
Database Design & ER Models Guide
No ratings yet
Database Design & ER Models Guide
20 pages
Chap05 - Building The Data Model
No ratings yet
Chap05 - Building The Data Model
17 pages
SQrum: Enhanced Scrum for Software QA
No ratings yet
SQrum: Enhanced Scrum for Software QA
12 pages
DFC20203 Database Design: Topic 1: Fundamentals of Database Management System
No ratings yet
DFC20203 Database Design: Topic 1: Fundamentals of Database Management System
12 pages
How To Validate A Backup: 1. Validating A Logical Export (Taken Using Exp Utility)
No ratings yet
How To Validate A Backup: 1. Validating A Logical Export (Taken Using Exp Utility)
3 pages

Bigdata & Hadoop: Shushrutha Reddy K M.Tech in Computational Engineering From Rgukt Senior Bigdata Developer @servicenow

Uploaded by

Bigdata & Hadoop: Shushrutha Reddy K M.Tech in Computational Engineering From Rgukt Senior Bigdata Developer @servicenow

Uploaded by

BigData &

Friday, January 21, 2022 3

The challenge includes capturing, curating, storing, searching, sharing,

Friday, 21 January 2022 4

Friday, 21 January 2022 5

Friday, 21 January 2022 6

Friday, 21 January 2022 7

Making Smarter and More Efficient Organisation

Optimize Business Operations by Analysing Customer Behaviour

New Generation Products

Friday, 21 January 2022 8

Friday, 21 January 2022 9

•determine why something happened in the past

•statistical models and forecasts techniques to understand the future

•optimization and simulation algorithms to advice on possible outcomes

Friday, 21 January 2022 10

Friday, 21 January 2022 12

Friday, 21 January 2022 13

Friday, 21 January 2022 15

Friday, 21 January 2022 16

• Hadoop is a framework that allows you to first store Big Data in a

• HDFS (Hadoop distributed File System)

Friday, 21 January 2022 17

2. Distributed & Parallel Computation

Friday, 21 January 2022 18

Friday, 21 January 2022 19

Friday, 21 January 2022 20

• Records the metadata of all the blocks stored in the cluster,

Friday, 21 January 2022 21

Friday, 21 January 2022 22

• The actual data is stored on DataNodes

• Responsible for creating blocks, deleting blocks and replicating the

• Sends heartbeats to the NameNode periodically to report the overall

Friday, 21 January 2022 23

Friday, 21 January 2022 24

Friday, 21 January 2022 25

• File “example.txt” into 2 blocks

Friday, 21 January 2022 28

Friday, 21 January 2022 29

Friday, 21 January 2022 30

Friday, 21 January 2022 31

Friday, 21 January 2022 32

Friday, 21 January 2022 33

Friday, 21 January 2022 35

• 2 tasks – Map and Reduce

Friday, 21 January 2022 36

Friday, 21 January 2022 37

Friday, 21 January 2022 38

Manages resources and schedules applications running on top of YARN

Keeps a track of the heartbeats from the Node Manager

Responsible for allocating resources to the various running

Responsible for accepting job submissions and negotiating

Friday, 21 January 2022 40

• Responsible for managing containers and monitoring resource utilization in each

• Keeps track of node health and log management

• Continuously communicates with Resource Manager to remain up-to-date

Friday, 21 January 2022 41

3) Application Submission Context

4 a) Start Container Launch

Friday, 21 January 2022 42

Friday, 21 January 2022 43

Friday, 21 January 2022 44

• executes in-memory computations to increase speed of data processing over Map-Reduce

Friday, 21 January 2022 45

Provides a managed Hadoop framework using the elastic infrastructure

Distributes computation of the data over multiple Amazon EC2 instances.

Analysis of the data is easy with Amazon Elastic MapReduce

Friday, 21 January 2022 46

Friday, 21 January 2022 47

You might also like