BigData &
Hadoop
Shushrutha Reddy K
M.Tech in Computational Engineering from RGUKT
Senior BigData Developer @ServiceNow
Bigdata
Hadoop
MapReduce
Agenda YARN
Spark
Amazon EMR
Friday, 21 January 2022 2
How It All Started?
Friday, January 21, 2022 3
What is BigData?
BigData is a term used for a collection of data sets that are large and
complex, which is difficult to store and process using available database
management tools or traditional data processing applications.
The challenge includes capturing, curating, storing, searching, sharing,
transferring, analysing and visualization.
Friday, 21 January 2022 4
Every minute:
Friday, 21 January 2022 5
Characteristics:
Friday, 21 January 2022 6
Types of Big Data
• Three types:
• Structured - stored and processed in a fixed format - SQL
• Semi-Structured - XML files or JSON
• Unstructured - Text Files, images, audios, videos
Friday, 21 January 2022 7
Why Big Data Analytics?
Making Smarter and More Efficient Organisation
Optimize Business Operations by Analysing Customer Behaviour
Cost Reduction
New Generation Products
Friday, 21 January 2022 8
Stages in Big Data Analytics
Friday, 21 January 2022 9
Types of Big Data Analytics
Descriptive Analytics
•data aggregation and data mining to provide insight into the past
Diagnostic Analytics
•determine why something happened in the past
Predictive Analytics
•statistical models and forecasts techniques to understand the future
Prescriptive Analytics
•optimization and simulation algorithms to advice on possible outcomes
Friday, 21 January 2022 10
Big Data Domains
Friday, 21 January 2022 12
Scope of Big Data
Friday, 21 January 2022 13
Friday, 21 January 2022 14
Problems with Traditional Approach
Friday, 21 January 2022 15
Evolution of Hadoop
Friday, 21 January 2022 16
What is Hadoop?
• Hadoop is a framework that allows you to first store Big Data in a
distributed environment, so that, you can process it parallelly.
• HDFS (Hadoop distributed File System)
• storage
• YARN (Yet Another Resource Negotiator)
• resource management
Friday, 21 January 2022 17
Advantages Of HDFS
1. Distributed Storage
2. Distributed & Parallel Computation
3. Horizontal Scalability
Friday, 21 January 2022 18
HDFS
Friday, 21 January 2022 19
Hadoop – NameNode
Friday, 21 January 2022 20
Hadoop - NameNode
• Master daemon that maintains and manages the DataNodes (slave nodes)
• Records the metadata of all the blocks stored in the cluster,
• location of blocks stored, size of the files, permissions, hierarchy, etc.
• Records each and every change that takes place to the file system metadata
• If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
• Regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to
ensure that the DataNodes are alive
• Keeps a record of all the blocks in the HDFS and DataNode in which they are stored
Friday, 21 January 2022 21
Secondary NameNode:
Friday, 21 January 2022 22
Hadoop - DataNode
• Slave daemon which runs on each slave machine
• The actual data is stored on DataNodes
• Responsible for serving read and write requests from the clients
• Responsible for creating blocks, deleting blocks and replicating the
same based on the decisions taken by the NameNode
• Sends heartbeats to the NameNode periodically to report the overall
health of HDFS, by default, this frequency is set to 3 seconds
Friday, 21 January 2022 23
Blocks
Friday, 21 January 2022 24
Replication Management
Friday, 21 January 2022 25
Friday, 21 January 2022 26
Friday, 21 January 2022 27
HDFS Write Architecture
• File “example.txt” into 2 blocks
• 128 MB (Block A)
• 120 MB (block B)
Friday, 21 January 2022 28
Data copy process
• Three stages:
• Set up of Pipeline
• Data streaming and replication
• Shutdown of Pipeline (Acknowledgement stage)
Friday, 21 January 2022 29
For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}.
Friday, 21 January 2022 30
For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}.
Friday, 21 January 2022 31
For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}.
Friday, 21 January 2022 32
For Block A: 1A -> 2A -> 3A -> 4A
For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B
Friday, 21 January 2022 33
Friday, 21 January 2022 34
MapReduce: Traditional Way
Friday, 21 January 2022 35
What is MapReduce?
• Framework that allows us to perform distributed and parallel processing on large data
sets in a distributed environment
• 2 tasks – Map and Reduce
• block of data is read and processed to produce key-value pairs as intermediate outputs
• output of a Mapper or map job (key-value pairs) is input to the Reducer
• the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a
smaller set of tuples or key-value pairs
Friday, 21 January 2022 36
MapReduce: Word Count
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Friday, 21 January 2022 37
YARN
Friday, 21 January 2022 38
Friday, 21 January 2022 39
Resource Manager
Cluster-level (one for each cluster) component and runs on the master machine
Manages resources and schedules applications running on top of YARN
Keeps a track of the heartbeats from the Node Manager
Two Components:
Responsible for allocating resources to the various running
Scheduler applications
Responsible for accepting job submissions and negotiating
Application Manager the first container for executing the application
Friday, 21 January 2022 40
Node Manager
• Node-level component (one on each node) and runs on each slave machine
• Responsible for managing containers and monitoring resource utilization in each
container
• Keeps track of node health and log management
• Continuously communicates with Resource Manager to remain up-to-date
Friday, 21 January 2022 41
Application Submission in YARN
1) Submit the job
2) Get Application ID
3) Application Submission Context
4 a) Start Container Launch
b) Launch Application Master
5) Allocate Resources
6 a) Container
b) Launch
7) Execute
Friday, 21 January 2022 42
Application Workflow
in Hadoop YARN
1. Client submits an application
2. Resource Manager allocates a container to
start Application Manager
3. Application Manager registers with Resource
Manager
4. Application Manager asks containers from Resource
Manager
5. Application Manager notifies Node Manager to
launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application
Manager to monitor application’s status
8. Application Manager unregisters with Resource
Manager
Friday, 21 January 2022 43
Hadoop Ecosystem
Friday, 21 January 2022 44
Apache Spark
• Framework for real time data analytics in a distributed computing environment
• executes in-memory computations to increase speed of data processing over Map-Reduce
• 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations
Friday, 21 January 2022 45
Amazon EMR
Provides a managed Hadoop framework using the elastic infrastructure
of Amazon EC2 and Amazon S3.
Distributes computation of the data over multiple Amazon EC2 instances.
Analysis of the data is easy with Amazon Elastic MapReduce
Friday, 21 January 2022 46
Benefits of Amazon EMR
• Elastic - Auto Scaling can use to modify the number of instances automatically
• Economical – Cheap and has support for Amazon EC2 Spot and Reserved Instances
• Secure - Inbuilt capability to turn on the firewall for the protection and controlling cloud
network access to instances
• Flexible - For performing tasks such as root access to any instance, Installation of additional
applications, and customization of the cluster with bootstrap actions
Friday, 21 January 2022 47