0% found this document useful (0 votes)

15 views12 pages

? Unit 2, 3 Big Data Notes

Uploaded by

cricosoul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views12 pages

? Unit 2, 3 Big Data Notes

Uploaded by

cricosoul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

📘 UNIT 2: HADOOP AND MAPREDUCE – DETAILED EXPLANATION

🔷 PART 1: HADOOP

1.1 History of Hadoop

 Origin: Hadoop started as part of the Nutch project, an open-source web search engine.

 Doug Cutting and Mike Cafarella found Google's GFS and MapReduce concepts ideal for
managing large data.

 To handle the growing web content, they separated the distributed processing/storage parts
from Nutch.

 Named "Hadoop" after Doug Cutting’s son’s toy elephant.

 2006: Yahoo! adopted Hadoop, contributed to its development.

 2008: Hadoop became a top-level Apache project.

1.2 Apache Hadoop

 An open-source framework for storing and processing large datasets in a distributed manner.

 Written in Java.

 Supports parallel processing across many machines.

 It provides:

o Storage (via HDFS)

o Computation (via MapReduce)

o Resource Management (via YARN)

1.3 Hadoop Distributed File System (HDFS)

 Core component of Hadoop, responsible for data storage.

 Block-based storage: Splits files into large blocks (default 128MB or 256MB).

 Each block is stored across multiple machines with replication (default 3 copies) for fault
tolerance.

 Architecture:

o NameNode: Master server; stores metadata (file names, locations).

o DataNode: Worker nodes; store actual data blocks.

o Secondary NameNode: Assists NameNode by merging edit logs and checkpointing.

1.4 Components of Hadoop

1. HDFS – Stores large data reliably across clusters.

2. MapReduce – Processes large data sets.

3. YARN – Manages cluster resources and schedules tasks.

4. Hadoop Common – Provides libraries and utilities used by other modules.

1.5 Data Format in Hadoop

 Hadoop supports different data formats:

o Text files (CSV, JSON, XML)

o Binary formats (Avro, Parquet, ORC)

o Sequence files: Binary key-value format used internally.

 Data is stored using Writable objects to ensure fast serialization/deserialization.

1.6 Analyzing Data with Hadoop

 Hadoop processes data using the MapReduce model.

 Suitable for:

o ETL operations

o Data aggregation

o Statistical computation

 Scales to handle petabytes of data.

 Enables parallel computation by distributing data and tasks.

1.7 Scaling Out

 Hadoop is designed to scale horizontally.

 Add more nodes to increase capacity, instead of upgrading hardware.

 No central bottleneck due to distributed design.

 Cost-efficient as it works on commodity hardware.

1.8 Hadoop Streaming

 Allows developers to write MapReduce programs in any language (not just Java).

 Works via command-line and uses stdin and stdout for data communication.

 Example: A Python script can act as a mapper or reducer.

1.9 Hadoop Pipes

 Provides C++ API to write MapReduce programs.

 Uses JNI (Java Native Interface) to interact with the Java-based Hadoop core.

 Suitable for developers who prefer C++ over Java.

1.10 Hadoop Ecosystem

 Extends Hadoop’s capabilities with various tools:

o Hive: SQL-like language for querying data stored in HDFS.

o Pig: High-level scripting language for data processing.

o HBase: NoSQL database on top of HDFS.

o Sqoop: Tool for importing/exporting data between Hadoop and RDBMS.

o Flume: Used to collect and aggregate log data.

o Oozie: Workflow scheduler for Hadoop jobs.

o Zookeeper: Coordinates distributed systems.

o Spark: In-memory computation engine for fast processing.

🔷 PART 2: MAPREDUCE

2.1 MapReduce Framework and Basics

 A programming model for parallel processing of large datasets.

 Developed by Google and adopted by Hadoop.

 Two main functions:

o Map: Takes input and produces intermediate key-value pairs.

o Reduce: Merges and summarizes intermediate results based on key.

2.2 How MapReduce Works

1. Input data is split into fixed-size chunks.

2. Mappers process these chunks and output intermediate key-value pairs.

3. The framework performs shuffle and sort to group data by keys.

4. Reducers process grouped data and produce final output.

5. Output is stored in HDFS.

2.3 Developing a MapReduce Application

 Consists of three main components:

1. Mapper: Processes input and emits intermediate key-value pairs.

2. Reducer: Aggregates values by key.

3. Driver: Sets job configuration (input/output paths, mapper/reducer classes).

 Written primarily in Java, but other languages are supported via streaming.

2.4 Unit Tests with MRUnit

 MRUnit is a Java library for testing MapReduce programs.

 Allows testing of:

o Mapper

o Reducer

o Complete job logic

 Ensures correctness before deploying to production.

2.5 Test Data and Local Tests

 Local testing helps validate logic without using an entire Hadoop cluster.

 Developers can use sample data and run jobs in local mode.

2.6 Anatomy of a MapReduce Job Run

1. Job submission by the client.

2. Input splits assigned to mappers.

3. Mapper output is shuffled and sorted.

4. Reducer processes the intermediate data.

5. Final output is saved in HDFS.

2.7 Failures in MapReduce

 Fault-tolerant system:

o Automatically retries failed tasks.

o Stores job progress.

 Speculative execution:

o Launches duplicate tasks for slow nodes.

o Uses result from whichever finishes first.

2.8 Job Scheduling

 Hadoop YARN manages resource allocation and task scheduling.

 Supports:

o FIFO: First-come-first-served.

o Fair Scheduler: Equal sharing among users/jobs.

o Capacity Scheduler: Allocates resources in a multi-tenant environment.

2.9 Shuffle and Sort

 Takes place between Map and Reduce stages.

 Shuffle: Transfers map output to appropriate reducers.

 Sort: Groups values by keys.

 Crucial for ensuring all values for a key go to the same reducer.

2.10 Task Execution

 Map and Reduce tasks are executed as separate JVMs on DataNodes.

 TaskTracker/NodeManager monitors progress and manages retries.

2.11 MapReduce Types

 Map-only Job: No reducer; used for filtering or parsing.

 MapReduce Job: Typical job with both stages.

 Reduce-only Job: Rare, usually for aggregating pre-grouped data.

2.12 Input Formats

 Defines how data is read into a job:

o TextInputFormat – Default; lines of text.

o KeyValueTextInputFormat – Treats each line as key and value.

o SequenceFileInputFormat – Binary format for key-value pairs.

2.13 Output Formats

 Defines how final data is written:

o TextOutputFormat – Default output as plain text.

o SequenceFileOutputFormat – Outputs binary files.

o Custom formats can be created for specialized use cases.

2.14 MapReduce Features

 Processes huge datasets across clusters.

 Highly scalable and fault-tolerant.

 Handles structured and unstructured data.

 Integrates well with Hadoop ecosystem tools.

 Optimized for batch processing, not real-time tasks.

2.15 Real-world Applications of MapReduce

 Search Engines: Indexing web pages.

 Retail: Customer trend analysis.

 Banking: Fraud detection.

 Social Media: Analyzing user behavior.

 Bioinformatics: DNA sequencing.

 Weather Forecasting: Processing sensor data.

1. HDFS (Hadoop Distributed File System)

1.1 Design of HDFS

 Modeled after Google File System (GFS).

 Designed for high throughput access to large datasets.

 Assumes hardware failures are common and handles them gracefully.

 Optimized for write-once-read-many access patterns.

1.2 HDFS Concepts

 NameNode: Manages metadata and namespace.

 DataNodes: Store actual file blocks.

 Block: A unit of storage; files are split into blocks.

 Replication: Blocks are replicated (default 3) across different nodes for fault tolerance.

1.3 Benefits and Challenges of HDFS

Benefits:

 Fault-tolerant

 Highly scalable

 Cost-effective (uses commodity hardware)

 Suitable for batch processing

Challenges:

 Not ideal for low-latency access

 Write-once-read-many limitation

 File system metadata is stored in memory (scalability of NameNode)

1.4 File Sizes and Block Sizes in HDFS

 File Sizes: Can be terabytes in size.

 Block Size: Default is 128 MB or 256 MB.

 Large block sizes reduce overhead and improve performance.

1.5 Block Abstraction in HDFS

 Files are split into blocks.

 Blocks are distributed across DataNodes.

 Each block is treated independently, and replication ensures durability.

1.6 Data Replication

 Default replication factor is 3.

 Ensures fault tolerance and data availability.

 NameNode maintains a replica placement policy.

1.7 How HDFS Stores, Reads, and Writes Files

 Storing: File is split into blocks and written across DataNodes.

 Reading: Client contacts NameNode to locate blocks, then reads from DataNodes.

 Writing: Data is first written to a pipeline of DataNodes and acknowledged upon successful
replication.

1.8 Java Interfaces to HDFS

 HDFS provides a rich Java API.

 Common interfaces:

o FileSystem

o FSDataInputStream

o FSDataOutputStream

 Example:

java

CopyEdit

FileSystem fs = FileSystem.get(new Configuration());

FSDataInputStream in = fs.open(new Path("/file.txt"));

1.9 Command Line Interface (CLI)

 Common HDFS commands:

o hdfs dfs -ls /

o hdfs dfs -put localfile /hdfsfile

o hdfs dfs -get /hdfsfile localfile

o hdfs dfs -rm /file

1.10 Hadoop File System Interfaces

 Hadoop supports multiple file systems:

o HDFS

o Local FS

o Amazon S3
o Azure Blob Storage

o All use the FileSystem abstract class

1.11 Data Flow in HDFS

 Write Flow: Client → NameNode (metadata) → DataNodes (data)

 Read Flow: Client → NameNode (block info) → DataNodes (read data directly)

1.12 Data Ingest with Flume and Sqoop

 Apache Flume: Collects, aggregates, and moves log data to HDFS.

o Useful for ingesting streaming data.

 Apache Sqoop: Imports and exports data between HDFS and relational databases (like
MySQL, Oracle).

o Used for ETL tasks.

1.13 Hadoop Archives (HAR)

 Compresses and stores a large number of small files into a single archive file.

 Improves NameNode performance by reducing metadata load.

🔷 2. Hadoop I/O

2.1 Compression

 Reduces storage needs and speeds up data transfer.

 Types:

o Gzip

o BZip2

o Snappy

 Hadoop supports compression in both map and reduce stages.

2.2 Serialization

 Process of converting data into a format for storage or transmission.

 Hadoop uses Writable for efficient serialization.

 Custom data types must implement the Writable interface.

2.3 Avro

 Row-oriented data serialization framework.

 Supports:

o Rich data structures

o Schema evolution

 Integrates well with Hive and Pig.

2.4 File-based Data Structures

 SequenceFile: Binary key-value pair format.

 MapFile: Sorted SequenceFile with an index.

 Parquet: Columnar storage, optimized for analytics.

🔷 3. Hadoop Environment

3.1 Setting Up a Hadoop Cluster

 Cluster: A collection of machines working together.

 Types:

o Single-node cluster: For development/testing.

o Multi-node cluster: For production workloads.

3.2 Cluster Specification

 Depends on:

o Storage needs

o Number of concurrent users

o Processing requirements

 Typical components:

o Master node (NameNode, ResourceManager)

o Worker nodes (DataNode, NodeManager)

3.3 Cluster Setup and Installation

 Install Java and Hadoop binaries.

 Configure core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml.

 Format the NameNode and start daemons.

3.4 Hadoop Configuration

 Key configuration files:

o core-site.xml: General Hadoop settings.

o hdfs-site.xml: HDFS settings.

o mapred-site.xml: MapReduce configuration.

o yarn-site.xml: YARN resource manager config.

 Important parameters:

o fs.defaultFS

o dfs.replication

o yarn.nodemanager.resource.memory-mb

3.5 Security in Hadoop

 Supports:

o Kerberos authentication

o Access Control Lists (ACLs)

o Service-level authorization

o Data encryption in transit and at rest

3.6 Administering Hadoop

 Tools:

o jps – Java process status.

o hdfs dfsadmin – HDFS admin operations.

o yarn top – Resource manager monitoring.

 Routine tasks:

o Managing disk usage

o Monitoring node health

o Restarting failed services

3.7 HDFS Monitoring & Maintenance

 Tools: Ambari, Cloudera Manager, Grafana, Prometheus

 Tasks include:

o Checking replication

o Verifying data integrity

o Handling disk failures

3.8 Hadoop Benchmarks

 Measure performance of Hadoop clusters.

 Common tools:

o TestDFSIO

o TeraSort

o HiBench
 Used for tuning and capacity planning.

3.9 Hadoop in the Cloud

 Cloud providers offer Hadoop as a service:

o AWS EMR

o Azure HDInsight

o Google Cloud Dataproc

 Advantages:

o Scalability

o Pay-per-use

o Managed infrastructure

Unit 2
No ratings yet
Unit 2
7 pages
Attachment
No ratings yet
Attachment
11 pages
Unit-3: Describe Mapreduce With Application?
No ratings yet
Unit-3: Describe Mapreduce With Application?
6 pages
HADOOP
No ratings yet
HADOOP
4 pages
Big-Data Unit-3
No ratings yet
Big-Data Unit-3
7 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Hadoop Architecture & MapReduce Guide
No ratings yet
Hadoop Architecture & MapReduce Guide
8 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
BDM 2
No ratings yet
BDM 2
5 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
3 Unit
No ratings yet
3 Unit
17 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Big Data Analytics
No ratings yet
Big Data Analytics
8 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
Unit 2
No ratings yet
Unit 2
22 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Unit-III Big Data
No ratings yet
Unit-III Big Data
10 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
Unit 5
No ratings yet
Unit 5
32 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
BD U-2 (Anupam Sir)
No ratings yet
BD U-2 (Anupam Sir)
30 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Hadoop and MapReduce Notes
No ratings yet
Hadoop and MapReduce Notes
4 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
? Chapter 2 - Hadoop & HDFS - Detailed Summary
No ratings yet
? Chapter 2 - Hadoop & HDFS - Detailed Summary
4 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Module 2
No ratings yet
Module 2
23 pages
Module - 2
No ratings yet
Module - 2
84 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
CO3 Session 19
No ratings yet
CO3 Session 19
29 pages
Bda Unit 2
No ratings yet
Bda Unit 2
16 pages
MapReduce for Big Data Analysis
No ratings yet
MapReduce for Big Data Analysis
59 pages
Big Data Mapreduce and Streaming
No ratings yet
Big Data Mapreduce and Streaming
10 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
Big Data Analytics Unit Wise Short Note
No ratings yet
Big Data Analytics Unit Wise Short Note
6 pages
HADOOP
No ratings yet
HADOOP
19 pages
BDA 3rd Unit QB
No ratings yet
BDA 3rd Unit QB
4 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Unit 2
No ratings yet
Unit 2
19 pages
Bda Ese
No ratings yet
Bda Ese
21 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
DSCC Unit 5 PDF
No ratings yet
DSCC Unit 5 PDF
8 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
BDA
No ratings yet
BDA
20 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Chapter 25
No ratings yet
Chapter 25
43 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Cloud Notes - Unit - 5
No ratings yet
Cloud Notes - Unit - 5
31 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
SAP C_TS462_2022 Exam Q&A Demo
No ratings yet
SAP C_TS462_2022 Exam Q&A Demo
5 pages
English Test (26 August) (Narration)
No ratings yet
English Test (26 August) (Narration)
16 pages
Aspiring Teacher's Journey
No ratings yet
Aspiring Teacher's Journey
2 pages
Exercise 4: AWS Database Services: COSC2626/COSC2640 Cloud Computing
No ratings yet
Exercise 4: AWS Database Services: COSC2626/COSC2640 Cloud Computing
25 pages
Eternal Recurrence Personal Infinity 2019
No ratings yet
Eternal Recurrence Personal Infinity 2019
18 pages
DENTAL SPECkannada
No ratings yet
DENTAL SPECkannada
1 page
Set 2 - SPED
No ratings yet
Set 2 - SPED
9 pages
Modal Verbs: Exercises: 1. Complete Using Can, Could, Must, Have To, Should and Might
No ratings yet
Modal Verbs: Exercises: 1. Complete Using Can, Could, Must, Have To, Should and Might
3 pages
Abinitio Intvw Questions
100% (1)
Abinitio Intvw Questions
20 pages
CMS Substructure in ANSYS Workbench
No ratings yet
CMS Substructure in ANSYS Workbench
8 pages
84 Indian Good Auto Approval
No ratings yet
84 Indian Good Auto Approval
6 pages
India's Contribution To Linguistics
No ratings yet
India's Contribution To Linguistics
24 pages
PDF
No ratings yet
PDF
25 pages
Data Exfiltration
No ratings yet
Data Exfiltration
40 pages
Tahap Bacaan Murid 2022
No ratings yet
Tahap Bacaan Murid 2022
9 pages
Shell Scripting Interview Questions and Answers
No ratings yet
Shell Scripting Interview Questions and Answers
7 pages
MYTHOLOGY and FOLKLORE
No ratings yet
MYTHOLOGY and FOLKLORE
59 pages
Test Unit 11 VI.
No ratings yet
Test Unit 11 VI.
2 pages
CELTA 2019 Pre-Course Task
No ratings yet
CELTA 2019 Pre-Course Task
19 pages
KumarAnupam SDE FullStack 0 2 EXP
No ratings yet
KumarAnupam SDE FullStack 0 2 EXP
1 page
Assertion and Reasoning Questions Class 6
No ratings yet
Assertion and Reasoning Questions Class 6
3 pages
CC3L-CP2 Syllabus
No ratings yet
CC3L-CP2 Syllabus
8 pages
Class 7 Science Chapter 15 LIGHT
No ratings yet
Class 7 Science Chapter 15 LIGHT
6 pages
Purification
No ratings yet
Purification
2 pages
General Architecture of Text Mining Systems
No ratings yet
General Architecture of Text Mining Systems
6 pages
# 'B02 - 12 - 2021 GMT 0408 Diff
No ratings yet
# 'B02 - 12 - 2021 GMT 0408 Diff
9 pages
Scienceofetymolo 00 Skeauoft
No ratings yet
Scienceofetymolo 00 Skeauoft
274 pages
Answer Key - Lesson 16 Equivalent Expressions
No ratings yet
Answer Key - Lesson 16 Equivalent Expressions
2 pages
Nkumba University
No ratings yet
Nkumba University
6 pages
Narrativizing Visual Culture
No ratings yet
Narrativizing Visual Culture
5 pages