📘 UNIT 2: HADOOP AND MAPREDUCE – DETAILED EXPLANATION
🔷 PART 1: HADOOP
1.1 History of Hadoop
Origin: Hadoop started as part of the Nutch project, an open-source web search engine.
Doug Cutting and Mike Cafarella found Google's GFS and MapReduce concepts ideal for
managing large data.
To handle the growing web content, they separated the distributed processing/storage parts
from Nutch.
Named "Hadoop" after Doug Cutting’s son’s toy elephant.
2006: Yahoo! adopted Hadoop, contributed to its development.
2008: Hadoop became a top-level Apache project.
1.2 Apache Hadoop
An open-source framework for storing and processing large datasets in a distributed manner.
Written in Java.
Supports parallel processing across many machines.
It provides:
o Storage (via HDFS)
o Computation (via MapReduce)
o Resource Management (via YARN)
1.3 Hadoop Distributed File System (HDFS)
Core component of Hadoop, responsible for data storage.
Block-based storage: Splits files into large blocks (default 128MB or 256MB).
Each block is stored across multiple machines with replication (default 3 copies) for fault
tolerance.
Architecture:
o NameNode: Master server; stores metadata (file names, locations).
o DataNode: Worker nodes; store actual data blocks.
o Secondary NameNode: Assists NameNode by merging edit logs and checkpointing.
1.4 Components of Hadoop
1. HDFS – Stores large data reliably across clusters.
2. MapReduce – Processes large data sets.
3. YARN – Manages cluster resources and schedules tasks.
4. Hadoop Common – Provides libraries and utilities used by other modules.
1.5 Data Format in Hadoop
Hadoop supports different data formats:
o Text files (CSV, JSON, XML)
o Binary formats (Avro, Parquet, ORC)
o Sequence files: Binary key-value format used internally.
Data is stored using Writable objects to ensure fast serialization/deserialization.
1.6 Analyzing Data with Hadoop
Hadoop processes data using the MapReduce model.
Suitable for:
o ETL operations
o Data aggregation
o Statistical computation
Scales to handle petabytes of data.
Enables parallel computation by distributing data and tasks.
1.7 Scaling Out
Hadoop is designed to scale horizontally.
Add more nodes to increase capacity, instead of upgrading hardware.
No central bottleneck due to distributed design.
Cost-efficient as it works on commodity hardware.
1.8 Hadoop Streaming
Allows developers to write MapReduce programs in any language (not just Java).
Works via command-line and uses stdin and stdout for data communication.
Example: A Python script can act as a mapper or reducer.
1.9 Hadoop Pipes
Provides C++ API to write MapReduce programs.
Uses JNI (Java Native Interface) to interact with the Java-based Hadoop core.
Suitable for developers who prefer C++ over Java.
1.10 Hadoop Ecosystem
Extends Hadoop’s capabilities with various tools:
o Hive: SQL-like language for querying data stored in HDFS.
o Pig: High-level scripting language for data processing.
o HBase: NoSQL database on top of HDFS.
o Sqoop: Tool for importing/exporting data between Hadoop and RDBMS.
o Flume: Used to collect and aggregate log data.
o Oozie: Workflow scheduler for Hadoop jobs.
o Zookeeper: Coordinates distributed systems.
o Spark: In-memory computation engine for fast processing.
🔷 PART 2: MAPREDUCE
2.1 MapReduce Framework and Basics
A programming model for parallel processing of large datasets.
Developed by Google and adopted by Hadoop.
Two main functions:
o Map: Takes input and produces intermediate key-value pairs.
o Reduce: Merges and summarizes intermediate results based on key.
2.2 How MapReduce Works
1. Input data is split into fixed-size chunks.
2. Mappers process these chunks and output intermediate key-value pairs.
3. The framework performs shuffle and sort to group data by keys.
4. Reducers process grouped data and produce final output.
5. Output is stored in HDFS.
2.3 Developing a MapReduce Application
Consists of three main components:
1. Mapper: Processes input and emits intermediate key-value pairs.
2. Reducer: Aggregates values by key.
3. Driver: Sets job configuration (input/output paths, mapper/reducer classes).
Written primarily in Java, but other languages are supported via streaming.
2.4 Unit Tests with MRUnit
MRUnit is a Java library for testing MapReduce programs.
Allows testing of:
o Mapper
o Reducer
o Complete job logic
Ensures correctness before deploying to production.
2.5 Test Data and Local Tests
Local testing helps validate logic without using an entire Hadoop cluster.
Developers can use sample data and run jobs in local mode.
2.6 Anatomy of a MapReduce Job Run
1. Job submission by the client.
2. Input splits assigned to mappers.
3. Mapper output is shuffled and sorted.
4. Reducer processes the intermediate data.
5. Final output is saved in HDFS.
2.7 Failures in MapReduce
Fault-tolerant system:
o Automatically retries failed tasks.
o Stores job progress.
Speculative execution:
o Launches duplicate tasks for slow nodes.
o Uses result from whichever finishes first.
2.8 Job Scheduling
Hadoop YARN manages resource allocation and task scheduling.
Supports:
o FIFO: First-come-first-served.
o Fair Scheduler: Equal sharing among users/jobs.
o Capacity Scheduler: Allocates resources in a multi-tenant environment.
2.9 Shuffle and Sort
Takes place between Map and Reduce stages.
Shuffle: Transfers map output to appropriate reducers.
Sort: Groups values by keys.
Crucial for ensuring all values for a key go to the same reducer.
2.10 Task Execution
Map and Reduce tasks are executed as separate JVMs on DataNodes.
TaskTracker/NodeManager monitors progress and manages retries.
2.11 MapReduce Types
Map-only Job: No reducer; used for filtering or parsing.
MapReduce Job: Typical job with both stages.
Reduce-only Job: Rare, usually for aggregating pre-grouped data.
2.12 Input Formats
Defines how data is read into a job:
o TextInputFormat – Default; lines of text.
o KeyValueTextInputFormat – Treats each line as key and value.
o SequenceFileInputFormat – Binary format for key-value pairs.
2.13 Output Formats
Defines how final data is written:
o TextOutputFormat – Default output as plain text.
o SequenceFileOutputFormat – Outputs binary files.
o Custom formats can be created for specialized use cases.
2.14 MapReduce Features
Processes huge datasets across clusters.
Highly scalable and fault-tolerant.
Handles structured and unstructured data.
Integrates well with Hadoop ecosystem tools.
Optimized for batch processing, not real-time tasks.
2.15 Real-world Applications of MapReduce
Search Engines: Indexing web pages.
Retail: Customer trend analysis.
Banking: Fraud detection.
Social Media: Analyzing user behavior.
Bioinformatics: DNA sequencing.
Weather Forecasting: Processing sensor data.
1. HDFS (Hadoop Distributed File System)
1.1 Design of HDFS
Modeled after Google File System (GFS).
Designed for high throughput access to large datasets.
Assumes hardware failures are common and handles them gracefully.
Optimized for write-once-read-many access patterns.
1.2 HDFS Concepts
NameNode: Manages metadata and namespace.
DataNodes: Store actual file blocks.
Block: A unit of storage; files are split into blocks.
Replication: Blocks are replicated (default 3) across different nodes for fault tolerance.
1.3 Benefits and Challenges of HDFS
Benefits:
Fault-tolerant
Highly scalable
Cost-effective (uses commodity hardware)
Suitable for batch processing
Challenges:
Not ideal for low-latency access
Write-once-read-many limitation
File system metadata is stored in memory (scalability of NameNode)
1.4 File Sizes and Block Sizes in HDFS
File Sizes: Can be terabytes in size.
Block Size: Default is 128 MB or 256 MB.
Large block sizes reduce overhead and improve performance.
1.5 Block Abstraction in HDFS
Files are split into blocks.
Blocks are distributed across DataNodes.
Each block is treated independently, and replication ensures durability.
1.6 Data Replication
Default replication factor is 3.
Ensures fault tolerance and data availability.
NameNode maintains a replica placement policy.
1.7 How HDFS Stores, Reads, and Writes Files
Storing: File is split into blocks and written across DataNodes.
Reading: Client contacts NameNode to locate blocks, then reads from DataNodes.
Writing: Data is first written to a pipeline of DataNodes and acknowledged upon successful
replication.
1.8 Java Interfaces to HDFS
HDFS provides a rich Java API.
Common interfaces:
o FileSystem
o FSDataInputStream
o FSDataOutputStream
Example:
java
CopyEdit
FileSystem fs = FileSystem.get(new Configuration());
FSDataInputStream in = fs.open(new Path("/file.txt"));
1.9 Command Line Interface (CLI)
Common HDFS commands:
o hdfs dfs -ls /
o hdfs dfs -put localfile /hdfsfile
o hdfs dfs -get /hdfsfile localfile
o hdfs dfs -rm /file
1.10 Hadoop File System Interfaces
Hadoop supports multiple file systems:
o HDFS
o Local FS
o Amazon S3
o Azure Blob Storage
o All use the FileSystem abstract class
1.11 Data Flow in HDFS
Write Flow: Client → NameNode (metadata) → DataNodes (data)
Read Flow: Client → NameNode (block info) → DataNodes (read data directly)
1.12 Data Ingest with Flume and Sqoop
Apache Flume: Collects, aggregates, and moves log data to HDFS.
o Useful for ingesting streaming data.
Apache Sqoop: Imports and exports data between HDFS and relational databases (like
MySQL, Oracle).
o Used for ETL tasks.
1.13 Hadoop Archives (HAR)
Compresses and stores a large number of small files into a single archive file.
Improves NameNode performance by reducing metadata load.
🔷 2. Hadoop I/O
2.1 Compression
Reduces storage needs and speeds up data transfer.
Types:
o Gzip
o BZip2
o Snappy
Hadoop supports compression in both map and reduce stages.
2.2 Serialization
Process of converting data into a format for storage or transmission.
Hadoop uses Writable for efficient serialization.
Custom data types must implement the Writable interface.
2.3 Avro
Row-oriented data serialization framework.
Supports:
o Rich data structures
o Schema evolution
Integrates well with Hive and Pig.
2.4 File-based Data Structures
SequenceFile: Binary key-value pair format.
MapFile: Sorted SequenceFile with an index.
Parquet: Columnar storage, optimized for analytics.
🔷 3. Hadoop Environment
3.1 Setting Up a Hadoop Cluster
Cluster: A collection of machines working together.
Types:
o Single-node cluster: For development/testing.
o Multi-node cluster: For production workloads.
3.2 Cluster Specification
Depends on:
o Storage needs
o Number of concurrent users
o Processing requirements
Typical components:
o Master node (NameNode, ResourceManager)
o Worker nodes (DataNode, NodeManager)
3.3 Cluster Setup and Installation
Install Java and Hadoop binaries.
Configure core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml.
Format the NameNode and start daemons.
3.4 Hadoop Configuration
Key configuration files:
o core-site.xml: General Hadoop settings.
o hdfs-site.xml: HDFS settings.
o mapred-site.xml: MapReduce configuration.
o yarn-site.xml: YARN resource manager config.
Important parameters:
o fs.defaultFS
o dfs.replication
o yarn.nodemanager.resource.memory-mb
3.5 Security in Hadoop
Supports:
o Kerberos authentication
o Access Control Lists (ACLs)
o Service-level authorization
o Data encryption in transit and at rest
3.6 Administering Hadoop
Tools:
o jps – Java process status.
o hdfs dfsadmin – HDFS admin operations.
o yarn top – Resource manager monitoring.
Routine tasks:
o Managing disk usage
o Monitoring node health
o Restarting failed services
3.7 HDFS Monitoring & Maintenance
Tools: Ambari, Cloudera Manager, Grafana, Prometheus
Tasks include:
o Checking replication
o Verifying data integrity
o Handling disk failures
3.8 Hadoop Benchmarks
Measure performance of Hadoop clusters.
Common tools:
o TestDFSIO
o TeraSort
o HiBench
Used for tuning and capacity planning.
3.9 Hadoop in the Cloud
Cloud providers offer Hadoop as a service:
o AWS EMR
o Azure HDInsight
o Google Cloud Dataproc
Advantages:
o Scalability
o Pay-per-use
o Managed infrastructure