KEMBAR78
? Unit 2, 3 Big Data Notes | PDF | Apache Hadoop | Map Reduce
0% found this document useful (0 votes)
15 views12 pages

? Unit 2, 3 Big Data Notes

Uploaded by

cricosoul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

? Unit 2, 3 Big Data Notes

Uploaded by

cricosoul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

📘 UNIT 2: HADOOP AND MAPREDUCE – DETAILED EXPLANATION

🔷 PART 1: HADOOP

1.1 History of Hadoop

 Origin: Hadoop started as part of the Nutch project, an open-source web search engine.

 Doug Cutting and Mike Cafarella found Google's GFS and MapReduce concepts ideal for
managing large data.

 To handle the growing web content, they separated the distributed processing/storage parts
from Nutch.

 Named "Hadoop" after Doug Cutting’s son’s toy elephant.

 2006: Yahoo! adopted Hadoop, contributed to its development.

 2008: Hadoop became a top-level Apache project.

1.2 Apache Hadoop

 An open-source framework for storing and processing large datasets in a distributed manner.

 Written in Java.

 Supports parallel processing across many machines.

 It provides:

o Storage (via HDFS)

o Computation (via MapReduce)

o Resource Management (via YARN)

1.3 Hadoop Distributed File System (HDFS)

 Core component of Hadoop, responsible for data storage.

 Block-based storage: Splits files into large blocks (default 128MB or 256MB).

 Each block is stored across multiple machines with replication (default 3 copies) for fault
tolerance.

 Architecture:

o NameNode: Master server; stores metadata (file names, locations).

o DataNode: Worker nodes; store actual data blocks.


o Secondary NameNode: Assists NameNode by merging edit logs and checkpointing.

1.4 Components of Hadoop

1. HDFS – Stores large data reliably across clusters.

2. MapReduce – Processes large data sets.

3. YARN – Manages cluster resources and schedules tasks.

4. Hadoop Common – Provides libraries and utilities used by other modules.

1.5 Data Format in Hadoop

 Hadoop supports different data formats:

o Text files (CSV, JSON, XML)

o Binary formats (Avro, Parquet, ORC)

o Sequence files: Binary key-value format used internally.

 Data is stored using Writable objects to ensure fast serialization/deserialization.

1.6 Analyzing Data with Hadoop

 Hadoop processes data using the MapReduce model.

 Suitable for:

o ETL operations

o Data aggregation

o Statistical computation

 Scales to handle petabytes of data.

 Enables parallel computation by distributing data and tasks.

1.7 Scaling Out

 Hadoop is designed to scale horizontally.

 Add more nodes to increase capacity, instead of upgrading hardware.

 No central bottleneck due to distributed design.

 Cost-efficient as it works on commodity hardware.

1.8 Hadoop Streaming


 Allows developers to write MapReduce programs in any language (not just Java).

 Works via command-line and uses stdin and stdout for data communication.

 Example: A Python script can act as a mapper or reducer.

1.9 Hadoop Pipes

 Provides C++ API to write MapReduce programs.

 Uses JNI (Java Native Interface) to interact with the Java-based Hadoop core.

 Suitable for developers who prefer C++ over Java.

1.10 Hadoop Ecosystem

 Extends Hadoop’s capabilities with various tools:

o Hive: SQL-like language for querying data stored in HDFS.

o Pig: High-level scripting language for data processing.

o HBase: NoSQL database on top of HDFS.

o Sqoop: Tool for importing/exporting data between Hadoop and RDBMS.

o Flume: Used to collect and aggregate log data.

o Oozie: Workflow scheduler for Hadoop jobs.

o Zookeeper: Coordinates distributed systems.

o Spark: In-memory computation engine for fast processing.

🔷 PART 2: MAPREDUCE

2.1 MapReduce Framework and Basics

 A programming model for parallel processing of large datasets.

 Developed by Google and adopted by Hadoop.

 Two main functions:

o Map: Takes input and produces intermediate key-value pairs.

o Reduce: Merges and summarizes intermediate results based on key.

2.2 How MapReduce Works

1. Input data is split into fixed-size chunks.


2. Mappers process these chunks and output intermediate key-value pairs.

3. The framework performs shuffle and sort to group data by keys.

4. Reducers process grouped data and produce final output.

5. Output is stored in HDFS.

2.3 Developing a MapReduce Application

 Consists of three main components:

1. Mapper: Processes input and emits intermediate key-value pairs.

2. Reducer: Aggregates values by key.

3. Driver: Sets job configuration (input/output paths, mapper/reducer classes).

 Written primarily in Java, but other languages are supported via streaming.

2.4 Unit Tests with MRUnit

 MRUnit is a Java library for testing MapReduce programs.

 Allows testing of:

o Mapper

o Reducer

o Complete job logic

 Ensures correctness before deploying to production.

2.5 Test Data and Local Tests

 Local testing helps validate logic without using an entire Hadoop cluster.

 Developers can use sample data and run jobs in local mode.

2.6 Anatomy of a MapReduce Job Run

1. Job submission by the client.

2. Input splits assigned to mappers.

3. Mapper output is shuffled and sorted.

4. Reducer processes the intermediate data.

5. Final output is saved in HDFS.


2.7 Failures in MapReduce

 Fault-tolerant system:

o Automatically retries failed tasks.

o Stores job progress.

 Speculative execution:

o Launches duplicate tasks for slow nodes.

o Uses result from whichever finishes first.

2.8 Job Scheduling

 Hadoop YARN manages resource allocation and task scheduling.

 Supports:

o FIFO: First-come-first-served.

o Fair Scheduler: Equal sharing among users/jobs.

o Capacity Scheduler: Allocates resources in a multi-tenant environment.

2.9 Shuffle and Sort

 Takes place between Map and Reduce stages.

 Shuffle: Transfers map output to appropriate reducers.

 Sort: Groups values by keys.

 Crucial for ensuring all values for a key go to the same reducer.

2.10 Task Execution

 Map and Reduce tasks are executed as separate JVMs on DataNodes.

 TaskTracker/NodeManager monitors progress and manages retries.

2.11 MapReduce Types

 Map-only Job: No reducer; used for filtering or parsing.

 MapReduce Job: Typical job with both stages.

 Reduce-only Job: Rare, usually for aggregating pre-grouped data.

2.12 Input Formats


 Defines how data is read into a job:

o TextInputFormat – Default; lines of text.

o KeyValueTextInputFormat – Treats each line as key and value.

o SequenceFileInputFormat – Binary format for key-value pairs.

2.13 Output Formats

 Defines how final data is written:

o TextOutputFormat – Default output as plain text.

o SequenceFileOutputFormat – Outputs binary files.

o Custom formats can be created for specialized use cases.

2.14 MapReduce Features

 Processes huge datasets across clusters.

 Highly scalable and fault-tolerant.

 Handles structured and unstructured data.

 Integrates well with Hadoop ecosystem tools.

 Optimized for batch processing, not real-time tasks.

2.15 Real-world Applications of MapReduce

 Search Engines: Indexing web pages.

 Retail: Customer trend analysis.

 Banking: Fraud detection.

 Social Media: Analyzing user behavior.

 Bioinformatics: DNA sequencing.

 Weather Forecasting: Processing sensor data.


1. HDFS (Hadoop Distributed File System)

1.1 Design of HDFS

 Modeled after Google File System (GFS).

 Designed for high throughput access to large datasets.

 Assumes hardware failures are common and handles them gracefully.

 Optimized for write-once-read-many access patterns.

1.2 HDFS Concepts

 NameNode: Manages metadata and namespace.

 DataNodes: Store actual file blocks.

 Block: A unit of storage; files are split into blocks.

 Replication: Blocks are replicated (default 3) across different nodes for fault tolerance.

1.3 Benefits and Challenges of HDFS

Benefits:

 Fault-tolerant

 Highly scalable

 Cost-effective (uses commodity hardware)

 Suitable for batch processing

Challenges:

 Not ideal for low-latency access

 Write-once-read-many limitation

 File system metadata is stored in memory (scalability of NameNode)

1.4 File Sizes and Block Sizes in HDFS

 File Sizes: Can be terabytes in size.

 Block Size: Default is 128 MB or 256 MB.

 Large block sizes reduce overhead and improve performance.

1.5 Block Abstraction in HDFS

 Files are split into blocks.

 Blocks are distributed across DataNodes.

 Each block is treated independently, and replication ensures durability.


1.6 Data Replication

 Default replication factor is 3.

 Ensures fault tolerance and data availability.

 NameNode maintains a replica placement policy.

1.7 How HDFS Stores, Reads, and Writes Files

 Storing: File is split into blocks and written across DataNodes.

 Reading: Client contacts NameNode to locate blocks, then reads from DataNodes.

 Writing: Data is first written to a pipeline of DataNodes and acknowledged upon successful
replication.

1.8 Java Interfaces to HDFS

 HDFS provides a rich Java API.

 Common interfaces:

o FileSystem

o FSDataInputStream

o FSDataOutputStream

 Example:

java

CopyEdit

FileSystem fs = FileSystem.get(new Configuration());

FSDataInputStream in = fs.open(new Path("/file.txt"));

1.9 Command Line Interface (CLI)

 Common HDFS commands:

o hdfs dfs -ls /

o hdfs dfs -put localfile /hdfsfile

o hdfs dfs -get /hdfsfile localfile

o hdfs dfs -rm /file

1.10 Hadoop File System Interfaces

 Hadoop supports multiple file systems:

o HDFS

o Local FS

o Amazon S3
o Azure Blob Storage

o All use the FileSystem abstract class

1.11 Data Flow in HDFS

 Write Flow: Client → NameNode (metadata) → DataNodes (data)

 Read Flow: Client → NameNode (block info) → DataNodes (read data directly)

1.12 Data Ingest with Flume and Sqoop

 Apache Flume: Collects, aggregates, and moves log data to HDFS.

o Useful for ingesting streaming data.

 Apache Sqoop: Imports and exports data between HDFS and relational databases (like
MySQL, Oracle).

o Used for ETL tasks.

1.13 Hadoop Archives (HAR)

 Compresses and stores a large number of small files into a single archive file.

 Improves NameNode performance by reducing metadata load.

🔷 2. Hadoop I/O

2.1 Compression

 Reduces storage needs and speeds up data transfer.

 Types:

o Gzip

o BZip2

o Snappy

 Hadoop supports compression in both map and reduce stages.

2.2 Serialization

 Process of converting data into a format for storage or transmission.

 Hadoop uses Writable for efficient serialization.

 Custom data types must implement the Writable interface.

2.3 Avro

 Row-oriented data serialization framework.

 Supports:

o Rich data structures


o Schema evolution

 Integrates well with Hive and Pig.

2.4 File-based Data Structures

 SequenceFile: Binary key-value pair format.

 MapFile: Sorted SequenceFile with an index.

 Parquet: Columnar storage, optimized for analytics.

🔷 3. Hadoop Environment

3.1 Setting Up a Hadoop Cluster

 Cluster: A collection of machines working together.

 Types:

o Single-node cluster: For development/testing.

o Multi-node cluster: For production workloads.

3.2 Cluster Specification

 Depends on:

o Storage needs

o Number of concurrent users

o Processing requirements

 Typical components:

o Master node (NameNode, ResourceManager)

o Worker nodes (DataNode, NodeManager)

3.3 Cluster Setup and Installation

 Install Java and Hadoop binaries.

 Configure core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml.

 Format the NameNode and start daemons.

3.4 Hadoop Configuration

 Key configuration files:

o core-site.xml: General Hadoop settings.

o hdfs-site.xml: HDFS settings.

o mapred-site.xml: MapReduce configuration.

o yarn-site.xml: YARN resource manager config.


 Important parameters:

o fs.defaultFS

o dfs.replication

o yarn.nodemanager.resource.memory-mb

3.5 Security in Hadoop

 Supports:

o Kerberos authentication

o Access Control Lists (ACLs)

o Service-level authorization

o Data encryption in transit and at rest

3.6 Administering Hadoop

 Tools:

o jps – Java process status.

o hdfs dfsadmin – HDFS admin operations.

o yarn top – Resource manager monitoring.

 Routine tasks:

o Managing disk usage

o Monitoring node health

o Restarting failed services

3.7 HDFS Monitoring & Maintenance

 Tools: Ambari, Cloudera Manager, Grafana, Prometheus

 Tasks include:

o Checking replication

o Verifying data integrity

o Handling disk failures

3.8 Hadoop Benchmarks

 Measure performance of Hadoop clusters.

 Common tools:

o TestDFSIO

o TeraSort

o HiBench
 Used for tuning and capacity planning.

3.9 Hadoop in the Cloud

 Cloud providers offer Hadoop as a service:

o AWS EMR

o Azure HDInsight

o Google Cloud Dataproc

 Advantages:

o Scalability

o Pay-per-use

o Managed infrastructure

You might also like