KEMBAR78
Complete Hadoop Notes Final | PDF | Apache Hadoop | File System
0% found this document useful (0 votes)
37 views4 pages

Complete Hadoop Notes Final

The document provides comprehensive notes on Hadoop, focusing on the Hadoop Distributed Filesystem (HDFS) and its architecture, including components like Namenodes and Datanodes, as well as replication and fault tolerance. It also covers command-line interface usage, security considerations, and integration with tools like Apache Flume and Sqoop for data ingestion and extraction. Key features of HDFS include handling large files, optimized data access, and support for high availability and federation.

Uploaded by

ashinisjdf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views4 pages

Complete Hadoop Notes Final

The document provides comprehensive notes on Hadoop, focusing on the Hadoop Distributed Filesystem (HDFS) and its architecture, including components like Namenodes and Datanodes, as well as replication and fault tolerance. It also covers command-line interface usage, security considerations, and integration with tools like Apache Flume and Sqoop for data ingestion and extraction. Key features of HDFS include handling large files, optimized data access, and support for high availability and federation.

Uploaded by

ashinisjdf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Complete Hadoop Notes - Line by Line with Topics

The Hadoop Distributed Filesystem (HDFS)

The Design of HDFS


Very large files: HDFS handles files ranging from hundreds of megabytes to petabytes.

Streaming data access: Optimized for write-once, read-many-times data access patterns.

Commodity hardware: Runs on affordable, reliable hardware, even with frequent node
failures.

Low-latency data access: Not suitable for applications needing low-latency access in
milliseconds.

Lots of small files: Limited by the namenode's memory for managing file metadata.

Multiple writers, file modifications: Supports single writer, appends only; no arbitrary file
modifications.

HDFS Concepts - Blocks


Block size in HDFS: HDFS blocks are typically 128 MB, much larger than disk blocks.

Disk block size: Disk blocks are usually 512 bytes; HDFS blocks are 128 MB by default.

File division: Files in HDFS are broken into blocks, stored as independent units across the
cluster.

Storage efficiency: Small files in HDFS use only the space they need, not a full block.

Large block size: HDFS blocks are large to minimize seek time and optimize data transfer.

Seek and transfer rates: Larger blocks reduce the relative cost of seeking compared to
transferring data.

HDFS Concepts Continued


Block size calculation: To minimize seek time, block size should be around 100 MB, default
is 128 MB.

Multiple blocks for large files: Large files can span multiple blocks stored across various
disks.

Simplified storage management: Blocks of fixed size simplify storage and metadata
management in distributed systems.

Fault tolerance: Blocks are replicated across multiple machines for data redundancy and
availability.
Replication factor: Each block typically has three replicas, ensuring availability even if one
machine fails.

HDFS fsck: The fsck command helps monitor blocks, listing them and checking for integrity.

Namenodes and Datanodes


Namenode and Datanodes: HDFS uses a master-worker model with namenodes (master)
and datanodes (workers).

Namenode role: Manages the filesystem namespace, storing metadata and block locations
persistently on local disk.

Datanodes role: Store and retrieve blocks, reporting to the namenode about stored blocks
periodically.

Failure of Namenode: If the namenode fails, the entire filesystem becomes unusable due to
loss of metadata.

Namenode backup: Hadoop supports synchronizing namenode metadata across multiple


filesystems to ensure resilience.

Secondary namenode: Merges namespace image with the edit log; helps recover metadata
in case of primary namenode failure.

Components of HDFS
NameNode: Maintains and manages the file system metadata (e.g., what blocks make up a
file, and on which datanodes those blocks are stored).

DataNode: Where HDFS stores the actual data, usually there are quite a few of these.

HDFS Federation and High Availability


Block Caching: Frequently accessed blocks can be cached in datanode memory for faster
reads.

Configurable Caching: Cached in one datanode by default, adjustable per file for
redundancy.

Job Scheduler Optimization: Spark/MapReduce prefer tasks on nodes with cached blocks.

Cache Directives: Define which files to cache and for how long via cache pools.

HDFS Federation: Multiple namenodes manage different parts of the filesystem.

Independent Namenodes: Each manages its own namespace, isolating failure.

Block Pool Storage: Datanodes store blocks from multiple namenodes.

Client-Side Mounting: Use ViewFileSystem to access federated HDFS.


Single Point of Failure: Traditional HDFS has one namenode.

High Availability (HA): Hadoop 2 supports active-standby namenodes for failover.

Command-Line Interface

Basic CLI Usage


Interacting with HDFS: The command line is a simple and familiar way to interact with
HDFS.

Pseudodistributed Mode: fs.defaultFS=hdfs://localhost/ to define the default filesystem.

Replication Factor: dfs.replication=1 to avoid warnings on a single datanode.

Basic Operations: copyFromLocal, copyToLocal, mkdir, and ls for file management.

Copying Files: Use hadoop fs -copyFromLocal and -copyToLocal.

Advanced CLI Options


Verifying Integrity: Use md5 to check file integrity.

File Listings: hadoop fs -ls shows permissions, replication, owner, size, and timestamps.

HDFS Permissions: POSIX-like model with read/write/execute.

Security Considerations: Default security is off; identity spoofing is possible.

Superuser Privileges: Namenode runs as superuser bypassing checks.

Hadoop Filesystems

HTTPs
Java API makes non-Java access awkward.

WebHDFS exposes an HTTP REST API.

Two access methods: direct from daemons, or via proxy using DistributedFileSystem API.

FUSE
FUSE: Allows user-space filesystems as standard Unix filesystems.

Hadoop Fuse-DFS: Mounts HDFS as a local filesystem using libhdfs.

Alternative: Hadoop NFS gateway (recommended for stability).

Java Interface
Reading Data from Hadoop URL.

Displaying files with and without seek().


Writing and Deleting Data from HDFS.

Data Flow and Flume

Flume Overview
Apache Flume: Ingests event-based data into Hadoop.

Sources log files from web servers and sends to HDFS.

Also supports HBase, Solr, etc.

Agents: Java processes managing data flow using sources, channels, and sinks.

Flume Architecture
Multi-agent setups aggregate and forward data.

Spooling directory sources with file or logger sinks.

Avro sink-source pair connects agent tiers.

Sqoop

Sqoop Features
Extracts structured data into Hadoop.

Supports Hive, HBase, and export to original databases.

Sqoop 2 introduces server component, CLI, Web UI, REST API.

Improved extensibility and Spark support.

Sqoop 1 is stable, Sqoop 2 under development.

Import & Export


Imports structured data via MapReduce.

Exports processed results in parallel back to data stores.

You might also like