HDFS Commands - Detailed Definitions
1. File and Directory Listing Commands
ls - List Directory Contents
Definition: Lists files and directories in the specified HDFS path, similar to Unix ls command.
hdfs dfs -ls [options] <path>
Purpose:
View contents of HDFS directories
Check file permissions, ownership, and timestamps
Verify file existence and properties
Options:
-l: Long format with detailed information
-R: Recursive listing of subdirectories
-a: Show hidden files (starting with .)
-h: Human-readable file sizes
mkdir - Make Directory
Definition: Creates one or more directories in HDFS file system.
hdfs dfs -mkdir [options] <path1> [path2] ...
Purpose:
Create directory structure for data organization
Establish folder hierarchy for different data types
Prepare storage locations before data ingestion
Options:
-p: Create parent directories if they don't exist
rmdir - Remove Directory
Definition: Removes empty directories from HDFS.
hdfs dfs -rmdir <path1> [path2] ...
Purpose:
Clean up empty directory structures
Remove unused organizational folders
Maintain clean file system hierarchy
Note: Only works on empty directories. Use rm -r for non-empty directories.
2. File Transfer Commands
put - Upload Files to HDFS
Definition: Copies files from local file system to HDFS.
hdfs dfs -put [options] <local_src> ... <hdfs_dest>
Purpose:
Upload data files from local system to Hadoop cluster
Initial data ingestion into HDFS
Transfer processed results back to HDFS
Options:
-f: Force overwrite if destination exists
-p: Preserve file attributes (timestamps, ownership, permissions)
get - Download Files from HDFS
Definition: Copies files from HDFS to local file system.
hdfs dfs -get [options] <hdfs_src> ... <local_dest>
Purpose:
Download processed results from Hadoop cluster
Extract data for local analysis
Create local backups of HDFS data
Options:
-ignoreCrc: Skip CRC checksum verification
-crc: Also copy CRC files
copyFromLocal - Copy from Local (No Overwrite)
Definition: Copies files from local file system to HDFS, but fails if destination exists.
hdfs dfs -copyFromLocal <local_src> ... <hdfs_dest>
Purpose:
Safe file upload that prevents accidental overwrites
Initial data loading with protection against duplicates
Batch uploads where file uniqueness is important
copyToLocal - Copy to Local
Definition: Copies files from HDFS to local file system.
hdfs dfs -copyToLocal <hdfs_src> ... <local_dest>
Purpose:
Extract specific files for local processing
Create local copies while keeping HDFS originals
Download configuration or result files
moveFromLocal - Move from Local
Definition: Moves files from local file system to HDFS (deletes local copy).
hdfs dfs -moveFromLocal <local_src> ... <hdfs_dest>
Purpose:
Transfer files while saving local disk space
One-time data migration to HDFS
Move temporary files after processing
3. File Manipulation Commands
cp - Copy Files within HDFS
Definition: Copies files or directories from one HDFS location to another.
hdfs dfs -cp [options] <src> ... <dest>
Purpose:
Create backups within HDFS
Duplicate data for different processing pipelines
Reorganize data across different directories
Options:
-p: Preserve file attributes
mv - Move/Rename Files
Definition: Moves or renames files and directories within HDFS.
hdfs dfs -mv <src> ... <dest>
Purpose:
Reorganize data structure
Rename files with better naming conventions
Move data between different organizational hierarchies
rm - Remove Files and Directories
Definition: Deletes files and directories from HDFS.
hdfs dfs -rm [options] <path> ...
Purpose:
Clean up unnecessary files
Remove temporary processing files
Delete outdated or corrupted data
Options:
-r or -R: Recursive deletion for directories
-f: Force deletion without confirmation
-skipTrash: Permanent deletion bypassing trash
4. File Content Viewing Commands
cat - Concatenate and Display Files
Definition: Displays the entire content of one or more files to stdout.
hdfs dfs -cat <path> ...
Purpose:
View small file contents
Combine multiple files for display
Quick content verification
Note: Not suitable for large files as it displays entire content.
head - Display Beginning of File
Definition: Shows the first 1KB of a file's content.
hdfs dfs -head <path>
Purpose:
Preview file structure and format
Check file headers
Verify data format without downloading entire file
Note: Unlike Unix head, shows bytes not lines.
tail - Display End of File
Definition: Shows the last 1KB of a file's content.
hdfs dfs -tail [options] <path>
Purpose:
View latest entries in log files
Check file endings
Monitor ongoing data writes
Options:
-f: Follow file (continuously display new content)
text - Display File as Text
Definition: Displays file content as text, automatically decompressing compressed files.
hdfs dfs -text <path> ...
Purpose:
View compressed files without manual decompression
Display various file formats as readable text
Handle different compression formats automatically
5. File Information Commands
stat - Display File Statistics
Definition: Shows specific statistics about files or directories using format specifiers.
hdfs dfs -stat <format> <path> ...
Purpose:
Get specific file properties
Programmatically extract file metadata
Monitor file characteristics
Format Specifiers:
%b: Block size
%o: File size in bytes
%n: File name
%r: Replication factor
%y: Modification time
du - Disk Usage
Definition: Shows space consumed by files and directories.
hdfs dfs -du [options] <path> ...
Purpose:
Monitor storage consumption
Identify large files or directories
Plan storage capacity
Options:
-h: Human-readable format (KB, MB, GB)
-s: Summary (total size only)
df - Display File System Information
Definition: Shows HDFS file system capacity, used space, and available space.
hdfs dfs -df [options] [path]
Purpose:
Monitor overall cluster storage
Check available disk space
Plan data ingestion based on capacity
Options:
-h: Human-readable format
count - Count Files, Directories, and Bytes
Definition: Counts directories, files, and content size for specified paths.
hdfs dfs -count [options] <path> ...
Purpose:
Inventory data organization
Monitor data growth
Generate usage reports
Options:
-h: Human-readable sizes
-q: Show quota information
-u: Show quota usage
6. File Integrity and Testing Commands
checksum - Calculate File Checksum
Definition: Computes and displays checksums for file integrity verification.
hdfs dfs -checksum <path> ...
Purpose:
Verify file integrity after transfers
Detect data corruption
Compare file versions
test - Test File Properties
Definition: Tests various properties of files and directories, returns exit codes.
hdfs dfs -test <flag> <path>
Purpose:
Script-friendly file existence checking
Conditional operations based on file properties
Automated file validation
Flags:
-e: File exists
-f: Is a file
-d: Is a directory
-z: File is empty
-s: File is not empty
7. Permission and Ownership Commands
chmod - Change File Permissions
Definition: Modifies access permissions for files and directories.
hdfs dfs -chmod [options] <mode> <path> ...
Purpose:
Control file access security
Set appropriate read/write permissions
Implement data governance policies
Options:
-R: Recursive permission change
Modes: Octal (755) or symbolic (u+x, g-w, o=r)
chown - Change Ownership
Definition: Changes the owner and/or group of files and directories.
hdfs dfs -chown [options] [owner][:group] <path> ...
Purpose:
Transfer file ownership
Assign data to appropriate teams
Implement organizational data structure
Options:
-R: Recursive ownership change
chgrp - Change Group Ownership
Definition: Changes only the group ownership of files and directories.
hdfs dfs -chgrp [options] <group> <path> ...
Purpose:
Modify group access without changing owner
Reorganize team-based access controls
Implement departmental data sharing
Options:
-R: Recursive group change
8. Advanced File Operations
appendToFile - Append to Existing File
Definition: Appends content from local files to an existing HDFS file.
hdfs dfs -appendToFile <local_src> ... <hdfs_dest>
Purpose:
Add data to existing files
Implement incremental data loading
Append log entries to existing log files
touchz - Create Empty File
Definition: Creates empty files in HDFS (zero-length files).
hdfs dfs -touchz <path> ...
Purpose:
Create placeholder files
Mark completion of processes
Initialize files for later appending
getmerge - Merge and Download
Definition: Merges multiple HDFS files into a single local file.
hdfs dfs -getmerge [options] <src> <local_dest>
Purpose:
Combine distributed processing results
Create single output file from multiple parts
Consolidate data for external systems
Options:
-nl: Add newline between merged files
-skip-empty-file: Skip empty files during merge
9. Replication Management
setrep - Set Replication Factor
Definition: Changes the replication factor for existing files.
hdfs dfs -setrep [options] <rep> <path> ...
Purpose:
Adjust data redundancy levels
Optimize storage usage
Improve data availability
Options:
-R: Recursive replication setting
-w: Wait for replication to complete
10. Administrative Commands
fsck - File System Check
Definition: Checks HDFS file system health and reports issues.
hdfs fsck [options] <path>
Purpose:
Diagnose file system problems
Identify corrupted files
Monitor cluster health
Options:
-files: Show file information
-blocks: Show block information
-locations: Show block locations
-list-corruptfileblocks: List corrupted files
find - Find Files and Directories
Definition: Searches for files and directories based on various criteria.
hdfs dfs -find <path> <expression>
Purpose:
Locate files by name patterns
Find files by size or date
Search directory structures
Expressions:
-name pattern: Find by name
-type f|d: Find files or directories
-size [+|-]size: Find by size
-mtime days: Find by modification time
11. Access Control Lists (ACLs)
getfacl - Get Access Control List
Definition: Displays Access Control List information for files and directories.
hdfs dfs -getfacl [options] <path> ...
Purpose:
View detailed permission settings
Audit access controls
Understand current security configuration
Options:
-R: Recursive ACL display
setfacl - Set Access Control List
Definition: Sets or modifies Access Control Lists for fine-grained permissions.
hdfs dfs -setfacl [options] <acl_spec> <path> ...
Purpose:
Implement complex permission schemes
Grant specific user/group access
Override default permission model
Options:
-m: Modify ACL
-x: Remove ACL entries
-b: Remove all ACLs
-R: Recursive ACL setting
12. Data Transfer Between Clusters
distcp - Distributed Copy
Definition: Efficiently copies large amounts of data within or between Hadoop clusters.
hadoop distcp [options] <source> <destination>
Purpose:
Transfer data between clusters
Perform large-scale data migrations
Synchronize data across environments
Options:
-update: Skip files that exist and have same size
-delete: Delete files in destination not in source
-overwrite: Overwrite existing files
-m <num>: Number of mappers to use
Summary of Command Categories
1. File Operations: put, get, cp, mv, rm
2. Directory Operations: ls, mkdir, rmdir
3. Content Viewing: cat, head, tail, text
4. Information Gathering: stat, du, df, count, checksum
5. Permissions: chmod, chown, chgrp, getfacl, setfacl
6. File Manipulation: appendToFile, touchz, getmerge
7. System Administration: fsck, setrep, distcp
8. Search and Test: find, test
Each command serves specific purposes in managing the Hadoop Distributed File System,
from basic file operations to advanced administrative tasks and security management.