Big Data Analytics
BTCSE DED 42
Dept of Computer Sc. & Engg.
Unit 1 (prescribed syllabus)
• Introduction To Big Data And Hadoop
• Types of Digital Data
• Introduction to Big Data
• Big Data Analytics
• History of Hadoop
• Apache Hadoop
• Analysing Data with Unix tools
• Analysing Data with Hadoop
• Hadoop Streaming
• Hadoop Echo System
• IBM Big Data Strategy
• Introduction to Infosphere BigInsights and Big Sheets
Big Data
• Big Data is data whose scale, distribution, diversity, and/or timeliness require the
use of new technical architectures and analytics to enable insights that unlock
new sources of business value - McKinsey Global report, 2011.
• Big Data refers to large amounts of massive data yet increases exponentially in
size over time.
• Data is so extensive and complicated that no usual data management methods
can effectively store or handle it.
• Big data is like data, but enormous.
• Structure of data dictates how to work with it and what insights it may provide
• Attributes of Big Data: Huge volume of data, Complexity of data types and
structures, Speed of new data creation and growth
Examples
Discovering consumer shopping habits
Finding new customer leads
Fuel optimization tools for the transportation
industry Live road mapping for autonomous vehicles
Monitoring health conditions through data from
wearables Personalised health plans for cancer patients
Personalised marketing
Predictive inventory ordering
Real-time data monitoring and cybersecurity
protocols Streamlined media streaming
User demand prediction for ridesharing companies
Benefits of Big Data Processing
• Businesses can utilize outside intelligence while taking decisions.
• Access to social data from search engines and sites like facebook,
twitter are enabling organizations to fine tune their business
strategies.
• Improved customer service - Big Data and natural language
processing technologies are being used to read and evaluate
consumer responses
• Early identification of risk to the product/services, if any
• Better operational efficiency
Why is Big Data Important? (Significance of Big Data)
• Saves Cost
• Saves Time
• Helps to gain a better grasp of market conditions
• Improve their online presence of companies
• Boost Customer Acquisition and Retention
• Solve Advertisers Problem and Offer Marketing Insights
• Driver of Innovations and Product Development
Applications
of Big Data
Challenges
with Big
Data
• Data privacy: The
Big Data we now
generate contains a
lot of information
about our personal
lives, much of which
we have a right to
keep private.
• Data security: Even
if we decide we are happy for someone to have our data for a
purpose, can we trust them to keep it safe?
• Data discrimination: When everything is known, will it become acceptable
to discriminate against people based on data we have on their lives? We
already use credit scoring to decide who can borrow money, and insurance
is heavily data-driven.
• Data quality: Not enough emphasis on quality and contextual relevance.
The trend with technology is collecting more raw data closer to the end
user. The danger is data in raw format has quality issues. Reducing the
gap between the end user and raw data increases issues in data quality.
Types of Digital Data
• Structured Data
• Unstructured Data
• Semi-Structured Data
• Quasi-structured data
Structured Data
• Data that has been arranged
and specified in terms of
length and format
• Data containing a defined
data type, format, and
structure • Highly ordered,
with parameters defining its
size
• Examples: Payroll database,
transaction data, online
analytical processing [OLAP]
data cubes, traditional
RDBMS, CSV files, and simple
spread sheets
Unstructured Data
• Data that has not been
arranged; has no
inherent structure
• Data which can’t be
stored in the form of
rows and columns
• Examples: text
documents, PDFs,
images, and video.
Semi-
Structured Data
• Data that is semi-organized
• Textual data files with a discernible pattern that enables
parsing • Examples:
• Extensible Markup Language [XML] data files that are self-describing. • Details
to an email; time it was sent, email addresses to and from, internet protocol
address of the device from which the email was sent, and other bits of
information associated with the email's content. Actual content (email text) is
unstructured. However, some components enable the data to be categorized.
Quasi-structured data
• Text data with erratic data formats that can be formatted with effort,
tools, and time.
• Example: web clickstream data that may contain inconsistencies in
data values and formats
Characteristics of Big Data
Three V’s:
• Volume
• Velocity
• Variety
Can be extended to fourth and
fifth V’s:
• Veracity
• Value
• Volume: refers to a large amount of
data. The magnitude of data plays a critical role in determining its worth. Example: In
2016, worldwide mobile traffic was predicted to be 6.2 Exabytes (6.2 billion GB) per
month. Predictions for 2020 - about 40000 ExaBytes of data.
• Velocity: refers to the rapid collection of data. Data comes in at a high rate from machines,
networks, social media, mobile phones, and other sources in Big Data velocity. A large and
constant influx of data exists. This influences the data's potential, or how quickly data is
created and processed. Example: Google receives more than 3.5 billion queries every day.
In addition, the number of Facebook users is growing at a rate of around 22% every year.
• Variety: Complexity of data types and structures: Big Data reflects the variety of new
data sources, formats, and structures, including digital traces being left on the web and
other digital repositories for subsequent analysis
• Veracity: It is equivalent to quality. We have all the data, but could we be missing
something? Are the data “clean” and accurate? Do they really have something to offer?
• Value: There is another V to take into account when looking at big data: Value having
access to big data is no good unless we can turn it into value. Companies are starting to
generate amazing value from their big data.
Tools used in Big Data
• Apache Hadoop
• HPCC
• Qubole
• Apache Cassandra
• Statwing
• CouchDB
• Apache Flink
• RapidMiner - best open-source data analytics tools
• Kaggle - world's largest big data community
History of Hadoop
• Hadoop is an Apache Software Foundation-managed open source framework
developed in Java for storing and analysing massive information on commodity
hardware clusters
• Hadoop serves as a solution to this issue of big data - storage and processing of
large amounts of data with specific additional capabilities.
• Hadoop is composed chiefly of Hadoop Distributed File System (HDFS) and Yet
Another Resource Negotiator (YARN)
Apache Hadoop
• The Apache® Hadoop® project develops open-source software for reliable,
scalable, distributed computing.
• The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models.
• It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
• Rather than rely on hardware to deliver high-availability, the library itself is
designed to detect and handle failures at the application layer, so delivering a
highly-available service on top of a cluster of computers, each of which may be
prone to failures.
• https://hadoop.apache.org/
Four significant modules of Hadoop
• HDFS — A distributed file system that works on commodity or low end
hardware. HDFS outperforms conventional file systems in data
performance, fault tolerance, and native support for massive datasets.
• YARN — manages and monitors cluster nodes and resource utilization. It
automates the scheduling of jobs and tasks.
• MapReduce — A framework that enables parallel computing on data by
programs. The map job turns the input data into a dataset calculated in
key-value pairs. It is reducing tasks that consume the output of the map
task in order to aggregate it and produce the desired result.
• Hadoop Common — Provides a set of shared Java libraries utilized by all
modules.
How Hadoop Operates
Due to Hadoop's flexibility, the ecosystem has evolved tremendously over the years. Today,
the Hadoop ecosystem comprises a variety of tools and applications that aid in the
collection, storage, processing, analysis, and management of large amounts of data. Several
of the most prominent tools include the following:
• Spark — An open-source distributed processing technology often used to handle large
amounts of data. Apache Spark provides general batch processing, streaming analytics,
machine learning, graph databases, and ad hoc queries through in-memory caching and
efficient execution.
• Presto — A distributed SQL query engine geared for low-latency, ad-hoc data processing.
It adheres to the ANSI SQL standard, which includes sophisticated searches, aggregations,
joins, and window functions. Presto can handle data from various sources, including
Hadoop Distributed File System (HDFS) and Amazon Simple Storage Service (S3).
• Hive — provides a SQL-based interface for using Hadoop MapReduce, allowing
massive
scale analytics as distributed and fault-tolerant data warehousing. • HBase — A
non-relational, versioned open-source database that works on Amazon S3 (through EMRFS)
or the Hadoop Distributed File System (HDFS). HBase is a massively scalable, distributed
large data store designed for real-time access to tables with billions of rows and millions of
columns through random, strictly consistent access. • Zeppelin — An interactive notebook
for exploratory data analysis.
A Weather Dataset
• Weather sensors collect data every hour at many locations across the globe and
gather a large volume of log data, which is a good candidate for analysis.
• The data is from the National Climatic Data Center, or NCDC.
• The data is stored using a line-oriented ASCII format, in which each line is a
record.
• The format supports a rich set of meteorological elements, many of which are
optional or with variable data lengths. For simplicity, we focus on the basic
elements, such as temperature, which are always present and are of fixed
width.
There is a directory for each year
from 1901 to 2001, each containing
a gzipped file for each weather
station with its readings for that
year. For example, the first entries
for 1990 shown above.
Data Analysis with Unix
• “What’s the highest recorded global
temperature for each year in the dataset?“ • We will answer this first without
using Hadoop.
• This analysis will
provide a
performance
baseline and a
useful means to
check our results.
The classic tool for
processing
line-oriented data is awk.
A small script to calculate the
maximum temperature for each
year:
The script loops through the compressed year files, first printing the year, and then processing each
file using awk.
The awk script extracts two fields from the data: the air temperature and the quality code.
The air temperature value is turned into an integer by adding 0.
Next, a test is applied to see whether the temperature is valid (the value 9999 signifies a missing
value in the NCDC dataset) and whether the quality code indicates that the reading is not suspect or
erroneous.
If the reading is OK, the value is compared with the maximum value seen so far, which is updated if a
new maximum is found.
The END block is executed after all the lines in the file have been processed, and it prints the
maximum value.
Here is the beginning of a run:
The temperature values in the source file are
scaled by a factor of 10, so this works out as a
maximum temperature of 31.7°C for 1901.
The complete run for the century took 42 minutes
in one run on a single EC2 High-CPU Extra Large
instance.
• To speed up the processing, we need to run parts of the program in parallel. In
theory, this is straightforward: we could process different years in different
processes, using all the available hardware threads on a machine. But there are a
few problems with this:
• First, dividing the work into equal-size pieces isn’t always easy or obvious.
• Combining the results from independent processes may require further
processing.
• You are still limited by the processing capacity of a single machine.
Conclusion: Although it’s feasible to parallelize the processing, in practice it’s
messy. Using a framework like Hadoop to take care of these issues is of great
help.
Analyzing the Data with Hadoop
• We need to express our query as a MapReduce job.
• After some local, small-scale testing, we will be able to run it on a cluster of
machines.
• MapReduce works by breaking the processing into two phases: the map phase
and the reduce phase.
Each phase has key-value pairs as input and output, the types of which may be
chosen by the programmer.
The programmer also specifies two functions: the map function and the reduce
function
• MapReduce is a programming model for data processing.
• MapReduce programs are inherently parallel, thus putting very large-scale
data analysis into the hands of anyone with enough machines at their
disposal.
• A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner. The
framework sorts the outputs of the maps, which are then input to the reduce
tasks. Typically both the input and the output of the job are stored in a file
system. The framework takes care of scheduling tasks, monitoring them and
re-executes the failed tasks.
• Unit-3 dedicated to MapReduce
• The input to our map phase is the raw NCDC data. We choose a text input format that gives us
each line in the dataset as a text value.
• The key is the offset of the beginning of the line from the beginning of the file, but as we have no
need for this, we ignore it.
• Our map function is simple. We pull out the year and the air temperature, because these are the
only fields we are interested in.
This is the final output: the maximum global temperature recorded in each year.
Java
MapReduce
Mapper for the
maximum
temperature
example
Frameworks resembling Hadoop
• Ceph
• Apache Storm
• Apache Spark: addresses the problem of MapReduce of requiring a significant amount of time to
complete specified jobs, by doing data processing in memory
• DataTorrentRTS
• Google BiqQuery
• Samza
• Flink: outperforms Hadoop and Spark in terms of performance
• HydraDataTorrentRTS
However, because of its scalability, cheap cost, and flexibility, Hadoop is the ideal platform for
Big Data analytics. It includes a slew of tools that data scientists need. Apache Hadoop, together
with YARN, converts a significant amount of raw data into an easily consumable feature matrix.
Hadoop simplifies the development of machine learning algorithms.
Hadoop Streaming (from textbook)
• Hadoop provides an API to MapReduce that allows you to write your map and reduce
functions in languages other than Java.
• Hadoop Streaming uses Unix standard streams as the interface between Hadoop and
your program, so you can use any language that can read standard input and write to
standard output to write your MapReduce program.
• Streaming is naturally suited for text processing. Map input data is passed over standard
input to your map function, which processes it line by line and writes lines to standard
output. A map output key-value pair is written as a single tab-delimited line. Input to the
reduce function is in the same format—a tab-separated key-value pair—passed over
standard input. The reduce function reads lines from standard input, which the frame
work guarantees are sorted by key, and writes its results to standard output.
We now simulate this python program using Unix
pipeline:
Run this same file using Hadoop:
% hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \ -
input input/ncdc/sample.txt \
-output output \
-mapper ch02-mr-intro/src/main/ruby/max_temperature_map.rb \ -
reducer ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb
Hadoop Streaming (from official Apache Hadoop website)
• https://hadoop.apache.org/docs/r1.2.1/streaming.html
• Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you
to create and run Map/Reduce jobs with any executable or script as the mapper and/or the
reducer. For example:
• both the mapper and the reducer are executables that read the input from stdin (line by
line) and emit the output to stdout
• The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and
monitor the progress of the job until it completes.
• When an executable is specified for mappers, each mapper task will launch the executable
as a separate process when the mapper is initialized. As the mapper task runs, it converts
its inputs into lines and feed the lines to the stdin of the process. In the meantime, the
mapper collects the line oriented outputs from the stdout of the process and converts
each line into a key/value pair, which is collected as the output of the mapper. By default,
the prefix of a line up to the first tab character is the key and the rest of the line (excluding
the tab character) will be the value. If there is no tab character in the line, then entire line
is considered as key and the value is null.
• When an executable is specified for reducers, each reducer task will launch the executable
as a separate process then the reducer is initialized. As the reducer task runs, it converts
its input key/values pairs into lines and feeds the lines to the stdin of the process. In the
meantime, the reducer collects the line oriented outputs from the stdout of the process,
converts each line into a key/value pair, which is collected as the output of the reducer.
By default, the prefix of a line up to the first tab character is the key and the rest of the
line (excluding the tab character) is the value.
This is the basis for the communication protocol between the Map/Reduce framework and
the streaming mapper/reducer.
Streaming Command Options
IBM’s Big Data Strategy
• International Business Machine (IBM) is an American company headquartered in New York. IBM is listed at # 43 in
Forbes list with a Market Capitalization of $162.4 billion as of May 2017.
• IBM has a sale of around $79.9 billion and a profit of $11.9 billion. In 2017, IBM holds most patents generated by
the business for 24 consecutive years.
• IBM is the biggest vendor for Big Data-related products and services. IBM Big Data solutions provide features such
as store data, manage data and analyze data.
• During the 1960s, American Airlines developed a flight reservation system using IBM computing systems and stored
around 807 Megabytes of data.
• Smarter Planet was a corporate initiative of IBM, which sought to highlight how government and business leaders
were capturing the potential of smarter systems to achieve economic and sustainable growth and societal
progress.
• In November 2008, in his speech at the Council on Foreign Relations, IBM’s Chairman, CEO and President Sam
Palmisano, outlined an agenda for building a ‘Smarter Planet’. He emphasized how the world’s various systems
– like traffic, water management, communication technology, smart grids, healthcare solutions, and rail
transportation – were struggling to function effectively.
• IBM committed itself to Big Data and Analytics through sustained investments and strategic acquisitions. In 2011, it
invested US$100 million in the research and development of services and solutions that facilitated Big Data analytics.
In addition, it had been bringing together as many Big Data technologies as possible under its roof.
• In 2013, IBM was awarded the contract to support Thames Water Utilities Limited’s (Thames Water) Big Data
project. The UK government planned to install smart meters in every home by 2020.
• The Big Data strategy of the company was to combine a wide array of the Big Data analytic solutions and
conquer the Big Data market.
IBM’s Big Data Solutions
1) Hadoop System: It is a storage platform that stores structured and unstructured data. It
is designed to process a large volume of data to gain business insights. 2) Stream
Computing: Stream Computing enables organizations to perform in-motion analytics
including the Internet of Things, real-time data processing, and analytics 3) Federated
discovery and Navigation: Federated discovery and navigation software help organizations
to analyze and access information across the enterprise. IBM provides below listed Big Data
products which will help to capture, analyze, and manage any structured and unstructured
data.
4) IBM® BigInsights for Apache Hadoop®: It enables organizations to analyze a huge
volume of data quickly and in a simple manner.
5) IBM BigInsights on Cloud: It provides Hadoop as a service
through the IBM SoftLayer cloud infrastructure.
6) IBM Streams: For critical Internet of Things applications, it helps organizations to
capture and analyze data in motion.
IBM InfoSphere BigInsights
• IBM® InfoSphere® BigInsights is an Apache Hadoop based, hardware-
agnostic software platform that provides new ways of using diverse and large-
scale data collections.
• InfoSphere BigInsights is focused on providing enterprises with the capabilities
they need to meet critical business requirements while maintaining compatibility
with the Hadoop project.
• InfoSphere BigInsights includes a variety of IBM technologies that enhance and
extend the value of open-source Hadoop software to facilitate faster time-to
value, including application accelerators, analytical facilities, development tools,
platform improvements and enterprise software integration.
Features
• Accelerating deployments by tapping into Hadoop community
innovation
• Leveraging existing SQL skills and solutions
• Enabling user-driven analytics and data provisioning
• Supporting human-oriented information discovery and topic
generation
• Leveraging in-motion and at-rest analytics
• Integrating with popular modeling and predictive analytics solutions
BigSheets
Spreadsheet style tool for IBM InfoSphere BigInsights
Unit 2 (prescribed syllabus)
HDFS(Hadoop Distributed File System), The Design of HDFS, HDFS Concepts,
Command Line Interface, Hadoop file system interfaces,
Data flow, Data Ingest with Flume and Scoop and Hadoop archives,
Hadoop I/O: Compression, Serialization, Avro and
File-Based Data structures.
HDFS
• Distributed filesystems: Filesystems that manage the storage across a network of
machines. When a dataset outgrows the storage capacity of a single physical
machine, it becomes necessary to partition it across a number of separate
machines.
• Since they are network based, all the complications of network programming kick
in, thus making distributed filesystems more complex than regular disk
filesystems. For example, one of the biggest challenges is making the filesystem
tolerate node failure without suffering data loss.
• Hadoop comes with a distributed filesystem called HDFS, which stands for
Hadoop Distributed Filesystem. HDFS is Hadoop’s flagship filesystem.
The Design of HDFS
HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
HDFS is not a good fit in following scenarios:
• Low-latency data access
• Lots of small files
• Multiple writers, arbitrary file modifications
HDFS Design goals
Techniques to meet HDFS Design goals
POSIX, or Portable Operating System Interface, is a set of standards that
define how operating systems should work and interact with
applications.
POSIX is based on the Unix operating system and is intended to ensure
that applications can run on different versions of Unix.
HDFS Design concepts
HDFS Concepts - Blocks
• Block size of a disk- the
minimum amount of data that
it can read or write.
Filesystem
blocks are typically a few
kilobytes in size,
whereas disk
blocks are normally 512 bytes.
• Unit of Block in HDFS - 128
MB
by default
Benefits of
block
abstraction
1. A file can be larger than any single disk in the network
2. Making the unit of abstraction a block rather than a file simplifies the storage
subsystem - because blocks are a fixed size, it is easy to calculate how many
can be stored on a given disk
3. Blocks fit well with replication for providing fault tolerance and availability. If a
block becomes unavailable due to corruption or machine failure, a copy can be
read from another location in a way that is transparent to the client.
Basic architecture of HDFS
HDFS Architecture key components
Namenodes and Datanodes
• Namenode (the master): manages the filesystem namespace. It maintains the filesystem tree and
the metadata for all the files and directories in the tree. This information is stored persistently on
the local disk in the form of two files: the namespace image and the edit log.
• Datanodes (workers): workhorses of the filesystem. They store and retrieve blocks when they are
told to (by clients or the namenode), and they report back to the namenode periodically with lists
of blocks that they are storing.
• Without the namenode, the filesystem cannot be used. In fact, if the machine running the
namenode were obliterated, all the files on the filesystem would be lost since there would be no
way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it
is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for
this.
• back up the files that make up the persistent state of the filesystem metadata • run a
secondary namenode, which despite its name does not act as a namenode. Its main role is to
periodically merge the namespace image with the edit log to prevent the edit log from
becoming too large; runs on a separate physical machine because it requires plenty of CPU
and as much memory as the namenode
Block Caching
• Normally a datanode reads blocks from disk, but for frequently
accessed files the blocks may be explicitly cached in the datanode’s
memory, in an off-heap block cache.
• By default, a block is cached in only one datanode’s memory,
although the number is configurable on a per-file basis.
• Job schedulers (for MapReduce, Spark, and other frame works) can
take advantage of cached blocks by running tasks on the datanode
where a block is cached, for increased read performance.
HDFS Federation
• The namenode keeps a reference to every file and block in the filesystem in
memory, which means that on very large clusters with many files, memory
becomes the limiting factor for scaling
• HDFS federation, introduced in the 2.x release series, allows a cluster to scale by
adding namenodes, each of which manages a portion of the filesystem
namespace. For example, one namenode might manage all the files rooted under
/user, say, and a second name node might handle files under /share.
• Under federation, each namenode manages a namespace volume, which is made
up of the metadata for the namespace, and a block pool containing all the blocks
for the files in the namespace.
• To access a federated HDFS cluster, clients use client-side mount tables to map
file paths to namenodes. This is managed in configuration using
ViewFileSystem and the viewfs:// URIs.
HDFS High Availability
• Hadoop 2 remedied the situation of long recovery time in routine maintenance and
unexpected failure of the namenode, by adding support for HDFS high availability (HA).
• Architectural changes needed:
• The namenodes must use highly available shared storage to share the edit log. When a
standby namenode comes up, it reads up to the end of the shared edit log to
synchronize its state with the active namenode, and then continues to read new
entries as they are written by the active namenode.
• Datanodes must send block reports to both namenodes because the block mappings
are stored in a namenode’s memory, and not on disk.
• Clients must be configured to handle namenode failover, using a mechanism that is
transparent to users.
• The secondary namenode’s role is subsumed by the standby, which takes periodic
checkpoints of the active namenode’s namespace.
Data Replication
• HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a
sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are
replicated for fault tolerance. The block size and replication factor are configurable per file. An application
can specify the number of replicas of a file. The replication factor can be specified at file creation time and
can be changed later. Files in HDFS are write-once and have strictly one writer at any time.
• The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and
a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode
is functioning properly. A Blockreport contains a list of all blocks on a DataNode.
• Large HDFS instances run on a cluster of computers that commonly spread across many racks. For the
common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one
node in the local rack, another on a node in a different (remote) rack, and the last on a different node in
the same remote rack.
Data Disk Failure, Heartbeats and Re-Replication: Each DataNode sends a Heartbeat message to the
NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the
NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode
marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them.
Any data that was registered to a dead DataNode is not available to HDFS any more. The NameNode
constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The
necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica
may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be
increased.
The Command-Line Interface
• There are many other interfaces to HDFS, but the command line is one of the simplest
and, to many developers, the most familiar.
• Two properties that we set in the pseudodistributed configuration: • fs.defaultFS, set to
hdfs://localhost/, which is used to set a default filesystem for Hadoop
• dfs.replication property is set to 1 so that HDFS doesn’t replicate filesystem blocks
by the default factor of three.
Basic Filesystem Operations
• All of the usual filesystem operations, such as reading files, creating directories,
moving files, deleting data, and listing directories can be performed.
• hadoop fs -help to get detailed help on every command.
• Copying a file from the local filesystem to HDFS:
• This command invokes Hadoop’s filesystem shell command fs, which supports a
number of subcommands—in this case, we are running -copyFromLocal The local
file quangle.txt is copied to the file /user/tom/quangle.txt on the HDFS instance
running on localhost. In fact, we could have omitted the scheme and host of the
URI and picked up the default, hdfs://localhost, as specified in core-site.xml:
• Copy the file back to the local filesystem and check whether it’s the
same:
• The MD5 digests are the same, showing that the file survived its trip to HDFS and
is back intact.
• HDFS file listing, create a directory to see how it is displayed in the
listing:
File Permissions in HDFS
• HDFS has a permissions model for files and directories that is much like the POSIX
model. There are three types of permission: the read permission (r), the write
permission (w), and the execute permission (x).
• Each file and directory has an owner, a group, and a mode. The mode is made up
of the permissions for the user who is the owner, the permissions for the users
who are members of the group, and the permissions for users who are neither
the owners nor members of the group.
• By default, Hadoop runs with security disabled, which means that a client’s
identity is not authenticated. Because clients are remote, it is possible for a client
to become an arbitrary user simply by creating an account of that name on the
remote system. This is not possible if security is turned on.
• There is a concept of a superuser, which is the identity of the namenode process.
Per missions checks are not performed for the superuser.
Hadoop Filesystems
• Hadoop has an abstract notion of filesystems, of which HDFS is just one
implementation. The Java abstract class org.apache.hadoop.fs.FileSystem
represents the client interface to a filesystem in Hadoop, and there are several
concrete implementations. The main ones that ship with Hadoop are described in
Table in next slide.
• Hadoop provides many interfaces to its filesystems, and it generally uses the URI
scheme to pick the correct filesystem instance to communicate with. For
example, the filesystem shell that we met in the previous section operates with
all Hadoop filesystems.
• Command to list the files in the root directory of the local filesystem:
% hadoop fs -ls file: ///