DSA Notes Unit-03
DSA Notes Unit-03
Technology Landscape
UNIT 3
NoSQL (NOT ONLY SQL)
The TERM NoSQL was first coined by Carlo Strozzi in 1998 to name his lightweight, open source, relational database that did
expose the standard SQL interface.
2. Social Media Networking: Social media is full of data, both structured and unstructured. A
field that is loaded with tons of data to be discovered, social media is one of the most
effective applications of NoSQL databases.
Few Popular NoSQL vendors
Company Produect Most widely used by
Amazon DynamoDB LinkedIn, Mozilla
Facebook Cassandra Netflix, Twitter, eBay
Google Big Table Adobe Photoshop
SQL V/S NoSQL
SQL NoSQL
Relational database Non Relational, distributed database
Relational Model Model-less approach
Pre-defined schema Dynamic Schema
Table based databases Document based or graph based or key value pairs
database
Vertically scalable Horizontally scalable
Uses SQL Uses UnQL(Unstructured query language)
Not preferred for large datasets Largely preferred for large datasets
Emphasis on ACID properties Follows Brewers CAP theorem
Examples: Oracle, MySQL, PostgreSQL Example: MongoDB, HBase, Redis, Couch DB
NewSQL
NewSQL is a modern relational database system that bridges the gap between SQL and NoSQL.
NewSQL databases aim to scale and stay consistent.
NoSQL databases scale while standard SQL databases are consistent.
NewSQL attempts to produce both features and find a middle ground. As a result, the database
type solves the problems in big data fields.
What is NewSQL?
NewSQL is a unique database system that combines ACID compliance with
horizontal scaling. The database system strives to keep the best of both worlds.
OLTP based transactions and the high performance of NoSQL combine in a single
solution. Enterprises expect high-quality of data integrity on large data volumes.
NewSQL Database Features
In-memory storage and data processing supply fast query results.
Partitioning scales the database into units. Queries execute on many shards and combine into a single
result.
ACID properties preserve the features of RDBMS.
Secondary indexing results in faster query processing and information retrieval.
High availability due to the database replication mechanism.
A built-in crash recovery mechanism delivers fault tolerance and minimizes downtime.
Hadoop
What’s HDFS
HDFS is a distributed file system that is fault tolerant, scalable
and extremely easy to expand.
HDFS is the primary distributed storage for Hadoop applications.
HDFS provides interfaces for applications to move themselves
closer to data.
HDFS is designed to ‘just work’, however a working knowledge
helps in diagnostics and improvements.
Refer Pg no: 69
Hadoop Ecosystem Components for Data Ingestion
Hadoop Ecosystem Components for Data Processing
Hadoop Ecosystem Components for Data Analysis
Hadoop Distributions
Big data analytics
What is Distributed Computing
Distributed computing refers to a system where processing and data storage is distributed
across multiple devices or systems, rather than being handled by a single central device.
In a distributed system, each device or system has its own processing capabilities and may
also store and manage its own data.
These devices or systems work together to perform tasks and share resources, with no
single device serving as the central hub.
One example of a distributed computing system is a cloud computing system, where
resources such as computing power, storage, and networking are delivered over the Internet
and accessed on demand. In this type of system, users can access and use shared resources
through a web browser or other client software.
Distributed Computing Challenges
1 Data Consistency and Synchronization:
•Ensuring that data is consistent across all nodes in a distributed system, especially when
multiple nodes are updating the same data, can be complex
2 Fault Tolerance and Availability
•Distributed systems are more prone to failures, and ensuring that the system continues to
operate even when some components fail is crucial
3 Complexity
•Distributed systems are more complex to design, implement, and manage than traditional
monolithic systems.
4 Scalability and Performance:
•As the workload on a distributed system increases, the system needs to be able to scale up or
down to meet the demands without performance degradation.
5 Security
•Distributed systems have more points of entry for security breaches, and protecting sensitive
data and user privacy is crucial.
6 Network Latency and Bandwidth:
•Communication between nodes in a distributed system can be slow due to network latency
(delay) and limited bandwidth.
Hadoop common-core libraries
2.DataNode
3.Secondary NameNode
4.HDFS Client
5.Block Structure
NameNode
The NameNode is the master server that manages the filesystem namespace and
controls access to files by clients. It performs operations such as opening,
closing, and renaming files and directories. Additionally, the NameNode maps
file blocks to DataNodes, maintaining the metadata and the overall structure of
the file system. This metadata is stored in memory for fast access and persisted
on disk for reliability.
Key Responsibilities:
•Maintaining the filesystem tree and metadata.
DataNodes are the worker nodes in HDFS, responsible for storing and retrieving
actual data blocks as instructed by the NameNode. Each DataNode manages the
storage attached to it and periodically reports the list of blocks it stores to the
NameNode.
Key Responsibilities:
•Storing data blocks and serving read/write requests from clients.
•Performing block creation, deletion, and replication upon instruction from the
NameNode.
The HDFS client is the interface through which users and applications interact
with the HDFS. It allows for file creation, deletion, reading, and writing
operations. The client communicates with the NameNode to determine which
DataNodes hold the blocks of a file and interacts directly with the DataNodes for
actual data read/write operations.
Key Responsibilities:
•Facilitating interaction between the user/application and HDFS.
•Communicating with the NameNode for metadata and with DataNodes for data
access.
HDFS – Data Organization
Each file written into HDFS is split into data blocks
Each block is stored on one or more nodes
Each copy of the block is called replica
Block placement policy
◦ First replica is placed on the local node
◦ Second replica is placed in a different rack
◦ Third replica is placed in the same rack as the second replica
Basic file system operations-HDFS
Starting HDFS
Initially you have to format the configured HDFS file system, open namenode
(HDFS server), and execute the following command.
After formatting the HDFS, start the distributed file system. The following command
will start the namenode as well as the data nodes as cluster.
$ start-dfs.sh
Hadoop File Systems
Hadoop is capable of running various file systems and HDFS is just one single
implementation that out of all those file systems. The Hadoop has a variety of
file systems that can be implemented concretely. The Java abstract
class org.apache.hadoop.fs.FileSystem represents a file system in Hadoop.
Anatomy of File Read in HDFS
Step 1: The client opens the file it wishes to read by calling open() on the File System
Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls
(RPCs), to determine the locations of the first few blocks in the file. For each block, the
name node returns the addresses of the data nodes that have a copy of that block. The
DFS returns an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the data node and
name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the
info node addresses for the primary few blocks within the file, then connects to the
primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read()
repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to
the data node, then finds the best data node for the next block. This happens transparently
to the client, which from its point of view is simply reading an endless stream. Blocks are
read as, with the DFSInputStream opening new connections to data nodes because the
client reads through the stream. It will also call the name node to retrieve the data node
locations for the next batch of blocks as needed.
Step 6: When the client has finished reading the file, a function is called, close() on the
FSDataInputStream.
Anatomy of File Write in HDFS
Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit
the files which are already stored in HDFS, but we can append data by reopening the
files.
Step 1: The client creates the file by calling create() on Distributed FileSystem (DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file
system’s namespace, with no blocks associated with it. The name node performs
various checks to make sure the file doesn’t already exist and that the client has the
right permissions to create the file. If these checks pass, the name node prepares a
record of the new file; otherwise, the file can’t be created and therefore the client is
thrown an error i.e. IO Exception. The DFS returns an FS DataOutput Stream for the
client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into
packets, which it writes to an indoor queue called the info queue. The data queue
is consumed by the DataStreamer, which is liable for asking the name node to
allocate new blocks by picking an inventory of suitable data nodes to store the
replicas. The list of data nodes forms a pipeline, and here we’ll assume the
replication level is three, so there are three nodes in the pipeline. The
DataStreamer streams the packets to the primary data node within the pipeline,
which stores each packet and forwards it to the second data node within the
pipeline
Step 4: Similarly, the second data node stores the packet and forwards it to the third (and
last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be
acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and waits
for acknowledgments before connecting to the name node to signal whether the file is
complete or not.
HDFS follows Write Once Read Many models. So, we can’t edit files that are already
stored in HDFS, but we can include them by again reopening the file. This design allows
HDFS to scale to a large number of concurrent clients because the data traffic is spread
across all the data nodes in the cluster. Thus, it increases the availability, scalability, and
throughput of the system
Coherency Model
Coherency Model: A Hadoop Distributed File System needs a model to write once
read much access for Files. A file written then closed should not be changed, only data
can be appended. This assumption helps us to minimize the data coherency issue.
MapReduce fits perfectly with such kind of file model
How does the coherency model work?
•When a file is created, it's visible in the filesystem namespace
•However, written content might not be visible until more than a block's worth of data
has been written
•The current block being written is not visible to other readers.
•The Simple Coherency Model assumes that files don't need to be changed after they're
created, except for appends and truncate
Parallel Copying with distcp
DistCp is a tool that uses MapReduce to copy data in parallel between
clusters. It's useful for copying large amounts of data between clusters that use
different file systems
How DistCp works
•DistCp uses MapReduce for distribution, error handling, recovery, and reporting
•It can copy data between clusters running different file systems, such as HDFS
and S3
••DistCp
DistCp supports
is a usefulcompression and preserves
tool for Hadoop file attributes
cluster administrators
•DistCp use cases
•It can be used for inter/intra-cluster copying
Hadoop Archives
Hadoop archives, or HAR files, are a special format used to reduce the load on the
Hadoop NameNode by combining multiple small files into a single, larger archive,
accessed via the har:// URI.
The primary goal of Hadoop archives is to reduce the number of files the NameNode
needs to manage, and improving performance, especially in clusters with a large
number of YARN aggregated logs.
A Hadoop archive maps to a file system directory and always has a .har extension.
It contains metadata (in the form of _index and _masterindex) and data (part-*)
files.
The HarFileSystem implements the FileSystem interface and provides access via
the har:// URI.
You access archived files using the har:// URI.
For example, har://hdfs/path/to/archive.har/path/to/file
You can create a Hadoop archive using the hadoop archive command.
Limitations: Hadoop Archives
Hadoop Archives (HAR) files, while designed to address small file issues in
HDFS, have limitations such as requiring a copy of the original files,
necessitating archive recreation for modifications, and potentially leading to
inefficient map tasks
Disk Space Consumption:
Creating HAR files creates a copy of the original files, thus requiring the same
amount of disk space as the original files
Modification Overhead:
Once a HAR archive is created, adding or removing files requires recreating the
entire archive, which can be time-consuming and resource-intensive.
Map Task Inefficiency:
HAR files can lead to an increased number of map tasks, which can be inefficient,
especially for large datasets
No Real-time Processing:
Hadoop, in general, is not designed for real-time processing, and HAR files do not
change this fundamental limitation
Batch Processing Focus:
Hadoop primarily supports batch processing, which can lead to slower processing
speeds compared to real-time or streaming solutions
Yarn
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop
2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0.
YARN was described as a “Redesigned Resource Manager” at the time of its
launching, but it has now evolved to be known as large-scale distributed operating
system used for Big Data processing.
YARN architecture basically separates resource management layer from the
processing layer. In Hadoop 1.0 version, the responsibility of Job tracker is split
between the resource manager and application manager.
YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run and
process data stored in HDFS (Hadoop Distributed File System) thus making the
system much more efficient. Through its various components, it can dynamically
allocate various resources and schedule the application processing. For large
volume data processing, it is quite necessary to manage the available resources
properly so that every application can leverage them
YARN Features: YARN gained popularity because of the following features-
•Resource Manager:It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications.
The YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler
to partition the cluster resources.
Its primary job is to keep-up with the Resource Manager. It registers with the Resource
Manager and sends heartbeats with the health status of the node.
It monitors resource usage, performs log management and also kills a container based on
directions from the resource manager.
It is also responsible for creating the container process and start it on the request of
Application master.
•Application Master: An application is a single job submitted to a framework. The
application master is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application.
The application master requests the container from the node manager by sending a
Container Launch Context(CLC) which includes everything an application needs to
run. Once the application is started, it sends the health report to the resource manager
from time-to-time.
•Container: It is a collection of physical resources such as RAM, CPU cores and disk
on a single node.
The containers are invoked by Container Launch Context(CLC) which is a record that
contains information such as environment variables, security tokens, dependencies
etc.
Application workflow in Hadoop YARN
1.Client submits an application
8.Once the processing is complete, the Application Manager un-registers with the Resource
Manager
Advantages
•Flexibility: YARN offers flexibility to run various types of distributed processing systems such as Apache
Spark, Apache Flink, Apache Storm, and others.
•Resource Management: YARN provides an efficient way of managing resources in the Hadoop cluster
•Scalability: YARN is designed to be highly scalable and can handle thousands of nodes in a cluster.
•Improved Performance: YARN offers better performance by providing a centralized resource management
system.
•Security: YARN provides robust security features such as Kerberos authentication, Secure Shell (SSH)
access, and secure data transmission.
Disadvantages :
•Overhead: YARN introduces additional overhead, which can slow down the performance of the Hadoop
cluster.
•Latency: This latency can be caused by resource allocation, application scheduling, and communication
between components.
•Single Point of Failure: YARN can be a single point of failure in the Hadoop cluster. If YARN fails, it can
cause the entire cluster to go down. To avoid this, administrators need to set up a backup YARN instance for
high availability.
•Limited Support: YARN has limited support for non-Java programming languages. Although it supports
multiple processing engines, some engines have limited language support, which can limit the usability of
YARN in certain environments.
Managing Resources and Applications with Hadoop YARN
Role of YARN (Yet Another Resource Negotiator) in Hadoop
YARN (Yet Another Resource Negotiator) which is a core part of Hadoop that helps in
boosting the architecture by effectively coordinating the resources and job scheduling .
Here are the key roles of YARN in Hadoop: Here are the key roles of YARN in Hadoop:
Resource Management:
•Dynamic Allocation: YARN employs adequate strategies for distribution of resources the
cluster CPU, memory, disk to a number of applications based on need.
•Centralized Management: It balances the resources in the Hadoop cluster and also avoids the
conflict of resources within the control system.
Job Scheduling:
•Flexible Scheduling: In YARN, there are different Scheduling policies that are offered to
the users such as FIFO, Capacity Scheduler, Fair Scheduler which helps in proper
scheduling of the workload throughout the processing interface.
•Decoupling from MapReduce: YARN provides for more flexible use of resources by the
system, as well as the separation of task scheduling from resource allocation, enabling
Hadoop to work with frameworks other than MapReduce, including Apache Spark Apache
Flink, and Apache Tez.
•Efficient Cluster Utilization: With this setup, it allows the various processing engines to
run thereby improving the general utilization of the cluster.
Improved Performance:
•Optimal Resource Usage: This frees up the rest of the nodes for resource provisioning to
applications as needed without wasting resources hence improving the performance of the
system.
•Yet Another Resource Negotiator, as the name implies, YARN is the one who helps
to manage the resources across the clusters. In short, it performs scheduling and
resource allocation for the Hadoop System.
•MapReduce makes the use of two functions i.e. Map() and Reduce() whose task
is:
• Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based result
which is later on processed by the Reduce() method.
• Reduce(), as the name suggests does the summarization by aggregating
the mapped data. In simple, Reduce() takes the output generated by Map()
as input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
•It is a platform for structuring the data flow, processing and analyzing huge data
sets.
•Pig does the work of executing commands and in the background, all the activities
of MapReduce are taken care of. After the processing, pig stores the result in HDFS.
•Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
•Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:
•With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive Query
Language).
•It is highly scalable as it allows real-time processing and batch processing both. Also,
all the SQL datatypes are supported by Hive thus, making the query processing easier.
•Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
•JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
•Mahout, allows Machine Learnability to a system or application. Machine
Learning, as the name suggests helps the system to develop itself based on some
patterns, user/environmental interaction or on the basis of algorithms.
•It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
•Spark is best suited for real-time data whereas Hadoop is best suited for structured
data or batch processing, hence both are used in most of the companies
interchangeably.
Apache HBase:
•It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus
able to work on Big Data sets effectively.
•At times where we need to search or retrieve the occurrences of something small in
a huge database, the request must be processed within a short quick span of time. At
such times, HBase comes handy as it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that
carry out a huge task in order to make Hadoop capable of processing large datasets. They
are as follows:
•Solr, Lucene: These are the two services that perform the task of searching and indexing
with the help of some java libraries, especially Lucene is based on Java which allows
spell check mechanism, as well. However, Lucene is driven by Solr.
•Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.
Map-Reduce Programming
Introduction
Mapper
Reducer
Combiner
Partitioner, Searching, Sorting
Compression
Word Count Example
Refer experiment No:07
Hive
What is Hive ?
History of Hive and Recent release of Hive and Features of Hive
Hive integration and Workflow
Hive Data units
Hive Architecture
Hive Data Types
Hive File Format
Hive Query Language (HQL)
RCfile Implementation
SERDE
User-Defined Function UDF