KEMBAR78
Big Data - Midsem | PDF | Big Data | Apache Hadoop
0% found this document useful (0 votes)
331 views526 pages

Big Data - Midsem

The document discusses an introduction to big data systems. It provides an outline of topics to be covered in the course, including big data analytics, distributed systems programming, Hadoop ecosystem technologies, NoSQL databases, and in-memory and streaming systems like Spark. It also lists recommended books and materials. The first part of the document covers motivation for big data in modern enterprises, challenges of scaling relational database management systems, and characteristics of big data systems and their architectures.

Uploaded by

Rachana Pandit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
331 views526 pages

Big Data - Midsem

The document discusses an introduction to big data systems. It provides an outline of topics to be covered in the course, including big data analytics, distributed systems programming, Hadoop ecosystem technologies, NoSQL databases, and in-memory and streaming systems like Spark. It also lists recommended books and materials. The first part of the document covers motivation for big data in modern enterprises, challenges of scaling relational database management systems, and characteristics of big data systems and their architectures.

Uploaded by

Rachana Pandit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 526

BIG DATA SYSTEMS

BITS Pilani
Pilani Campus Email: kanantharaman@wilp.bits-pilani.ac.in

10/20/2023 CCCSZG522 BDS S1-13-14 1


Introduction to Big Data,
Locality of Reference
BITS Pilani CSession-1
Pilani Campus

10/20/2023 CCCSZG522 BDS S1-13-14 2


Course outline

➢ S1: Introduction to Big Data and data locality


➢ S2: Parallel and Distributed Processing
➢ S3: Big Data Analytics and Big Data Systems
➢ S4: Consistency, Availability, Partition tolerance and Data
Lifecycle
➢ S5: Distributed Systems Programming
➢ S6-S9: Hadoop ecosystem technologies
➢ S10: NoSQL Databases
➢ S11: Big Data on Cloud
➢ S12: Amazon storage services
➢ S13-16: In-memory and streaming - Spark
10/20/2023 CCCSZG522 BDS S1-13-14 3
BITS Pilani, Pilani Campus
Books

T1 Seema Acharya and Subhashini Chellappan. Big Data


and Analytics.
Wiley India Pvt. Ltd. Second Edition
R1 DT Editorial Services. Big Data - Black Book.
DreamTech. Press. 2016
R2 Kai Hwang, Jack Dongarra, and Geoffrey C. Fox.
Distributed and Cloud Computing: From Parallel
Processing to the Internet of Things. Morgan Kauffman
2011
AR Additional Reading (As per topic)

10/20/2023 CCCSZG522 BDS S1-13-14 4


BITS Pilani, Pilani Campus
Topics for today

➢ Motivation
– Why do modern Enterprises need to work with data
– What is Big Data and data classification
– Scaling RDBMS

➢ What is a Big Data System


– Characteristics
– Design challenges

➢ Architecture
– Data warehouse
– High level architecture of Big Data solutions

➢ Why locality of reference and storage organization


matter?

10/20/2023 CCCSZG522 BDS S1-13-14 5


BITS Pilani, Pilani Campus
Why do modern Enterprises need
to work with data?

10/20/2023 CCCSZG522 BDS S1-13-14 6


BITS Pilani, Pilani Campus
Example of a data-driven Enterprise:
A large online retailer (1)

➢ What data is collected


– Millions of transactions and browsing clicks per day across
products, users
– Delivery tracking
– Reviews on multiple channels - website, social media, customer
support
– Support emails, logged calls
– Ad click and browsing data
– …

➢ Data is a mix of metrics, natural language text, logs,


events, videos, images etc.

10/20/2023 CCCSZG522 BDS S1-13-14 7


BITS Pilani, Pilani Campus
Example of a data-driven Enterprise:
A large online retailer (2)
➢ What is this data used for
– User profiling for better shopping experience
– Operations efficiency metrics
– Improve customer support experience, support training
– Demand forecasting
– Product marketing
– …

➢ Data is the only way to create competitive differentiators,


retain customers and ensure growth

10/20/2023 CCCSZG522 BDS S1-13-14 8


BITS Pilani, Pilani Campus
Data Volume Growth

▪ Facebook: 500+ TB/day of comments, images, videos etc.


• NYSE: 1TB/day of trading data
• A Jet Engine: 20TB / hour of sensor / log data
Source: https://www.guru99.com/what-is-big-data.html

10/20/2023 CCCSZG522 BDS S1-13-14 9


BITS Pilani, Pilani Campus
Variety of data sources
Data → Information → Insights

Source: https://www.guru99.com/what-is-big-data.html

10/20/2023 CCCSZG522 BDS S1-13-14 10


BITS Pilani, Pilani Campus
Type of Digital Data

10/20/2023 CCCSZG522 BDS S1-13-14 11


BITS Pilani, Pilani Campus
Data Classification

• Digital data is classified into the following categories:

➢Structured data

➢Semi-structured data

➢Unstructured data

10/20/2023 CCCSZG522 BDS S1-13-14 12


BITS Pilani, Pilani Campus
Data Classification

Digital data is classified into the following categories:


Structured data

Semi-structured data

Unstructured data

10/20/2023 CCCSZG522 BDS S1-13-14 13


BITS Pilani, Pilani Campus
Approximate percentage
distribution of digital data

10/20/2023 CCCSZG522 BDS S1-13-14 14


BITS Pilani, Pilani Campus
Structured Data

➢ This is the data which is in an organized form (e.g., in


rows and columns) and can be easily used by a
computer program.

➢ Relationships exist between entities of data, such as


classes and their objects.

➢ Data stored in databases is an example of structured


data.

10/20/2023 CCCSZG522 BDS S1-13-14 15


BITS Pilani, Pilani Campus
Sources of Structured Data

10/20/2023 CCCSZG522 BDS S1-13-14 16


BITS Pilani, Pilani Campus
Ease with Structured Data

10/20/2023 CCCSZG522 BDS S1-13-14 17


BITS Pilani, Pilani Campus
Semi-Structured Data

➢ This is the data which does not conform to a data model


but has some structure. However, it is not in a form
which can be used easily by a computer program.

➢ Example, emails, XML, markup languages like HTML,


etc. Metadata for this data is available but is not
sufficient.

10/20/2023 CCCSZG522 BDS S1-13-14 18


BITS Pilani, Pilani Campus
Sources of Semi-Structured
Data

XML (eXtensible Markup Language)

XHTML
Other Markup Languages SGML
….

JSON (Java Script Object Notation)


Semi-Structured Data

10/20/2023 CCCSZG522 BDS S1-13-14 19


BITS Pilani, Pilani Campus
Characteristics of Semi-
structured Data

10/20/2023 CCCSZG522 BDS S1-13-14 20


BITS Pilani, Pilani Campus
Unstructured Data

➢ This is the data which does not conform to a data


model or is not in a form which can be used easily by a
computer program.

➢ About 80–90% data of an organization is in this format.

➢ Example: memos, chat rooms, PowerPoint


presentations, images, videos, letters, researches,
white papers, body of an email, etc.

10/20/2023 CCCSZG522 BDS S1-13-14 21


BITS Pilani, Pilani Campus
Sources of Unstructured Data

10/20/2023 CCCSZG522 BDS S1-13-14 22


BITS Pilani, Pilani Campus
Issues with terminology –
Unstructured Data

10/20/2023 CCCSZG522 BDS S1-13-14 23


BITS Pilani, Pilani Campus
Dealing with Unstructured
Data

10/20/2023 CCCSZG522 BDS S1-13-14 24


BITS Pilani, Pilani Campus
Q&A

10/20/2023 CCCSZG522 BDS S1-13-14 25


BITS Pilani, Pilani Campus
Topics – 2nd Half

1. Definition of big data.

2. Challenges of big data.

3. Why big data?

4. Traditional Business Intelligence versus big data.

10/20/2023 CCCSZG522 BDS S1-13-14 26


BITS Pilani, Pilani Campus
Definition of Big Data
Big data is high-volume, high-velocity,
and high-variety information assets that
demand cost effective, innovative
forms of information processing for
enhanced insight and decision making.

Source: Gartner IT Glossary

10/20/2023 CCCSZG522 BDS S1-13-14 27


BITS Pilani, Pilani Campus
Volume - A Mountain of Data

10/20/2023 CCCSZG522 BDS S1-13-14 28


BITS Pilani, Pilani Campus
Velocity

Batch → Periodic → Near real-time → Real-time processing

10/20/2023 CCCSZG522 BDS S1-13-14 29


BITS Pilani, Pilani Campus
Variety

➢ Structured data: example: traditional transaction


processing systems and RDBMS, etc.

➢ Semi-structured data: example: Hyper Text Markup


Language (HTML), eXtensible Markup Language (XML).

➢ Unstructured data: example: unstructured text


documents, audio, video, email, photos, PDFs, social
media, etc.

10/20/2023 CCCSZG522 BDS S1-13-14 30


BITS Pilani, Pilani Campus
Other Characteristics of Data –
Which are not Definitional Traits of Big Data

• Veracity and Validity


• Refers to biases, noise, and abnormality in data

• Volatility
• Volatility of data deals with, how long is the data valid?

• Variability
• Data flows can be highly inconsistent with periodic peaks.

10/20/2023 CCCSZG522 BDS S1-13-14 31


BITS Pilani, Pilani Campus
Dealing with Unstructured
Data

• Data mining
✓ Association rule mining, e.g. market basket or affinity analysis
✓ Regression, e.g. predict dependent variable from independent variables
✓ Collaborative filtering, e.g. predict a user preference from group preferences
✓ NLP - e.g. Human to Machine interaction, conversational systems
✓ Text Analytics - e.g. sentiment analysis, search
✓ Noisy text analytics - e.g. spell correction, speech to text

10/20/2023 CCCSZG522 BDS S1-13-14 32


BITS Pilani, Pilani Campus
Challenges with Big Data

10/20/2023 CCCSZG522 BDS S1-13-14 33


BITS Pilani, Pilani Campus
10/20/2023 CCCSZG522 BDS S1-13-14 34
BITS Pilani, Pilani Campus
Sources of Big Data

There are a multitude of sources for big data. An XLS, a DOC, a


PDF … is unstructured data, a video on YouTube, a chat
conversation on Internet Messenger, a customer feedback form
on an online retail website … is unstructured data, a CCTV
coverage, a weather forecast report is unstructured data too.

10/20/2023 CCCSZG522 BDS S1-13-14 35


BITS Pilani, Pilani Campus
Why Big Data?

10/20/2023 CCCSZG522 BDS S1-13-14 36


BITS Pilani, Pilani Campus
Traditional Business Intelligence (BI)
versus Big Data

10/20/2023 CCCSZG522 BDS S1-13-14 37


BITS Pilani, Pilani Campus
A Typical Data Warehouse
Environment

In a typical DW environment, data is collected from multiple disparate sources,


integrated, cleansed and transformed before loading it to a data warehouse. A host of
market leading BI tools can then be used on top of the data warehouse for
reporting/dashboarding, ad hoc querying and modeling.

10/20/2023 CCCSZG522 BDS S1-13-14 38


BITS Pilani, Pilani Campus
A Typical Hadoop Environment

Hadoop takes care of storage and processing using the following:

a) HDFS (Hadoop Distributed File System) (distributed storage)


b) MapReduce (distributed processing)
10/20/2023 CCCSZG522 BDS S1-13-14 39
BITS Pilani, Pilani Campus
Co-existence of Big Data and
Data Warehouse

10/20/2023 CCCSZG522 BDS S1-13-14 40


BITS Pilani, Pilani Campus
Systems perspective - Processing: In-memory vs. (from)
secondary storage vs. (over the) network

10/20/2023 CCCSZG522 BDS S1-13-14 41


BITS Pilani, Pilani Campus
Processing Approaches for Big Data
Systems
• In-Memory Processing
✓ Data loaded into RAM for fast access
✓ Useful for real-time, low latency needs
✓ Limited by available RAM capacity
Use Cases: It is often used in scenarios where speed is critical, such as financial
trading, real-time recommendation engines, and online gaming analytics.
• Secondary Storage Processing
✓ Data stored on disk (SSD/HDD)
✓ Higher latency than in-memory
✓ Allows larger dataset storage
✓ Disk I/O can bottleneck
Use Cases: It's often used in cloud-based or distributed systems where data is
stored in multiple geographic locations. Examples include processing data from
remote IoT devices and distributed data centers.
10/20/2023 CCCSZG522 BDS S1-13-14 42
BITS Pilani, Pilani Campus
Network and Hybrid Processing

• Network Processing
✓ Analyze data as it streams over network
✓ Enables real-time analytics on remote data
✓ Network latency affects speed
• Hybrid Strategies
✓ Combine in-memory, disk, and network processing
✓ Leverage benefits of each approach
✓ Balance real-time and batch needs
✓ Mitigate individual limitations
✓ Optimal mix depends on:
✓ Data sizes, analytics needs
✓ Infrastructure costs
✓ Real-time vs batch requirements

10/20/2023 CCCSZG522 BDS S1-13-14 43


BITS Pilani, Pilani Campus
Locality of Reference: Principle: examples

Impact of Latency: Algorithms and data structures that


leverage locality, data organization on disk for better
Locality

10/20/2023 CCCSZG522 BDS S1-13-14 44


BITS Pilani, Pilani Campus
Locality of Reference

• Principle: Data elements accessed close together in time or


space tend to be accessed again in the near future.
• Temporal Locality - Data accessed recently is likely to be
accessed again soon. Exhibited in code by accessing the same
variable/data item multiple times.
• For example, in an online shopping system, a user browsing a product
catalog may click on multiple product pages, showing temporal locality.
• Spatial Locality - Data elements stored close together are
likely to be referenced close together. Exhibited in code by
iterating through arrays or sequential data structures
• For instance, when a web page is loaded, the CSS and image files
associated with it are often requested, demonstrating spatial locality.

10/20/2023 CCCSZG522 BDS S1-13-14 45


BITS Pilani, Pilani Campus
Example: Temporal Locality

// Temporal locality -Data accessed recently is likely to be accessed again soon.


x = getData();
// Use x multiple times

Data X- Exhibits temporal Locality

10/20/2023 CCCSZG522 BDS S1-13-14 46


BITS Pilani, Pilani Campus
Example:Spatial Locality

// Array exhibiting spatial locality


int[] data = {1, 2, 3, 4, 5};

for (i = 0; i < data.length; i++) {


x = data[i];
// use x
}
Data X – Exhibits spatial locality
• The key point is that arrays and other sequential data structures store data
elements close together in memory. Traversing the array accesses
neighboring elements, exhibiting spatial locality. Storing the array
contiguously takes advantage of this access pattern for better performance.

10/20/2023 CCCSZG522 BDS S1-13-14 47


BITS Pilani, Pilani Campus
Impact of Latency(1)

• Algorithms leveraging temporal locality keep frequently reused elements in


faster memory to reduce access latency:
cache = new Cache();
// Keep reusing x
cache.add(x);
Caching: Caching is a common application of the locality of reference. Web
browsers, for example, cache web pages, images, and scripts to reduce
load times. When a user revisits a webpage, the cached resources can be
quickly retrieved from local storage rather than re-downloading them.

Disk Access Patterns: In big data systems, when processing a large dataset,
algorithms often read data from disk. These algorithms can benefit from
spatial locality if data on the same disk blocks or in nearby regions are
frequently accessed. This can be observed in data processing frameworks
like Hadoop, which are optimized for processing data with spatial locality.
10/20/2023 CCCSZG522 BDS S1-13-14 48
BITS Pilani, Pilani Campus
Impact of Latency(2)

• Data Organization on Disk: To mitigate latency, data is often organized in specific


ways on disk. For example, in a database system, related data might be stored
together in data pages. This minimizes the time required to read data, as reading a
page that contains multiple relevant records is more efficient than seeking and
reading data from different disk locations.

• Algorithms and Data Structures: Big data systems leverage algorithms and data
structures that take advantage of locality of reference to optimize performance.
For example, B-trees are used in databases to efficiently access data based on
spatial locality, reducing the number of disk I/O operations.

• MapReduce in Hadoop: Hadoop's MapReduce framework capitalizes on the


locality of reference. It processes data where it resides by scheduling tasks on
nodes that store the data, reducing data transfer time and improving performance.

10/20/2023 CCCSZG522 BDS S1-13-14 49


BITS Pilani, Pilani Campus
Summary – Locality and
Impact of Latency
• Understanding and leveraging the locality of reference is
crucial in big data systems to reduce latency and
enhance performance.
• This principle influences how data is organized on disk,
the choice of data structures and algorithms, and the
design of distributed data processing frameworks.
• It plays a key role in making big data systems more
efficient and responsive to data access and processing
needs.

10/20/2023 CCCSZG522 BDS S1-13-14 50


BITS Pilani, Pilani Campus
Q&A

10/20/2023 CCCSZG522 BDS S1-13-14 51


BITS Pilani, Pilani Campus
BIG DATA SYSTEMS
Contact Session-2
BITS Pilani
Pilani Campus Email: kanantharaman@wilp.bits-pilani.ac.in
Big data Systems - Parallel
and Distributed Computing
➢ Big data characterized by its sheer volume, velocity,
variety, and complexity.

➢ Processing such data using traditional methods on a


single machine can be extremely slow and often
impractical.

➢ Parallel and distributed processing is motivated by the


need to harness the processing power of multiple
machines to handle big data

12/20/2023 CCC ZG522 - BDS S1-23-24 2


BITS Pilani, Pilani Campus
Parallel and Distributed
system Organization

Shared Memory vs Message passing

12/20/2023 CCC ZG522 - BDS S1-23-24 3


BITS Pilani, Pilani Campus
Improvement in processor
and network technologies

12/20/2023 CCC ZG522 - BDS S1-23-24 4


BITS Pilani, Pilani Campus
Advances in CPU Processors

Example: Intel Core i7 990x


has reported
159,000 MIPS execution rate

• Schematic of a modern multicore CPU chip using a hierarchy of caches,


where L1 cache is private to each core, on-chip L2 cache is shared and L3
cache or DRAM Is off the chip.

12/20/2023 CCC ZG522 - BDS S1-23-24 5


BITS Pilani, Pilani Campus
Multicore CPU and Many-
Core GPU Architectures

• Five micro-architectures in modern CPU processors, that exploit ILP and


TLP supported by multicore and multithreading technologies.
12/20/2023 CCC ZG522 - BDS S1-23-24 6
BITS Pilani, Pilani Campus
GPU Programming Model

• Essentially, the CPU’s floating-point kernel computation role is largely


offloaded to the many-core GPU. The CPU instructs the GPU to perform
massive data processing.

12/20/2023 CCC ZG522 - BDS S1-23-24 7


BITS Pilani, Pilani Campus
The NVIDIA Fermi GPU Chip with
512 CUDA Cores
• CUDA® is a parallel computing
platform and programming
model developed
by NVIDIA for general
computing on graphical
processing units (GPUs).

• CUDA (Compute Unified


Device Architecture)

12/20/2023 CCC ZG522 - BDS S1-23-24 8


BITS Pilani, Pilani Campus
➢ Memory, Storage, and Wide-
Area Networking

12/20/2023 CCC ZG522 - BDS S1-23-24 9


BITS Pilani, Pilani Campus
• Improvement in memory and disk technologies over 33 years. The
Seagate Barracuda XT disk has a capacity of 3 TB in 2011
12/20/2023 CCC ZG522 - BDS S1-23-24 10
BITS Pilani, Pilani Campus
Memory (RAM)

Role in Big Data Systems:


– Memory is crucial for handling data processing tasks, as it allows for
quick access to frequently used data and reduces latency.
– In-memory processing involves keeping data in RAM, enabling faster
data retrieval and manipulation.
– It is used for caching frequently accessed data, improving query and
analysis performance in real-time and batch processing.
Examples:
– In-memory databases like Apache Ignite or Redis store data in RAM,
enabling real-time data analysis.
– Spark, a popular big data processing framework, utilizes memory for its
Resilient Distributed Datasets (RDDs) to accelerate data processing.

12/20/2023 CCC ZG522 - BDS S1-23-24 11


BITS Pilani, Pilani Campus
Storage (1)

• Storage refers to the non-volatile, long-term data storage


components within a big data system.

• This includes hard disk drives (HDDs), solid-state drives


(SSDs), and distributed file systems like Hadoop
Distributed File System (HDFS).

12/20/2023 CCC ZG522 - BDS S1-23-24 12


BITS Pilani, Pilani Campus
Storage(2)

• Role in Big Data Systems:


– Storage is essential for persistently storing large volumes of data,
including historical and raw data.
– Distributed file systems like HDFS divide data into blocks and store
them across multiple nodes to provide fault tolerance and scalability.
– Data is often stored on disks for long-term retention, with retrieval when
needed for processing.
• Examples:
• HDFS in Hadoop divides data into blocks and distributes them across a
cluster of machines to provide reliable and scalable storage.
• Cloud-based storage solutions like Amazon S3 or Azure Blob Storage
provide cost-effective and scalable storage for big data.

12/20/2023 CCC ZG522 - BDS S1-23-24 13


BITS Pilani, Pilani Campus
Wide-Area Networking
(WAN) (1)
• Wide-Area Networking (WAN) is the network
infrastructure that connects geographically dispersed
locations.
• It includes the use of public and private networks to
facilitate communication between data centers, offices,
or cloud services.

12/20/2023 CCC ZG522 - BDS S1-23-24 14


BITS Pilani, Pilani Campus
Wide-Area Networking
(WAN) (2)
• Role in Big Data Systems:
• WAN enables the transfer of data between distributed data centers, cloud
services, and remote locations.
• It supports the replication of data across multiple geographic regions for data
redundancy and disaster recovery.
• WAN connections are crucial for sending and receiving data in real-time or
batch data processing.
• Examples:
• Cloud services like AWS and Azure use WAN connections to enable
remote access to data stored in their data centers.
• Distributed big data systems use WAN connections to share data and
synchronize updates between nodes in different locations.

In big data systems, these three components—Memory, Storage, and Wide-Area


Networking—work together to ensure efficient data processing, storage, and data transfer,
especially when dealing with large and distributed datasets.
12/20/2023 CCC ZG522 - BDS S1-23-24 15
BITS Pilani, Pilani Campus
Virtual Machines and Virtualization
Middleware

• Three VM architectures in (b), (c), and (d), compared with the traditional
physical machine shown in (a).

12/20/2023 CCC ZG522 - BDS S1-23-24 16


BITS Pilani, Pilani Campus
VM Primitive Operations (1)

• The VMM provides the VM abstraction to the guest OS.


• With full virtualization, the VMM exports a VM
abstraction identical to the physical machine so that a
standard OS such as Windows 2000 or Linux can run
just as it would on the physical hardware.

12/20/2023 CCC ZG522 - BDS S1-23-24 17


BITS Pilani, Pilani Campus
VM Primitive Operations (2)

• VM multiplexing, suspension, provision, and migration in a


distributed computing environment.

12/20/2023 CCC ZG522 - BDS S1-23-24 18


BITS Pilani, Pilani Campus
Data Center Virtualization
for Cloud Computing

12/20/2023 CCC ZG522 - BDS S1-23-24 19


BITS Pilani, Pilani Campus
Cluster Architecture

• A cluster of servers interconnected by a high-bandwidth SAN


or LAN with shared I/O devices and disk arrays; the cluster
acts as a single computer attached to the Internet.

12/20/2023 CCC ZG522 - BDS S1-23-24 20


BITS Pilani, Pilani Campus
• At the platform level, MapReduce offers a new
programming model that transparently handles data
parallelism with natural fault tolerance capability. We will
discuss

• Iterative MapReduce extends MapReduce to support a


broader range of data mining algorithms commonly used
in scientific applications. The cloud runs on an extremely
large cluster of commodity computers. I

12/20/2023 CCC ZG522 - BDS S1-23-24 21


BITS Pilani, Pilani Campus
• Cloud Computing Over the
Internet

12/20/2023 CCC ZG522 - BDS S1-23-24 22


BITS Pilani, Pilani Campus
Internet Clouds

12/20/2023 CCC ZG522 - BDS S1-23-24 23


BITS Pilani, Pilani Campus
The Cloud Landscape

• Three cloud service models in a cloud landscape of major


providers.

12/20/2023 CCC ZG522 - BDS S1-23-24 24


BITS Pilani, Pilani Campus
Service-Oriented
Architecture (SOA) (1)
• These architectures build on the traditional seven Open
Systems Interconnection (OSI) layers that provide the
base networking abstractions.
• On top of this we have a base software environment,
which would be .NET or Apache Axis for web services,
the Java Virtual Machine for Java, and a broker network
for CORBA.
• On top of this base environment one would build a
higher level environment reflecting the special features
of the distributed computing environment.

12/20/2023 CCC ZG522 - BDS S1-23-24 25


BITS Pilani, Pilani Campus
Service-Oriented
Architecture (SOA) (2)

12/20/2023 CCC ZG522 - BDS S1-23-24 26


BITS Pilani, Pilani Campus
• Memory Hierarchy in Distributed
Systems: In-node vs. over the
network latencies, Locality,
Communication Cost.

12/20/2023 CCC ZG522 - BDS S1-23-24 27


BITS Pilani, Pilani Campus
Memory Hierarchy in Distributed
Systems

• Levels of memory within each node - registers, caches,


RAM, storage
• Faster memory used for frequently accessed data
• Slower memory used for less frequent data
• Optimizes data access and processing

12/20/2023 CCC ZG522 - BDS S1-23-24 28


BITS Pilani, Pilani Campus
In-Node vs Over-the-
Network Latencies
• In-node latencies faster - within a node's memory
hierarchy
• Over-network latencies higher - between nodes over
network
• Minimizing network transfer optimizes performance
Locality
• Locality of reference - nearby data likely to be accessed
together
• Data locality - processing near where data resides
• Minimizes network transfer overhead

12/20/2023 CCC ZG522 - BDS S1-23-24 29


BITS Pilani, Pilani Campus
Communication Cost

• Overhead of transferring data between nodes


• Includes network latency, bandwidth, protocols
• High costs impact performance negatively

12/20/2023 CCC ZG522 - BDS S1-23-24 30


BITS Pilani, Pilani Campus
Key Principles

• In-memory computing minimizes disk accesses


• Distributed processing near data (data locality)
• Efficient data partitioning and shuffling
• Minimize network transfer overhead
• In summary,
• The memory hierarchy in distributed systems, in-node vs. over-the-
network latencies, data locality, and communication cost are critical
considerations in designing and optimizing big data systems.
• Minimizing data transfer overhead and maximizing data locality are key
principles for achieving efficient data processing and analysis in
distributed environments.

12/20/2023 CCC ZG522 - BDS S1-23-24 31


BITS Pilani, Pilani Campus
• Distributed Systems: Motivation (size,
scalability, cost-benefit)

12/20/2023 CCC ZG522 - BDS S1-23-24 32


BITS Pilani, Pilani Campus
Distributed Systems:
Motivation
• Size:
• Big data involves massive datasets that are often too large to be
processed efficiently on a single machine. Distributed systems enable
the storage and processing of these datasets by distributing the
workload across multiple nodes.
• Scalability:
• As data volumes continue to grow, distributed systems provide
scalability by allowing organizations to add more computational and
storage resources as needed.
• Cost-Benefit:
• Distributed systems offer cost-effective solutions by utilizing commodity
hardware and open-source software, making it feasible to store and
process vast amounts of data without investing in expensive,
specialized infrastructure.

12/20/2023 CCC ZG522 - BDS S1-23-24 33


BITS Pilani, Pilani Campus
Client-Server vs. Peer-to-Peer
Models
Client-Server Model:
– In a client-server model, one or more central servers provide services
and resources to multiple clients.
Relevance to Big Data:
– Many big data processing frameworks, such as Hadoop and Spark,
follow a client-server model. The master node (server) coordinates
tasks across worker nodes (clients) for distributed data processing.
Peer-to-Peer Model:
– In a peer-to-peer model, individual nodes (peers) in the network have
equal status and share resources directly with one another.
Relevance to Big Data:
– While less common in big data processing, peer-to-peer models can be
used for distributed storage or data sharing in decentralized
environments.

12/20/2023 CCC ZG522 - BDS S1-23-24 34


BITS Pilani, Pilani Campus
Client-Server Model

12/20/2023 CCC ZG522 - BDS S1-23-24 35


BITS Pilani, Pilani Campus
Peer to Peer Network

12/20/2023 CCC ZG522 - BDS S1-23-24 36


BITS Pilani, Pilani Campus
Cluster Computing: Components
and Architectures (1)
Components:
• Nodes: Cluster computing involves multiple
interconnected nodes, each equipped with CPU,
memory, and storage resources.
• Interconnect: High-speed interconnects (e.g., Ethernet,
InfiniBand) enable fast data transfer and communication
between nodes.
• Software Stack: Cluster computing relies on specialized
software for resource management, workload
distribution, and coordination.

12/20/2023 CCC ZG522 - BDS S1-23-24 37


BITS Pilani, Pilani Campus
Cluster Computing: Components
and Architectures(2)
Homogeneous Clusters: All nodes in the cluster have
similar hardware configurations. Homogeneous clusters
are often used in high-performance computing (HPC)
environments.
Heterogeneous Clusters: Nodes in the cluster have
varying hardware specifications. Heterogeneous clusters
can be more cost-effective and can accommodate a
wider range of workloads.

12/20/2023 CCC ZG522 - BDS S1-23-24 38


BITS Pilani, Pilani Campus
Cluster Computing: Components
and Architectures(3)
Symmetric Multiprocessing (SMP): In SMP architectures,
multiple processors share the same memory, offering a
single-system image. SMP clusters are suited for multi-
threaded applications.
Non-Uniform Memory Access (NUMA): NUMA
architectures feature processors with localized memory,
reducing memory access latencies. NUMA clusters are
effective for memory-intensive tasks.

12/20/2023 CCC ZG522 - BDS S1-23-24 39


BITS Pilani, Pilani Campus
Relevance to Big Data

• Cluster computing is fundamental to big data systems,


as it provides the infrastructure for parallel data
processing and storage.
• Big data frameworks like Hadoop, Spark, and Apache
Kafka are designed to run on cluster architectures.
• Hadoop clusters use the Hadoop Distributed File System
(HDFS) for distributed storage and MapReduce for
parallel data processing.

12/20/2023 CCC ZG522 - BDS S1-23-24 40


BITS Pilani, Pilani Campus
BIG DATA SYSTEMS
Contact Session-2
Big Data Analytics and Big Data System Characteristics
BITS Pilani Email: kanantharaman@wilp.bits-pilani.ac.in
Pilani Campus
Learning Objectives and
Learning Outcomes
Learning Objectives Learning Outcomes
Big Data Analytics

1. What is big data analytics and what it a) To understand the significance of big
isn’t? data analytics.

2. Why is big data analytics important? b) To understand the role of data


scientist.
3. What is data Science?
c) To understand the various
4. Getting familiar with the terminologies terminologies used in the big data
used in the big data environment. environment.

12/20/2023 CCC ZG522 - BDS S1-23-24 2


BITS Pilani, Pilani Campus
Agenda

➢ What is Big Data Analytics?


➢ What Big Data Analytics isn’t?
➢ Classification of Analytics
➢ Why is Big Data Analytics Important?
➢ Data Science
➢ Data Scientist … Your New Best Friend!!!
➢ Terminologies Used in Big Data Environment
➢ In Memory Analytics
➢ In Database Processing
➢ Massively Parallel Processing
➢ Difference between Parallel versus Distributed Systems
➢ Shared Nothing Architecture
➢ Consistency, Availability, Partition Tolerance (CAP): Theorem
Explained
➢ Few Top Analytics Tools

12/20/2023 CCC ZG522 - BDS S1-23-24 3


BITS Pilani, Pilani Campus
What is Big Data Analytics?

12/20/2023 CCC ZG522 - BDS S1-23-24 4


BITS Pilani, Pilani Campus
What Big Data Analytics isn’t?

12/20/2023 CCC ZG522 - BDS S1-23-24 5


BITS Pilani, Pilani Campus
Analytics 1.0, 2.0 and 3.0

12/20/2023 CCC ZG522 - BDS S1-23-24 6


BITS Pilani, Pilani Campus
Data Science & Data Scientist

12/20/2023 CCC ZG522 - BDS S1-23-24 7


BITS Pilani, Pilani Campus
Terminologies Used in Big
data Environments
• In-Memory Analytics
• In-Database Processing
• Massively Parallel Processing
• Parallel System
• Distributed System
• Shared Nothing Architecture

12/20/2023 CCC ZG522 - BDS S1-23-24 8


BITS Pilani, Pilani Campus
Brewer’s CAP Theorem
CAP theorem:

Any distributed system can comply with


only two of the three characteristics of
CAP theorem.

Consistency: Any read fetches the last


write.

Availability: No read/write requests will


ever be refused.

Partition tolerance: The distributed


system shall continue to function even
when network partition occurs.

12/20/2023 CCC ZG522 - BDS S1-23-24 9


BITS Pilani, Pilani Campus
Few Top Analytical Tools

• MS Excel
https://support.office.microsoft.com/en-in/article/Whats-new-in-Excel-2013-
1cbc42cd-bfaf-43d7-9031-5688ef1392fd?CorrelationId=1a2171cc-191f-47de-
8a55-08a5f2e9c739&ui=en-US&rs=en-IN&ad=IN
SAS
http://www.sas.com/en_us/home.htm
IBM SPSS Modeler
http://www-01.ibm.com/software/analytics/spss/products/modeler/

12/20/2023 CCC ZG522 - BDS S1-23-24 10


BITS Pilani, Pilani Campus
Q&A

• What are the key questions to be answered by all


organizations stepping into analytics?

• What is predictive and prescriptive analytics?

12/20/2023 CCC ZG522 - BDS S1-23-24 11


BITS Pilani, Pilani Campus
Answer(1)

• The key questions for any organization stepping into analytics


are:

• Should you be storing all of your big data? If “Yes”, where are you
going to store it? If “No”, how will you know what to store and what
to discard?
o • How will you sieve through your massive data to filter out the
relevant from the irrelevant?
• How long will you store this data? • How will you accommodate the
peaks (variability in terms of data influx) in your data?
• How will you analyze? Will you analyze all the data that is stored or
analyze a sample?
• What will you do with the insights generated from this analysis?

12/20/2023 CCC ZG522 - BDS S1-23-24 12


BITS Pilani, Pilani Campus
Answer(2)

• Predictive analytics helps you answer the questions:


“What will happen?” and “Why will it happen”?
• Prescriptive analytics goes beyond
“What will happen?” “Why will it happen?” and “When will it
happen?” to answer
“What should be the action taken to take advantage of what
will happen”?

12/20/2023 CCC ZG522 - BDS S1-23-24 13


BITS Pilani, Pilani Campus
2nd Half

Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes


The big data technology landscape a) To understand the significance of NoSQL
databases
1. What is NoSQL databases?
b) To understand the need for NewSQL
2. Why NoSQL?
c) To understand the Hadoop platform and be
3. Key advantages of NoSQL able to appreciate the difference between
Hadoop 1.0 and Hadoop 2.0
4. What is NewSQL?

5. SQL Vs. NoSQL

6. Getting familiar with Hadoop.

12/20/2023 CCC ZG522 - BDS S1-23-24 14


BITS Pilani, Pilani Campus
Agenda

• NoSQL
❖ What is it?
❖ Types of NoSQL Databases
❖ Why NoSQL?
❖ Advantages of NoSQL
❖ NoSQL Vendors
❖ SQL versus NoSQL
❖ NewSQL
❖ Comparison of SQL, NoSQL and NewSQL

• Hadoop
– Features of Hadoop
– Key Advantages of Hadoop
– Versions of Hadoop

12/20/2023 CCC ZG522 - BDS S1-23-24 15


BITS Pilani, Pilani Campus
What is NoSQL?

NoSQL databases are open source, non-relational and distributed databases.


They relax one or more ACID (Atomicity, Consistency, Isolation and Durability)
properties of transactions. However, they adhere to Brewer’s CAP theorem.
12/20/2023 CCC ZG522 - BDS S1-23-24 16
BITS Pilani, Pilani Campus
Types of NoSQL

12/20/2023 CCC ZG522 - BDS S1-23-24 17


BITS Pilani, Pilani Campus
Use cases:
(Key, Value pair NOSQL)
Key−value or the big hash table. 2. Schema-less.

12/20/2023 CCC ZG522 - BDS S1-23-24 18


BITS Pilani, Pilani Campus
Use case:
Document Oriented NOSQL

12/20/2023 CCC ZG522 - BDS S1-23-24 19


BITS Pilani, Pilani Campus
Column Oriented NOSQL

For example:
Cassandra,
HBase, etc.

12/20/2023 CCC ZG522 - BDS S1-23-24 20


BITS Pilani, Pilani Campus
Graph Database- NOSQL

12/20/2023 CCC ZG522 - BDS S1-23-24 21


BITS Pilani, Pilani Campus
Advantages of NoSQL
• NoSQL has
flexibility with
respect to
schema.
• One of the key
advantages with
NoSQL is its
ability to scale
up and down
easily.

12/20/2023 CCC ZG522 - BDS S1-23-24 22


BITS Pilani, Pilani Campus
NOSQL Vendors

12/20/2023 CCC ZG522 - BDS S1-23-24 23


BITS Pilani, Pilani Campus
SQL Vs. NoSQL

12/20/2023 CCC ZG522 - BDS S1-23-24 24


BITS Pilani, Pilani Campus
NewSQL

12/20/2023 CCC ZG522 - BDS S1-23-24 25


BITS Pilani, Pilani Campus
SQL Vs. NoSQL Vs.
NewSQL

12/20/2023 CCC ZG522 - BDS S1-23-24 26


BITS Pilani, Pilani Campus
• Hadoop

12/20/2023 CCC ZG522 - BDS S1-23-24 27


BITS Pilani, Pilani Campus
12/20/2023 CCC ZG522 - BDS S1-23-24 28
BITS Pilani, Pilani Campus
Key Advantages of Hadoop

➢ Stores data in its native format


➢ Scalable
➢ Cost-effective
➢ Resilient to failure
➢ Flexibility
➢ Fast

12/20/2023 CCC ZG522 - BDS S1-23-24 29


BITS Pilani, Pilani Campus
Versions of Hadoop

12/20/2023 CCC ZG522 - BDS S1-23-24 30


BITS Pilani, Pilani Campus
Hadoop Ecosystem(1)

12/20/2023 CCC ZG522 - BDS S1-23-24 31


BITS Pilani, Pilani Campus
Hadoop Ecosystem(2)

• Components that help with Data Ingestion are:


1. Sqoop
2. Flume
• Components that help with Data Processing are:
1. MapReduce
2. Spark
• Components that help with Data Analysis are:
1. Pig
2. Hive
3. Impala

12/20/2023 CCC ZG522 - BDS S1-23-24 32


BITS Pilani, Pilani Campus
Three Difference between
HBase and Hadoop/ HDFS
• HDFS is the file system where as HBase is a Hadoop
database. It is like NTFS and MySQL.

• HDFS is WORM (Write once and read multiple times or


many times). Latest versions supports appending of data
but this feature is rarely used. However, HBase supports
real time random read and write.

• HDFS is based on Google File System (GFS) whereas


Hbase is based on Google Big Table.

12/20/2023 CCC ZG522 - BDS S1-23-24 33


BITS Pilani, Pilani Campus
Hadoop Ecosystem Components
for Data Ingestion
Sqoop:
Sqoop stands for SQL to Hadoop. It can provision the data
from external system on to HDFS and populate tables in
Hive and HBase.
Flume:
Flume is an important log aggregator (aggregates logs from
different machines and places them in HDFS)
component in the Hadoop Ecosystem.

12/20/2023 CCC ZG522 - BDS S1-23-24 34


BITS Pilani, Pilani Campus
Hadoop Ecosystem Components
for Data Processing
MapReduce:
It is a programing paradigm that allows distributed and parallel processing of
huge datasets. It. is based on Google MapReduce.
Spark:
It is both a programming model as well as a computing model. It is an open
source big data processing framework.
It is written in Scala. It provides in-memory computing for Hadoop.
Spark can be used with Hadoop coexisting smoothly with MapReduce (sitting
on top of Hadoop YARN) or used independently of Hadoop (standalone).

12/20/2023 CCC ZG522 - BDS S1-23-24 35


BITS Pilani, Pilani Campus
Hadoop ecosystem components for Data Analysis

Pig
It is a high level scripting language used with Hadoop. It serves as an
alternative to MapReduce. It has two parts:
Pig Latin: It is a SQL like scripting language.
Pig runtime: is the runtime environment.
Hive:
Hive is a data warehouse software project built on top of Hadoop. Three main
tasks performed by Hive are summarization, querying and analysis
Impala:
It is a high performance SQL engine that runs on Hadoop cluster. It is ideal for
interactive analysis. It has very low latency measured in milliseconds. It
supports a dialect of SQL called Impala SQL.

12/20/2023 CCC ZG522 - BDS S1-23-24 36


BITS Pilani, Pilani Campus
Q&A

1. MongoDB is ------------------ and ---------------.


2. Cassandra is -------------------and. -------------
3. ---------------------- has no support for ACID properties of
transactions
4.----------------------- is a robust database that supports
ACID properties of transactions and has the scalability of
NoSQL.

12/20/2023 CCC ZG522 - BDS S1-23-24 37


BITS Pilani, Pilani Campus
1, Mango DB is Consistent and Partition Tolerant

2. Cassandra is Available and Partition Tolerant

3. NOSQL has no support for ACID properties of


transactions.
4. NewSQL is a robust database that supports ACID
properties of transactions and has the scalability of
NoSQL.

12/20/2023 CCC ZG522 - BDS S1-23-24 38


BITS Pilani, Pilani Campus
BIG DATA SYSTEMS
Contact Session-4
Cap Theorem and Big data life cycle
BITS Pilani Email: kanantharaman@wilp.bits-pilani.ac.in
Pilani Campus
Agenda

• What is CAP Theorem?


• Consistency, Availability, and Partition Tolerance defined.
• The inherent trade-offs between these three factors.
• Big Data Lifecycle:
• Data Acquisition, Data Extraction –Validation and Cleaning, Data
Loading, Data Transformation, Data Analysis and Visualization. Case
study – Big data application

11/2/23 CCC ZG522 - BDS S1-23-24 2


BITS Pilani, Pilani Campus
CAP Theorem Overview(1)

• CAP Theorem is a fundamental concept in distributed


systems.
• Developed by Eric Brewer in 2000.
• It states that in a distributed system:
Ø You can achieve at most two of the three attributes:
Consistency, Availability, and Partition Tolerance.

11/2/23 CCC ZG522 - BDS S1-23-24 3


BITS Pilani, Pilani Campus
CAP Theorem Overview(2)

• Consistency refers to all clients seeing the same data at


the same time.
• Availability means that any client making a request for
data gets a response.
• Partition tolerance means that the cluster must
continue to work despite any number of communication
breakdowns between nodes in the system
• In big data analytics, the CAP theorem has significant
implications for designing distributed applications and
choosing a NoSQL or relational data store

11/2/23 CCC ZG522 - BDS S1-23-24 4


BITS Pilani, Pilani Campus
Big Data Analytics Use Cases

• Example: Consistency is crucial in financial analytics,


while availability may take precedence in real-time
monitoring systems.

• Example: Apache Cassandra and its AP characteristics


for distributed databases.

11/2/23 CCC ZG522 - BDS S1-23-24 5


BITS Pilani, Pilani Campus
Balancing CAP in Big Data Systems

• How can big data systems strike a balance?

• Techniques like eventual consistency, replication, and


data sharding.

11/2/23 CCC ZG522 - BDS S1-23-24 6


BITS Pilani, Pilani Campus
1. Eventual Consistency:

• Explanation: Eventual consistency allows data to


become consistent over time, even when there are
temporary inconsistencies.

• Use Case: Eventual consistency is often acceptable in


scenarios where immediate consistency is not a strict
requirement, such as content delivery networks (CDNs)
or social media platforms.

11/2/23 CCC ZG522 - BDS S1-23-24 7


BITS Pilani, Pilani Campus
2. Replication:

Explanation: Replicating data across multiple nodes can


enhance availability while maintaining a degree of
consistency.

Use Case: Replication is commonly used in distributed


databases like Apache Cassandra, which provides
tunable consistency levels to balance between C and A.

11/2/23 CCC ZG522 - BDS S1-23-24 8


BITS Pilani, Pilani Campus
3. Data Sharding:

Explanation: Data sharding involves partitioning data into


smaller subsets (shards) distributed across different
nodes. Each shard is responsible for specific data,
enhancing parallelism and availability.

• Use Case: Sharding is employed in systems like


Hadoop, which distribute and process data across
multiple nodes in parallel, optimizing A and P.

11/2/23 CCC ZG522 - BDS S1-23-24 9


BITS Pilani, Pilani Campus
4. Smart Data Routing:

Explanation: Implementing intelligent data routing


strategies can ensure that read and write requests are
directed to the most appropriate nodes based on the
current state of the system.

Use Case: Systems like Apache ZooKeeper use leader


election to ensure availability, while routing data to the
leader for writes, ensuring a level of consistency.

11/2/23 CCC ZG522 - BDS S1-23-24 10


BITS Pilani, Pilani Campus
5. Trade-Off Analysis:

Explanation: Assessing trade-offs between C, A, and P


based on the unique requirements of your application is
crucial. Evaluate which elements are non-negotiable and
which can be relaxed.

Use Case: Financial applications may prioritize strong


consistency (C) to prevent data inconsistencies, while
real-time monitoring systems may emphasize high
availability (A) to ensure uninterrupted data collection.

11/2/23 CCC ZG522 - BDS S1-23-24 11


BITS Pilani, Pilani Campus
Big Data Lifecycle

11/2/23 CCC ZG522 - BDS S1-23-24 12


BITS Pilani, Pilani Campus
Big Data Lifecycle Stages

• High-level breakdown of the key stages:Data Ingestion


1. Data Ingestion
2. Data Storage
3. Data Processing
4. Data Analysis
5. Data Visualization
6. Data Insights

11/2/23 CCC ZG522 - BDS S1-23-24 13


BITS Pilani, Pilani Campus
Stack of layers in Big Data
Architecture

11/2/23 CCC ZG522 - BDS S1-23-24 14


BITS Pilani, Pilani Campus
Data Ingestion

11/2/23 CCC ZG522 - BDS S1-23-24 15


BITS Pilani, Pilani Campus
Data Ingestion

• Definition: Data ingestion is the process of collecting


data from various sources, such as databases, logs,
sensors, social media, and more.

• Variety of Data: Ingested data can be structured


(relational databases), semi-structured (XML, JSON), or
unstructured (text, images, videos), and can arrive in
real-time or batch mode.

11/2/23 CCC ZG522 - BDS S1-23-24 16


BITS Pilani, Pilani Campus
Functioning of Data Ingestion
Layer (1)

• Identification—At this stage, data is


categorized into various known data
formats, or we can say that
unstructured data is assigned with
default formats.

• Filtration—At this stage, the


information relevant for the enterprise is
filtered on the basis of the Enterprise
Master Data Management (MDM)
repository.

11/2/23 CCC ZG522 - BDS S1-23-24 17


BITS Pilani, Pilani Campus
Functioning of Data Ingestion
Layer (2)
• Validation—At this stage, the filtered
data is analyzed against MDM
metadata.

• Noise reduction—At this stage, data is


cleaned by removing the noise and
minimizing the related disturbances.

• Transformation—At this stage, data is


split or combined on the basis of its
type, contents, and the requirements of
the organization.

11/2/23 CCC ZG522 - BDS S1-23-24 18


BITS Pilani, Pilani Campus
Functioning of Data Ingestion
Layer (3)
Compression—At this stage, the size of
the data is reduced without affecting its
relevance for the required process. It
should be noted that compression does not
affect the analysis results.

Integration—At this stage, the refined


dataset is integrated with the Hadoop
storage layer, which consists of Hadoop
Distributed File System (HDFS) and
NoSQL databases.

11/2/23 CCC ZG522 - BDS S1-23-24 19


BITS Pilani, Pilani Campus
Methods and Tools for Data
Ingestion
ETL (Extract, Transform, Load): ETL processes are used to extract
data from source systems, transform it into a suitable format, and
load it into a target storage or processing system. Tools like Apache
Nifi, Talend, and Apache Beam facilitate ETL processes.

Real-Time Streaming: Data can be ingested in real time using


streaming frameworks such as Apache Kafka or Apache Pulsar. This
is crucial for applications that require immediate processing and
analysis of data as it arrives.

Data Connectors: Various connectors and APIs are used to fetch data
from sources like databases, cloud storage, and web services.

11/2/23 CCC ZG522 - BDS S1-23-24 20


BITS Pilani, Pilani Campus
Importance of Data Quality and
Data Sources:
Data Quality: Ensuring the quality, consistency, and
accuracy of ingested data is critical. Poor data quality
can lead to incorrect analysis and insights. Data
cleansing and validation are often part of the ingestion
process.

Data Sources: Diverse data sources, including IoT


devices, social media, customer transactions, and
internal databases, contribute to the richness of big data.
Understanding the sources and the data they produce is
essential.

11/2/23 CCC ZG522 - BDS S1-23-24 21


BITS Pilani, Pilani Campus
Data Ingestion: Summary

• Data ingestion sets the stage for the rest of the Big Data
Lifecycle, making it imperative to gather, clean, and
organize data from different sources for subsequent
storage, processing, and analysis.

• The choice of ingestion method depends on the data's


volume, velocity, variety, and the specific use case
requirements.

11/2/23 CCC ZG522 - BDS S1-23-24 22


BITS Pilani, Pilani Campus
Data Storage

11/2/23 CCC ZG522 - BDS S1-23-24 23


BITS Pilani, Pilani Campus
Data Storage

• Definition:
• zzData storage involves the management of ingested
data and its organization into a format that is accessible
for analysis. This stage addresses the challenge of
managing vast amounts of data efficiently.

11/2/23 CCC ZG522 - BDS S1-23-24 24


BITS Pilani, Pilani Campus
Different Storage
Technologies (1)
Data Warehouses: Traditional data warehouses like
Oracle, Teradata, and Amazon Redshift are designed for
structured data storage and retrieval. They are suitable
for scenarios where data is primarily structured and
relational.

Data Lakes: Data lakes (e.g., Amazon S3, Hadoop HDFS)


store raw, unstructured or semi-structured data at scale.
Data lakes are ideal for retaining large volumes of data,
whether structured or unstructured, and make it available
for future analysis.

11/2/23 CCC ZG522 - BDS S1-23-24 25


BITS Pilani, Pilani Campus
Different Storage
Technologies (2)
NoSQL Databases: NoSQL databases, including
document stores, key-value stores, and column-family
databases, offer flexible storage for unstructured or
semi-structured data. They are optimized for high write
throughput and fast access.

In-Memory Databases: In-memory databases like Redis


or Apache Cassandra with in-memory options provide
high-speed access to frequently used data and are well-
suited for real-time analytics.

11/2/23 CCC ZG522 - BDS S1-23-24 26


BITS Pilani, Pilani Campus
Data Storage: Scalability and
Data Retention:
Scalability is a critical aspect of data storage. As data
volume grows, the storage system should be able to
scale horizontally to accommodate the increasing data
load.

Data retention policies must be defined to determine how


long data should be retained, which data should be
archived, and when data should be purged

11/2/23 CCC ZG522 - BDS S1-23-24 27


BITS Pilani, Pilani Campus
Data Storage: Data Compression
and Optimization:
• Data compression techniques can be employed to
reduce storage costs and improve data retrieval times.

• Storage optimization techniques, such as indexing and


partitioning, help organize data for efficient access.

11/2/23 CCC ZG522 - BDS S1-23-24 28


BITS Pilani, Pilani Campus
Data Storage: Data Security and
Access Control:
Ensuring data security and access control is essential in
data storage. Access to sensitive data should be
restricted, and data encryption should be implemented to
protect data at rest.

11/2/23 CCC ZG522 - BDS S1-23-24 29


BITS Pilani, Pilani Campus
Data Storage:Metadata
Management:
• Metadata, which describes the characteristics of the
stored data, plays a crucial role in data storage. Effective
metadata management helps in data discovery,
governance, and tracking the lineage of data.

11/2/23 CCC ZG522 - BDS S1-23-24 30


BITS Pilani, Pilani Campus
Data Storage: Data Lifecycle
Management:
• Data storage should be integrated with data lifecycle
management, allowing organizations to manage the
entire data journey, from ingestion to archiving or
deletion.

11/2/23 CCC ZG522 - BDS S1-23-24 31


BITS Pilani, Pilani Campus
Data Storage: Summary

• Data storage serves as the repository for all ingested


data, providing the foundation for data processing,
analysis, and insights.

• The choice of storage technology depends on the nature


of the data, scalability requirements, and the
organization's data retention policies.

• An effective data storage strategy is crucial for


optimizing data access and ensuring that data remains
available for future analysis and decision-making.

11/2/23 CCC ZG522 - BDS S1-23-24 32


BITS Pilani, Pilani Campus
Data Processing

Definition: Data processing involves the manipulation,


transformation, and analysis of the ingested data to
extract meaningful information, patterns, and insights.

Objective: The primary goal of data processing is to


convert raw data into a format that is suitable for analysis
and decision-making.

11/2/23 CCC ZG522 - BDS S1-23-24 33


BITS Pilani, Pilani Campus
Batch and Real-Time Processing

• Batch Processing: In batch processing, data is


collected and processed in fixed-size chunks or batches
at scheduled intervals. It is well-suited for applications
that can tolerate some latency in data analysis.

• Real-Time Processing: Real-time processing, or stream


processing, involves the immediate analysis of data as it
arrives. It is crucial for applications that require low-
latency insights and quick responses.

11/2/23 CCC ZG522 - BDS S1-23-24 34


BITS Pilani, Pilani Campus
Data Processing Frameworks and
Tools (1)
• Hadoop: Hadoop, along with the MapReduce
framework, was one of the pioneers in big data
processing. It enables distributed batch processing of
large datasets.

• Apache Spark: Apache Spark is a powerful, in-memory


data processing framework that supports both batch and
real-time data processing. It is known for its speed and
ease of use.

11/2/23 CCC ZG522 - BDS S1-23-24 35


BITS Pilani, Pilani Campus
Data Processing Frameworks and
Tools (2)

• Apache Flink: Apache Flink is designed for high-


throughput, low-latency stream processing, making it
suitable for real-time analytics.

• Data Warehousing Solutions: Traditional data


warehousing solutions such as Amazon Redshift and
Google BigQuery provide SQL-based batch processing
capabilities for structured data.

11/2/23 CCC ZG522 - BDS S1-23-24 36


BITS Pilani, Pilani Campus
Data Transformation and
Enrichment
Data processing often involves data transformation to
convert data from one format to another, filter out
irrelevant data, and enrich data with additional
information.

Data cleansing, normalization, and feature engineering


are common data transformation tasks.

11/2/23 CCC ZG522 - BDS S1-23-24 37


BITS Pilani, Pilani Campus
Parallel Processing and Distributed
Computing
• Data processing frameworks leverage parallel
processing and distributed computing to handle the
immense volumes of data. They distribute tasks across
multiple nodes for faster execution.

11/2/23 CCC ZG522 - BDS S1-23-24 38


BITS Pilani, Pilani Campus
Data Quality and Validation

• Ensuring data quality is a critical part of data


processing. It involves validating data for accuracy,
completeness, and consistency.

• Data quality checks and validation rules are applied


during processing.

11/2/23 CCC ZG522 - BDS S1-23-24 39


BITS Pilani, Pilani Campus
Data Processing :Summary

• Data processing is the stage where the value of big data


is unlocked.
• It involves transforming raw data into insights, which can
drive informed decision-making.
• The choice of data processing framework depends on
factors like data volume, processing speed
requirements, and the complexity of data
transformations.
• Effective data processing is essential for generating
actionable insights and extracting the full potential of big
data.

11/2/23 CCC ZG522 - BDS S1-23-24 40


BITS Pilani, Pilani Campus
Data Analysis

11/2/23 CCC ZG522 - BDS S1-23-24 41


BITS Pilani, Pilani Campus
Data Analysis

Definition: Data analysis is the systematic process of


examining, cleaning, transforming, and interpreting data
to discover meaningful patterns and insights.

Objective: The primary goal of data analysis is to derive


actionable information from raw data that can be used
for decision-making and problem-solving.

11/2/23 CCC ZG522 - BDS S1-23-24 42


BITS Pilani, Pilani Campus
Data Analysis Techniques(1)

• Descriptive Analysis: Descriptive statistics and


visualizations are used to summarize and present data,
providing an overview of key characteristics.

• Diagnostic Analysis: This stage involves identifying


patterns, anomalies, and potential issues in the data. It
aims to answer questions like "Why did this happen?"

11/2/23 CCC ZG522 - BDS S1-23-24 43


BITS Pilani, Pilani Campus
Data Analysis Techniques(2)

Predictive Analysis: Predictive modeling and machine


learning techniques are applied to make future
predictions or classifications based on historical data.

Prescriptive Analysis: This advanced stage recommends


actions or decisions based on the insights generated,
guiding organizations on what steps to take.

11/2/23 CCC ZG522 - BDS S1-23-24 44


BITS Pilani, Pilani Campus
Data Mining and Machine
Learning:
• Data mining techniques, such as clustering,
classification, and association rule mining, are used to
discover hidden patterns in data.

• Machine learning algorithms, including supervised and


unsupervised learning, are applied to build predictive
models.

11/2/23 CCC ZG522 - BDS S1-23-24 45


BITS Pilani, Pilani Campus
Big Data Analytics Tools:

• Big data analytics tools include platforms like Apache


Hadoop for distributed data processing and storage,
Apache Spark for in-memory data processing, and
specialized tools like R and Python for statistical
analysis.

• Data analytics platforms such as SAS, IBM SPSS, and


open-source tools like Jupyter and scikit-learn are
commonly used for data analysis.

11/2/23 CCC ZG522 - BDS S1-23-24 46


BITS Pilani, Pilani Campus
Visualization and Reporting

• Data visualization tools (e.g., Tableau, Power BI, D3.js)


are employed to create interactive charts, graphs, and
dashboards for presenting insights.

• Reporting solutions generate reports that summarize


key findings and recommendations for stakeholders.

11/2/23 CCC ZG522 - BDS S1-23-24 47


BITS Pilani, Pilani Campus
Iterative and Exploratory
Process
• Data analysis is often an iterative and exploratory
process. Analysts may need to revisit data processing
and analysis steps to refine their understanding of the
data and generate deeper insights.

11/2/23 CCC ZG522 - BDS S1-23-24 48


BITS Pilani, Pilani Campus
Insights for Decision-Making

• The ultimate goal of data analysis is to provide


actionable insights that support decision-making, inform
strategies, and drive organizational improvements.

11/2/23 CCC ZG522 - BDS S1-23-24 49


BITS Pilani, Pilani Campus
Data Analysis: Summary

• Data analysis is the stage where the raw data is


transformed into knowledge.
• It involves exploring data, building models, and
visualizing results.
• Effective data analysis helps organizations make
informed decisions, optimize processes, and gain a
competitive edge by leveraging the power of data-driven
insights.

11/2/23 CCC ZG522 - BDS S1-23-24 50


BITS Pilani, Pilani Campus
Q&A

11/2/23 CCC ZG522 - BDS S1-23-24 51


BITS Pilani, Pilani Campus
End

11/2/23 CCC ZG522 - BDS S1-23-24 52


BITS Pilani, Pilani Campus
BIG DATA SYSTEMS
Contact Session-2
BITS Pilani Email: kanantharaman@wilp.bits-pilani.ac.in
Pilani Campus
Agenda

Ø Distributed Computing - Design Strategy:


Ø Divide-and-conquer for Parallel / Distributed Systems - Basic scenarios
and Implications.

11/2/23 CCC ZG522 - BDS S1-23-24 2


BITS Pilani, Pilani Campus
Divide-and-conquer for Parallel /
Distributed Systems
Distributed Computing Design Strategy:
§ Think of distributed computing as teamwork for computers. Instead of
one computer doing all the work, many computers work together to
solve a big task.

§ Example:
§ Some common scenarios where divide-and-conquer is used
include sorting, searching, and matrix multiplication.
§ It's like a team of chefs in a restaurant kitchen; each chef has a specific
job to prepare a delicious meal.

11/2/23 CCC ZG522 - BDS S1-23-24 3


BITS Pilani, Pilani Campus
Divide-and-conquer Strategy
in Big Data

11/2/23 CCC ZG522 - BDS S1-23-24 4


BITS Pilani, Pilani Campus
Divide-and-conquer Strategy
Basic Scenarios
1. Imagine you have a big pile of LEGO bricks. Instead of trying to build a
whole LEGO castle at once, you break it down into smaller parts, like walls,
towers, and a drawbridge. Then, you build those smaller parts one by one,
making it easier to assemble the entire castle.
2. When you shop online, your order needs to be processed quickly.
Websites use divide-and-conquer when they split your order into parts, like
finding items in the warehouse, checking them, and packing them. Different
computers handle each part, making your order ready faster.
3. In video games, making characters move realistically can be hard. Game
designers use divide-and-conquer to break the problem into pieces, like
walking, running, and jumping. Each action is handled by different parts of
the computer to make the game smooth.
4. Divide-and-conquer helps things work faster and better. It's like
teamwork in a relay race. One runner passes the baton to the next, and
they all work together to finish the race. In the end, the whole team is faster
because they divided the task.

11/2/23 CCC ZG522 - BDS S1-23-24 5


BITS Pilani, Pilani Campus
Programming patterns

11/2/23 CCC ZG522 - BDS S1-23-24 6


BITS Pilani, Pilani Campus
Parallel merge sort

• Parallel merge sort is a technique that takes advantage


of multiple processors or threads to speed up the sorting
process.
• The idea is to divide the array into smaller parts, sort
each part independently, and then merge the sorted
parts together.

• Here's a simplified pseudocode for parallel merge sort:

11/2/23 CCC ZG522 - BDS S1-23-24 7


BITS Pilani, Pilani Campus
Pseudocode for parallel merge
sort
• This algorithm sorts a list
recursively by dividing
the list into smaller
pieces, sorting the
smaller pieces during
reassembly of the list.

§ / Create two threads (or


processes) to sort the parts
in parallel

11/2/23 CCC ZG522 - BDS S1-23-24 8


BITS Pilani, Pilani Campus
Data-Parallel Programs

• Data-parallel programming is a pattern where the same


operation is applied to multiple data elements in parallel.
It's like an assembly line where each worker does the
same task simultaneously.

• This pattern is useful for tasks like element-wise


operations on arrays, such as adding numbers or filtering
data points.

11/2/23 CCC ZG522 - BDS S1-23-24 9


BITS Pilani, Pilani Campus
Map as a Construct

• The "map" operation is a fundamental concept in data-


parallel programming.
• It involves applying a function to each item in a collection
and generating a new collection of results.

• Think of it as applying the same operation to every item


in a list.

11/2/23 CCC ZG522 - BDS S1-23-24 10


BITS Pilani, Pilani Campus
Tree-Parallelism

• Tree-parallelism extends the idea of data-parallelism to


hierarchical data structures, like trees. In this pattern,
you process tree nodes in parallel, which is common in
tasks like parsing XML or JSON data.

11/2/23 CCC ZG522 - BDS S1-23-24 11


BITS Pilani, Pilani Campus
Reduce as a Construct

• The "reduce" operation is another key concept. It


combines multiple values into a single result. It's like
summarizing the results of parallel computations into one
final output, such as finding the sum of all numbers.

11/2/23 CCC ZG522 - BDS S1-23-24 12


BITS Pilani, Pilani Campus
Map-Reduce Model

• Map-reduce is a programming model for processing and


generating large datasets.
• It's widely used in distributed computing. The model
consists of two main operations: "map" and "reduce."

11/2/23 CCC ZG522 - BDS S1-23-24 13


BITS Pilani, Pilani Campus
Map-Reduce Model :
Examples
Map Example: Consider a dataset of words and you want
to count how many times each word appears.
• The "map" step would split the text into words and
assign a count of 1 to each word. For example, "apple" -
> 1, "banana" -> 1.

Reduce Example: After the "map" step, you have a list of


words with counts.
The "reduce" step would group all instances of the same
word together and sum their counts. So, "apple" (1, 1, 1)
-> "apple" -> 3.

11/2/23 CCC ZG522 - BDS S1-23-24 14


BITS Pilani, Pilani Campus
Map-Reduce Combinations(1)

• You can combine "map" and "reduce" operations in more


complex ways.

• For instance, in sentiment analysis, the "map" step might


assign positive or negative sentiment scores to words,
and the "reduce" step could aggregate the sentiment for
an entire document.

11/2/23 CCC ZG522 - BDS S1-23-24 15


BITS Pilani, Pilani Campus
Map-Reduce Combinations(2)

• In some scenarios, you need to apply map-reduce


multiple times in an iterative fashion.

• For example, in graph processing, you may repeatedly


apply map-reduce to traverse and analyze a graph until
you find specific patterns or insights

11/2/23 CCC ZG522 - BDS S1-23-24 16


BITS Pilani, Pilani Campus
Programming patterns:
Summary
• These programming patterns and the map-reduce model
are essential for efficiently processing large datasets in
parallel and distributed computing environments.

• They provide a framework for breaking down complex


tasks into smaller, parallelizable components and
aggregating their results to solve a wide range of
problems, from counting words in text to analyzing vast
datasets.

11/2/23 CCC ZG522 - BDS S1-23-24 17


BITS Pilani, Pilani Campus
BIG DATA SYSTEMS
Contact Session-6
Overview of Hadoop
BITS Pilani Email: kanantharaman@wilp.bits-pilani.ac.in
Pilani Campus
Learning Objectives and
Learning Outcomes

11/29/2023 CCC ZG522 - BDS S1-23-24 2


BITS Pilani, Pilani Campus
Agenda
➢ Hadoop - An Introduction
HDFS
➢ RDBMS versus Hadoop
➢ HDFS Daemons
➢ Distributed Computing
➢ Anatomy of File Read
Challenges
➢ Anatomy of File Write
➢ History of Hadoop
➢ Replica Placement Strategy
➢ Hadoop Overview
➢ Working with HDFS commands
❖ Key Aspects of Hadoop ➢ Special Features of HDFS
❖ Hadoop Components
❖ High Level Architecture of
Hadoop
➢ Use case for Hadoop
➢ ClickStream Data
➢ Hadoop Distributors

11/29/2023 CCC ZG522 - BDS S1-23-24 3


BITS Pilani, Pilani Campus
Hadoop

• Processing Data with Hadoop


➢ What is MapReduce Programming?
➢ How does MapReduce Works?
➢ MapReduce Word Count Example

• Managing Resources and Application with Hadoop


YARN
➢ Limitations of Hadoop 1.0 Architecture
➢ Hadoop 2 YARN: Taking Hadoop Beyond Batch
• Hadoop Ecosystem
➢ Pig
➢ Hive
➢ Sqoop
➢ HBase
11/29/2023 CCC ZG522 - BDS S1-23-24 4
BITS Pilani, Pilani Campus
What is Hadoop

Hadoop is:
• Ever wondered why Hadoop has been and is one of the most wanted
technologies!!
• The key consideration (the rationale behind its huge popularity) is:
• Its capability to handle massive amounts of data, different categories of data
– fairly quickly.
• The other considerations are :

• Commodity hardware
• Distributed computing
• Scalable horizontally
• Flexible storage
• Failsafe

11/29/2023 CCC ZG522 - BDS S1-23-24 5


BITS Pilani, Pilani Campus
A Hadoop Cluster

11/29/2023 CCC ZG522 - BDS S1-23-24 6


BITS Pilani, Pilani Campus
Why not RDBMS?

• Not suitable for processing large files, Images Videos.


• Storage Cost/GB is high

11/29/2023 CCC ZG522 - BDS S1-23-24 7


BITS Pilani, Pilani Campus
RDBMS versus HADOOP

11/29/2023 CCC ZG522 - BDS S1-23-24 8


BITS Pilani, Pilani Campus
Distributed computing
challenges (1)
• Hardware failure
• Node/Disk Failure
• Hadoop handles it by using replication

11/29/2023 CCC ZG522 - BDS S1-23-24 9


BITS Pilani, Pilani Campus
Distributed computing
Challenges(2)
• How to Process This Gigantic Store of Data?
• In a distributed system, the data is spread across the
network on several machines.

• A key challenge here is to integrate the data available


on several machines prior to processing it.

• Hadoop solves this problem by using MapReduce


Programming. It is a programming model to process the
data (MapReduce programming will be discussed a little
later).

11/29/2023 CCC ZG522 - BDS S1-23-24 10


BITS Pilani, Pilani Campus
The Name “Hadoop”

The Name “Hadoop” The name Hadoop is not an


acronym; it’s a made-up name.

The project creator, Doug Cutting, explains how the name


came about: “The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell and pronounce,
meaningless, and not used elsewhere: those are my
naming criteria. Kids are good at generating such.
Googol is a kid’s term”.

11/29/2023 CCC ZG522 - BDS S1-23-24 11


BITS Pilani, Pilani Campus
History of Hadoop

11/29/2023 CCC ZG522 - BDS S1-23-24 12


BITS Pilani, Pilani Campus
Key Aspects of Hadoop

11/29/2023 CCC ZG522 - BDS S1-23-24 13


BITS Pilani, Pilani Campus
Hadoop Components(1)

11/29/2023 CCC ZG522 - BDS S1-23-24 14


BITS Pilani, Pilani Campus
Hadoop Components(2)

Hadoop Core Components:


HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.

MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.

11/29/2023 CCC ZG522 - BDS S1-23-24 15


BITS Pilani, Pilani Campus
Hadoop Eco System

Hadoop Ecosystem: Hadoop Ecosystem are support


projects to enhance the functionality of Hadoop Core
Components.
The Eco Projects are as follows:
1. HIVE
2. PIG
3. SQOOP
4. HBASE
5. FLUME
6. OOZIE 7. MAHOUT

11/29/2023 CCC ZG522 - BDS S1-23-24 16


BITS Pilani, Pilani Campus
Hadoop Conceptual Layers

It is conceptually divided into:


• Data Storage Layer which stores huge volumes of data

• Data Processing Layer which processes data in


parallel to extract richer and meaningful insights from
data.

11/29/2023 CCC ZG522 - BDS S1-23-24 17


BITS Pilani, Pilani Campus
Hadoop High Level
Architecture
Master HDFS: Its main responsibility is
partitioning the data storage across the
slave nodes. It also keeps track of
locations of data on DataNodes.

Master MapReduce: It decides and


schedules computation task on slave
nodes

11/29/2023 CCC ZG522 - BDS S1-23-24 18


BITS Pilani, Pilani Campus
Use case for Hadoop

• Clickstream data (mouse clicks) helps you to


understand the purchasing behavior of customers.

• Clickstream analysis helps online marketers to optimize


their product web pages, promotional content, etc. to
improve their business

11/29/2023 CCC ZG522 - BDS S1-23-24 19


BITS Pilani, Pilani Campus
Hadoop Distributors

11/29/2023 CCC ZG522 - BDS S1-23-24 20


BITS Pilani, Pilani Campus
HDFS
(HADOOP DISTRIBUTED FILE SYSTEM)

11/29/2023 CCC ZG522 - BDS S1-23-24 21


BITS Pilani, Pilani Campus
Hadoop Distributed File
System
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and
moves computation where data is stored).
5. You can replicate a file for a configured number of times, which is
tolerant in terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. You can realize the power of HDFS when you perform read or write
on large files (gigabytes and larger).
8. Sits on top of native file system such as ext3 and ext4.

11/29/2023 CCC ZG522 - BDS S1-23-24 22


BITS Pilani, Pilani Campus
HDFS Daemons

NameNode:
• Single NameNode per cluster.
• Keeps the metadata details

DataNode:
• Multiple DataNode per cluster
• Read/Write operations

SecondaryNameNode:
• Housekeeping Daemon

11/29/2023 CCC ZG522 - BDS S1-23-24 23


BITS Pilani, Pilani Campus
Anatomy of File Read

11/29/2023 CCC ZG522 - BDS S1-23-24 24


BITS Pilani, Pilani Campus
Anatomy of File Write

11/29/2023 CCC ZG522 - BDS S1-23-24 25


BITS Pilani, Pilani Campus
Replica Placement Strategy

• As per the Hadoop Replica Placement Strategy, first replica is placed on the
same node as the client. Then it places second replica on a node that is
present on different rack. It places the third replica on the same rack as
second, but on a different node in the rack. Once replica locations have
been set, a pipeline is built. This strategy provides good reliability.

11/29/2023 CCC ZG522 - BDS S1-23-24 26


BITS Pilani, Pilani Campus
Working with HDFS
Commands
Objective: To create a directory (say, sample) in HDFS.
Act:
hadoop fs -mkdir /sample

Objective: To copy a file from local file system to HDFS.


Act:
hadoop fs -put /root/sample/test.txt /sample/test.txt

Objective: To copy a file from HDFS to local file system.


Act:
Hadoop fs- get /sample/test.txt /root/sample/testsample.txt

11/29/2023 CCC ZG522 - BDS S1-23-24 27


BITS Pilani, Pilani Campus
HDFS Snapshots

11/29/2023 CCC ZG522 - BDS S1-23-24 28


BITS Pilani, Pilani Campus
Special Features of HDFS

Data Replication: There is absolutely no need for a client


application to track all blocks. It directs the client to the
nearest replica to ensure high performance.

Data Pipeline: A client application writes a block to the first


DataNode in the pipeline. Then this DataNode takes
over and forwards the data to the next node in the
pipeline. This process continues for all the data blocks,
and subsequently all the replicas are written to the disk.

11/29/2023 CCC ZG522 - BDS S1-23-24 29


BITS Pilani, Pilani Campus
Hadoop

Some Additional points about Hadoop

11/29/2023 CCC ZG522 - BDS S1-23-24 30


BITS Pilani, Pilani Campus
Hadoop

11/29/2023 CCC ZG522 - BDS S1-23-24 31


BITS Pilani, Pilani Campus
Contributing Organizations

11/29/2023 CCC ZG522 - BDS S1-23-24 32


BITS Pilani, Pilani Campus
Understanding Usage Mode of
Hadoop

• Development of New tools

11/29/2023 CCC ZG522 - BDS S1-23-24 33


BITS Pilani, Pilani Campus
Hadoop History

Hadoop is also used by many smaller companies – open source , Cheap

11/29/2023 CCC ZG522 - BDS S1-23-24 34


BITS Pilani, Pilani Campus
Hadoop Handles Big Data

11/29/2023 CCC ZG522 - BDS S1-23-24 35


BITS Pilani, Pilani Campus
Hadoop addresses Data
throughput mismatch (1)

• Due to slow disk transfer rate

11/29/2023 CCC ZG522 - BDS S1-23-24 36


BITS Pilani, Pilani Campus
Hadoop addresses Data
throughput mismatch (2)

1 TB = 10 **6 MB
Transfer speed = 100 MB/sec
Time taken = 10**6/100 = 10**4= 10000 Sec ~ 167 minutes
11/29/2023 CCC ZG522 - BDS S1-23-24 37
BITS Pilani, Pilani Campus
Hadoop Design Principles

11/29/2023 CCC ZG522 - BDS S1-23-24 38


BITS Pilani, Pilani Campus
Hadoop Version -1

11/29/2023 CCC ZG522 - BDS S1-23-24 39


BITS Pilani, Pilani Campus
Hadoop-1 Eco System

11/29/2023 CCC ZG522 - BDS S1-23-24 40


BITS Pilani, Pilani Campus
Hadoop-1 limitations

11/29/2023 CCC ZG522 - BDS S1-23-24 41


BITS Pilani, Pilani Campus
YARN Overview
(We shall discuss more about YARN in the later part of the course)

11/29/2023 CCC ZG522 - BDS S1-23-24 42


BITS Pilani, Pilani Campus
YARN Solution (Hadoop-2
approach)

11/29/2023 CCC ZG522 - BDS S1-23-24 43


BITS Pilani, Pilani Campus
Hadoop Version-2

So for instance, in version 1 MapReduce, the number of reducers and mappers


was fixed. In version 2, the MapReduce framework now, which is now a YARN
application, can dynamically adjust the numbers of mappers and reducers at
runtime.
11/29/2023 CCC ZG522 - BDS S1-23-24 44
BITS Pilani, Pilani Campus
Hadoop Eco System-2

11/29/2023 CCC ZG522 - BDS S1-23-24 45


BITS Pilani, Pilani Campus
Hadoop Yarn Beyond Map
reduce

11/29/2023 CCC ZG522 - BDS S1-23-24 46


BITS Pilani, Pilani Campus
Hadoop YARN - Features

• Move Beyond Java

11/29/2023 CCC ZG522 - BDS S1-23-24 47


BITS Pilani, Pilani Campus
Hadoop 2.x or Greater
Versions

11/29/2023 CCC ZG522 - BDS S1-23-24 48


BITS Pilani, Pilani Campus
Hadoop V2 components

11/29/2023 CCC ZG522 - BDS S1-23-24 49


BITS Pilani, Pilani Campus
Hadoop-2 Eco system
components (1)
Apache Pig is a high-level language for creating MapReduce programs
used with Hadoop.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, ad-hoc queries, and the analysis of large
datasets using a SQL-like language called HiveQL.
Apache HCataIog Apache HCataIog is a table and storage management
service for data created using Apache Hadoop.
Apache HBase HBase (Hadoop Database) is a distributed and scalable,
column oriented database. Similar to Google Big Table.
Apache ZooKeeper is a centralized service used by applications for
maintaining configuration, health, etc. on and between nodes.

11/29/2023 CCC ZG522 - BDS S1-23-24 50


BITS Pilani, Pilani Campus
Hadoop-2 Eco system
components (2)
Apache Ambari is a tool for provisioning, managing, and monitoring
Apache Hadoop clusters.
Apache Oozie is a workflow/coordination system to manage multistage
Hadoop jobs.
Apache Avro is a remote procedure call and serialization framework.
Apache Cassandra is a distributed database designed to handle large
amounts of data across many commodity servers.
Apache Sqoop is a tool designed for efficiently transferring bulk data
between Hadoop and relational databases.
Apache Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data. We will
cover Flume in this tutorial. Apache Mahout a scalable machine learning
library that implements many different approaches to machine learning.

11/29/2023 CCC ZG522 - BDS S1-23-24 51


BITS Pilani, Pilani Campus
Apache Tez

11/29/2023 CCC ZG522 - BDS S1-23-24 52


BITS Pilani, Pilani Campus
Processing with Hadoop

11/29/2023 CCC ZG522 - BDS S1-23-24 53


BITS Pilani, Pilani Campus
Introduction- MapReduce
Programming
• In MapReduce Programming, Jobs (Applications) are
split into a set of map tasks and reduce tasks. Then
these tasks are executed in a distributed fashion on
Hadoop cluster.

• Each task processes small subset of data that has been


assigned to it. This way, Hadoop distributes the load
across the cluster.

• MapReduce job takes a set of files that is stored in


HDFS (Hadoop Distributed File System) as input.

11/29/2023 CCC ZG522 - BDS S1-23-24 54


BITS Pilani, Pilani Campus
What is MapReduce
Programming?
• MapReduce Programming is a software framework.
MapReduce Programming helps you to process massive
amounts of data in parallel.

11/29/2023 CCC ZG522 - BDS S1-23-24 55


BITS Pilani, Pilani Campus
Mapper

11/29/2023 CCC ZG522 - BDS S1-23-24 56


BITS Pilani, Pilani Campus
Mapper

• A mapper maps the input key−value pairs into a set of


intermediate key–value pairs. Maps are individual tasks
that have the responsibility of transforming input records
into intermediate key–value pairs.

Mapper Consists of following phases:


• RecordReader
• Map
• Combiner
• Partitioner

11/29/2023 CCC ZG522 - BDS S1-23-24 57


BITS Pilani, Pilani Campus
Reducer

11/29/2023 CCC ZG522 - BDS S1-23-24 58


BITS Pilani, Pilani Campus
Reducer

• The primary chore of the Reducer is to reduce a set of


intermediate values (the ones that share a common key)
to a smaller set of values.

• The Reducer has three primary phases: Shuffle and


Sort, Reduce, and Output Format.

11/29/2023 CCC ZG522 - BDS S1-23-24 59


BITS Pilani, Pilani Campus
The chores of Mapper, Combiner,
Partitioner, and Reducer

11/29/2023 CCC ZG522 - BDS S1-23-24 60


BITS Pilani, Pilani Campus
How MapReduce Programming
Works

11/29/2023 CCC ZG522 - BDS S1-23-24 61


BITS Pilani, Pilani Campus
MapReduce – Word Count
Data flow

11/29/2023 CCC ZG522 - BDS S1-23-24 62


BITS Pilani, Pilani Campus
Understand how MapReduce
works (3 Nodes)

11/29/2023 CCC ZG522 - BDS S1-23-24 63


BITS Pilani, Pilani Campus
Add a combiner

• It is an optimization technique for MapReduce Job. Generally, the


reducer class is set to be the combiner class. The difference between
combiner class and reducer class is as follows:
• Output generated by combiner is intermediate data and it is passed to
the reducer.
• Output of the reducer is passed to the output file on disk.
11/29/2023 CCC ZG522 - BDS S1-23-24 64
BITS Pilani, Pilani Campus
MapReduce – Word count
Example: Python Code (1)
Sample Data
# Sample document data
documents = ["the cow jumps over the moon",
"the cow jumps over the moon again"]

➢ Next Define Mapper function

11/29/2023 CCC ZG522 - BDS S1-23-24 65


BITS Pilani, Pilani Campus
Define Mapper function

# Mapper
def mapper(document):
# Split document string into words
words = document.split()
for word in words:
# Emit each word with a count of 1
print('%s\t%d' % (word, 1))

11/29/2023 CCC ZG522 - BDS S1-23-24 66


BITS Pilani, Pilani Campus
Apply Mapper function

# Process each document


for document in documents:
mapper(document)

• Intermediate results
# Sample data mapped to words and counts
words = {'the': [1, 1],
'cow': [1, 1],
'jumps': [1, 1],
'over': [1, 1],
'moon': [1],
'again': [1]}
11/29/2023 CCC ZG522 - BDS S1-23-24 67
BITS Pilani, Pilani Campus
Define Reducer function

# Reducer
def reducer(word, counts):
# Sums counts for each instance of a word
sum = 0
for count in counts:
sum += count
# Emit word with final count
print('%s\t%d' % (word, sum))

11/29/2023 CCC ZG522 - BDS S1-23-24 68


BITS Pilani, Pilani Campus
Call Reducer for each word

# Call reducer on each word


for word, counts in words.iteritems():
reducer(word, counts)

11/29/2023 CCC ZG522 - BDS S1-23-24 69


BITS Pilani, Pilani Campus
MapReduce – Word count
Example: Python Code (1)

11/29/2023 CCC ZG522 - BDS S1-23-24 70


BITS Pilani, Pilani Campus
MapReduce – Word count
Example: Python Code (2)

word

11/29/2023 CCC ZG522 - BDS S1-23-24 71


BITS Pilani, Pilani Campus
MapReduce – Word count
Example: Python Code (3)

11/29/2023 CCC ZG522 - BDS S1-23-24 72


BITS Pilani, Pilani Campus
• Understand Hadoop-2 version
MapReduce

11/29/2023 CCC ZG522 - BDS S1-23-24 73


BITS Pilani, Pilani Campus
MapReduce compatibility

11/29/2023 CCC ZG522 - BDS S1-23-24 74


BITS Pilani, Pilani Campus
Binary Compatibility

11/29/2023 CCC ZG522 - BDS S1-23-24 75


BITS Pilani, Pilani Campus
Source Code Compatibility

11/29/2023 CCC ZG522 - BDS S1-23-24 76


BITS Pilani, Pilani Campus
Compatibility of command line
scripts

11/29/2023 CCC ZG522 - BDS S1-23-24 77


BITS Pilani, Pilani Campus
11/29/2023 CCC ZG522 - BDS S1-23-24 78
BITS Pilani, Pilani Campus
MapReduce History Server

11/29/2023 CCC ZG522 - BDS S1-23-24 79


BITS Pilani, Pilani Campus
• Benchmarking programs to test the performance and
health of a Hadoop cluster.
• Provided by Apache.org

WE shall discuss five Commonly used Benchmark


programs.

11/29/2023 CCC ZG522 - BDS S1-23-24 80


BITS Pilani, Pilani Campus
Benchmark Programs (1)
(To Test workload)
TeraSort :
• Tests the cluster's ability to handle a large-scale sort
workload.
• It sorts 1 TB of randomly generated 100 byte records.
• This tests both the MapReduce framework's
performance as well as the HDFS throughput.
Expected results:
• Total time to sort 1 terabyte of data
• MapReduce job execution time
• Data throughput rate (GB/min)
• Processor utilization
11/29/2023 CCC ZG522 - BDS S1-23-24 81
BITS Pilani, Pilani Campus
Benchmark Programs(2)

TestDFSIO :’
• Benchmark tool to test HDFS throughput for large file
reads/writes.
• It measures aggregate read/write throughput across the
cluster by testing them under varied conditions and file
sizes.
Expected Operational Metrics:
• Read throughput (MB/sec)
• Write throughput (MB/sec)
• Average I/O rate

11/29/2023 CCC ZG522 - BDS S1-23-24 82


BITS Pilani, Pilani Campus
Benchmark Programs(3)

MRBench :
• Helpful for testing incremental changes in the system.
• Micro benchmark suites to test various aspects of
MapReduce operation.
• Includes tests like sort, grep, aggregation, join, statistics
metrics etc.
Expected Operational Metrics:
• Completion time for sort, word count etc
• Processor utilization percentage
• Mapper/reducer task numbers
• Map input/output records
11/29/2023 CCC ZG522 - BDS S1-23-24 83
BITS Pilani, Pilani Campus
Benchmark Programs(4)

GridMix
• Creates a controllable MapReduce workload mix to
simulate production load.
• Workloads can be customized to match variety of use
cases based on needs.
• Helps test cluster performance for production
environments.
Expected Operational Metrics:
• MapReduce job latency distribution
• Overall execution time
• Network utilization
• 11/29/2023
CPU/memory usage per node
CCC ZG522 - BDS S1-23-24 84
BITS Pilani, Pilani Campus
Benchmark Programs(5)

YARN benchmarking

• To test ResourceManager and ability of YARN to


schedule under load.

Expected Operational Metrics:


• Scheduling latency
• Node utilization levels
• Job queue lengths
• Time for queues to drain under simulated load
• Improved resource allocation over time
11/29/2023 CCC ZG522 - BDS S1-23-24 85
BITS Pilani, Pilani Campus
Job tracking and Monitoring:

• YARN in Hadoop 2 helps track jobs, tasks, and provide


logs for debugging:.

11/29/2023 CCC ZG522 - BDS S1-23-24 86


BITS Pilani, Pilani Campus
Job Tracking and Monitoring

• YARN ResourceManager has a built-in web UI and CLI


to track execution and status of MapReduce jobs.

• Provides details like job ID, start/end times,


mapper/reducer details.

• Metrics on CPU, memory, shuffle/sort numbers per job

11/29/2023 CCC ZG522 - BDS S1-23-24 87


BITS Pilani, Pilani Campus
Task Tracking

• MapReduce and Monitoring Protocol tracks task


attempts from initialization to completion.
• Failed tasks are designated and diagnosis information
logged
• Task logs are aggregated and made available for
debugging

11/29/2023 CCC ZG522 - BDS S1-23-24 88


BITS Pilani, Pilani Campus
Logs

• YARN collects logs from containers running mapper and


reducer attempts
• Logs include stdout, stderr, syslog, and Application
Master logs
• Logs are aggregated and stored on HDFS
• Can be accessed using yarn logs CLI along with job ID

11/29/2023 CCC ZG522 - BDS S1-23-24 89


BITS Pilani, Pilani Campus
End

11/29/2023 CCC ZG522 - BDS S1-23-24 90


BITS Pilani, Pilani Campus
NAME OF THE COURSE
Contact Session-7 HDFS YARN
BITS Pilani
Pilani Campus kanantharaman@wilp.bits-pilani.ac.in
Hadoop 2 YARN: Taking Hadoop
beyond Batch
(Job Tracker changes)

11/29/23 CCC ZG522 - BDS S1-23-24 2


BITS Pilani, Pilani Campus
Apache Hadoop-2 YARN (1)

Provides 5 new features

11/29/23 CCC ZG522 - BDS S1-23-24 3


BITS Pilani, Pilani Campus
Hadoop 2.2-YARN Release

11/29/23 CCC ZG522 - BDS S1-23-24 4


BITS Pilani, Pilani Campus
YARN Components

11/29/23 CCC ZG522 - BDS S1-23-24 5


BITS Pilani, Pilani Campus
YARN Architecture

MPI- message passing interface, MRAM – Map Reduce Application master


MPIAM – MPI Application master
11/29/23 CCC ZG522 - BDS S1-23-24 6
BITS Pilani, Pilani Campus
Resource Manager

11/29/23 CCC ZG522 - BDS S1-23-24 7


BITS Pilani, Pilani Campus
YARN Containers

11/29/23 CCC ZG522 - BDS S1-23-24 8


BITS Pilani, Pilani Campus
Node Manager

11/29/23 CCC ZG522 - BDS S1-23-24 9


BITS Pilani, Pilani Campus
Application Master

11/29/23 CCC ZG522 - BDS S1-23-24 10


BITS Pilani, Pilani Campus
Job History server

11/29/23 CCC ZG522 - BDS S1-23-24 11


BITS Pilani, Pilani Campus
YARN workflow Scheduling

11/29/23 CCC ZG522 - BDS S1-23-24 12


BITS Pilani, Pilani Campus
YARN Scheduling options

11/29/23 CCC ZG522 - BDS S1-23-24 13


BITS Pilani, Pilani Campus
FIFO Scheduler

11/29/23 CCC ZG522 - BDS S1-23-24 14


BITS Pilani, Pilani Campus
Capacity Scheduler

11/29/23 CCC ZG522 - BDS S1-23-24 15


BITS Pilani, Pilani Campus
Fair Scheduler

11/29/23 CCC ZG522 - BDS S1-23-24 16


BITS Pilani, Pilani Campus
Understand the application life cycle in
YARN

11/29/23 CCC ZG522 - BDS S1-23-24 17


BITS Pilani, Pilani Campus
Client Request to Resource
Manager

11/29/23 CCC ZG522 - BDS S1-23-24 18


BITS Pilani, Pilani Campus
Application – Node Manager
Interaction

11/29/23 CCC ZG522 - BDS S1-23-24 19


BITS Pilani, Pilani Campus
YARN Summary

BITS Pilani, Pilani Campus


YARN (1)

• The fundamental idea behind this architecture is splitting


the JobTracker responsibility of resource management
and Job Scheduling/Monitoring into separate daemons.

• Daemons/process that are part of YARN Architecture


are described below:

11/29/23 CCC ZG522 - BDS S1-23-24 21


BITS Pilani, Pilani Campus
YARN (2)

A Global ResourceManager: Its main responsibility is to


distribute resources among various applications in the
system. It has two main components:

(a) Scheduler: The pluggable scheduler of


ResourceManager decides allocation of resources to
various running applications. The scheduler is just that,
a pure scheduler, meaning it does NOT monitor or track
the status of the application.

11/29/23 CCC ZG522 - BDS S1-23-24 22


BITS Pilani, Pilani Campus
YARN(3)

(b) ApplicationManager: ApplicationManager does the


following:
• Accepting job submissions.
• Negotiating resources (container) for executing the
application specific ApplicationMaster.

• Restarting the ApplicationMaster in case of failure

11/29/23 CCC ZG522 - BDS S1-23-24 23


BITS Pilani, Pilani Campus
YARN(4)

• NodeManager: This is a per-machine slave daemon.


NodeManager responsibility is launching the application
containers for application execution.

• NodeManager monitors the resource usage such as


memory, CPU, disk, network, etc. It then reports the
usage of resources to the global ResourceManager.

11/29/23 CCC ZG522 - BDS S1-23-24 24


BITS Pilani, Pilani Campus
YARN(5)

Per-application ApplicationMaster: This is an application-


specific entity. Its responsibility is to negotiate required
resources for execution from the ResourceManager.

• It works along with the NodeManager for executing and


monitoring component tasks.

11/29/23 CCC ZG522 - BDS S1-23-24 25


BITS Pilani, Pilani Campus
Basic Concepts YARN

11/29/23 CCC ZG522 - BDS S1-23-24 26


BITS Pilani, Pilani Campus
YARN(6)

1. Application is a job submitted to the framework.


2. Example – MapReduce Job.

3. Container: 1. Basic unit of allocation.


–2. Fine-grained resource allocation across multiple resource types
(Memory, CPU, disk, network, etc.)

1. (a) container_0 = 2GB, 1CPU


2. (b) container_1 = 1GB, 6 CPU
3. Replaces the fixed map/reduce slots.

11/29/23 CCC ZG522 - BDS S1-23-24 27


BITS Pilani, Pilani Campus
YARN Architecture
1.A client program submits
the application which includes
the necessary specifications
to launch the application-
specific ApplicationMaster
itself.
2. The ResourceManager
launches the
ApplicationMaster by
assigning some container.
3. The ApplicationMaster, on
boot-up, registers with the
Resource Manager

4. During the normal course, ApplicationMaster negotiates


appropriate resource containers via the resource-request
protocol.

11/29/23 CCC ZG522 - BDS S1-23-24 28


BITS Pilani, Pilani Campus
YARN- Summary

• The YARN architecture allows Hadoop to support


various types of processing models, beyond the
traditional MapReduce paradigm.
• It enables the execution of diverse workloads, such as
interactive querying, stream processing, and machine
learning, by providing a more flexible and efficient
resource management framework.
• This separation of resource management from job
execution has contributed to the scalability and versatility
of Hadoop clusters.

11/29/23 CCC ZG522 - BDS S1-23-24 29


BITS Pilani, Pilani Campus
Q&A

11/29/23 CCC ZG522 - BDS S1-23-24 30


BITS Pilani, Pilani Campus
END

11/29/23 CCC ZG522 - BDS S1-23-24 31


BITS Pilani, Pilani Campus
BIG DATA SYSTEMS
Contact Session-7
HDFS
BITS Pilani Email:kanantharaman@wilp.bitspilani.ac.in
Pilani Campus
Hadoop Distributed File
System
Some key Points of Hadoop Distributed File System are as follows:
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and moves
computation where data is stored).
5. You can replicate a file for a configured number of times, which is tolerant in
terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. You can realize the power of HDFS when you perform read or write on large
files (gigabytes and larger).
8. Sits on top of native file system such as ext3 and ext4,

12/9/23 CCC ZG522 - BDS S1-23-24 2


BITS Pilani, Pilani Campus
Hadoop Distributed file
Systems (1)
• Distributing data over multiple machines needed when
capacity exceeds single machine
• Distributed filesystems manage storage across network
of machines
• More complex than regular disk filesystems due to
network complications
• Main challenge is tolerating node failures without losing
data

12/9/23 CCC ZG522 - BDS S1-23-24 3


BITS Pilani, Pilani Campus
What is Streaming Data
access Pattern (1)
• Data is continuously generated from various sources like
users, sensors, applications etc.

• Data elements arrive sequentially in a stream rather than


as a persistent store

• Access is primarily sequential, accessing elements in


arrival order

12/9/23 CCC ZG522 - BDS S1-23-24 4


BITS Pilani, Pilani Campus
What is Streaming Data
access Pattern (2)
• Often processing needs to happen in real-time or near-
real-time as data arrives

• Difficult to go back to previous elements after they've


been processed

• Streaming computations are long-running and


continuous rather than batch oriented

12/9/23 CCC ZG522 - BDS S1-23-24 5


BITS Pilani, Pilani Campus
The Design Principles of
HDFS
• HDFS is a filesystem designed for:

• Very large files:


• Streaming Data Access
• Commodity hardware

12/9/23 CCC ZG522 - BDS S1-23-24 6


BITS Pilani, Pilani Campus
The Design of HDFS
-Very Large files
• “Very large” in this context means files that are
hundreds of megabytes, gigabytes, or terabytes in size.
There are Hadoop clusters running today that store
petabytes of data.

12/9/23 CCC ZG522 - BDS S1-23-24 7


BITS Pilani, Pilani Campus
The Design of HDFS
-Streaming Data Access
• Write-once, read-many-times pattern
• A dataset is typically generated or copied from source.
• Then various analyses are performed on that dataset
over time
• Each analysis will involve a large proportion, if not all, of
the dataset.
• The time to read the whole dataset is more important
than the latency in reading the first record.

12/9/23 CCC ZG522 - BDS S1-23-24 8


BITS Pilani, Pilani Campus
The Design of HDFS
-Commodity Hardware
• Hadoop doesn’t require expensive, highly reliable
hardware.
• It’s designed to run on clusters of commodity hardware
(commonly available hardware that can be obtained from
multiple vendors.
• For which the chance of node failure across the cluster
is high, at least for large clusters.
• HDFS is designed to carry on working without a
noticeable interruption to the user in the face of such
failure.

12/9/23 CCC ZG522 - BDS S1-23-24 9


BITS Pilani, Pilani Campus
HDFS Does work for: (1)

• Low-latency data access


– Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS.

• HDFS is optimized for delivering high throughput of data

12/9/23 CCC ZG522 - BDS S1-23-24 10


BITS Pilani, Pilani Campus
HDFS Does work for: (2)

• Lots of small files


• Name node holds the file system.
• The number files in the file system is governed by
amount of memory available in the NameNode.

12/9/23 CCC ZG522 - BDS S1-23-24 11


BITS Pilani, Pilani Campus
HDFS Does work for: (3)

• Multiple writers, arbitrary file modifications


• Files in HDFS may be written to by a single writer.
• Writes are always made at the end of the file, in append-
only fashion.
• There is no support for multiple writers or for
modifications at arbitrary offsets in the file.

12/9/23 CCC ZG522 - BDS S1-23-24 12


BITS Pilani, Pilani Campus
HDFS Block Size (1)

• A disk has a block size, which is the minimum amount of


data that it can read or write.
• Normal File system block size – 512 bytes
• HDFS block size - 128 MB
• The reason for larger block size is to minimize the cost of
seek time
• If the block is large enough, the time it takes to transfer
the data from the disk can be significantly longer than
the time to seek to the start of the block.
• Thus, transferring a large file made of multiple blocks
operates at the disk transfer rate.

12/9/23 CCC ZG522 - BDS S1-23-24 13


BITS Pilani, Pilani Campus
HDFS Block Size (2)

• Suppose disk seek time = 10ms


• Transfer rate = 100MB/Sec
• Block size = 128MB
• Transfer time is approx. = 1 sec
• Seek time = 1% of transfer time

12/9/23 CCC ZG522 - BDS S1-23-24 14


BITS Pilani, Pilani Campus
HDFS Tradeoff

12/9/23 CCC ZG522 - BDS S1-23-24 15


BITS Pilani, Pilani Campus
HDFS Nodes

12/9/23 CCC ZG522 - BDS S1-23-24 16


BITS Pilani, Pilani Campus
HDFS Nodes

12/9/23 CCC ZG522 - BDS S1-23-24 17


BITS Pilani, Pilani Campus
Various roles in HDFS

Not a failover
node

12/9/23 CCC ZG522 - BDS S1-23-24 18


BITS Pilani, Pilani Campus
HDFS Namespace

12/9/23 CCC ZG522 - BDS S1-23-24 19


BITS Pilani, Pilani Campus
Key points about HDFS

• .Client Application interacts with NameNode for metadata related activities


and communicates with DataNodes to read and write files.
• DataNodes converse with each other for pipeline reads and writes.

• Let us assume that the file “Sample.txt” is of size 192 MB. As per the default
data block size(64 MB), it will be split into three blocks and replicated across
the nodes on the cluster based on the default replication factor.
12/9/23 CCC ZG522 - BDS S1-23-24 20
BITS Pilani, Pilani Campus
HDFS Replication (Tunable)

12/9/23 CCC ZG522 - BDS S1-23-24 21


BITS Pilani, Pilani Campus
HDFS Read

12/9/23 CCC ZG522 - BDS S1-23-24 22


BITS Pilani, Pilani Campus
HDFS Write

12/9/23 CCC ZG522 - BDS S1-23-24 23


BITS Pilani, Pilani Campus
Namenode(1)

• HDFS breaks a large file into smaller pieces called


blocks
• NameNode uses a rack ID to identify DataNodes in the
rack. A rack is a collection of DataNodes within the
cluster.
• NameNode keeps tracks of blocks of a file as it is placed
on various DataNodes.
• NameNode manages file-related operations such as
read, write, create, and delete. Its main job is managing
the File System Namespace.(FsImage)

12/9/23 CCC ZG522 - BDS S1-23-24 24


BITS Pilani, Pilani Campus
Namenode (2)

• A file system namespace is collection of files in the cluster.


NameNode stores HDFS namespace.
• File system namespace includes mapping of blocks to file, file
properties and is stored in a file called FsImage.
• NameNode uses an EditLog (transaction log) to record every
transaction that happens to the filesystem metadata.
• When NameNode starts up, it reads FsImage and EditLog from disk
and applies all transactions from the EditLog to in-memory
representation of the FsImage.
• Then it flushes out new version of FsImage on disk and truncates
the old EditLog because the changes are updated in the FsImage.

• There is a single NameNode per cluster.

12/9/23 CCC ZG522 - BDS S1-23-24 25


BITS Pilani, Pilani Campus
Namenode (3)

12/9/23 CCC ZG522 - BDS S1-23-24 26


BITS Pilani, Pilani Campus
DataNode

There are multiple DataNodes per cluster.


• During Pipeline read and write DataNodes communicate
with each other.

• A DataNode also continuously sends “heartbeat”


message to NameNode to ensure the connectivity
between the NameNode and DataNode.

• In case there is no heartbeat from a DataNode, the


NameNode replicates that DataNode within the cluster
and keeps on running as if nothing had happened.

12/9/23 CCC ZG522 - BDS S1-23-24 27


BITS Pilani, Pilani Campus
NameNode and DataNode
Communication

12/9/23 CCC ZG522 - BDS S1-23-24 28


BITS Pilani, Pilani Campus
Secondary NameNode

• The Secondary NameNode takes a snapshot of HDFS


metadata at intervals specified in the Hadoop
configuration.
• Since the memory requirements of Secondary
NameNode are the same as NameNode.
• it is better to run NameNode and Secondary NameNode
on different machines.
• In case of failure of the NameNode, the Secondary
NameNode can be configured manually to bring up the
cluster.
• However, the Secondary NameNode does not record
any real-time changes that happen to the HDFS
metadata.
12/9/23 CCC ZG522 - BDS S1-23-24 29
BITS Pilani, Pilani Campus
HADOOP HDFS commands
(Four node cluster-Screen shots)

12/9/23 CCC ZG522 - BDS S1-23-24 30


BITS Pilani, Pilani Campus
Hadoop with four Nodes

1.Loging into Master Node


2, Type Cat /etc/hosts
List the IP Addresses of Master node and Data nodes

12/9/23 CCC ZG522 - BDS S1-23-24 31


BITS Pilani, Pilani Campus
List the local files in Master
Node
>ls

12/9/23 CCC ZG522 - BDS S1-23-24 32


BITS Pilani, Pilani Campus
List Hadoop files system
(HDFS)

12/9/23 CCC ZG522 - BDS S1-23-24 33


BITS Pilani, Pilani Campus
To Copy a local file into HDFS

> - put Command

3 – is replication factor
File size is 3.2MB

12/9/23 CCC ZG522 - BDS S1-23-24 34


BITS Pilani, Pilani Campus
HDFS Make Directory

> makdir command

12/9/23 CCC ZG522 - BDS S1-23-24 35


BITS Pilani, Pilani Campus
Change the name of the HDFS
file
> mv command

12/9/23 CCC ZG522 - BDS S1-23-24 36


BITS Pilani, Pilani Campus
To copy file from HDFS to
Local file system
Ø -get command

12/9/23 CCC ZG522 - BDS S1-23-24 37


BITS Pilani, Pilani Campus
Copy Directory into HDFS file
system
Ø -put command

12/9/23 CCC ZG522 - BDS S1-23-24 38


BITS Pilani, Pilani Campus
To Delete a HDFS directory

>- rmr command

12/9/23 CCC ZG522 - BDS S1-23-24 39


BITS Pilani, Pilani Campus
TO Delete a file from HDFS file
system
Ø - rm command

12/9/23 CCC ZG522 - BDS S1-23-24 40


BITS Pilani, Pilani Campus
HDFS Help option

Ø - help command
Ø Hadoop dfs –help|more

12/9/23 CCC ZG522 - BDS S1-23-24 41


BITS Pilani, Pilani Campus
Additional Features of HDFS

12/9/23 CCC ZG522 - BDS S1-23-24 42


BITS Pilani, Pilani Campus
New HDFS command

12/9/23 CCC ZG522 - BDS S1-23-24 43


BITS Pilani, Pilani Campus
HDFS high Availability(1)

12/9/23 CCC ZG522 - BDS S1-23-24 44


BITS Pilani, Pilani Campus
HDFS High Availability

12/9/23 CCC ZG522 - BDS S1-23-24 45


BITS Pilani, Pilani Campus
HDFS Federation

12/9/23 CCC ZG522 - BDS S1-23-24 46


BITS Pilani, Pilani Campus
HDFS1.0 limitation

• NameNode saves all its file metadata in main memory.


• Although the main memory today is not as small and as
expensive as it used to be two decades ago, still there is
a limit on the number of objects that one can have in the
memory on a single NameNode.
• The NameNode can quickly become overwhelmed with
load on the system increasing. In Hadoop 2.x, this is
resolved with the help of HDFS Federation

12/9/23 CCC ZG522 - BDS S1-23-24 47


BITS Pilani, Pilani Campus
HDFS federation

12/9/23 CCC ZG522 - BDS S1-23-24 48


BITS Pilani, Pilani Campus
Hadoop 2.0 (HDFS 2.0)

• HDFS 2 consists of two major components:


(a) namespace
(b) blocks storage service.

• Namespace service takes care of file-related


operations, such as creating files, modifying files, and
directories.

• The block storage service handles data node cluster


management, replication.

12/9/23 CCC ZG522 - BDS S1-23-24 49


BITS Pilani, Pilani Campus
HDFS2 features

HDFS 2 Features
1. Horizontal scalability.
2. High availability.
• HDFS Federation uses multiple independent
NameNodes for horizontal scalability. NameNodes are
independent of each other.

• .The DataNodes are common storage for blocks and


shared by all NameNodes. All DataNodes in the cluster
registers with each NameNode in the cluster. High
availability of NameNodeis obtained.

12/9/23 CCC ZG522 - BDS S1-23-24 50


BITS Pilani, Pilani Campus
Passive Standby Namenode

In Hadoop 2.x, Active−Passive NameNode handles


failover automatically. All namespace edits are recorded
to a shared NFS storage and there is a single writer at
any point of time. Passive NameNode reads edits from
shared storage.

12/9/23 CCC ZG522 - BDS S1-23-24 51


BITS Pilani, Pilani Campus
Q&A

12/9/23 CCC ZG522 - BDS S1-23-24 52


BITS Pilani, Pilani Campus
END

12/9/23 CCC ZG522 - BDS S1-23-24 53


BITS Pilani, Pilani Campus
BITS Pilani presentation
BITS Pilani K Anantharaman
kanantharaman@wilp.bits-pilani.ac.in
Pilani Campus

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani
Pilani Campus

CCCSZG522 - Big Data Systems


Lecture No 8 - Introduction to Pig
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
Learning Objectives and Learning
Outcomes

Learning Objectives Learning Outcomes


Introduction to Pig

1. To study the key features and a) To have an easy comprehension on


anatomy of Pig. when to use and when NOT to use Pig.

2. To study the execution modes of Pig. b) To be able to differentiate between Pig


and Hive.
3. To study the various relational
operators in pig.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Agenda-1
➢ What is Pig?
❖ Key Features of Pig
➢ Why do we Need Apache Pig?
➢ Features of Pig
➢ Pig vs MapReduce
➢ Pig Architecture
➢ Usecase for Pig
➢ Pig Latin Overview
❖ Pig Latin Statements
❖ Pig Latin: Identifiers
❖ Pig Latin: Comments
➢ Data Types in Pig
❖ Simple Data Types
❖ Complex Data Type

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Agenda-2

➢ Running Pig
➢ Execution Modes of Pig
➢ Relational Operators
➢ Eval Function
➢ Piggy Bank
➢ When to use Pig?
➢ When NOT to use Pig?
➢ Pig versus Hive

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
What is Pig?

• Apache Pig is an abstraction over MapReduce


• Pig used to handle structured, semi-structured, and unstructured data.
• It is a high-level data flow tool developed to execute queries on large
datasets that are stored in HDFS.
• Pig Latin is the high-level scripting language used by Apache Pig to write
data analysis programs.
• Pig was developed as a research project at Yahoo.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Why do we Need Apache Pig?(1)

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Why do we Need Apache Pig?(2)

• Pig Latin is easy to understand as it is a SQL-like language.

• Pig is also known as an operator-rich language because it offers multiple


built-in operators like joins, filters, ordering, etc.

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig vs MapReduce

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Architecture and Components of Pig (1)

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Architecture and Components of Pig (2)
(Parser)
Pig Latin Script Processing Stages:
First Stage: Parser
• Pig Latin script sent to Hadoop is processed by the Parser.
• Parser conducts type, syntax, and other checks on the script.
• Outputs a Directed Acyclic Graph (DAG) containing logical operators and Pig
Latin statements.
• In the DAG, script's logical operators are nodes, and data flows are edges

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Architecture and Components of Pig (3)
(Parser)
• A DAG is a directed graph that has no cycles.
• In the context of Apache Pig, the DAG represents the logical flow of data in a
Pig Latin script.
• The nodes in the DAG represent the logical operators in the script, such as
LOAD, FILTER, GROUP, and JOIN.
• The edges in the DAG represent the data flows between the operators. For
example, the output of a LOAD operator might flow into a FILTER operator,
which in turn might flow into a GROUP operator.

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Architecture and Components of Pig (4)
Optimizer
Second Stage: Optimizer
• After retrieving the output from the parser, a logical plan for DAG is submitted
to a logical optimizer.
• The logical optimizations are carried out by the optimizer, which includes
activities like transform, split, merge, reorder operators, etc.
• The optimizer basically aims to reduce the quantity of data in the pipeline
when it processes the extracted data.

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Architecture and Components of Pig (5)
Optimizer
➢ This optimizer performs automatic optimization of the data and uses various
functions like:
✓ PushUpFilter: If multiple conditions are available in a filter and the filter can be split, then Pig pushes up each
condition individually and splits those conditions. An earlier selection of these conditions is helpful by resulting in
the reduction of the number of records left in the pipeline

✓ .LimitOptimizer: If the limit operator is applied just after a load or sort operator, then Pig converts these
operators into a limit-sensitive implementation, which omits the processing of the whole data set.

✓ ColumnPruner: This function will omit the columns that are never used; hence, it reduces the size of the
record. This function can be applied after each operator to prune the fields aggressively and frequently.

✓ MapKeyPruner: This function will omit the map keys that are never used, hence, reducing the size of the
record.

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Architecture and Components of Apache
Pig (6) - Compiler
Third Stage: Compiler
➢ After receiving the optimizer’s output, the Compiler compiles the resultant
code into a series of MapReduce tasks.
➢ The Compiler is responsible for the conversion of Pig Script into MapReduce
jobs.

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Architecture and Components of Apache
Pig (7) – Execution Engine
Fourth Stage: Execution Engine
• At last, come to the Execution Engine, where the MapReduce jobs are
transferred for execution to the Hadoop.
• Then the MapReduce jobs get executed, and Hadoop provides the required
results.
• Output can be displayed on the screen by using the ‘DUMP’ statement and
can be stored in the HDFS by the ‘STORE’ statement.

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Process flow of Pig Script execution

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Usecase for Pig- ETL Processing

• ETL (Extract, Transform, Load)

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig Data Model

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig Data Model (1)

➢ The data model of Pig Latin allows it to handle a variety of data.


➢ Pig Latin can handle simple atomic data types such as int, float, long, double,
etc., as well as complex non-atomic data types such as map, tuple, and bag.

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig Latin Data Model (2)
(Atom)
Atom
➢ Atom is a scaler primitive data type that can be any single value in Pig Latin,
irrespective of their data type.
➢ The atomic values of Pig can be string, int, long, float, double, char array, and
byte array.
➢ A simple atomic value or a byte of data is known as a field.

✓ Example of an atom : 2, ‘Kiran,’ ‘25’, ‘Kolkata,’ etc.

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig Latin Data Model (3)
Tuple
Tuple
➢ A tuple is a record that is formed by an ordered set of fields that may carry
different data types for each field.
➢ A tuple can be compared with the records stored in a row in an RDBMS.
➢ It is not mandatory to have a schema attached with the elements present
inside a tuple. Small brackets ‘()’ are used to represent the tuples.

✓ Example of tuple : (2, Kiran, 25, Kolkata)

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig Latin Data Model (4)
Bag

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig Latin Data Model (5)
Map
Map
➢ A map is nothing but a set of key-value pairs used to represent the data
elements.
➢ The key should be unique and must be of the type char array, whereas the
value can be of any type.
➢ Square brackets ‘[]’ are used to represent the Map, and the hash ‘#’ symbol is
used to separate the key-value pair.

✓ Example of maps: [name#Kiran, age#25 ], [name#Aisha, age#20 ]

Source: analyticsvidhya.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Example: Atom ,Tuple, Map, Bag

Source: Medium

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Running Pig

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Running Pig

Pig can run in two ways:

1. Interactive Mode.

2. Batch Mode.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
1. Running Pig- Interactive Mode (1)

➢ You can run Pig in interactive mode by invoking grunt shell.


➢ Type pig to get grunt shell as shown below.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
1. Running Pig- Interactive Mode (2)

➢ Once you get the grunt prompt, you can type the Pig Latin statement as
shown below.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
2. Running Pig- Batch Mode (1)

➢ To run pig in MapReduce mode, you need to have access to a Hadoop


Cluster to read /write file. This is the default mode of Pig.

➢ Syntax
✓ Pig filename

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
HDFS Commands

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig Latin Overview

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig Latin Statements

1. Pig Latin statements are basic constructs to process data using Pig.
2. Pig Latin statement is an operator.
3. An operator in Pig Latin takes a relation as input and yields another relation
as output.
4. Pig Latin statements include schemas and expressions
5. Pig Latin Statements end with Semi-colon ‘;’

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Example – Pig Latin Script

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig Latin : Keywords

➢ Keywords are reserved. It cannot be used to name things.

✓ Example: Load, Filter, foreach etc..

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig Latin Identifiers:

1. Identifiers are names assigned to fields or other data structures.


2. It should begin with a letter and should be followed only by letters, numbers,
and underscores.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig Latin -Comments

In Pig Latin two types of comments are supported:


1. Single line comments that begin with “--”.
2. Multiline comments that begin with “/* and end with */”.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig Latin: Case Sensitivity

1. Keywords are not case sensitive such as LOAD, STORE, GROUP,


FOREACH, DUMP, etc.
2. Relations and paths are case-sensitive.
3. Function names are case sensitive such as PigStorage, COUNT.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Relation Operators

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
LOAD

Reading Data
To read data in Pig, we need to put the data from the local file system to
Hadoop. Let’s see the steps:
Step 1:- Create a file using the cat command in the local file system.
Step 2:- Transfer the file into the HDFS using the put command.
Step 3:- Read the data from the Hadoop to the Pig Latin using the load
command.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
LOAD Operator-Syntax
Syntax:-
Relation = LOAD 'Input file path information' USING load_function AS schema;
Where,

Relation − We have to provide the relation name where we want to load the file content.

Input file path information − We have to provide the path of the Hadoop directory where the file is stored.

load_function − Apache Pig provides a variety of load functions like BinStorage, JsonLoader, PigStorage,
TextLoader. Here, we need to choose a function from this set. PigStorage is the commonly used
function as it is suited for loading structured text files.

Schema − We need to define the schema of the data or the passing files in parenthesis.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Example -Load

Source: Edureka.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
STORE

Source: Edureka.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
FOREACH

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
FILTER

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
DISTINCT (1)

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
DISTINCT (2)

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
LIMIT

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
ORDER BY

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
JOIN

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
UNION (1)
• It is used to merge contents of two relations

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
UNION(2)

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
SPILT

• It used to split the relation into two or more relations

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Aggregate Functions

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
AVG
• It used to computer Average numeric values in single column Bag.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
MAX

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
COUNT
• Count number of elements in a Bag

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Complex Datatypes

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
TUBLE

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
MAP

12/16/2023
CCCSZG522 - Big Data Systems - S1-23-24 BITS Pilani, Pilani Campus
Piggy Bank

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Piggy Bank

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
UDF (User Defined Function)

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
UDF (2)

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Word Count Example using Pig (1)

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Word Count Example using Pig (2)-
Output

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
When to use Pig?

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig can be used in the following situations:

1. When data loads are time sensitive.

2. When processing various data sources.

3. When analytical insights are required through sampling

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
When NOT to use Pig?

Pig should not be used in the following situations:

1. When data is completely unstructured such as video, text, and audio.

2. When there is a time constraint because Pig is slower than MapReduce jobs.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig vs Hive

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Pig Vs Hive

Source : geekforgeeks

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Q&A

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
End

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus

CCCSZG522 - Big Data Systems


Lecture No 8 – Introduction to Hive
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
Learning Objectives and Learning
Outcomes
Learning Objectives Learning Outcomes

Introduction to Hive

1. To study the Hive Architecture a) To understand the hive architecture.


b) To create databases, tables and execute
2. To study the Hive File format data manipulation language statements on
it.
3. To study the Hive Query Language c) To differentiate between static and
dynamic partitions.
d) To differentiate between managed and
external tables.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Agenda-1

➢ What is Hive?
➢ Hive Architecture
➢ Hive Data Types
➢ Primitive Data Types
➢ Collection Data Types
➢ Hive File Format
➢ Text File
➢ Sequential File
➢ RCFile (Record Columnar File)

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Agenda-2

➢ Hive Query Language


– DDL (Data Definition Language) Statements
– DML (Data Manipulation Language) Statements
– Database
– Tables
– Partitions
– Buckets
– Aggregation
– Group BY and Having

➢ SERDER

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
What is Hive?

• ETL and Data warehousing tool developed on top of Hadoop Distributed File
System (HDFS)
• Provides an SQL-like interface between the user and the Hadoop distributed
file system (HDFS)
• Hive makes job easy for performing operations like
✓ Data encapsulation
✓ Ad-hoc queries
✓ Analysis of huge datasets
✓ Support for data query and analysis using SQL
✓ Processing of structured and semi-structured

▪ Facebook created Hive component to


manage their ever-growing volumes of data.
Source : towardsdatascience

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
What is Hive in Hadoop?

• Writing MapReduce jobs is tedious work!


• Hadoop Hive, allows to submit SQL queries and perform MapReduce jobs
✓ If comfortable with SQL, then Hive is the right tool!

• Hive has its own language, called HiveQL (HQL)


✓ similar to SQL

• HQL translates SQL-like queries into MapReduce jobs, like what Pig Latin
does, uses HDFS for Storage
✓ No need to learn Java to work with Hadoop Hive

• The three important functionalities for which Hive is deployed are


✓ data summarization
✓ data analysis
✓ and data query

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Hive – in short

• Best suited for batch jobs


• Cannot work for online transaction processing (OLTP) systems
• Does not provide real-time querying for row-level updates

Source : datadog

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Characteristics of Hive

• Databases and tables are built before loading the data


• Hive
✓ as data warehouse is built to manage and query only structured data which is residing under tables
✓ framework have optimization and usability
✓ can partition the data with directory structures to improve performance on certain queries
✓ is compatible for the various file formats which are TEXTFILE, SEQUENCEFILE, RCFILE, etc
✓ uses derby database in single user metadata storage
✓ uses MYSQL for multiple user Metadata or shared Metadata

Source : davidscoding
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
BITS Pilani, Pilani Campus
Applications of Apache Hive
You can use Apache Hive mainly for
✓Data Warehousing
✓Ad-hoc Analysis

Also Can be used for


✓Data Mining
✓Log Processing
✓Document Indexing
✓Customer Facing Business Intelligence
✓Predictive Modelling
✓Hypothesis Testing

Not meant for


✓any real-time queries
✓Online transaction processing OLTP
Ref: Linkedin Learning: Analyzing Data with Hive

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Hadoop vs Hive

Source : geekforgeeks
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
BITS Pilani, Pilani Campus
History of Hive and Recent Releases of
Hive

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Hive Architecture

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Architecture

RPC
Protocol

Source : guru99

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Core Parts
Hive Clients
✓ Hive provides different drivers for communication with a different type of applications
✓ For Thrift based applications, it will provide Thrift client for communication
✓ For Java related applications, it provides JDBC Drivers
Hive Services
✓ Client interactions with Hive can be performed through Hive Services
✓ If the client wants to perform any query related operations in Hive, it has to communicate
through Hive Services
✓ CLI is the command line interface acts as Hive service for DDL (Data definition
Language) operations
✓ Driver present in the Hive services represents the main driver
❖ communicates all type of JDBC, ODBC, and other client specific applications
❖ processes those requests from different applications to meta store and field systems for
further processing
Hive Storage and Computing
– Hive services such as Meta store, File system, and Job Client in turn communicates with
Hive storage and performs the following actions
✓ Metadata information of tables created in Hive is stored in Hive "Meta storage database
✓ Query results and data loaded in the tables are going to be stored in Hadoop cluster on
HDFS
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
BITS Pilani, Pilani Campus
Major Components of Hive
Architecture(1)
Clients
CLI, UI, and Thrift Server
✓ The command-line interface (CLI) and the user interface (UI) submit queries and process monitoring and
instructions so that the external users can interact with Hive
✓ Thrift Server lets other clients interact with Hive

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Major Components of Hive Architecture (2)
Metastore
The Metastore stores the information about the tables,
partitions, the columns within the table

The repository of metadata


✓ Metadata consists of data for each table like its location and schema
✓ Holds the information for partition metadata which allows monitoring
✓ of various distributed data progresses in the cluster
✓ Generally present in the relational databases
✓ Metadata keeps track of the data, replicates it, and provides a backup in
✓ the case of data loss
There are 3 ways of storing in Metastore
– Embedded Metastore
– Local Metastore
– Remote Metastore
– Remote Metastore will be used in production mode
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
BITS Pilani, Pilani Campus
Major Components of Hive Architecture (3)
Types of Metastore

Embedded metastore
✓ is a simple way to get started with Hive
✓ only one embedded Derby database can access the database files on disk at any
one time
✓ means you can only have one Hive session open at a time that shares the same
metastore

Local Metastore
✓ The solution to supporting multiple sessions (and therefore multiple users) is to use a
standalone database
✓ Metastore service still runs in the same process as the Hive service, but connects to
a database running in a separate process
✓ either on the same machine or on a remote machine.
Remote metastore
✓ One or more metastore servers run in separate processes to the Hive service
✓ This brings better manageability and security, since the database tier can be
completely firewalled off
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
BITS Pilani, Pilani Campus
Major Components of Hive
Architecture(4)
Driver
✓ Receives HiveQL statements and works like a controller
✓ Monitors the progress and life cycle of various executions by creating sessions
✓ Stores the metadata that is generated while executing the HiveQL statement
✓ Collects the data points and query results, when the reducing operation is completed by the MapReduce job

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Major Components of Hive
Architecture(5)
Driver Components
Compiler
✓ The compiler is assigned with the task of converting a HiveQL query into a MapReduce
input
✓ Includes a method to execute the steps and tasks needed to let the HiveQL output as
needed by MapReduce
✓ Transforms the query into an execution plan that contains tasks.
Optimizer
✓ Performs many transformations on the execution plan for providing an optimized DAG
✓ Performs various transformation steps for aggregation and pipeline conversion
by a single join for multiple joins
✓ Assigned to split a task while transforming data, before the reduce operations, for
improved efficiency and scalability
Executor
✓ Executes tasks after the compilation and optimization steps
✓ Directly interacts with the Hadoop Job Tracker for scheduling the tasks to be run
✓ Responsible for pipelining the tasks

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Modes of Hive
• Depending on the size of data nodes in Hadoop
Local Mode
✓ Used when the Hadoop is built under pseudo mode which have only one data node
✓ when the data size is smaller in term of restricted to single local machine Hive
✓ when processing will be faster on smaller datasets existing in the local machine

Map Reduce Mode


✓ Used when Hadoop is built with multiple data nodes and data is divided
Local MapReduce
across various nodes
✓ Function on huge datasets and query is executed parallelly
✓ Used to achieve enhanced performance in processing large datasets

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Hive Request Flow

Source : AnalyticsVidhya

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Request Flow - Detailed

Source : geeksforgeeks
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
BITS Pilani, Pilani Campus
Request Flow - Detailed

Jobexecution flow in Hive with Hadoop


Step-1: Execute Query
✓ Interface of the Hive such as Command Line or Web user interface delivers query to the driver to execute
✓ In this, UI calls the execute interface to the driver such as ODBC or JDBC.

Step-2: Get Plan


✓ Driver designs a session handle for the query and transfer the query to the compiler to make execution plan
✓ In other words, driver interacts with the compiler

Step-3: Get Metadata


✓ The compiler transfers the metadata request to any database and the compiler gets the necessary metadata
from the metastore

Step-4: Send Metadata


✓ Metastore transfers metadata as an acknowledgement to the compiler

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Request Flow – Detailed (2)

Step-5: Send Plan


✓ Compiler communicating with driver with the execution plan made by the compiler to execute the query

Step-6: Execute Plan


✓ Execute plan is sent to the execution engine by the driver
✓ Execute Job
✓ Job Done
✓ Dfs operation (Metadata Operation)

Step-7: Fetch Results


✓ Fetching results from the driver to the user interface (UI)

Step-8: Send Results


✓ Result is transferred to the execution engine from the driver
✓ Sending results to Execution engine
✓ When the result is retrieved from data nodes to the execution engine, it returns the result to the driver and to
user interface (UI)

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Hive Integration and Work Flow

Explanation of the workflow: Hourly Log Data can be stored directly into HDFS and then data cleansing
is performed on the log file. Finally, Hive table(s) can be created to query the log file.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Hive Data Model

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Data Modelling

Database

Table
Data Model

Partition Bucket

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Hive Dataunits

• Databases
–The namespace for tables.

• Tables
–Set of records that have similar schema.

• Partitions
–Logical separations of data based on classification of given information as per specific attributes. Once hive has
partitioned the data based on a specified key, it starts to assemble the records into specific folders as and when
the records are inserted, (Partition by : Year, Country)

• Buckets (Clusters)
– Similar to partitions but uses hash function to segregate data and determines the cluster or bucket into which
the record should be placed
– “CLUSTERED BY (customer_id) INTO XX BUCKETS”;

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Tables

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Tables (1)
Two types
Managed Tables
✓ In a managed table, both the table data and the table schema are managed by Hive
✓ The data will be located in a folder named after the table within the Hive data warehouse
❖ which is essentially just a file location in HDFS
✓ The location is user-configurable when Hive is installed
✓ If you drop (delete) a managed table, then Hive will delete both the Schema (the description of the table) and
the data files associated with the table
✓ Default location is /user/hive/warehouse

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Tables (2)
External Tables
✓ An external table is one where only the table schema is controlled by Hive
✓ In most cases, the user will set up the folder location within HDFS and copy the data file(s) there
✓ This location is included as part of the table definition statement
✓ When an external table is deleted, Hive will only delete the schema associated with the table
✓ The data files are not affected

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Managed vs. External Table – What’s the
Difference

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Create Managed Table

To create managed table named ‘STUDENT’.

CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa


FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Managed table location

• Hive creates managed table in the warehouse directory of Hive as shown


below:

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Create External Table

To create external table named ‘EXT_STUDENT’.

CREATE EXTERNAL TABLE IF NOT EXISTS EXT_STUDENT(rollno INT,name


STRING,gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY
'\t' LOCATION ‘/STUDENT_INFO

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Table Location of External Table

• Hive loads the file in the specified location as shown below:

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Load Data into a Table

To load data into the table from file named student.tsv.

LOAD DATA LOCAL INPATH ‘/root/hivedemos/student.tsv' OVERWRITE


INTO TABLE EXT_STUDENT;

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
The difference between INTO TABLE and
OVERWRITE TABLE
• Assume the “EXT_STUDENT” table already had 100 records and the
“student.tsv” file has 10 records.
• After issuing the LOAD DATA statement with the INTO TABLE clause, the
table “EXT_STUDENT” will contain 110 records;
• However, the same LOAD DATA statement with the OVERWRITE clause will
wipe out all the former content from the table and then load the 10 records
from the data file.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
To Retrieve Data from a Table

To retrieve the student details from “EXT_STUDENT” table.

SELECT * from EXT_STUDENT;

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Partitions

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Partitions
• Hive organizes tables into partitions for grouping similar type of data together based on a column or
partition key
• Each Table can have one or more partition keys to identify a particular partition
• Allows us to have a faster query on slices of the data

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Partition - example
• E –commerce data which belongs to India operations in which each state (28 states) operations mentioned in
as a whole
• If we take state column as partition key and perform partitions on that India data as a whole, we can able to get
Number of partitions (28 partitions) which is equal to number of states (28) present in India
• Each state data can be viewed separately in partitions tables

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Partition

Partition is of two types:


STATIC PARTITION: It is upon the user to mention the partition (the
segregation unit) where the data from the file is to be loaded.

DYNAMIC PARTITION: The User is required to simply state the column, basis
which the partitioning will take place. Hive will then create partitions basis the
unique values in the column on which partition is to be carried out.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Static Partition

• To create static partition based on “gpa” column.


CREATE TABLE IF NOT EXISTS STATIC_PART_STUDENT (rollno INT, name
STRING) PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Load Data into Static Partition

• Load data into partition table from table.


INSERT OVERWRITE TABLE STATIC_PART_STUDENT PARTITION (gpa
=4.0) SELECT rollno, name from EXT_STUDENT where gpa=4.0;

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Dynamic Partition

• To create dynamic partition on column date.

CREATE TABLE IF NOT EXISTS DYNAMIC_PART_STUDENT(rollno INT,


name STRING) PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
To load data into a dynamic partition

INSERT OVERWRITE TABLE DYNAMIC_PART_STUDENT PARTITION (gpa)


SELECT rollno,name,gpa from EXT_STUDENT;

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Buckets

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Partitioning Vs. Bucketing

Bucketing is similar to partition.


• However there is a subtle difference between partition and bucketing. In partition, you need to create
partition for each unique value of the column.
✓ This may lead to a situation where you may end up with thousands of partitions.

• This can be avoided using Bucketing in which you can limit the number of buckets that will be created.
• A bucket is a file whereas a partition is a directory.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
When to Use Partitioning/Bucketing?

• Bucketing works well when the field has high cardinality (cardinality is the
number of values a column or field can have) and data is evenly distributed
among buckets.

• Partitioning works best when the cardinality of the partitioning field is not too
high.
.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Create Buckets

• To create a bucketed table having 3 buckets.


CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno INT,name
STRING,grade FLOAT)
CLUSTERED BY (grade) into 3 buckets;

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Load data to bucketed table.

FROM STUDENT
INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,grade;

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
• To display the content of first bucket

SELECT DISTINCT GRADE FROM STUDENT_BUCKET


TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Hive Data Types

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Hive Datatypes (1)
Numeric Data Type
TINYINT 1 - byte signed integer
SMALLINT 2 -byte signed integer

INT 4 - byte signed integer


BIGINT 8 - byte signed integer
FLOAT 4 - byte single-precision floating-point
DOUBLE 8 - byte double-precision floating-point number

String Types
STRING
VARCHAR Only available starting with Hive 0.12.0
CHAR Only available starting with Hive 0.13.0
Strings can be expressed in either single quotes (‘) or double quotes (“)

Miscellaneous Types
BOOLEAN
BINARY Only available starting with Hive
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
BITS Pilani, Pilani Campus
Hive Datatypes(2)

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Hive File Formats

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Hive File Format

• Text File
• Sequential file
• RCFile (Record Columnar File)

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Text file

• The default file format is text file. In this format, each record is a line in the
file. In text file, different control characters are used as delimiters. (“,”, ‘t’, ^A
(O001))

• The supported text files are CSV and TSV. JSON or XML documents too can
be specified as text file.
.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Sequential file

• Sequential files are flat files that store binary key–value pairs. It includes
compression support which reduces the CPU, I/O requirement.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
RCFile (Record Columnar File)

• RCFile stores the data in Column Oriented Manner. So it's efficient for
column-based queries.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
HQL

• Create and manage tables and partitions.


• Support various Relational, Arithmetic, and Logical Operators.
• Evaluate functions.
• Download the contents of a table to a local directory or result of queries to
HDFS directory.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
DDL and DML statements

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
DDL (Data Definition Language)
Statements
These statements are used to build and modify the tables and other objects in
the database.
The DDL commands are as follows:
1. Create/Drop/Alter Database
2. Create/Drop/Truncate Table
3. Alter Table/Partition/Column
4. Create/Drop/Alter View
5. Create/Drop/Alter Index
6. Show- Databases
7. Describe – a Database

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
DML (Data Manipulation Language)
Statements
• These statements are used to retrieve, store, modify, delete, and update data
in database.

The DML commands are as follows:


1. Loading files into table.
2. Inserting data into Hive Tables from queries.
3. Hive 0.14 supports update, delete, and transaction

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Starting Hive Shell

To start Hive, go to the installation path of Hive and type as below:

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Important Hive Commands

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Database (1)

• A database is like a container for data. It has a collection of tables which


houses the data.
To create a database named “STUDENTS” with comments and database
properties.
CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT 'STUDENT Details' WITH DBPROPERTIES ('creator'
= 'JOHN');

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Explanation of the syntax:

• IF NOT EXIST: It is an optional clause. The create database statement with “IF Not EXISTS” clause
creates a database if it does not exist. However, if the database already exists then it will notify the user
that a database with the same name already exists and will not show any error message.

• COMMENT: This is to provide short description about the database.


• WITH DBPROPERTIES: It is an optional clause. It is used to specify any properties of database in the
form of (key, value) separated pairs. In the above example, “Creator” is the “Key” and “JOHN” is the
value.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Note

• We have not specified the location where the Hive database will be created.
• By default all the Hive databases will be created under default warehouse
directory (set by the property hive.metastore.warehouse. dir) as
/user/hive/warehouse/database_name.db.
• But if we want to specify our own location, then the LOCATION clause can
be specified. This clause is optional.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Database(2)

➢ Show Databases
➢ Objective: To display a list of all databases.

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Database (3)

To describe a database.
➢ Shows only DB name, comment, and DB directory.

DESCRIBE DATABASE STUDENTS

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Database(3)

To drop database.
• The DROP DATABASE command in Hive is used to delete an existing
database along with all its tables, partitions and data

DROP DATABASE STUDENTS;

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Aggregations

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Aggregation

Hive supports aggregation functions like avg, count, etc.


To write the average and count aggregation function.
SELECT avg(gpa) FROM STUDENT;
SELECT count(*) FROM STUDENT

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Group by and Having

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Group by Having

To write group by and having function.


SELECT rollno, name,gpa
FROM STUDENT GROUP BY rollno,name,gpa
HAVING gpa > 4.0;

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
SerDer

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
SerDer

• SerDer stands for Serializer/Deserializer.


• Contains the logic to convert unstructured data into records.
• Implemented using Java.
• Serializers are used at the time of writing.
• Deserializers are used at query time (SELECT Statement).

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
Q&A

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus
End

12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24


BITS Pilani, Pilani Campus

You might also like