KEMBAR78
Bda Module 4 PPT (KM) | PDF | Map Reduce | Apache Hadoop
0% found this document useful (0 votes)
368 views76 pages

Bda Module 4 PPT (KM)

The document discusses MapReduce tasks, including map tasks and reduce tasks, and how MapReduce execution works. It describes key aspects of MapReduce like: - Map tasks implement the map() function to process key-value pairs from the input. - Reduce tasks take the output from map tasks as input and combine the data using aggregation or other functions. - The MapReduce framework schedules tasks on slave nodes and monitors them, while re-executing failed tasks. - MapReduce uses key-value pairs as input/output, and the mapper, partitioner, and reducer functions operate on these pairs. - The map output is shuffled and sorted by key before being provided as input to the redu

Uploaded by

Ajay Bhuj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
368 views76 pages

Bda Module 4 PPT (KM)

The document discusses MapReduce tasks, including map tasks and reduce tasks, and how MapReduce execution works. It describes key aspects of MapReduce like: - Map tasks implement the map() function to process key-value pairs from the input. - Reduce tasks take the output from map tasks as input and combine the data using aggregation or other functions. - The MapReduce framework schedules tasks on slave nodes and monitors them, while re-executing failed tasks. - MapReduce uses key-value pairs as input/output, and the mapper, partitioner, and reducer functions operate on these pairs. - The map output is shuffled and sorted by key before being provided as input to the redu

Uploaded by

Ajay Bhuj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 76

Sri Raghavendra Educational Institutions Society (R)

Sri Krishna Institute of Technology


(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Title: Big Data And Analytics


Sub Code: 18CS72
Presented by: KAVYA M
Department: Computer Science & Engineering

/skit.org.in
Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Module-4

MapReduce, Hive and Pig

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.1.1 INTRODUCTION

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

key terms
MapReduce programming model refers to a programming paradigm for processing Big Data sets with a
parallel and distributed environment using map and reduce tasks.
YARN refers to provisioning of running and scheduling parallel programs for map and reduce tasks and allocating
parallel processing resources for computing sub-tasks running in parallel at the Hadoop for a user application.
Script refers to a small program (codes up to few thousand lines of code) used for purposes such as query
processing, text processing, or refers to a small code written in a dynamic high-level general-purpose language,
such as Python or PERL.
Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

SQL-like scripting language:


means a language for writing script that processes queries similar to SQL. SQL lets us:
(i) write structured queries for processing in DBMS,
(ii) create and modify schema, and control the data access,
(iii) create client for sending query scripts, and create and manage server databases, and
(iv)view, query and change (update, insert or append or delete) databases.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.2.1 MAPREDUCE MAP TASKS, REDUCE TASKS AND


MAPREDUCE EXECUTION
• Big data processing employs the MapReduce programming model.
• A Job means a MapReduce program.
• Each job consists of several smaller units, called MapReduce tasks.
• A software execution framework in MapReduce programming defines the parallel tasks.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

MapReduce process on client submitting a job

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

• A user application specifies locations of the input/output data and translates into map and reduce
functions.
• The Hadoop job client then submits the job (jar/executable etc.) and configuration to the
JobTracker, which then takes the responsibility of distributing the software/configuration to the slaves by
scheduling tasks, monitoring them, and provides status and diagnostic information to the job-client.
• The master is responsible (JobTracker) for scheduling the component tasks in a job onto the
slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by
the master.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.2.1 Map-Tasks
• Map task means a task that implements a map(), which runs user application
• codes for each key-value pair (ki, vi).
• The output of map() would be zero (when no values are found) or intermediate key-value
pairs (k2, v2). The value v2 is the information that is later used at reducer for the transformation
operation using aggregation or other reducing functions.
• Reduce task refers to a task which takes the output v2 from the map as an input and combines
Dept. of ISE those data pieces into a smaller set of data using a combiner.

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Logical View of map() Functioning

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Hadoop Mapper Class


Hadoop Java API includes Mapper class.
An abstract function map() is present in the Mapper class.
Any specific Mapper implementation should be a subclass of this class and overrides the abstract function,
map().

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.2.2 Key-Value Pair


• MapReduce uses only key-value pairs as input and output.
• Hence available Data should be first converted into key-value pairs before it is passed to the Mapper,
as the Mapper only understands key-value pairs of data.

• Key-value pairs in Hadoop MapReduce are generated as follows:


• InputSplit - Defines a logical representation of data and presents a Split data for processing at individual
map().
• As user we don’t deal with InputSplit in Hadoop directly, as InputFormat (InputFormat is responsible for
Dept. of ISE
creating the Inputsplit and dividing into the records) creates it. FileInputFormat breaks a file into 128MB
chunks.

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Steps Involved in MapReduce key-value pairing

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Map and Reduce functions uses key value pair at 4 places:


1. map() input,
2. map() output,
3. reduce() input and
4. reduce() output.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.2.3 Grouping by Key


• Mapper outputs by grouping the key-values, and the value v2 append in a list of values.
• A "Group By" operation on intermediate keys creates v2.

• Shuffle and Sorting Phase


• Shuffling in MapReduce
• The process of transferring data from the mappers to reducers is known as shuffling i.e. the
process by which the system performs the sort and transfers the map output to the reducer as input. So,
MapReduce shuffle phase is necessary for the reducers, otherwise, they would not have any input (or input
Dept. of ISE
from every mapper).

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.2.4 Partitioning

A partitioner partitions the key-value pairs of intermediate Map-outputs.


It partitions the data using a user-defined condition, which works like a hash function.
• The total number of partitions is same as the number of Reducer tasks for the job. Let
us take an example to understand how the partitioner works.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.2.5 Combiners
• A Combiner, also known as a semi-reducer, is an optional class that operates by accepting
the inputs from the Map class and thereafter passing the output key-value pairs to the
Reducer class.

• The main function of a Combiner is to summarize the map output records with the same key.
The output (key-value collection) of the combiner will be sent over the network to the actual
Reducer task as input.

• The Combiner class is used in between the Map class and the Reduce class to reduce the
volume of data transfer between Map and Reduce. Usually, the output of the map task is
Dept. of ISE
large and the data transferred to the reduce task is high.

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.2.6 Reduce Tasks

• Java API at Hadoop includes Reducer class. An abstract function, reduce() is in the
Reducer. Any specific Reducer implementation should be subclass of this class and override
the abstract reduce().
• Reduce task implements reduce() that takes the Mapper output (which shuffles and sorts),
which is grouped by key-values (k2, v2) and applies it in parallel to each group.
• Intermediate pairs are at input of each Reducer in order after sorting using the key.
• Reduce function iterates over the list of values associated with a key and produces outputs
such as aggregations and statistics.
• The reduce function sends output zero or another set of key-value pairs (k3, v3) to the final
Dept. of ISEthe output file. Reduce: {(k2, list (v2) -> list (k3, v3)}

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.2.7 Details of MapReduce Processing Steps

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.2.8 Coping with Node Failures


Hadoop achieves fault tolerance is through restarting the tasks.
Each task nodes (TaskTracker) regularly communicates with the master node, JobTracker.
If a TaskTracker fails to communicate with the JobTracker for a pre-defined period (by default, it is set to 10 minutes),
JobTracker assumes node failure.
The JobTracker knows which map and reduce tasks were assigned to each TaskTracker.
• If a TaskTracker has already completed nine out of ten reduce tasks assigned to it, only the tenth task must execute at a
different node.
Dept.• ofMap
ISE tasks are slightly more complicated. A node may have completed ten map tasks but the Reducers may not have
copied all their inputs from the output of those map tasks. Now if a node fails, then its Mapper outputs are inaccessible.
Thus, any complete map tasks must also be re-executed to make their results available to the remaining reducing nodes.
Hadoop handles all of this automatically.

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Following points summarize the coping mechanism with distinct Node Failures:
• (i) Map TaskTracker failure:
Map tasks completed or in-progress at TaskTracker, are reset to idle on failure
Reduce TaskTracker gets a notice when a task is rescheduled on another TaskTracker
• (ii) Reduce TaskTracker failure:
Only in-progress tasks are reset to idle
• (iii) Master JobTracker failure:
Map-Reduce task aborts and notifies the client (in case of one master node).

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.3.2 Matrix-Vector Multiplication by MapReduce

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.3.3 Relational—Algebra Operations

Relational algebraic operations on large datasets using MapRed:


1 Selection
2 Projection
3 Union
4 Intersection and Difference

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.4.1 HIVE
• Hive was created by Facebook.
• Hive is a data warehousing tool and is also a data store on the top of Hadoop.
• Enterprises uses a data warehouse as large data repositories that are designed to enable the Searching,
managing, and analyzing the data.
• Hive processes structured data, integrates well heterogeneous sources.
• Additionally, also manages the volumes of data.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

HIVE Features

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Hive Characteristics
1. Has the capability to translate queries into MapReduce jobs. This makes Hive scalable, able
to handle data warehouse applications, and therefore, suitable for the analysis of static data of an
extremely large size data and application.
2. Supports web interfaces as well. Application APIs as well as web-browser clients, can access
the Hive DB server.
3. Provides an SQL dialect (Hive Query Language, abbreviated HiveQL or HQL).

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

HIVE Limitations
1. Not a full database. Main disadvantage is that Hive does not provide update, alter and deletion of
records in the database.
2. Not developed for unstructured data.
3. Not designed for real-time queries.
4. Performs the partition always from the last column.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.4.1 Hive Architecture

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Hive architecture components are:


• Hive Server - An optional service that allows a remote client to submit requests to Hive and retrieve
results. Requests can use a variety of programming languages. Hive Server exposes a very simple client API
to execute HiveQL statements.
• Hive CLI (Command Line Interface) - Popular interface to interact with Hive. Hive runs in local mode
that uses local storage when running the CLI on a Hadoop cluster instead of HDFS.
• Web Interface - Hive can be accessed using a web browser as well. This requires a HWI Server running
on some designated code. The URL http:// hadoop:<port no.> / hwi command can be used to access
Dept. of ISE
Hive through the web.

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

• Metastore - It is the system catalog. All other components of Hive interact with the Metastore. It stores the
schema or metadata of tables, databases, columns in a table, their data types and HDFS mapping.
• Hive Driver - It manages the life cycle of a HiveQL statement during compilation, optimization and
execution.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.4.2 Hive Installation


Hive can be installed on Windows 10, Ubuntu 16.04 and MySQL. It requires three software
packages:
• Java Development kit for Java compiler (Javac) and interpreter
• Hadoop
• Compatible version of Hive with Java- Hive 1.2 onward supports Java 1.7 or newer.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Steps for installation of Hive in a Linux based OS are as follows:


1. Install Javac and Java from Oracle Java download site. Download jdk 7 or a later version from
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html, and
extract the compressed file.
All users can access Java by Making Java available to all users.
The user has to move it to the location "/usr/local/" using the required commands.
2. Set the path by the commands for jdk1.7.0_71,
export JAVA_HOME=usr/local/jdk1.7.0_71,
Dept. of ISE
exportPATH=$PATH: $JAVA_HOME/bin
(Can use alternative install /usr/bin/java usr/local/java/bin/java 2)

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

3. Install Hadoop http://apache.claz.org/hadoop/common/hadoop- 2.4.1/

4. Make shared HADOOP, MAPRED, COMMON, HDFS and all related files, configure HADOOP and set property
such as replication parameter.
5. Name the yarn.nodemanager.aux-services. Assign value to
mapreduce_shuffle. Set namenode and datanode paths.
6. Download http://apache.petsads.us/hive/hive-0.14.0/. Use is command to verify the files $ tar zxvf
apache-hive-0.14.0-bin.tar.gz, $ ls
7. Use an external database server. Configure metastore for the server.
Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.4.3 Comparison with RDBMS (Traditional Database)

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.4.4 Hive Data Types and File Formats

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Hive has three Collection data types

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

HIVE file formats and their descriptions

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.4.5 Hive Data Model

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.4.6 Hive Integration and Workflow Steps


Hive integrates with the MapReduce and HDFS. Figure below shows the dataflow sequences and workflow steps
between Hive and Hadoop.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.4.7 Hive Built-in Functions

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.51 HIVEQL
Hive Query Language (abbreviated HiveQL) is for querying the large datasets which reside in the HDFS
environment.
HiveQL script commands enable data definition, data manipulation and query processing.
HiveQL supports a large base of SQL users who are using SQL to extract information from data warehouse.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.5.1 HiveQL Data Definition Language (DDL)


• HiveQL database commands for data definition for DBs and Tables are
• CREATE DATABASE,
• SHOW DATABASE (list of all DBs),
• CREATE SCHEMA,
• CREATE TABLE.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Following are HiveQL commands which create a table:

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Creating a Database

Showing Database
Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Dropping a Database

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.5.2 HiveQL Data Manipulation Language (DML)

HiveQL commands for data manipulation are


USE <database name>,
DROP DATABASE,
DROP SCHEMA,
ALTER TABLE,
DROP TABLE, and

Dept. ofLOAD
ISE DATA.

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Loading Data into HIVE DB

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.5.3 HiveQL For Querying the Data


• For any data analysis application there is a need for partitioning and storing the data.
• A data warehouse should have a large number of partitions where the tables, files and databases store.
• Querying then requires sorting, aggregating and joining functions.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.5.3.1 Partitioning
• Table partitioning refers to dividing the table data into some parts based on the values of particular
set of columns.
• Hive organizes tables into partitions.
• Partition makes querying easy and fast.
• This is because SELECT is then from the smaller number of column fields.
• The following example explains the concept of partitioning, columnar and file records formats.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Table Partitioning

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Renaming the Partition

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Add a partition to the Existing Table

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Drop a partition

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Partitioning: Query Fast Processing


The following example shows how querying is processed fast by using partitioning of a table.
A query processes faster when using partition.
Selection of a product of a specific category from a table during query processing takes lesser time when
the table has a partition based on a category.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Advantages of Partition
1. Distributes execution load horizontally.
2. Query response time becomes faster when processing a small part of the data instead of
searching the entire dataset.
Limitations of Partition
1. Creating a large number of partitions in a table leads to a large number of files and directories in
HDFS, which is an overhead to NameNode, since it must keep all metadata for the file system in memory
only.
2. of ISE
Dept. Partitions may optimize some queries based on Where clauses, but they may be less responsive
for other important queries on grouping clauses.

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.5.3.2 Bucketing
• A partition itself may have a large number of columns when tables are very large.
• Tables or partitions can be sub-divided into buckets.
• Division is based on the hash of a column in the table.
• CLUSTERED BY clause divides a table into buckets. A coding example on Buckets is given below:

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.5.3.3 Views

Views provide ease of programming.


Complex queries simplify using reusable Views.
A View provisions the following:
• Saves the query and reduces the query complexity
• Use a View like a table but a View does not store data like a table
• Hides the complexity by dividing the query into smaller, more manageable pieces
Dept.
• of ISE
The Hive executes the View and then the planner combines the information in View definition
with the remaining actions on the query

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Hive supports the following built-in aggregation functions. The usage of these functions is same as the SQL
aggregate functions.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.5.5 Join
• A JOIN clause combines columns of two or more tables, based on a relation between them.
• HiveQL Join is more or less similar to SQL JOINS.
• Example:

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.5.6 Group By Clause

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

4.6.1 PIG

Apache developed Pig, which:


• Is an abstraction over MapReduce
• Is an execution framework for parallel processing
• Reduces the complexities of writing a MapReduce program
• Is a high-level dataflow language. Dataflow language means that a Pig operation node takes the
inputs and generates the output for the next node

Dept.•of ISE Is mostly used in HDFS environment


• Performs data manipulation operations at files at data nodes in Hadoop.

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Applications of Apache Pig


Applications of Pig are:
• Analyzing large datasets
• Executing tasks involving adhoc processing
• Processing large data sources such as web logs and streaming online data
• Data processing for search platforms. Pig processes different types of data
• Processing time sensitive data loads; data extracts and analyzes quickly. For example, analysis of
dataoffrom
Dept. ISE twitter to find patterns for user behavior and recommendations.

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Differences between Pig and MapReduce

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Differences between Pig and SQL

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Differences between Pig and Hive

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Pig Architecture

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

The three ways to execute scripts are:


1. Grunt Shell: An interactive shell of Pig that executes the scripts.
2. Script File: Pig commands written in a script file that execute at Pig Server.
3. Embedded Script: Create UDFs for the functions unavailable as Pig built-in operators. UDF can be in
other programming languages. The UDFs can embed in Pig Latin Script file.

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Installing Pig

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Dept. of ISE

Big Data & Analytics/18CS72 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

Discussion

75 Web Technology And its Applications/18CS63 /skit.org.in


Sri Krishna Institute of Technology
(Approved by AICTE, Accredited by NAAC, Affiliated to VTU, Karnataka)

76 Web Technology And its Applications/18CS63 /skit.org.in

You might also like