0% found this document useful (0 votes)

326 views13 pages

Hive Query Optimization Infinity

Well designed tables such as partitioning and bucketing tables in Hive can improve query speed and reduce processing costs. The document discusses partitioning Hive tables horizontally by fields like date or location to group related records together. It also covers bucketing tables to enable more efficient queries and sampling. Parallel query execution in Hive allows subqueries that are not interdependent to run simultaneously to improve performance.

Uploaded by

shashwat2010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

326 views13 pages

Hive Query Optimization Infinity

Uploaded by

shashwat2010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 13

dwivedishashwat@gmail.com http://helpmetocode.blogspot.

com

Well designed tables Partitioning Bucketing and well written queries can improve your query speed and

reduce processing cost.

Optimization on Table side

Partitioning Hive Tables:
It is a kind of horizontal slicing of data. This slicing can be

on the range, single value or a set of values. Imagine log files where each record includes a timestamp. If we partitioned by date, then records for the same date would be stored in the same partition. E.g.: Partition on date. Partition on geography location. Partition on number range.

Defining a table partition

Lets take a Apache log file example where we have log generated by web

server on visit of client. These log contains data & time information about browser and location(IP). So we can create table in hive and partition these log data using date & time and we can create sub partition of location. Which looks like :

CREATE TABLE alogs (timstamp BIGINT, detail STRING) PARTITIONED BY (date STRING, loc STRING);

Log Table

Directory Structure

/user/hive/warehouse/logs/dt=2010-01-01/country=GB/file1 /file2 /country=US/file3 /dt=2010-01-02/country=GB/file4 /country=US/file5 /file6

Hive Buckets
Bucketing Hive Tables:
Bucketing hive table result in more efficient queries.

Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. The two tables are bucketed in the same way, a mapper processing a bucket of the left table knows that the matching rows in the right table are in its corresponding bucket, so it need only retrieve that bucket. Bucket may additionally be sorted by one or more columns. This allows even more efficient map-side joins, since the join of each bucket becomes an efficient merge-sort.

It makes sampling more efficient.

Parallel execution of queries

Hadoop can execute map reduce jobs in parallel and several queries executed on Hive make automatically use of this parallelism. The queries or sub queries which are not interdependent can be execute in parallel mode,like some Join queries.

Following is the example how it is done:

SET hive.exce.parallel=true; #Can be used to set this mode on

Final Result 4 Main Query 5 Query (1 & 2) & 3 Joined Join Sub query (1 & 2) Joined Join Sub query 1

2 Sub query 2

3 Sub query 3

Misc
So in the above flow, 1,2,4 can run in parallel as sub queries and

then joined finally to 3 and then to 5 and the final query result.

Since map join is faster than the common join, it's better to run the map join whenever possible. Previously, Hive users needed to give a hint in the query to specify the small table. For example, select /*+mapjoin(a)*/ * from src1 x join src2 y on x.key=y.key; Newer hive automatically converts normal join to map join.

Some examples

Which query is faster? Select count(distinct(column)) from table.

Or
Select count(*) from (select distinct(column) from table) ??

Answer
M M M M M M

Result

2nd one is faster

In first case :
Maps send each value to reducer Single reducer counts them all(over head)

In Second Case:
Map splits the values to many reducer
Each reducer generated a list Final job is to count the size of each list

Note : Singleton reducer is not always good.

Tips
Hive does not know whether query is bad.

So try to use Explain for queries which you doubt to be bad or

even dont doubt. Explain tells about following Number of jobs Number of map and reduce What job is sorting by What are the directories it will read. So explain will help to see the difference between the two or more queries for the same purpose. Job configuration and history can be studied for the query performance.

24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
No ratings yet
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
7 pages
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
Mandapriyanka (7 0)
No ratings yet
Mandapriyanka (7 0)
3 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Hive in Class Assignment Winter 2021
No ratings yet
Hive in Class Assignment Winter 2021
2 pages
Data Cleaning With PySpark
No ratings yet
Data Cleaning With PySpark
21 pages
Azure Databricks Interview Guide
No ratings yet
Azure Databricks Interview Guide
17 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
6 Years of Experience in Functional, DB and ETL Testing
No ratings yet
6 Years of Experience in Functional, DB and ETL Testing
3 pages
Top Sqoop Interview Questions
No ratings yet
Top Sqoop Interview Questions
6 pages
Scala Interview Prep Guide
No ratings yet
Scala Interview Prep Guide
21 pages
Apache Spark RDD API Examples
No ratings yet
Apache Spark RDD API Examples
38 pages
IT & Big Data Professional Profile
No ratings yet
IT & Big Data Professional Profile
7 pages
Introduction To Databricks SQL Answer Guide
No ratings yet
Introduction To Databricks SQL Answer Guide
6 pages
Bigdata Interview Preparation Guide
No ratings yet
Bigdata Interview Preparation Guide
292 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
Hadoop JobTracker Explained
No ratings yet
Hadoop JobTracker Explained
8 pages
Spark Big Data Tuning Guide
100% (1)
Spark Big Data Tuning Guide
20 pages
Databricks Delta for Developers
No ratings yet
Databricks Delta for Developers
11 pages
Hadoop/Spark Developer Resume
No ratings yet
Hadoop/Spark Developer Resume
7 pages
Spark Scala Interview Question
No ratings yet
Spark Scala Interview Question
3 pages
HDPDeveloper EnterpriseSpark1 StudentGuide
100% (1)
HDPDeveloper EnterpriseSpark1 StudentGuide
244 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Resume
No ratings yet
Resume
4 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Hadoop Hive Cheat Sheet - Developer Guide For SQL To HiveQL - Qubole
No ratings yet
Hadoop Hive Cheat Sheet - Developer Guide For SQL To HiveQL - Qubole
19 pages
Hbase PDF
No ratings yet
Hbase PDF
8 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
1.hadoop Admin Brochure
No ratings yet
1.hadoop Admin Brochure
11 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Hive Real Life Use Cases - AcadGild Blog
No ratings yet
Hive Real Life Use Cases - AcadGild Blog
19 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Big Data Engineer Interview Questions
No ratings yet
Big Data Engineer Interview Questions
1 page
Cloud Bigdata Amand AWS
No ratings yet
Cloud Bigdata Amand AWS
6 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Hive Tutorial PDF
0% (1)
Hive Tutorial PDF
14 pages
Architecting Data Pipelines on GCP
No ratings yet
Architecting Data Pipelines on GCP
24 pages
SCJPBy Nagaraju
No ratings yet
SCJPBy Nagaraju
250 pages
Lead Data Engineer with AWS Expertise
No ratings yet
Lead Data Engineer with AWS Expertise
2 pages
Hadoop for Data Engineers
No ratings yet
Hadoop for Data Engineers
44 pages
Hadoop Performance Tuning
100% (1)
Hadoop Performance Tuning
13 pages
Lead Data Engineer Resume Example
No ratings yet
Lead Data Engineer Resume Example
1 page
Big Query Interview Q&A
100% (1)
Big Query Interview Q&A
8 pages
Hadoop Administration Interview Questions and Answers: 40% Career Booster Discount On All Course - Call Us Now 9019191856
No ratings yet
Hadoop Administration Interview Questions and Answers: 40% Career Booster Discount On All Course - Call Us Now 9019191856
26 pages
Impala
No ratings yet
Impala
11 pages
Akka PDF
No ratings yet
Akka PDF
454 pages
HDFS Interview Prep Guide
No ratings yet
HDFS Interview Prep Guide
29 pages
Apache Cassandra Sample Resume
No ratings yet
Apache Cassandra Sample Resume
17 pages
Shelly Bansal - SR Data Engineer
No ratings yet
Shelly Bansal - SR Data Engineer
6 pages
Apache Hive Join Optimization
No ratings yet
Apache Hive Join Optimization
24 pages
HiveQL Overview
No ratings yet
HiveQL Overview
71 pages
HDFSandhivecommands
No ratings yet
HDFSandhivecommands
15 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Hive Main
No ratings yet
Hive Main
33 pages
Hive Configuration: Shashwat Shriparv
No ratings yet
Hive Configuration: Shashwat Shriparv
5 pages
Apache Tomcat
No ratings yet
Apache Tomcat
18 pages
Hadoop Fully Distributed Cluster
No ratings yet
Hadoop Fully Distributed Cluster
5 pages
MySQL Installation & Features Guide
100% (1)
MySQL Installation & Features Guide
11 pages
Hive Configuration: Shashwat Shriparv
No ratings yet
Hive Configuration: Shashwat Shriparv
5 pages
Next Generation Technology
No ratings yet
Next Generation Technology
4 pages
Nram
No ratings yet
Nram
29 pages
Configure HBase Hadoop and Hbase Client
No ratings yet
Configure HBase Hadoop and Hbase Client
16 pages
C# Interview Quesions
No ratings yet
C# Interview Quesions
10 pages
Secondary Storage Insights
No ratings yet
Secondary Storage Insights
36 pages
Network Structures: Shashwat Shriparv Infinitysoft
No ratings yet
Network Structures: Shashwat Shriparv Infinitysoft
12 pages
Search Engine
No ratings yet
Search Engine
42 pages
Project Oxygen : Shashwat Shriparv Infinitysoft
No ratings yet
Project Oxygen : Shashwat Shriparv Infinitysoft
25 pages
Java OOP: Objects and Classes
No ratings yet
Java OOP: Objects and Classes
9 pages
Shashwat Shriparv Infinitysoft: Access To Non Local Names
No ratings yet
Shashwat Shriparv Infinitysoft: Access To Non Local Names
12 pages
Microsoft Surface Introduction
No ratings yet
Microsoft Surface Introduction
25 pages
System Programming: Shashwat Shriparv Infinitysoft
No ratings yet
System Programming: Shashwat Shriparv Infinitysoft
40 pages
NewPaper Problem
No ratings yet
NewPaper Problem
12 pages
Shashwat Shriparv Infinitysoft
No ratings yet
Shashwat Shriparv Infinitysoft
38 pages
Jini Network Technology
No ratings yet
Jini Network Technology
45 pages
Java Ring: Shashwat Shriparv Infinitysoft
No ratings yet
Java Ring: Shashwat Shriparv Infinitysoft
33 pages
Issues Regarding Mis Structure: Shashwat Shriparv Infinitysoft
No ratings yet
Issues Regarding Mis Structure: Shashwat Shriparv Infinitysoft
15 pages
Unit II-bid Data Programming
No ratings yet
Unit II-bid Data Programming
23 pages
Resume-Shwetha Seetharam
No ratings yet
Resume-Shwetha Seetharam
1 page
BDA - Exp-8 - Aarya Sawant
No ratings yet
BDA - Exp-8 - Aarya Sawant
18 pages
Unit 5
No ratings yet
Unit 5
20 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Anisha ETL DataEngineer
No ratings yet
Anisha ETL DataEngineer
7 pages
Cp4152 Database Practice Lab Manual R 2021
No ratings yet
Cp4152 Database Practice Lab Manual R 2021
48 pages
Nithin Resume
No ratings yet
Nithin Resume
8 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
21 pages
CV - Alekh Ved
No ratings yet
CV - Alekh Ved
5 pages
Wellarchitected Analytics Lens
No ratings yet
Wellarchitected Analytics Lens
59 pages
Using AWS Lake Formation With Amazon Athena - AWS Lake Formation
No ratings yet
Using AWS Lake Formation With Amazon Athena - AWS Lake Formation
9 pages
A Project Report On Web Based Data Management
No ratings yet
A Project Report On Web Based Data Management
16 pages
BDA Unit-4
No ratings yet
BDA Unit-4
38 pages
BDAmod 3
No ratings yet
BDAmod 3
18 pages
Scripting
No ratings yet
Scripting
88 pages
Snowflake
No ratings yet
Snowflake
43 pages
Deepak (Sr. Data Engineer)
No ratings yet
Deepak (Sr. Data Engineer)
10 pages
Big Data & Hadoop Developer Resume
No ratings yet
Big Data & Hadoop Developer Resume
8 pages
Big Data Taxonomy PDF
No ratings yet
Big Data Taxonomy PDF
33 pages
Apache Atlas User Guide
100% (1)
Apache Atlas User Guide
107 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Certified Data Engineer Professional Topic 3
No ratings yet
Certified Data Engineer Professional Topic 3
24 pages
Salary - SQL Interview Queries Examples For Fresher and Exprinced Set-4 Interview Questions PDF
No ratings yet
Salary - SQL Interview Queries Examples For Fresher and Exprinced Set-4 Interview Questions PDF
8 pages
Hadoop Development Download Syllabus PDF
No ratings yet
Hadoop Development Download Syllabus PDF
5 pages
Facebook Hive POC
No ratings yet
Facebook Hive POC
18 pages
Data Science & Analytics Guide
No ratings yet
Data Science & Analytics Guide
6 pages
Hortonworks HDP Installing Manually Book
100% (2)
Hortonworks HDP Installing Manually Book
140 pages
R23 IDS Unit 3 Lecture Notes
No ratings yet
R23 IDS Unit 3 Lecture Notes
57 pages
Ite06 Big Data Analytics-Qbank
No ratings yet
Ite06 Big Data Analytics-Qbank
18 pages

Hive Query Optimization Infinity

Uploaded by

Hive Query Optimization Infinity

Uploaded by

dwivedishashwat@gmail.com http://helpmetocode.blogspot.

reduce processing cost.

Optimization on Table side

Defining a table partition

/user/hive/warehouse/logs/dt=2010-01-01/country=GB/file1 /file2 /country=US/file3 /dt=2010-01-02/country=GB/file4 /country=US/file5 /file6

It makes sampling more efficient.

Parallel execution of queries

Following is the example how it is done:

Which query is faster? Select count(distinct(column)) from table.

2nd one is faster

Note : Singleton reducer is not always good.

So try to use Explain for queries which you doubt to be bad or

You might also like