Dynamic Partition Pruning in Apache Spark

Dynamic Partition Pruning
in Apache Spark
Spark + AI Summit, Amsterdam
1
Bogdan Ghit and Juliusz Sompolski

2
About Us
BI Experience team in the
Databricks Amsterdam European Development Centre
● Working on improving the experience and performance of
Business Intelligence / SQL analytics workloads using
Databricks
○ JDBC / ODBC connectivity to Databricks clusters
○ Integrations with BI tools such as Tableau
○ But also: core performance improvements in
Apache Spark for common SQL analytics query
patterns
Bogdan Ghit
Juliusz Sompolski

TPCDS Q98 on 10 TB
How to Make a Query 100x Faster?

Static Partition Pruning
SELECT * FROM Sales WHERE day_of_week = ‘Mon’
Filter
Scan
Basic data-flow
Filter
Scan
Filter Push-down
Filter
Scan
Partition files with
multi-columnar data

Table Denormalization
SELECT * FROM Sales JOIN Date
WHERE Date.day_of_week = ‘Mon’
Static pruning not possible
Scan
Sales
Filter
day_of_week = ‘mon’
Join
Simple workaround
Scan
Sales
Join
Scan
Date
Filter
Scan
Scan
Date

This Talk
Dynamic pruning
Scan
Sales
Filter
Join
SELECT * FROM Sales JOIN Date
WHERE Date.day_of_week = ‘Mon’
Scan
Countries

Spark In a Nutshell
Query Logical Plan
Optimization
Physical Plan
Selection
RDD batches
Cluster slots
Stats-based
cost model
Rule-based
transformations
APIs

Optimization Opportunities
Data Layout
multi-columnar data
Scan FACT TABLE Scan DIM TABLE
Non-partitioned dataset
Filter DIM
Join on partition id
Query Shape

A Simple Approach
multi-columnar data
Scan FACT TABLE
Scan DIM TABLE
Filter DIM
Join on partition id
Scan DIM TABLE
Filter DIM
Work duplication may be expensive
Heuristics based on inaccurate stats

Broadcast Hash Join
FileScan FileScan with Dim Filter
BroadcastExchange
Broadcast Hash Join
Execute the build side
of the join
Place the result in a
broadcast variableBroadcast the build
side results
Execute the join
locally without
a shuﬀle

Reusing Broadcast Results
multi-columnar data
FileScan
FileScan with Dim Filter
BroadcastExchange
Broadcast Hash Join
Dynamic Filter

Experimental Setup
Workload Selection
- TPC-DS scale factors 1-10 TB
Cluster Configuration
- 10 i3.xlarge machines
Data-Processing Framework
- Apache Spark 3.0

TPCDS 1 TB
60 / 102 queries speedup between 2 and 18

Top Queries
Very good speedups for top 10% of the queries

Data Skipped
Very eﬀective in skipping data

TPCDS 10 TB
Even better speedups at 10x the scale

Query 98
SELECT i_item_desc, i_category, i_class, i_current_price,
sum(ss_ext_sales_price) as itemrevenue,
sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over
(partition by i_class) as revenueratio
FROM
store_sales, item, date_dim
WHERE
ss_item_sk = i_item_sk
and i_category in ('Sports', 'Books', 'Home')
and ss_sold_date_sk = d_date_sk
and cast(d_date as date) between cast('1999-02-22' as date)
and (cast('1999-02-22' as date) + interval '30' day)
GROUP BY
i_item_id, i_item_desc, i_category, i_class, i_current_price
ORDER BY
i_category, i_class, i_item_id, i_item_desc, revenueratio

TPCDS 10 TB
Highly selective dimension filter that retains only
one month out of 5 years of data

Conclusion
Apache Spark 3.0 introduces Dynamic Partition Pruning
- Strawman approach at logical planning time
- Optimized approach during execution time
Significant speedup, exhibited in many TPC-DS queries
With this optimization Spark may now work good with
star-schema queries, making it unnecessary to ETL
denormalized tables.

20
Thanks!
Bogdan Ghit - linkedin.com/in/bogdanghit
Juliusz Sompolski - linkedin.com/in/juliuszsompolski

Dynamic Partition Pruning in Apache Spark

In this document

More Related Content

What's hot

Similar to Dynamic Partition Pruning in Apache Spark

More from Databricks

Recently uploaded

Dynamic Partition Pruning in Apache Spark