KEMBAR78
Dynamic Partition Pruning in Apache Spark | PDF
Dynamic Partition Pruning
in Apache Spark
Spark + AI Summit, Amsterdam
1
Bogdan Ghit and Juliusz Sompolski
2
About Us
BI Experience team in the
Databricks Amsterdam European Development Centre
● Working on improving the experience and performance of
Business Intelligence / SQL analytics workloads using
Databricks
○ JDBC / ODBC connectivity to Databricks clusters
○ Integrations with BI tools such as Tableau
○ But also: core performance improvements in
Apache Spark for common SQL analytics query
patterns
Bogdan Ghit
Juliusz Sompolski
TPCDS Q98 on 10 TB
How to Make a Query 100x Faster?
Static Partition Pruning
SELECT * FROM Sales WHERE day_of_week = ‘Mon’
Filter
Scan
Basic data-flow
Filter
Scan
Filter Push-down
Filter
Scan
Partition files with
multi-columnar data
Table Denormalization
SELECT * FROM Sales JOIN Date
WHERE Date.day_of_week = ‘Mon’
Static pruning not possible
Scan
Sales
Filter
day_of_week = ‘mon’
Join
Simple workaround
Scan
Sales
Join
Scan
Date
Filter
day_of_week = ‘mon’
Scan
Scan
Date
This Talk
Dynamic pruning
Scan
Sales
Filter
day_of_week = ‘mon’
Join
SELECT * FROM Sales JOIN Date
WHERE Date.day_of_week = ‘Mon’
Scan
Countries
Spark In a Nutshell
Query Logical Plan
Optimization
Physical Plan
Selection
RDD batches
Cluster slots
Stats-based
cost model
Rule-based
transformations
APIs
Optimization Opportunities
Data Layout
Partition files with
multi-columnar data
Scan FACT TABLE Scan DIM TABLE
Non-partitioned dataset
Filter DIM
Join on partition id
Query Shape
A Simple Approach
Partition files with
multi-columnar data
Scan FACT TABLE
Scan DIM TABLE
Non-partitioned dataset
Filter DIM
Join on partition id
Scan DIM TABLE
Filter DIM
Work duplication may be expensive
Heuristics based on inaccurate stats
Broadcast Hash Join
FileScan FileScan with Dim Filter
Non-partitioned dataset
BroadcastExchange
Broadcast Hash Join
Execute the build side
of the join
Place the result in a
broadcast variableBroadcast the build
side results
Execute the join
locally without
a shuffle
Reusing Broadcast Results
Partition files with
multi-columnar data
FileScan
FileScan with Dim Filter
Non-partitioned dataset
BroadcastExchange
Broadcast Hash Join
Dynamic Filter
Experimental Setup
Workload Selection
- TPC-DS scale factors 1-10 TB
Cluster Configuration
- 10 i3.xlarge machines
Data-Processing Framework
- Apache Spark 3.0
TPCDS 1 TB
60 / 102 queries speedup between 2 and 18
Top Queries
Very good speedups for top 10% of the queries
Data Skipped
Very effective in skipping data
TPCDS 10 TB
Even better speedups at 10x the scale
Query 98
SELECT i_item_desc, i_category, i_class, i_current_price,
sum(ss_ext_sales_price) as itemrevenue,
sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over
(partition by i_class) as revenueratio
FROM
store_sales, item, date_dim
WHERE
ss_item_sk = i_item_sk
and i_category in ('Sports', 'Books', 'Home')
and ss_sold_date_sk = d_date_sk
and cast(d_date as date) between cast('1999-02-22' as date)
and (cast('1999-02-22' as date) + interval '30' day)
GROUP BY
i_item_id, i_item_desc, i_category, i_class, i_current_price
ORDER BY
i_category, i_class, i_item_id, i_item_desc, revenueratio
TPCDS 10 TB
Highly selective dimension filter that retains only
one month out of 5 years of data
Conclusion
Apache Spark 3.0 introduces Dynamic Partition Pruning
- Strawman approach at logical planning time
- Optimized approach during execution time
Significant speedup, exhibited in many TPC-DS queries
With this optimization Spark may now work good with
star-schema queries, making it unnecessary to ETL
denormalized tables.
20
Thanks!
Bogdan Ghit - linkedin.com/in/bogdanghit
Juliusz Sompolski - linkedin.com/in/juliuszsompolski

Dynamic Partition Pruning in Apache Spark

  • 1.
    Dynamic Partition Pruning inApache Spark Spark + AI Summit, Amsterdam 1 Bogdan Ghit and Juliusz Sompolski
  • 2.
    2 About Us BI Experienceteam in the Databricks Amsterdam European Development Centre ● Working on improving the experience and performance of Business Intelligence / SQL analytics workloads using Databricks ○ JDBC / ODBC connectivity to Databricks clusters ○ Integrations with BI tools such as Tableau ○ But also: core performance improvements in Apache Spark for common SQL analytics query patterns Bogdan Ghit Juliusz Sompolski
  • 3.
    TPCDS Q98 on10 TB How to Make a Query 100x Faster?
  • 4.
    Static Partition Pruning SELECT* FROM Sales WHERE day_of_week = ‘Mon’ Filter Scan Basic data-flow Filter Scan Filter Push-down Filter Scan Partition files with multi-columnar data
  • 5.
    Table Denormalization SELECT *FROM Sales JOIN Date WHERE Date.day_of_week = ‘Mon’ Static pruning not possible Scan Sales Filter day_of_week = ‘mon’ Join Simple workaround Scan Sales Join Scan Date Filter day_of_week = ‘mon’ Scan Scan Date
  • 6.
    This Talk Dynamic pruning Scan Sales Filter day_of_week= ‘mon’ Join SELECT * FROM Sales JOIN Date WHERE Date.day_of_week = ‘Mon’ Scan Countries
  • 7.
    Spark In aNutshell Query Logical Plan Optimization Physical Plan Selection RDD batches Cluster slots Stats-based cost model Rule-based transformations APIs
  • 8.
    Optimization Opportunities Data Layout Partitionfiles with multi-columnar data Scan FACT TABLE Scan DIM TABLE Non-partitioned dataset Filter DIM Join on partition id Query Shape
  • 9.
    A Simple Approach Partitionfiles with multi-columnar data Scan FACT TABLE Scan DIM TABLE Non-partitioned dataset Filter DIM Join on partition id Scan DIM TABLE Filter DIM Work duplication may be expensive Heuristics based on inaccurate stats
  • 10.
    Broadcast Hash Join FileScanFileScan with Dim Filter Non-partitioned dataset BroadcastExchange Broadcast Hash Join Execute the build side of the join Place the result in a broadcast variableBroadcast the build side results Execute the join locally without a shuffle
  • 11.
    Reusing Broadcast Results Partitionfiles with multi-columnar data FileScan FileScan with Dim Filter Non-partitioned dataset BroadcastExchange Broadcast Hash Join Dynamic Filter
  • 12.
    Experimental Setup Workload Selection -TPC-DS scale factors 1-10 TB Cluster Configuration - 10 i3.xlarge machines Data-Processing Framework - Apache Spark 3.0
  • 13.
    TPCDS 1 TB 60/ 102 queries speedup between 2 and 18
  • 14.
    Top Queries Very goodspeedups for top 10% of the queries
  • 15.
  • 16.
    TPCDS 10 TB Evenbetter speedups at 10x the scale
  • 17.
    Query 98 SELECT i_item_desc,i_category, i_class, i_current_price, sum(ss_ext_sales_price) as itemrevenue, sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over (partition by i_class) as revenueratio FROM store_sales, item, date_dim WHERE ss_item_sk = i_item_sk and i_category in ('Sports', 'Books', 'Home') and ss_sold_date_sk = d_date_sk and cast(d_date as date) between cast('1999-02-22' as date) and (cast('1999-02-22' as date) + interval '30' day) GROUP BY i_item_id, i_item_desc, i_category, i_class, i_current_price ORDER BY i_category, i_class, i_item_id, i_item_desc, revenueratio
  • 18.
    TPCDS 10 TB Highlyselective dimension filter that retains only one month out of 5 years of data
  • 19.
    Conclusion Apache Spark 3.0introduces Dynamic Partition Pruning - Strawman approach at logical planning time - Optimized approach during execution time Significant speedup, exhibited in many TPC-DS queries With this optimization Spark may now work good with star-schema queries, making it unnecessary to ETL denormalized tables.
  • 20.
    20 Thanks! Bogdan Ghit -linkedin.com/in/bogdanghit Juliusz Sompolski - linkedin.com/in/juliuszsompolski