KEMBAR78
Databricks Performance Optimization | PDF | Apache Spark | Computer Cluster
0% found this document useful (0 votes)
125 views94 pages

Databricks Performance Optimization

The document outlines a performance optimization course for Databricks focusing on Spark architecture, code optimization, and cluster fine-tuning. It covers key topics such as data skipping, handling small file problems, and effective partitioning strategies to enhance data processing efficiency. The course includes lectures, demonstrations, and hands-on labs to equip participants with advanced data engineering skills using Databricks tools.

Uploaded by

NarmadhaHariram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views94 pages

Databricks Performance Optimization

The document outlines a performance optimization course for Databricks focusing on Spark architecture, code optimization, and cluster fine-tuning. It covers key topics such as data skipping, handling small file problems, and effective partitioning strategies to enhance data processing efficiency. The course includes lectures, demonstrations, and hands-on labs to equip participants with advanced data engineering skills using Databricks tools.

Uploaded by

NarmadhaHariram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

Databricks

Performance
Optimization

Databricks
Databricks Academy
May 2025
Academy
January 2025
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Agenda
1. Spark Architecture Lecture Demo Lab
Spark UI Introduction ✓
2. Designing the Foundation Lecture Demo Lab
File Explosion ✓
Data Skipping and Liquid Clustering ✓ ✓
3. Code Optimization Lecture Demo Lab
Skew ✓
Shuffle ✓ ✓
Spill ✓
Exploding Join ✓
Serialization ✓
User-Defined Functions ✓

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Agenda
4. Fine-Tuning: Choosing the Right Cluster Lecture Demo Lab
Fine-Tuning: Choosing the Right Cluster ✓
Pick the Best Instance Types ✓

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Introduction
We commence with Designing the Foundation, focusing on establishing fundamental principles in Spark
programming. Following this, we delve into Code Optimization, uncovering strategies to elevate code
efficiency and performance. Our exploration further extends to understanding the intricate layers of Spark
Architecture and optimizing clusters for diverse workloads in Fine-Tuning - Choosing the Right Cluster.

Beyond theory, our sessions offer immersive hands-on experiences. Engage in real-time simulation
through Follow Along - Spark Simulator, and dive deep into critical operational aspects such as Shuffles,
Spill, Skew, alongside understanding the prowess of Serialization in Spark.

This course aims to equip you with comprehensive expertise in advanced data engineering, leveraging the
powerful tools and techniques offered by Databricks.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Building Performance Analytics
File Layout

Cluster Sizing

Code
Optimization

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spark Architecture

Databricks
Databricks Performance Performance Optimization

Optimization
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spark Architecture

LECTURE

Spark UI
Introduction

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Executing a Spark Application
Data processing tasks run in parallel across a cluster of machines

Job Task
Spark
Job Stage Stage Task
application

Job Task

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spark Architecture

Driver

Executor Executor Executor

Core Task Task Task Core Core

Worker nodes
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Scenario: Filter out brown pieces from these candy bags
Cluster

Data
Driver

Executor

Partition
Core

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Student A, get bag #1,
Remove brown pieces from the bag,
Student B, get bag #2,
place the rest in the corner
Student C, get bag #3…

A B C D E F

G H I J K L

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Student A, get bag #1,
Remove brown pieces from the bag,
Student B, get bag #2,
place the rest in the corner
Student C, get bag #3…

A B C D E F

G H I J K L

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Students A, E, H, J,
process bags 13, 14, 15, 16 on
completion previous tasks

A B C D E F

G H I J K L

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
All done!

A B C D E F

G H I J K L

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Scenario 2: Count Total Pieces in Candy
Bags Introducing Stages

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Stage 1: Local Count We need to count the total
pieces in these candy bags

A B C D E F

G H I J K L

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Stage 1: Local Count Students B, E, I, L, count these
four bags

A B C D E F

G H I J K L

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Stage 1: Local Count
Students B, E, I, L,
commit your findings

5 6

A B C D E F

4 5

G H I J K L

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Stage 2: Global Count
Student G, total counts from
students B, E, I, L

A B C D E F

5
6
4
5 G H I J K L

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Stage 1: Local Count Stage 2: Global Count

20

5 6
A B C D E F A B C D E F

4 5

G H I J K L G H I J K L

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Query Optimization
Enabled by default as of Spark 3.2

LOGICAL OPTIMIZATION COST BASED OPTIMIZATION

Metadata Catalog

Catalyst Catalog

Cost Model
Query Unresolved Optimized Physical
Physical Selected
Logical Plan
Logical Plan Physical RDDs
Logical Plan Plans
Plans Physical Plan
Plans

WHOLE-STAGE
ANALYSIS PHYSICAL PLANNING CODE GENERATION

Runtime Statistics
ADAPTIVE QUERY EXECUTION

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization Recommendations
1. Using Dataframes or SQL instead of RDD APIs.
2. In production jobs, avoid unnecessary operations that trigger an action
besides reading and writing files. These operations might include
count(), display(), collect().
3. Avoid operations that will force all computation into the driver node
such as using single threaded python/pandas. Use Pandas API on Spark
instead to distribute pandas functions.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
22
logo are trademarks of the Apache Software Foundation.
Designing the
Foundation

Databricks Performance Optimization

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Designing the Foundation

LECTURE

Introduction to
Designing
Foundation

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Fundamental Concepts
Why some schemas and queries perform faster than others

● Number of bytes read


● Query complexity/computation
● Number of files accessed
● Parallelism

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
25
logo are trademarks of the Apache Software Foundation.
Common Performance Bottlenecks
Encountered with any big data or MPP system
Bottleneck Details

● Listing and metadata operation for too many small files can be expensive
Small File Problem
● Can also result in throttling from cloud storage I/O limits

● Large amounts of data skew can result in more work handled by a single executor
Data Skew
● Even if data read in is not skewed, certain transformations can lead to in-memory skew

Processing More
● Traditional data lake platforms often require rewriting entire datasets or partitions
Than Needed

Before Aggregation After Aggregation by City

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
26
logo are trademarks of the Apache Software Foundation.
Avoiding the Small File Problem
Automatically handle this common performance challenge in Data Lakes

● Too many small files greatly increases overhead for reads


● Too few large files reduces parallelism on reads
● Over-partitioning is a common problem
● Databricks will automatically tune the size of Delta Lake tables
● Databricks will automatically compact small files on write with
auto-optimize

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
27
logo are trademarks of the Apache Software Foundation.
Designing the Foundation

DEMONSTRATION

File Explosion

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Designing the Foundation

LECTURE

Data Skipping
and Liquid
Clustering

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Data Skipping
Simple, well-known I/O
pruning technique
● Track file-level stats like min & file_name col_min col_max
max
● Leverage them to avoid 1. parquet 6 8
scanning irrelevant files
2. parquet 3 10

SELECT input_file_name() as 3. parquet 1 4


“file_name”,
min(col) AS “col_min”,
max(col) AS “col_max”
FROM table
GROUP BY input_file_name()

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Z-Ordering

Optimize Table Z-Order by Column

Old Layout New Layout


file_name col_min col_max file_name col_min col_max

1. parquet 6 8 1. parquet 1 3

2. parquet 3 10 2. parquet 4 7

3. parquet 1 4 3. parquet 8 10

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Z-Ordering

Select * from Table Where Col = 7

Old Layout New Layout


file_name col_min col_max file_name col_min col_max

1. parquet 6 8 1. parquet 1 3

2. parquet 3 10 2. parquet 4 7

3. parquet 1 4 3. parquet 8 10

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Databricks Delta Lake and Stats
● Databricks Delta Lake collects stats about the first N columns
○ dataSkippingNumIndexedCols = 32
● These stats are used in queries:
○ Metadata only queries: select max(col) from table
■ Queries just the Delta Log, doesn’t need to look at the files if col has stats
○ Allows us to skip files
■ Partition FIlters, Data Filters, Pushed Filters apply in that order
○ TimeStamp and String types aren’t always very useful
■ Precision/Truncation prevent exact matches, have to fall back to files sometimes
● Avoid collecting stats on long strings
○ Put them outside first 32 columns or collect stats on fewer columns
■ alter table change column col after col32
■ set spark.databricks.delta.properties.defaults.dataSkippingNumIndexedCols = 3

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
What about Partitioning?
● Generally not recommended!
○ Partitioning is usually misused/overused
○ Tiny file problems or Skew
● Good use-cases for partitioning
○ Isolating data for separate schemas (single->multiplexing)
○ GDPR/CCP use cases where you commonly delete a partitions worth of data
○ Use cases requiring a physical boundary to isolate data SCD Type 2, partition on current
or not for better performance.
● If you partition
○ Choose column with low cardinality
○ Try to keep each partition less than 1tb and greater than 1gb
○ Tables expected to grow TBs
○ Partition (usually) on a date, zorder on commonly used predicates in where clauses

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Challenges with Disk Partitioning
Partition by customer ID and date + optimize

2023-02-05 2023-02-06 2023-02-07

Customer A • Many small files.


Customer B
○ High metadata operations
overhead.
Customer C
○ Slow read operations.
Customer D

Customer E
• Data skew.
Customer F
○ Inconsistent file sizes across
partitions.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Introducing Liquid Clustering

What it is Benefits

Innovative technique to clustering data ● Best performance out of the box


layout to support efficient query access ○ Clustering on write
and reduce data management and ● Most consistent data skipping
tuning overhead. It’s flexible and ○ Immune to data skew

adaptive to data pattern changes, ● Minimal write amplification on table


scaling, and data skew maintenance
○ True incremental optimize
● Row Level Concurrency
○ Simplify logic of concurrent writers
● Reduced Cognitive Overhead
○ No worrying about cardinality
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Liquid Clustering
Liquid cluster by customer ID and date

2023-02-05 2023-02-06 2023-02-07 • Liquid is not subject to rigid


Customer A boundaries
○ Liquid intelligently decides what
Customer B ranges of data to combine

Customer C • Data skew is gone


○ Data sizes are consistent
Customer D

Customer E
• Liquid stores metadata
○ New data can be clustered into
Customer F existing clusters on write

Target file size

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Table Statistics
Keeping table statistics up to date for best results with cost-based optimizer

● Collects statistics on all columns in table


● Helps Adaptive Query Execution
○ Choose proper join type
○ Select correct build side in a hash-join
○ Calibrating the join order in a multi-way join

ANALYZE TABLE mytable COMPUTE STATISTICS FOR ALL COLUMNS

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Predictive Optimization
What is Predictive Optimization?

● Predictive optimization refers to using predictive analytics techniques to


automatically optimize and enhance the performance of systems, processes,
or workflow.

● It involves leveraging data-driven insights to proactively identify and


implement optimizations, improving efficiency, cost-effectiveness, and overall
system performance.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Predictive Optimization
Key Features

Automatic Maintenance Support Maintenance Operations

● It automates the execution of ● It supports maintenance operations,


background maintenance tasks on including OPTIMIZE to improve query
Delta tables. performance by optimizing file sizes
and VACUUM to reduce storage costs
by deleting unused data

Set and Forget Approach Serverless Computing

● It intelligently and automatically ● It utilizes serverless compute,


runs maintenance jobs without eliminating the need for users to
requiring ongoing user supervision. manually manage compute cluster.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Designing the Foundation

LAB EXERCISE

Data Skipping
and Liquid
Clustering

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization

Databricks Performance Optimization

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization
4 commonly seen performance problems associate with Spark

● Skew
● Shuffles
● Spill
● Serialization
● Adaptive Query Execution in action

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization

LECTURE

Skew

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Skew - Before and After

Before aggregation After aggregation by city

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Handling Data Skew
Data Skew is unavoidable, Databricks handles this automatically

● In MPP systems, data skew significantly Partition 1 (50 MB) Partition 1 (50 MB)

impacts performance because some


Partition 2 (50 MB) Partition 2 (50 MB)

Partition 3 (50 MB) Partition 3 (50 MB)

workers are processing much more data. Partition 4 (50 MB) Partition 4 (50 MB)

Partition 5-A (45 MB)

● Most cloud DWs require a manual, offline Partition 5 (90 MB)


Partition 5-B (45 MB)

redistribution to solve for data skew. Partition 6-A (50 MB)

Partition 6 (150 MB) Partition 6-B (50 MB)

● With Adaptive Query Execution Spark Partition 6-C (50 MB)

automatically breaks down larger


partitions into smaller, similar sized
partitions.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Skew - Mitigation
Three “common” solutions

1. Adaptive Query Execution (enabled by default in Spark 3.1)


2. Filter skewed values
3. Databricks’ [proprietary] Skew Hint
• Easier to add a single hint than to salt your keys
• Great option for version of Spark 2.x
4. Salt the join keys forcing even distribution during the shuffle
• If none of the options are suitable, salting is the only alternative
• It involves breaking a large skewed partition into smaller ones by adding random integers as
suffixes.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization

LECTURE

Shuffles

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles
Shuffling is a side effect of wide transformations
• join()
• distinct()
• groupBy()
• orderBy()

And technically some actions, e.g. count()

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles - Mitigation

● Reduce network IO by using fewer, Re-evaluate join strategy:


larger workers
● Reordering the join
● Speed up shuffle reads & writes by
● Dynamically Switching Join
using NVMe & SSDs
Strategies
● Reduce amount of shuffled data
● Broadcast Hash Join
○ Remove unnecessary columns
● Shuffle Hash Joins (default for
○ Filter out unnecessary records
Databricks Photon)
preemptively
● Sort-Merge Join (default for OS
● Denormalize datasets, esp when
Spark)
shuffle is rooted in a join

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization

DEMONSTRATION

Shuffle

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization

LECTURE

Spill

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spill
● Spill is the term used to refer to the act of moving
data from RAM to disk, and later back into RAM again

● This occurs when a given partition is simply too large to fit into RAM

● In this case, Spark is forced into [potentially] expensive


disk reads and writes to free up local RAM

● All of this just to avoid the


dreaded OOM Error

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spill - Examples
● Set spark.sql.files.maxPartitionBytes too high (default is 128 MB)
● The explode() of even a small array
● The join() or crossJoin() of two tables which generates lots of new
rows
● The join() or crossJoin() of two tables by a skewed key
The groupBy() where the column has low cardinality
● The countDistinct() and size(collect_set())
● Setting spark.sql.shuffle.partitions too low or wrong use of
repartition()

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spill - Memory & Disk
In the Spark UI, spill is represented by two values:

● Spill (Memory): For the partition that was spilled,


this is the size of that data as it existed in memory

● Spill (Disk): Likewise, for the partition that was spilled,


this is the size of the data as it existed on disk

The two values are always presented together

The size on disk will always be smaller due to the natural compression
gained in the act of serializing that data before writing it to disk

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spill - Mitigations
● Allocate cluster with more RAM per Core

● Address data skew

● Manage size of Spark partitions

● Avoid expensive operations like explode()

● Reduce amount data preemptively whenever possible

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization

LAB EXERCISE

Exploding Join

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization

LECTURE

Serialization

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Performance Problems with Serialization
● Spark SQL and DataFrame instructions are highly optimized
● All UDFs must be serialized and distributed to each executor
● The parameters and return value of each UDF must be converted for
each row of data before distributing to executors
● Python UDFs takes an even harder hit
○ The Python code has to be pickled
○ Spark must instantiate a Python interpreter in each and every Executor
○ The conversion of each row from Python to DataFrame costs even more

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Mitigating Serialization Issues
● Don’t use UDFs
○ I challenge you to find a set of transformations that cannot be done with the built-in,
continuously optimized, community supported, higher-order functions

● If you have to use UDFs in Python (common for Data Scientist) use the Vectorized
UDFs as opposed to the stock Python UDFs or Apache Arrow Optimised Python UDFs

● If you have to use UDFs in Scala use Typed Transformations


as opposed to the stock Scala UDFs

● Resist the temptation to use UDFs to integrate Spark code with


existing business logic - porting that logic to Spark almost always pays off

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization

DEMONSTRATION

User-Defined
Functions

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Fine-Tuning:
Choosing the Right
Cluster

Databricks Performance Optimization

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Fine-Tuning: Choosing the Right Cluster

LECTURE

Fine-Tuning:
Choosing the Right
Cluster

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Cluster Types

ALL PURPOSE COMPUTE JOBS COMPUTE SQL WAREHOUSE

● Designed to handle ● Run on ephemeral clusters ● Built for high concurrency


interactive workloads, that are created for the job, ad-hoc SQL analytics and BI
including streaming and terminate on completion serving
workloads. ● Pre-scheduled or submitted ● Photon included
● Enable Auto-Scale to add via API
● Recommended shared
capacity when needed and ● Single-user
warehouse for ad-hoc SQL
reduce time to answer ● Great for isolation and analytics, isolated warehouse
● Security considerations must debugging for specific workloads
be considered as ● Production and repeat
● Serverless available for
auto-scaling can introduce workloads instant startup and lower
additional risks. ● Lower cost TCO

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Autoscaling
● Dynamically resizes cluster based on workload
○ Can run faster than a statically-sized, under-provisioned cluster
○ Can reduce overall costs compared to a statically-sized cluster

● Setting range for the number of workers requires some experimenting

Use Case Auto Scaling Range


Ad-hoc usage or business analytics Large variance
Production batch jobs Not needed or buffer on upper limit
Streaming Available in Delta Live Tables
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spot Instances
● Use spot instances to use spare VM instances for below market rate
○ Great for ad-hoc/shared clusters
○ Not recommended for jobs with mission critical SLAs
○ Never use for driver!Combine on-demand and spot instances (with custom spot
price) to tailor clusters to different use cases

SLA Spot or On-Demand


Non-mission critical jobs Driver on-demand and workers spot
Use spot instance w/fallback to
Workflows with tight SLAs
on-demand

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Photon
World record achieving query engine with zero tuning or setup

Save on compute costs


● ETL customers are saving up to 40% on their compute
cost
Fast query performance
● Built for modern hardware with up to 12x better
price/perf compared to other cloud data warehouses
No code changes
● Spark APIs that can do exploration, ETL, big data, small
data, low latency, high concurrency, batch, and
streaming
Broad language support
● Support for SQL, Python, Scala, R, and Java

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Cluster Optimization Recommendations
1. DS & DE development: all-purpose compute, auto-scale and auto-stop enabled, develop
& test on a subset of the data
2. Ingestion & ETL jobs: jobs compute, size accordingly to job SLA
3. Ad-hoc SQL analytics: (serverless) SQL warehouse, auto-scale and auto-stop enabled
4. BI Reporting: isolated SQL warehouse, sized according to BI SLAs
5. Best practices:
a. Enable spot instances on worker nodes
b. Use the latest LTS Databricks Runtime when possible
c. Use Photon for best TCO when applicable
d. Use latest gen VM, start with general purpose, then test memory/compute optimized

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Fine-Tuning: Choosing the Right Cluster

LECTURE

Pick the Best


Instance Types

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Have an Open Mind When Picking
Machines
● For AWS. i3’s aren't always the best. Explore m7gd and r7gd
Enable caching if needed.
○ Graviton instances work well, try those first
○ M7gd and r7gd have better processors, similar (albeit smaller) local disk and much
more stable spot markets than i-series
● For Azure, try the eav4, dav4 and f-series over L-series
○ The ACU is very useful
● GCP defaults are pretty good
● Usually don’t need network optimized instance types some occasions
they help with Photon

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
How to Choose the Right Machine Is Pretty
Simple
● Just a series of IFTTT questions and rules of thumb!
○ Side note - if you 2x the cluster and it runs in 1/2 the time, it costs the same

● Rules of thumb
○ First run: set spark.sql.shuffle.partitions = 2x # of cores
○ Keep total memory available to the machine less than 128gb
○ Number of cores should be a ratio of 1 core to 128mb -> 2gb of reads (Some
caveats may apply)
○ Avoid setting any other configs at first (don’t carry over configs from legacy
platforms unless absolutely necessary)

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
How to Choose the Right Machine Is Pretty
Simple
● Rules of thumb

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
What You Care about with the Instance
Type
● Core to Ram ratio
● Processor type
● Local vs remote storage
● Storage medium

Cloud Family Core:Ram Processor Storage

Intel Cascade 3.6


AWS c5 1core:2gb (d) Local NVME
GHz

Azure f-series 1core:2gb Intel Xeon 2.4 GHz Local SSD

Intel Cascade 3.4


GCP n2-highcpu 1core:1gb Local SSD
GHz
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Sizing a Driver
● Leave it the same size as your worker unless you care about being the
absolute cheapest - dont make things more complicated than they
need to be.
● Driver typically do very little work in a Spark application. Using a 4-8
core 16-32gb ram driver should be fine for most workloads
● Large commits to delta tables use more memory.
● This suggestion is voided when:
○ Running many streams/concurrent jobs on the same machine
○ Commiting a very large (100k+ files) amount of data to a delta table
○ Collecting large amount of data to the driver to use in Pandas/R

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spot Market Considerations
● The spot market is a great way to save money on infrastructure.
● Each instance type has a different level availability and price savings in
each region.
● Example - i3s aren’t great, r5d’s look a lot better.

Reference: https://aws.amazon.com/ec2/spot/instance-advisor/
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
IFTTT - Step 1
Want to use Photon?

No: Yes:
Next slide
Cloud Family

m6gd/r6gd/i4i
AWS
m7gd/r7gd

Azure Edsv4

n2-highmem
GCP
n2-standard

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
IFTTT - Step 2
Is your job an ETL job that uses joins/windows/groupbys/aggregations

No: Yes:

Cloud Family Cloud Family

AWS c7g/c6g AWS c7gd/c6gd

Azure fsv2 Azure fsv2

GCP e2-highcpu GCP n2-highcpu

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
IFTTT - Step 3
Run the job with the instance type, follow our rules of thumb and go to the
SQL UI of the longest running query - do you see spill?

No: Yes:
Stop this is good enough Set spark.sql.shuffle.partitions to the largest
shuffle read stage / 200mb
spark.sql.shuffle.partitions=auto
FYI: Spill is much less impactful when using Photon
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
IFTTT - Step 4
Run the job with the updated shuffle partitions - do you still see spill?

No: Yes:
Stop this is good enough
Cloud Family

AWS m7gd

Azure dav4/dasv4

GCP n2-standard

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
IFTTT - Step 5
Run the job with the updated instance type - do you still see spill?

No: Yes:
Stop this is good enough
Cloud Family

AWS r7gd/r6gd

Azure Edsv4

GCP n2-highmen

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Reminder on Shuffle Partitions
spark.sql.shuffle.partitions = auto
OR
Go to stage UI, find the largest shuffle read size, divide that by 200mb

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Don’t Forget to Double Check the Event
Log!
Spot failures happen. They slow things down. We know. Don’t forget to
double check the event log. It’s probably the first thing you should do.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Questions?

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.

You might also like