Databricks Performance Optimization
Databricks Performance Optimization
Performance
Optimization
Databricks
Databricks Academy
May 2025
Academy
January 2025
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Agenda
1. Spark Architecture Lecture Demo Lab
Spark UI Introduction ✓
2. Designing the Foundation Lecture Demo Lab
File Explosion ✓
Data Skipping and Liquid Clustering ✓ ✓
3. Code Optimization Lecture Demo Lab
Skew ✓
Shuffle ✓ ✓
Spill ✓
Exploding Join ✓
Serialization ✓
User-Defined Functions ✓
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Agenda
4. Fine-Tuning: Choosing the Right Cluster Lecture Demo Lab
Fine-Tuning: Choosing the Right Cluster ✓
Pick the Best Instance Types ✓
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Introduction
We commence with Designing the Foundation, focusing on establishing fundamental principles in Spark
programming. Following this, we delve into Code Optimization, uncovering strategies to elevate code
efficiency and performance. Our exploration further extends to understanding the intricate layers of Spark
Architecture and optimizing clusters for diverse workloads in Fine-Tuning - Choosing the Right Cluster.
Beyond theory, our sessions offer immersive hands-on experiences. Engage in real-time simulation
through Follow Along - Spark Simulator, and dive deep into critical operational aspects such as Shuffles,
Spill, Skew, alongside understanding the prowess of Serialization in Spark.
This course aims to equip you with comprehensive expertise in advanced data engineering, leveraging the
powerful tools and techniques offered by Databricks.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Building Performance Analytics
File Layout
Cluster Sizing
Code
Optimization
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spark Architecture
Databricks
Databricks Performance Performance Optimization
Optimization
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spark Architecture
LECTURE
Spark UI
Introduction
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Executing a Spark Application
Data processing tasks run in parallel across a cluster of machines
Job Task
Spark
Job Stage Stage Task
application
Job Task
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spark Architecture
Driver
Worker nodes
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Scenario: Filter out brown pieces from these candy bags
Cluster
Data
Driver
Executor
Partition
Core
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Student A, get bag #1,
Remove brown pieces from the bag,
Student B, get bag #2,
place the rest in the corner
Student C, get bag #3…
A B C D E F
G H I J K L
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Student A, get bag #1,
Remove brown pieces from the bag,
Student B, get bag #2,
place the rest in the corner
Student C, get bag #3…
A B C D E F
G H I J K L
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Students A, E, H, J,
process bags 13, 14, 15, 16 on
completion previous tasks
A B C D E F
G H I J K L
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
All done!
A B C D E F
G H I J K L
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Scenario 2: Count Total Pieces in Candy
Bags Introducing Stages
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Stage 1: Local Count We need to count the total
pieces in these candy bags
A B C D E F
G H I J K L
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Stage 1: Local Count Students B, E, I, L, count these
four bags
A B C D E F
G H I J K L
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Stage 1: Local Count
Students B, E, I, L,
commit your findings
5 6
A B C D E F
4 5
G H I J K L
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Stage 2: Global Count
Student G, total counts from
students B, E, I, L
A B C D E F
5
6
4
5 G H I J K L
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Stage 1: Local Count Stage 2: Global Count
20
5 6
A B C D E F A B C D E F
4 5
G H I J K L G H I J K L
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Query Optimization
Enabled by default as of Spark 3.2
Metadata Catalog
Catalyst Catalog
Cost Model
Query Unresolved Optimized Physical
Physical Selected
Logical Plan
Logical Plan Physical RDDs
Logical Plan Plans
Plans Physical Plan
Plans
WHOLE-STAGE
ANALYSIS PHYSICAL PLANNING CODE GENERATION
Runtime Statistics
ADAPTIVE QUERY EXECUTION
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization Recommendations
1. Using Dataframes or SQL instead of RDD APIs.
2. In production jobs, avoid unnecessary operations that trigger an action
besides reading and writing files. These operations might include
count(), display(), collect().
3. Avoid operations that will force all computation into the driver node
such as using single threaded python/pandas. Use Pandas API on Spark
instead to distribute pandas functions.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
22
logo are trademarks of the Apache Software Foundation.
Designing the
Foundation
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Designing the Foundation
LECTURE
Introduction to
Designing
Foundation
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Fundamental Concepts
Why some schemas and queries perform faster than others
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
25
logo are trademarks of the Apache Software Foundation.
Common Performance Bottlenecks
Encountered with any big data or MPP system
Bottleneck Details
● Listing and metadata operation for too many small files can be expensive
Small File Problem
● Can also result in throttling from cloud storage I/O limits
● Large amounts of data skew can result in more work handled by a single executor
Data Skew
● Even if data read in is not skewed, certain transformations can lead to in-memory skew
Processing More
● Traditional data lake platforms often require rewriting entire datasets or partitions
Than Needed
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
26
logo are trademarks of the Apache Software Foundation.
Avoiding the Small File Problem
Automatically handle this common performance challenge in Data Lakes
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
27
logo are trademarks of the Apache Software Foundation.
Designing the Foundation
DEMONSTRATION
File Explosion
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Designing the Foundation
LECTURE
Data Skipping
and Liquid
Clustering
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Data Skipping
Simple, well-known I/O
pruning technique
● Track file-level stats like min & file_name col_min col_max
max
● Leverage them to avoid 1. parquet 6 8
scanning irrelevant files
2. parquet 3 10
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Z-Ordering
1. parquet 6 8 1. parquet 1 3
2. parquet 3 10 2. parquet 4 7
3. parquet 1 4 3. parquet 8 10
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Z-Ordering
1. parquet 6 8 1. parquet 1 3
2. parquet 3 10 2. parquet 4 7
3. parquet 1 4 3. parquet 8 10
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Databricks Delta Lake and Stats
● Databricks Delta Lake collects stats about the first N columns
○ dataSkippingNumIndexedCols = 32
● These stats are used in queries:
○ Metadata only queries: select max(col) from table
■ Queries just the Delta Log, doesn’t need to look at the files if col has stats
○ Allows us to skip files
■ Partition FIlters, Data Filters, Pushed Filters apply in that order
○ TimeStamp and String types aren’t always very useful
■ Precision/Truncation prevent exact matches, have to fall back to files sometimes
● Avoid collecting stats on long strings
○ Put them outside first 32 columns or collect stats on fewer columns
■ alter table change column col after col32
■ set spark.databricks.delta.properties.defaults.dataSkippingNumIndexedCols = 3
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
What about Partitioning?
● Generally not recommended!
○ Partitioning is usually misused/overused
○ Tiny file problems or Skew
● Good use-cases for partitioning
○ Isolating data for separate schemas (single->multiplexing)
○ GDPR/CCP use cases where you commonly delete a partitions worth of data
○ Use cases requiring a physical boundary to isolate data SCD Type 2, partition on current
or not for better performance.
● If you partition
○ Choose column with low cardinality
○ Try to keep each partition less than 1tb and greater than 1gb
○ Tables expected to grow TBs
○ Partition (usually) on a date, zorder on commonly used predicates in where clauses
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Challenges with Disk Partitioning
Partition by customer ID and date + optimize
Customer E
• Data skew.
Customer F
○ Inconsistent file sizes across
partitions.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Introducing Liquid Clustering
What it is Benefits
Customer E
• Liquid stores metadata
○ New data can be clustered into
Customer F existing clusters on write
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Table Statistics
Keeping table statistics up to date for best results with cost-based optimizer
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Predictive Optimization
What is Predictive Optimization?
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Predictive Optimization
Key Features
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Designing the Foundation
LAB EXERCISE
Data Skipping
and Liquid
Clustering
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization
4 commonly seen performance problems associate with Spark
● Skew
● Shuffles
● Spill
● Serialization
● Adaptive Query Execution in action
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization
LECTURE
Skew
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Skew - Before and After
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Handling Data Skew
Data Skew is unavoidable, Databricks handles this automatically
● In MPP systems, data skew significantly Partition 1 (50 MB) Partition 1 (50 MB)
workers are processing much more data. Partition 4 (50 MB) Partition 4 (50 MB)
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Skew - Mitigation
Three “common” solutions
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization
LECTURE
Shuffles
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles
Shuffling is a side effect of wide transformations
• join()
• distinct()
• groupBy()
• orderBy()
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles at a Glance
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Shuffles - Mitigation
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization
DEMONSTRATION
Shuffle
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization
LECTURE
Spill
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spill
● Spill is the term used to refer to the act of moving
data from RAM to disk, and later back into RAM again
● This occurs when a given partition is simply too large to fit into RAM
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spill - Examples
● Set spark.sql.files.maxPartitionBytes too high (default is 128 MB)
● The explode() of even a small array
● The join() or crossJoin() of two tables which generates lots of new
rows
● The join() or crossJoin() of two tables by a skewed key
The groupBy() where the column has low cardinality
● The countDistinct() and size(collect_set())
● Setting spark.sql.shuffle.partitions too low or wrong use of
repartition()
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spill - Memory & Disk
In the Spark UI, spill is represented by two values:
The size on disk will always be smaller due to the natural compression
gained in the act of serializing that data before writing it to disk
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spill - Mitigations
● Allocate cluster with more RAM per Core
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization
LAB EXERCISE
Exploding Join
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization
LECTURE
Serialization
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Performance Problems with Serialization
● Spark SQL and DataFrame instructions are highly optimized
● All UDFs must be serialized and distributed to each executor
● The parameters and return value of each UDF must be converted for
each row of data before distributing to executors
● Python UDFs takes an even harder hit
○ The Python code has to be pickled
○ Spark must instantiate a Python interpreter in each and every Executor
○ The conversion of each row from Python to DataFrame costs even more
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Mitigating Serialization Issues
● Don’t use UDFs
○ I challenge you to find a set of transformations that cannot be done with the built-in,
continuously optimized, community supported, higher-order functions
● If you have to use UDFs in Python (common for Data Scientist) use the Vectorized
UDFs as opposed to the stock Python UDFs or Apache Arrow Optimised Python UDFs
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization
DEMONSTRATION
User-Defined
Functions
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Fine-Tuning:
Choosing the Right
Cluster
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Fine-Tuning: Choosing the Right Cluster
LECTURE
Fine-Tuning:
Choosing the Right
Cluster
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Cluster Types
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Autoscaling
● Dynamically resizes cluster based on workload
○ Can run faster than a statically-sized, under-provisioned cluster
○ Can reduce overall costs compared to a statically-sized cluster
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Photon
World record achieving query engine with zero tuning or setup
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Cluster Optimization Recommendations
1. DS & DE development: all-purpose compute, auto-scale and auto-stop enabled, develop
& test on a subset of the data
2. Ingestion & ETL jobs: jobs compute, size accordingly to job SLA
3. Ad-hoc SQL analytics: (serverless) SQL warehouse, auto-scale and auto-stop enabled
4. BI Reporting: isolated SQL warehouse, sized according to BI SLAs
5. Best practices:
a. Enable spot instances on worker nodes
b. Use the latest LTS Databricks Runtime when possible
c. Use Photon for best TCO when applicable
d. Use latest gen VM, start with general purpose, then test memory/compute optimized
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Fine-Tuning: Choosing the Right Cluster
LECTURE
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Have an Open Mind When Picking
Machines
● For AWS. i3’s aren't always the best. Explore m7gd and r7gd
Enable caching if needed.
○ Graviton instances work well, try those first
○ M7gd and r7gd have better processors, similar (albeit smaller) local disk and much
more stable spot markets than i-series
● For Azure, try the eav4, dav4 and f-series over L-series
○ The ACU is very useful
● GCP defaults are pretty good
● Usually don’t need network optimized instance types some occasions
they help with Photon
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
How to Choose the Right Machine Is Pretty
Simple
● Just a series of IFTTT questions and rules of thumb!
○ Side note - if you 2x the cluster and it runs in 1/2 the time, it costs the same
● Rules of thumb
○ First run: set spark.sql.shuffle.partitions = 2x # of cores
○ Keep total memory available to the machine less than 128gb
○ Number of cores should be a ratio of 1 core to 128mb -> 2gb of reads (Some
caveats may apply)
○ Avoid setting any other configs at first (don’t carry over configs from legacy
platforms unless absolutely necessary)
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
How to Choose the Right Machine Is Pretty
Simple
● Rules of thumb
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
What You Care about with the Instance
Type
● Core to Ram ratio
● Processor type
● Local vs remote storage
● Storage medium
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spot Market Considerations
● The spot market is a great way to save money on infrastructure.
● Each instance type has a different level availability and price savings in
each region.
● Example - i3s aren’t great, r5d’s look a lot better.
Reference: https://aws.amazon.com/ec2/spot/instance-advisor/
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
IFTTT - Step 1
Want to use Photon?
No: Yes:
Next slide
Cloud Family
m6gd/r6gd/i4i
AWS
m7gd/r7gd
Azure Edsv4
n2-highmem
GCP
n2-standard
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
IFTTT - Step 2
Is your job an ETL job that uses joins/windows/groupbys/aggregations
No: Yes:
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
IFTTT - Step 3
Run the job with the instance type, follow our rules of thumb and go to the
SQL UI of the longest running query - do you see spill?
No: Yes:
Stop this is good enough Set spark.sql.shuffle.partitions to the largest
shuffle read stage / 200mb
spark.sql.shuffle.partitions=auto
FYI: Spill is much less impactful when using Photon
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
IFTTT - Step 4
Run the job with the updated shuffle partitions - do you still see spill?
No: Yes:
Stop this is good enough
Cloud Family
AWS m7gd
Azure dav4/dasv4
GCP n2-standard
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
IFTTT - Step 5
Run the job with the updated instance type - do you still see spill?
No: Yes:
Stop this is good enough
Cloud Family
AWS r7gd/r6gd
Azure Edsv4
GCP n2-highmen
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Reminder on Shuffle Partitions
spark.sql.shuffle.partitions = auto
OR
Go to stage UI, find the largest shuffle read size, divide that by 200mb
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Don’t Forget to Double Check the Event
Log!
Spot failures happen. They slow things down. We know. Don’t forget to
double check the event log. It’s probably the first thing you should do.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Questions?
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.