0% found this document useful (0 votes)

125 views94 pages

Databricks Performance Optimization

The document outlines a performance optimization course for Databricks focusing on Spark architecture, code optimization, and cluster fine-tuning. It covers key topics such as data skipping, handling small file problems, and effective partitioning strategies to enhance data processing efficiency. The course includes lectures, demonstrations, and hands-on labs to equip participants with advanced data engineering skills using Databricks tools.

Uploaded by

NarmadhaHariram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views94 pages

Databricks Performance Optimization

Uploaded by

NarmadhaHariram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 94

Databricks

Performance
Optimization

Databricks
Databricks Academy
May 2025
Academy
January 2025
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Agenda
1. Spark Architecture Lecture Demo Lab
Spark UI Introduction ✓
2. Designing the Foundation Lecture Demo Lab
File Explosion ✓
Data Skipping and Liquid Clustering ✓ ✓
3. Code Optimization Lecture Demo Lab
Skew ✓
Shufﬂe ✓ ✓
Spill ✓
Exploding Join ✓
Serialization ✓
User-Deﬁned Functions ✓

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Agenda
4. Fine-Tuning: Choosing the Right Cluster Lecture Demo Lab
Fine-Tuning: Choosing the Right Cluster ✓
Pick the Best Instance Types ✓

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Introduction
We commence with Designing the Foundation, focusing on establishing fundamental principles in Spark
programming. Following this, we delve into Code Optimization, uncovering strategies to elevate code
efﬁciency and performance. Our exploration further extends to understanding the intricate layers of Spark
Architecture and optimizing clusters for diverse workloads in Fine-Tuning - Choosing the Right Cluster.

Beyond theory, our sessions offer immersive hands-on experiences. Engage in real-time simulation
through Follow Along - Spark Simulator, and dive deep into critical operational aspects such as Shufﬂes,
Spill, Skew, alongside understanding the prowess of Serialization in Spark.

This course aims to equip you with comprehensive expertise in advanced data engineering, leveraging the
powerful tools and techniques offered by Databricks.

Cluster Sizing

Code
Optimization

Databricks
Databricks Performance Performance Optimization

Optimization
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spark Architecture

LECTURE

Spark UI
Introduction

Job Task
Spark
Job Stage Stage Task
application

Job Task

Driver

Executor Executor Executor

Core Task Task Task Core Core

Worker nodes
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Scenario: Filter out brown pieces from these candy bags
Cluster

Data
Driver

Executor

Partition
Core

A B C D E F

G H I J K L

A B C D E F

G H I J K L

A B C D E F

G H I J K L

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
All done!

A B C D E F

G H I J K L

A B C D E F

G H I J K L

A B C D E F

G H I J K L

5 6

A B C D E F

4 5

G H I J K L

A B C D E F

5
6
4
5 G H I J K L

5 6
A B C D E F A B C D E F

4 5

G H I J K L G H I J K L

LOGICAL OPTIMIZATION COST BASED OPTIMIZATION

Metadata Catalog

Catalyst Catalog

Cost Model
Query Unresolved Optimized Physical
Physical Selected
Logical Plan
Logical Plan Physical RDDs
Logical Plan Plans
Plans Physical Plan
Plans

WHOLE-STAGE
ANALYSIS PHYSICAL PLANNING CODE GENERATION

Runtime Statistics
ADAPTIVE QUERY EXECUTION

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Code Optimization Recommendations
1. Using Dataframes or SQL instead of RDD APIs.
2. In production jobs, avoid unnecessary operations that trigger an action
besides reading and writing ﬁles. These operations might include
count(), display(), collect().
3. Avoid operations that will force all computation into the driver node
such as using single threaded python/pandas. Use Pandas API on Spark
instead to distribute pandas functions.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
22
logo are trademarks of the Apache Software Foundation.
Designing the
Foundation

Databricks Performance Optimization

LECTURE

Introduction to
Designing
Foundation

● Number of bytes read

● Query complexity/computation
● Number of ﬁles accessed
● Parallelism

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
25
logo are trademarks of the Apache Software Foundation.
Common Performance Bottlenecks
Encountered with any big data or MPP system
Bottleneck Details

● Listing and metadata operation for too many small ﬁles can be expensive
Small File Problem
● Can also result in throttling from cloud storage I/O limits

● Large amounts of data skew can result in more work handled by a single executor
Data Skew
● Even if data read in is not skewed, certain transformations can lead to in-memory skew

Processing More
● Traditional data lake platforms often require rewriting entire datasets or partitions
Than Needed

Before Aggregation After Aggregation by City

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
26
logo are trademarks of the Apache Software Foundation.
Avoiding the Small File Problem
Automatically handle this common performance challenge in Data Lakes

● Too many small ﬁles greatly increases overhead for reads

● Too few large ﬁles reduces parallelism on reads
● Over-partitioning is a common problem
● Databricks will automatically tune the size of Delta Lake tables
● Databricks will automatically compact small ﬁles on write with
auto-optimize

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
27
logo are trademarks of the Apache Software Foundation.
Designing the Foundation

DEMONSTRATION

File Explosion

LECTURE

Data Skipping
and Liquid
Clustering

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Data Skipping
Simple, well-known I/O
pruning technique
● Track ﬁle-level stats like min & file_name col_min col_max
max
● Leverage them to avoid 1. parquet 6 8
scanning irrelevant ﬁles
2. parquet 3 10

SELECT input_file_name() as 3. parquet 1 4

“file_name”,
min(col) AS “col_min”,
max(col) AS “col_max”
FROM table
GROUP BY input_file_name()

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Z-Ordering

Optimize Table Z-Order by Column

Old Layout New Layout

file_name col_min col_max file_name col_min col_max

1. parquet 6 8 1. parquet 1 3

2. parquet 3 10 2. parquet 4 7

3. parquet 1 4 3. parquet 8 10

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Z-Ordering

Select * from Table Where Col = 7

Old Layout New Layout

file_name col_min col_max file_name col_min col_max

1. parquet 6 8 1. parquet 1 3

2. parquet 3 10 2. parquet 4 7

3. parquet 1 4 3. parquet 8 10

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Databricks Delta Lake and Stats
● Databricks Delta Lake collects stats about the first N columns
○ dataSkippingNumIndexedCols = 32
● These stats are used in queries:
○ Metadata only queries: select max(col) from table
■ Queries just the Delta Log, doesn’t need to look at the files if col has stats
○ Allows us to skip files
■ Partition FIlters, Data Filters, Pushed Filters apply in that order
○ TimeStamp and String types aren’t always very useful
■ Precision/Truncation prevent exact matches, have to fall back to files sometimes
● Avoid collecting stats on long strings
○ Put them outside first 32 columns or collect stats on fewer columns
■ alter table change column col after col32
■ set spark.databricks.delta.properties.defaults.dataSkippingNumIndexedCols = 3

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
What about Partitioning?
● Generally not recommended!
○ Partitioning is usually misused/overused
○ Tiny ﬁle problems or Skew
● Good use-cases for partitioning
○ Isolating data for separate schemas (single->multiplexing)
○ GDPR/CCP use cases where you commonly delete a partitions worth of data
○ Use cases requiring a physical boundary to isolate data SCD Type 2, partition on current
or not for better performance.
● If you partition
○ Choose column with low cardinality
○ Try to keep each partition less than 1tb and greater than 1gb
○ Tables expected to grow TBs
○ Partition (usually) on a date, zorder on commonly used predicates in where clauses

2023-02-05 2023-02-06 2023-02-07

Customer A • Many small ﬁles.

Customer B
○ High metadata operations
overhead.
Customer C
○ Slow read operations.
Customer D

Customer E
• Data skew.
Customer F
○ Inconsistent ﬁle sizes across
partitions.

What it is Beneﬁts

Innovative technique to clustering data ● Best performance out of the box

layout to support efﬁcient query access ○ Clustering on write
and reduce data management and ● Most consistent data skipping
tuning overhead. It’s ﬂexible and ○ Immune to data skew

adaptive to data pattern changes, ● Minimal write ampliﬁcation on table

scaling, and data skew maintenance
○ True incremental optimize
● Row Level Concurrency
○ Simplify logic of concurrent writers
● Reduced Cognitive Overhead
○ No worrying about cardinality
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Liquid Clustering
Liquid cluster by customer ID and date

2023-02-05 2023-02-06 2023-02-07 • Liquid is not subject to rigid

Customer A boundaries
○ Liquid intelligently decides what
Customer B ranges of data to combine

Customer C • Data skew is gone

○ Data sizes are consistent
Customer D

Customer E
• Liquid stores metadata
○ New data can be clustered into
Customer F existing clusters on write

Target ﬁle size

● Collects statistics on all columns in table

● Helps Adaptive Query Execution
○ Choose proper join type
○ Select correct build side in a hash-join
○ Calibrating the join order in a multi-way join

ANALYZE TABLE mytable COMPUTE STATISTICS FOR ALL COLUMNS

● Predictive optimization refers to using predictive analytics techniques to

automatically optimize and enhance the performance of systems, processes,
or workﬂow.

● It involves leveraging data-driven insights to proactively identify and

implement optimizations, improving efﬁciency, cost-effectiveness, and overall
system performance.

Automatic Maintenance Support Maintenance Operations

● It automates the execution of ● It supports maintenance operations,

background maintenance tasks on including OPTIMIZE to improve query
Delta tables. performance by optimizing ﬁle sizes
and VACUUM to reduce storage costs
by deleting unused data

Set and Forget Approach Serverless Computing

● It intelligently and automatically ● It utilizes serverless compute,

runs maintenance jobs without eliminating the need for users to
requiring ongoing user supervision. manually manage compute cluster.

LAB EXERCISE

Data Skipping
and Liquid
Clustering

Databricks Performance Optimization

● Skew
● Shufﬂes
● Spill
● Serialization
● Adaptive Query Execution in action

LECTURE

Skew

Before aggregation After aggregation by city

● In MPP systems, data skew signiﬁcantly Partition 1 (50 MB) Partition 1 (50 MB)

impacts performance because some

Partition 2 (50 MB) Partition 2 (50 MB)

Partition 3 (50 MB) Partition 3 (50 MB)

workers are processing much more data. Partition 4 (50 MB) Partition 4 (50 MB)

Partition 5-A (45 MB)

● Most cloud DWs require a manual, ofﬂine Partition 5 (90 MB)

Partition 5-B (45 MB)

redistribution to solve for data skew. Partition 6-A (50 MB)

Partition 6 (150 MB) Partition 6-B (50 MB)

● With Adaptive Query Execution Spark Partition 6-C (50 MB)

automatically breaks down larger

partitions into smaller, similar sized
partitions.

1. Adaptive Query Execution (enabled by default in Spark 3.1)

2. Filter skewed values
3. Databricks’ [proprietary] Skew Hint
• Easier to add a single hint than to salt your keys
• Great option for version of Spark 2.x
4. Salt the join keys forcing even distribution during the shufﬂe
• If none of the options are suitable, salting is the only alternative
• It involves breaking a large skewed partition into smaller ones by adding random integers as
sufﬁxes.

LECTURE

Shufﬂes

And technically some actions, e.g. count()

● Reduce network IO by using fewer, Re-evaluate join strategy:

larger workers
● Reordering the join
● Speed up shuffle reads & writes by
● Dynamically Switching Join
using NVMe & SSDs
Strategies
● Reduce amount of shuffled data
● Broadcast Hash Join
○ Remove unnecessary columns
● Shuffle Hash Joins (default for
○ Filter out unnecessary records
Databricks Photon)
preemptively
● Sort-Merge Join (default for OS
● Denormalize datasets, esp when
Spark)
shuffle is rooted in a join

DEMONSTRATION

Shufﬂe

LECTURE

Spill

● This occurs when a given partition is simply too large to ﬁt into RAM

● In this case, Spark is forced into [potentially] expensive

disk reads and writes to free up local RAM

● All of this just to avoid the

dreaded OOM Error

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spill - Examples
● Set spark.sql.ﬁles.maxPartitionBytes too high (default is 128 MB)
● The explode() of even a small array
● The join() or crossJoin() of two tables which generates lots of new
rows
● The join() or crossJoin() of two tables by a skewed key
The groupBy() where the column has low cardinality
● The countDistinct() and size(collect_set())
● Setting spark.sql.shufﬂe.partitions too low or wrong use of
repartition()

● Spill (Memory): For the partition that was spilled,

this is the size of that data as it existed in memory

● Spill (Disk): Likewise, for the partition that was spilled,

this is the size of the data as it existed on disk

The two values are always presented together

The size on disk will always be smaller due to the natural compression
gained in the act of serializing that data before writing it to disk

● Address data skew

● Manage size of Spark partitions

● Avoid expensive operations like explode()

● Reduce amount data preemptively whenever possible

LAB EXERCISE

Exploding Join

LECTURE

Serialization

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Performance Problems with Serialization
● Spark SQL and DataFrame instructions are highly optimized
● All UDFs must be serialized and distributed to each executor
● The parameters and return value of each UDF must be converted for
each row of data before distributing to executors
● Python UDFs takes an even harder hit
○ The Python code has to be pickled
○ Spark must instantiate a Python interpreter in each and every Executor
○ The conversion of each row from Python to DataFrame costs even more

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Mitigating Serialization Issues
● Don’t use UDFs
○ I challenge you to ﬁnd a set of transformations that cannot be done with the built-in,
continuously optimized, community supported, higher-order functions

● If you have to use UDFs in Python (common for Data Scientist) use the Vectorized
UDFs as opposed to the stock Python UDFs or Apache Arrow Optimised Python UDFs

● If you have to use UDFs in Scala use Typed Transformations

as opposed to the stock Scala UDFs

● Resist the temptation to use UDFs to integrate Spark code with

existing business logic - porting that logic to Spark almost always pays off

DEMONSTRATION

User-Deﬁned
Functions

Databricks Performance Optimization

LECTURE

Fine-Tuning:
Choosing the Right
Cluster

ALL PURPOSE COMPUTE JOBS COMPUTE SQL WAREHOUSE

● Designed to handle ● Run on ephemeral clusters ● Built for high concurrency

interactive workloads, that are created for the job, ad-hoc SQL analytics and BI
including streaming and terminate on completion serving
workloads. ● Pre-scheduled or submitted ● Photon included
● Enable Auto-Scale to add via API
● Recommended shared
capacity when needed and ● Single-user
warehouse for ad-hoc SQL
reduce time to answer ● Great for isolation and analytics, isolated warehouse
● Security considerations must debugging for speciﬁc workloads
be considered as ● Production and repeat
● Serverless available for
auto-scaling can introduce workloads instant startup and lower
additional risks. ● Lower cost TCO

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Autoscaling
● Dynamically resizes cluster based on workload
○ Can run faster than a statically-sized, under-provisioned cluster
○ Can reduce overall costs compared to a statically-sized cluster

● Setting range for the number of workers requires some experimenting

Use Case Auto Scaling Range

Ad-hoc usage or business analytics Large variance
Production batch jobs Not needed or buffer on upper limit
Streaming Available in Delta Live Tables
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spot Instances
● Use spot instances to use spare VM instances for below market rate
○ Great for ad-hoc/shared clusters
○ Not recommended for jobs with mission critical SLAs
○ Never use for driver!Combine on-demand and spot instances (with custom spot
price) to tailor clusters to different use cases

SLA Spot or On-Demand

Non-mission critical jobs Driver on-demand and workers spot
Use spot instance w/fallback to
Workﬂows with tight SLAs
on-demand

Save on compute costs

● ETL customers are saving up to 40% on their compute
cost
Fast query performance
● Built for modern hardware with up to 12x better
price/perf compared to other cloud data warehouses
No code changes
● Spark APIs that can do exploration, ETL, big data, small
data, low latency, high concurrency, batch, and
streaming
Broad language support
● Support for SQL, Python, Scala, R, and Java

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Cluster Optimization Recommendations
1. DS & DE development: all-purpose compute, auto-scale and auto-stop enabled, develop
& test on a subset of the data
2. Ingestion & ETL jobs: jobs compute, size accordingly to job SLA
3. Ad-hoc SQL analytics: (serverless) SQL warehouse, auto-scale and auto-stop enabled
4. BI Reporting: isolated SQL warehouse, sized according to BI SLAs
5. Best practices:
a. Enable spot instances on worker nodes
b. Use the latest LTS Databricks Runtime when possible
c. Use Photon for best TCO when applicable
d. Use latest gen VM, start with general purpose, then test memory/compute optimized

LECTURE

Pick the Best

Instance Types

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Have an Open Mind When Picking
Machines
● For AWS. i3’s aren't always the best. Explore m7gd and r7gd
Enable caching if needed.
○ Graviton instances work well, try those ﬁrst
○ M7gd and r7gd have better processors, similar (albeit smaller) local disk and much
more stable spot markets than i-series
● For Azure, try the eav4, dav4 and f-series over L-series
○ The ACU is very useful
● GCP defaults are pretty good
● Usually don’t need network optimized instance types some occasions
they help with Photon

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
How to Choose the Right Machine Is Pretty
Simple
● Just a series of IFTTT questions and rules of thumb!
○ Side note - if you 2x the cluster and it runs in 1/2 the time, it costs the same

● Rules of thumb
○ First run: set spark.sql.shuffle.partitions = 2x # of cores
○ Keep total memory available to the machine less than 128gb
○ Number of cores should be a ratio of 1 core to 128mb -> 2gb of reads (Some
caveats may apply)
○ Avoid setting any other configs at first (don’t carry over configs from legacy
platforms unless absolutely necessary)

Cloud Family Core:Ram Processor Storage

Intel Cascade 3.6

AWS c5 1core:2gb (d) Local NVME
GHz

Azure f-series 1core:2gb Intel Xeon 2.4 GHz Local SSD

Intel Cascade 3.4

GCP n2-highcpu 1core:1gb Local SSD
GHz
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Sizing a Driver
● Leave it the same size as your worker unless you care about being the
absolute cheapest - dont make things more complicated than they
need to be.
● Driver typically do very little work in a Spark application. Using a 4-8
core 16-32gb ram driver should be ﬁne for most workloads
● Large commits to delta tables use more memory.
● This suggestion is voided when:
○ Running many streams/concurrent jobs on the same machine
○ Commiting a very large (100k+ ﬁles) amount of data to a delta table
○ Collecting large amount of data to the driver to use in Pandas/R

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Spot Market Considerations
● The spot market is a great way to save money on infrastructure.
● Each instance type has a different level availability and price savings in
each region.
● Example - i3s aren’t great, r5d’s look a lot better.

Reference: https://aws.amazon.com/ec2/spot/instance-advisor/
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
IFTTT - Step 1
Want to use Photon?

No: Yes:
Next slide
Cloud Family

m6gd/r6gd/i4i
AWS
m7gd/r7gd

Azure Edsv4

n2-highmem
GCP
n2-standard

No: Yes:

Cloud Family Cloud Family

AWS c7g/c6g AWS c7gd/c6gd

Azure fsv2 Azure fsv2

GCP e2-highcpu GCP n2-highcpu

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
IFTTT - Step 3
Run the job with the instance type, follow our rules of thumb and go to the
SQL UI of the longest running query - do you see spill?

No: Yes:
Stop this is good enough Set spark.sql.shuffle.partitions to the largest
shuffle read stage / 200mb
spark.sql.shuffle.partitions=auto
FYI: Spill is much less impactful when using Photon
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
IFTTT - Step 4
Run the job with the updated shufﬂe partitions - do you still see spill?

No: Yes:
Stop this is good enough
Cloud Family

AWS m7gd

Azure dav4/dasv4

GCP n2-standard

No: Yes:
Stop this is good enough
Cloud Family

AWS r7gd/r6gd

Azure Edsv4

GCP n2-highmen

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Reminder on Shuffle Partitions
spark.sql.shuffle.partitions = auto
OR
Go to stage UI, find the largest shuffle read size, divide that by 200mb

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
Don’t Forget to Double Check the Event
Log!
Spot failures happen. They slow things down. We know. Don’t forget to
double check the event log. It’s probably the ﬁrst thing you should do.

© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.
© Databricks 2025. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg
logo are trademarks of the Apache Software Foundation.

Spark SQL
100% (1)
Spark SQL
25 pages
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
No ratings yet
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
219 pages
06-Setting Up Unity Catalog
No ratings yet
06-Setting Up Unity Catalog
5 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Mastering Apache Spark - Sample Chapter
No ratings yet
Mastering Apache Spark - Sample Chapter
24 pages
Simplify Your Streaming
No ratings yet
Simplify Your Streaming
27 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
SCD Type-2 with Pandas in Spark
0% (1)
SCD Type-2 with Pandas in Spark
8 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Building Data Pipelines - 3
No ratings yet
Building Data Pipelines - 3
29 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
44 pages
Azure Databricks Notes
No ratings yet
Azure Databricks Notes
20 pages
Databricks Delta for Developers
No ratings yet
Databricks Delta for Developers
11 pages
Cloudera Introduction PDF
No ratings yet
Cloudera Introduction PDF
97 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Python Basics for Aspiring Engineers
No ratings yet
Python Basics for Aspiring Engineers
4 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Python Virtual Environment
No ratings yet
Python Virtual Environment
23 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
Complete DBT Bootcamp Slides
100% (1)
Complete DBT Bootcamp Slides
99 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
Azure Databricks Interview Question
No ratings yet
Azure Databricks Interview Question
12 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
No ratings yet
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
11 pages
Spark
No ratings yet
Spark
13 pages
Data Stream Processing Insights
No ratings yet
Data Stream Processing Insights
67 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Big Data and Visualization
No ratings yet
Big Data and Visualization
141 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
2 - Snowflake de Feb25
No ratings yet
2 - Snowflake de Feb25
90 pages
Snowflake Training for IT Pros
No ratings yet
Snowflake Training for IT Pros
7 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
Databricks Vs SQL Cheat Sheet
100% (1)
Databricks Vs SQL Cheat Sheet
11 pages
Azure DevOps CICD With Azure Databricks and Data Factory
No ratings yet
Azure DevOps CICD With Azure Databricks and Data Factory
69 pages
Day 4-01-Spark
No ratings yet
Day 4-01-Spark
43 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
DataCamp Databricks
No ratings yet
DataCamp Databricks
33 pages
Data Fundamentals DP-900 Guide
No ratings yet
Data Fundamentals DP-900 Guide
37 pages
Data Warehousing Schema Guide
No ratings yet
Data Warehousing Schema Guide
4 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
AWS Tools for Data Engineers
No ratings yet
AWS Tools for Data Engineers
24 pages
Apache Spark Programming With Databricks
No ratings yet
Apache Spark Programming With Databricks
112 pages
Automated Deployment With Databricks Asset Bundles
No ratings yet
Automated Deployment With Databricks Asset Bundles
116 pages
Databricks Streaming and Delta Live Tables
No ratings yet
Databricks Streaming and Delta Live Tables
69 pages
7.2+ +Data+Pipeline+Testing
No ratings yet
7.2+ +Data+Pipeline+Testing
7 pages
9.1+ +Certification+Overview
No ratings yet
9.1+ +Certification+Overview
17 pages
4.3+ +Auto+Optimize
No ratings yet
4.3+ +Auto+Optimize
6 pages
3 2+-+Delta+lake+CDF
No ratings yet
3 2+-+Delta+lake+CDF
13 pages
The MAC Layer
No ratings yet
The MAC Layer
77 pages
QUESTIONS ON MYSQL-DBMS and PYTHON CONNECTIVITY WITH ANSWERS USING Format
No ratings yet
QUESTIONS ON MYSQL-DBMS and PYTHON CONNECTIVITY WITH ANSWERS USING Format
3 pages
DSD Memory
No ratings yet
DSD Memory
37 pages
Distributed Database Systems Guide
No ratings yet
Distributed Database Systems Guide
24 pages
JavaScript Quick Reference Guide
100% (1)
JavaScript Quick Reference Guide
56 pages
Reglamento Ley de Contrataciones Del Estado
No ratings yet
Reglamento Ley de Contrataciones Del Estado
3 pages
Stellar Data Recovery Editions Comparison
No ratings yet
Stellar Data Recovery Editions Comparison
2 pages
Hadoop Architecture Guide
No ratings yet
Hadoop Architecture Guide
25 pages
Task 1: Design The Database Together
No ratings yet
Task 1: Design The Database Together
16 pages
DVClub Advanced Scoreboarding Techniques-Francois PDF
No ratings yet
DVClub Advanced Scoreboarding Techniques-Francois PDF
23 pages
Splunk Skills Assessment-Updated
No ratings yet
Splunk Skills Assessment-Updated
14 pages
COMBIN14 Element Node Warning
No ratings yet
COMBIN14 Element Node Warning
1 page
Java Notes
No ratings yet
Java Notes
24 pages
Assignment - Azure Vs Aws
No ratings yet
Assignment - Azure Vs Aws
1 page
Remus: High Availability Via Asynchronous Virtual Machine Replication
No ratings yet
Remus: High Availability Via Asynchronous Virtual Machine Replication
14 pages
Register and Register Files
No ratings yet
Register and Register Files
17 pages
Working With Cursors
No ratings yet
Working With Cursors
3 pages
Microprocessors MCQs Set-2 ExamRadar
No ratings yet
Microprocessors MCQs Set-2 ExamRadar
2 pages
Sap 3: Push, Pop, Call Instruction
100% (1)
Sap 3: Push, Pop, Call Instruction
1 page
DIP Part1 (Practical)
No ratings yet
DIP Part1 (Practical)
39 pages
T24 Directory Structure and Important Parameter Files PDF
No ratings yet
T24 Directory Structure and Important Parameter Files PDF
29 pages
SAP Europe DRP Retest - Sep 2014 v2.1
No ratings yet
SAP Europe DRP Retest - Sep 2014 v2.1
12 pages
Storage Class:: Types of Storage Classes
No ratings yet
Storage Class:: Types of Storage Classes
6 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
CET341 Assignment Two 2021 - 22
No ratings yet
CET341 Assignment Two 2021 - 22
9 pages
Telemetry
No ratings yet
Telemetry
22 pages
Deploying and Scaling Kubernetes With Rancher - 2nd Ed PDF
0% (1)
Deploying and Scaling Kubernetes With Rancher - 2nd Ed PDF
66 pages
Chapter 4 Relational Database Model
No ratings yet
Chapter 4 Relational Database Model
33 pages
DBMSit
No ratings yet
DBMSit
282 pages
Tour Guide: Image Acquisition Image Generation
No ratings yet
Tour Guide: Image Acquisition Image Generation
40 pages

Databricks Performance Optimization

Uploaded by

Databricks Performance Optimization

Uploaded by

Databricks

Executor Executor Executor

Core Task Task Task Core Core

LOGICAL OPTIMIZATION COST BASED OPTIMIZATION

Databricks Performance Optimization

● Number of bytes read

Before Aggregation After Aggregation by City

● Too many small ﬁles greatly increases overhead for reads

SELECT input_file_name() as 3. parquet 1 4

Optimize Table Z-Order by Column

Old Layout New Layout

Select * from Table Where Col = 7

Old Layout New Layout

2023-02-05 2023-02-06 2023-02-07

Customer A • Many small ﬁles.

Innovative technique to clustering data ● Best performance out of the box

adaptive to data pattern changes, ● Minimal write ampliﬁcation on table

2023-02-05 2023-02-06 2023-02-07 • Liquid is not subject to rigid

Customer C • Data skew is gone

Target ﬁle size

● Collects statistics on all columns in table

ANALYZE TABLE mytable COMPUTE STATISTICS FOR ALL COLUMNS

● Predictive optimization refers to using predictive analytics techniques to

● It involves leveraging data-driven insights to proactively identify and

Automatic Maintenance Support Maintenance Operations

● It automates the execution of ● It supports maintenance operations,

Set and Forget Approach Serverless Computing

● It intelligently and automatically ● It utilizes serverless compute,

Databricks Performance Optimization

Before aggregation After aggregation by city

impacts performance because some

Partition 3 (50 MB) Partition 3 (50 MB)

Partition 5-A (45 MB)

● Most cloud DWs require a manual, ofﬂine Partition 5 (90 MB)

redistribution to solve for data skew. Partition 6-A (50 MB)

Partition 6 (150 MB) Partition 6-B (50 MB)

● With Adaptive Query Execution Spark Partition 6-C (50 MB)

automatically breaks down larger

1. Adaptive Query Execution (enabled by default in Spark 3.1)

And technically some actions, e.g. count()

● Reduce network IO by using fewer, Re-evaluate join strategy:

● In this case, Spark is forced into [potentially] expensive

● All of this just to avoid the

● Spill (Memory): For the partition that was spilled,

● Spill (Disk): Likewise, for the partition that was spilled,

The two values are always presented together

● Address data skew

● Manage size of Spark partitions

● Avoid expensive operations like explode()

● Reduce amount data preemptively whenever possible

● If you have to use UDFs in Scala use Typed Transformations

● Resist the temptation to use UDFs to integrate Spark code with

Databricks Performance Optimization

ALL PURPOSE COMPUTE JOBS COMPUTE SQL WAREHOUSE

● Designed to handle ● Run on ephemeral clusters ● Built for high concurrency

● Setting range for the number of workers requires some experimenting

Use Case Auto Scaling Range

SLA Spot or On-Demand

Save on compute costs

Pick the Best

Cloud Family Core:Ram Processor Storage

Intel Cascade 3.6

Azure f-series 1core:2gb Intel Xeon 2.4 GHz Local SSD

Intel Cascade 3.4

Cloud Family Cloud Family

AWS c7g/c6g AWS c7gd/c6gd

Azure fsv2 Azure fsv2

GCP e2-highcpu GCP n2-highcpu

You might also like