Build Data
Pipelines with
Delta Live Tables
Module 04
©2023 Databricks Inc. — All rights reserved 1
Agenda
Build Data Pipelines with Delta Live Tables
The Medallion Architecture
Introduction to Delta Live Tables
DE 4.1 - DLT UI Walkthrough
DE 4.1A - SQL Pipelines
DE 4.1B - Python Pipelines
DE 4.2 - Python vs SQL
DE 4.3 - Pipeline Results
DE 4.4 - Pipeline Event Logs
©2023 Databricks Inc. — All rights reserved 2
The Medallion
Architecture
©2023 Databricks Inc. — All rights reserved 3
Medallion Architecture in the Lakehouse
Streaming
Analytics
Kinesis BRONZE SILVER GOLD
CSV,
JSON,TXT… BI &
Reporting
Data Lake
Data Science
Raw ingestion Filtered, cleaned, Business-level & ML
and history augmented aggregates
Data Quality & Governance Data Sharing
4
©2023 Databricks Inc. — All rights reserved
Multi-Hop in the Lakehouse
Bronze Layer
Typically just a raw copy of ingested data
Replaces traditional data lake
Bronze
Provides efficient storage and querying of full, unprocessed
history of data
©2023 Databricks Inc. — All rights reserved 5
Multi-Hop in the Lakehouse
Silver Layer
Reduces data storage complexity, latency, and redundancy
Optimizes ETL throughput and analytic query performance
Silver
Preserves grain of original data (without aggregations)
Eliminates duplicate records
Production schema enforced
Data quality checks, corrupt data quarantined
©2023 Databricks Inc. — All rights reserved 6
Multi-Hop in the Lakehouse
Gold Layer
Powers ML applications, reporting, dashboards, ad hoc analytics
Refined views of data, typically with aggregations
Gold
Reduces strain on production systems
Optimizes query performance for business-critical data
©2023 Databricks Inc. — All rights reserved 7
Introduction to Delta
Live Tables
©2023 Databricks Inc. — All rights reserved 8
Multi-Hop in the Lakehouse
Streaming analytics
CSV
JSON
TXT
Bronze Silver Gold
Databricks Auto
Loader Raw Ingestion and Filtered, Cleaned, Business-level
History Augmented Aggregates
Data quality
AI and reporting
©2023 Databricks Inc. — All rights reserved
The Reality is Not so Simple
Bronze Silver Gold
©2023 Databricks Inc. — All rights reserved
Large scale ETL is complex and brittle
Complex pipeline Data quality and Difficult pipeline
development governance operations
Hard to build and maintain table Difficult to monitor and enforce Poor observability at granular,
dependencies data quality data level
Difficult to switch between batch Impossible to trace data lineage Error handling and recovery is
and stream processing laborious
©2023 Databricks Inc. — All rights reserved 11
Introducing Delta Live Tables
Make reliable ETL easy on Delta Lake
Operate with agility Trust your data Scale with reliability
Declarative tools to DLT has built-in Easily scale
build batch and declarative quality infrastructure
streaming data controls alongside your data
pipelines
Declare quality
expectations and
actions to take
©2023 Databricks Inc. — All rights reserved 12
What is a LIVE TABLE?
©2023 Databricks Inc. — All rights reserved 13
What is a Live Table?
Live Tables are materialized views for the lakehouse.
A live table is: Live tables provides tools to:
• Defined by a SQL query • Manage dependencies
• Created and kept up-to-date by a • Control quality
pipeline
• Automate operations
• Simplify collaboration
CREATE OR REFRESH LIVE TABLE report
• Save costs
AS SELECT sum(profit)
• Reduce latency
FROM prod.sales
©2023 Databricks Inc. — All rights reserved 14
What is a Streaming Live Table?
Based on SparkTM Structured Streaming
A streaming live table is “stateful”: • Streaming Live tables compute results
over append-only streams such as
• Ensures exactly-once processing of
Kafka, Kinesis, or Auto Loader (files on
input rows
cloud storage)
• Inputs are only read once
• Streaming live tables allow you to reduce
costs and latency by avoiding
reprocessing of old data.
CREATE STREAMING LIVE TABLE report
AS SELECT sum(profit)
FROM cloud_files(prod.sales)
©2023 Databricks Inc. — All rights reserved 15
When should I use
streaming?
©2022 Databricks Inc. — All rights reserved 16
Using Spark Structured Streaming for ingestion
Easily ingest files from cloud storage as they are uploaded
This example creates a table with all the
json data stored in “/data”:
• cloud_files keeps track of which files
CREATE STREAMING LIVE TABLE raw_data have been read to avoid duplication and
AS SELECT * wasted work
FROM cloud_files("/data", "json”) • Supports both listing and notifications
for arbitrary scale
• Configurable schema inference and
schema evolution
©2022 Databricks Inc. — All rights reserved 17
Using the SQL STREAM() function
Stream data from any Delta table
CREATE STREAMING LIVE TABLE mystream • STREAM(my_table) reads a stream of
AS SELECT * new records, instead of a snapshot
FROM STREAM(my_table) • Streaming tables must be an
append-only table
Pitfall: my_table must be an append-only source. • Any append-only delta table can be
read as a stream (i.e. from the live
e.g. it may not:
schema, from the catalog, or just from a
• be the target of APPLY CHANGES INTO path).
• define an aggregate function
• be a table on which you’ve executed DML to
delete/update a row (see GDPR section)
©2023 Databricks Inc. — All rights reserved 18
How do I use DLT?
©2023 Databricks Inc. — All rights reserved 19
Creating Your First Live Table Pipeline
SQL to DLT in three easy steps…
Write create live table Create a pipeline Click start
• Table definitions are written • A Pipeline picks one or more • DLT will create or update all
(but not run) in notebooks notebooks of table the tables in the pipelines.
definitions, as well as any
• Databricks Repos allow you
configuration required.
to version control your table
definitions.
©2023 Databricks Inc. — All rights reserved 20
BEST PRACTICE
Development vs Production
Fast iteration or enterprise grade reliability
Development Mode Production Mode
• Reuses a long-running cluster • Cuts costs by turning off clusters
running for fast iteration. as soon as they are done (within 5
minutes)
• No retries on errors enabling
faster debugging. • Escalating retries, including
cluster restarts, ensure reliability
in the face of transient issues.
In the Pipelines UI:
©2023 Databricks Inc. — All rights reserved 21
What if I have
dependent tables?
©2023 Databricks Inc. — All rights reserved 22
Declare LIVE Dependencies
Using the LIVE virtual schema.
CREATE LIVE TABLE events • Dependencies owned by other producers
are just read from the catalog or spark
AS SELECT … FROM prod.raw_data
data source as normal.
• LIVE dependencies, from the same
CREATE LIVE TABLE report pipeline, are read from the LIVE schema.
AS SELECT … FROM LIVE.events • DLT detects LIVE dependencies and
executes all operations in correct order.
events report
• DLT handles parallelism and captures the
lineage of the data.
©2023 Databricks Inc. — All rights reserved 23
How do I ensure
Data Quality?
©2023 Databricks Inc. — All rights reserved 24
BEST PRACTICE
Ensure correctness with Expectations
Expectations are tests that ensure data quality in production
CONSTRAINT valid_timestamp Expectations are true/false expressions
that are used to validate each row during
EXPECT (timestamp > '2012-01-01’)
processing.
ON VIOLATION DROP
DLT offers flexible policies on how to handle
@dlt.expect_or_drop( records that violate expectations:
"valid_timestamp", • Track number of bad records
col("timestamp") > '2012-01-01') • Drop bad records
• Abort processing for a single bad record
©2023 Databricks Inc. — All rights reserved 25
What about
operations?
©2023 Databricks Inc. — All rights reserved 26
Pipelines UI (1 of 5)
A one stop shop for ETL debugging and operations
• Visualize data flows
between tables
©2023 Databricks Inc. — All rights reserved 27
Pipelines UI (2 of 5)
A one stop shop for ETL debugging and operations
• Visualize data flows
between tables
• Discover metadata and
quality of each table
©2023 Databricks Inc. — All rights reserved 28
Pipelines UI (3 of 5)
A one stop shop for ETL debugging and operations
• Visualize data flows
between tables
• Discover metadata and
quality of each table
• Access to historical
updates
©2023 Databricks Inc. — All rights reserved 29
Pipelines UI (4 of 5)
A one stop shop for ETL debugging and operations
• Visualize data flows
between tables
• Discover metadata and
quality of each table
• Access to historical
updates
• Control operations
©2023 Databricks Inc. — All rights reserved 30
Pipelines UI (5 of 5)
A one stop shop for ETL debugging and operations
• Visualize data flows
between tables
• Discover metadata and
quality of each table
• Access to historical
updates
• Control operations
• Dive deep into events
©2023 Databricks Inc. — All rights reserved 31
The Event Log
The event log automatically records all pipelines operations.
Operational Statistics Provenance Data Quality
Time and current status, for all Table schemas, definitions, and Expectation pass / failure / drop
operations declared properties statistics
Pipeline and cluster Table-level lineage Input/Output rows that caused
configurations expectation failures
Query plans used to update
Row counts tables
©2023 Databricks Inc. — All rights reserved 32
How can I use
parameters?
©2022 Databricks Inc. — All rights reserved 33
Modularize your code with configuration
Avoid hard coding paths, topic names, and other constants in your code.
A pipeline’s configuration is a
map of key value pairs that
can be used to parameterize
your code:
• Improve code CREATE STREAMING LIVE TABLE data AS
readability/maintainability SELECT * FROM cloud_files("${my_etl.input_path}", "json")
• Reuse code in multiple @dlt.table
pipelines for different data def data():
input_path = spark.conf.get("my_etl.input_path”)
spark.readStream.format("cloud_files”).load(input_path)
©2023 Databricks Inc. — All rights reserved 34
How can I do
change data capture
(CDC)?
©2023 Databricks Inc. — All rights reserved 35
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere
APPLY CHANGES INTO LIVE.cities
{UPDATE}
FROM STREAM(LIVE.city_updates)
{DELETE}
KEYS (id) {INSERT}
SEQUENCE BY ts
APPLY
CHANGES
INTO
Up-to-date Snapshot
©2023 Databricks Inc. — All rights reserved 36
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere
APPLY CHANGES INTO LIVE.cities
city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 1, "city": "Bekerly, CA"}
KEYS (id)
SEQUENCE BY ts
cities
A target for the changes to id city
be applied to.
©2023 Databricks Inc. — All rights reserved 37
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere
APPLY CHANGES INTO LIVE.cities
city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 1, "city": "Bekerly, CA"}
KEYS (id)
SEQUENCE BY ts
A source of changes,
currently this has to be a
stream.
©2023 Databricks Inc. — All rights reserved 38
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere
APPLY CHANGES INTO LIVE.cities
city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 1, "city": "Bekerly, CA"}
KEYS (id)
SEQUENCE BY ts
cities
A unique key that can be
id city
used to identify a given row.
©2023 Databricks Inc. — All rights reserved 39
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere
APPLY CHANGES INTO LIVE.cities
city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 100, "city": "Bekerly, CA"}
KEYS (id)
SEQUENCE BY ts
A sequence that can be used cities
to order changes: id city
• Log sequence number (lsn)
• Timestamp
• Ingestion time
©2023 Databricks Inc. — All rights reserved 40
APPLY CHANGES INTO for CDC
Maintain an up-to-date replica of a table stored elsewhere
APPLY CHANGES INTO LIVE.cities
city_updates
FROM STREAM(LIVE.city_updates)
{"id": 1, "ts": 100, "city": "Bekerly, CA"}
KEYS (id)
{"id": 1, "ts": 200, "city": "Berkeley, CA"}
SEQUENCE BY ts
cities
id city
1 Bekerly, CA Berkeley, CA
©2023 Databricks Inc. — All rights reserved 41
REFERENCE ARCHITECTURE
Change Data Capture (CDC) from RDBMS
A variety of 3rd party tools can provide a streaming change feed
replicated_table
Amazon DMS to S3 cloud_files APPLY CHANGES INTO
RDS
replicated_table
MySQL or Debezium APPLY CHANGES INTO
Postgres
replicated_table
Golden Gate APPLY CHANGES INTO
Oracle
©2023 Databricks Inc. — All rights reserved
What do I no longer
need to manage with
DLT?
©2023 Databricks Inc. — All rights reserved 43
Automated Data Management
DLT automatically optimizes data for performance & ease-of-use
Best Practices Physical Data Schema Evolution
What: What: What:
DLT encodes Delta best practices DLT automatically manages your Schema evolution is handled for you
automatically when creating DLT physical data to minimize cost and
How:
tables. optimize performance.
Modifying a live table transformation
How: How:
to add/remove/rename a column will
DLT sets the following properties: • runs vacuum daily automatically do the right thing.
• runs optimize daily
• optimizeWrite When removing a column in a
• autoCompact You still can tell us how you want it streaming live table, old values are
• tuneFileSizesForRewrites organized (ie ZORDER) preserved.
©2023 Databricks Inc. — All rights reserved 44
DE 4.1 - Using the Delta
Live Tables UI
Deploy a DLT pipeline
Explore the resultant DAG
Execute an update of the pipeline
©2023 Databricks Inc. — All rights reserved 45
DE 4.1.1 - Fundamentals of
DLT Syntax
Declaring Delta Live Tables
Ingesting data with Auto Loader
Using parameters in DLT Pipelines
Enforcing data quality with constraints
Adding comments to tables
Describing differences in syntax and execution of live tables and streaming live
tables
©2023 Databricks Inc. — All rights reserved 46
DE 4.1.2 - More DLT SQL
Syntax
Processing CDC data with APPLY CHANGES INTO
Declaring live views
Joining live tables
Describing how DLT library notebooks work together in a pipeline
Scheduling multiple notebooks in a DLT pipeline
©2023 Databricks Inc. — All rights reserved 47
DE 4.2 - Delta Live Tables:
Python vs SQL
Identify key differences between the Python and SQL implementations of Delta
Live Tables
©2023 Databricks Inc. — All rights reserved 48
DE 4.3 - Exploring the
Results of a DLT Pipeline
©2023 Databricks Inc. — All rights reserved 49
DE 4.4 - Exploring the
Pipeline Events Logs
©2023 Databricks Inc. — All rights reserved 50
DE 4.1.3 - Troubleshooting
DLT Syntax Lab
Identifying and troubleshooting DLT syntax
Iteratively developing DLT pipelines with notebooks
©2023 Databricks Inc. — All rights reserved 51
©2023 Databricks Inc. — All rights reserved 52