0% found this document useful (0 votes)

459 views52 pages

De Mod 4 Build Data Pipelines With Delta Live Tables

The document discusses Delta Live Tables (DLT) in Databricks. It provides an overview of the key concepts, including: - DLT allows users to easily build batch and streaming data pipelines on Delta Lake to reliably move data between the bronze, silver and gold layers. - Live tables are materialized views defined by SQL queries that are kept up-to-date by pipelines. This simplifies ETL operations. - Streaming live tables ensure exactly-once processing of input rows and only read data once to reduce costs and latency. - The document demonstrates how to create a basic live table pipeline with three steps: defining the table, creating a pipeline, and starting the pipeline.

Uploaded by

nizammbb2u

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

459 views52 pages

De Mod 4 Build Data Pipelines With Delta Live Tables

Uploaded by

nizammbb2u

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Build Data

Pipelines with
Delta Live Tables

Module 04

©2023 Databricks Inc. — All rights reserved 1

Agenda
Build Data Pipelines with Delta Live Tables

The Medallion Architecture

Introduction to Delta Live Tables
DE 4.1 - DLT UI Walkthrough
DE 4.1A - SQL Pipelines
DE 4.1B - Python Pipelines
DE 4.2 - Python vs SQL
DE 4.3 - Pipeline Results
DE 4.4 - Pipeline Event Logs
©2023 Databricks Inc. — All rights reserved 2
The Medallion
Architecture

©2023 Databricks Inc. — All rights reserved 3

Medallion Architecture in the Lakehouse

Streaming
Analytics
Kinesis BRONZE SILVER GOLD
CSV,
JSON,TXT… BI &
Reporting
Data Lake

Data Science
Raw ingestion Filtered, cleaned, Business-level & ML

and history augmented aggregates

Data Quality & Governance Data Sharing

4
©2023 Databricks Inc. — All rights reserved
Multi-Hop in the Lakehouse
Bronze Layer

Typically just a raw copy of ingested data

Replaces traditional data lake
Bronze
Provides efﬁcient storage and querying of full, unprocessed
history of data

©2023 Databricks Inc. — All rights reserved 5

Multi-Hop in the Lakehouse
Silver Layer

Reduces data storage complexity, latency, and redundancy

Optimizes ETL throughput and analytic query performance
Silver
Preserves grain of original data (without aggregations)
Eliminates duplicate records
Production schema enforced
Data quality checks, corrupt data quarantined

©2023 Databricks Inc. — All rights reserved 6

Multi-Hop in the Lakehouse
Gold Layer

Powers ML applications, reporting, dashboards, ad hoc analytics

Reﬁned views of data, typically with aggregations
Gold
Reduces strain on production systems
Optimizes query performance for business-critical data

©2023 Databricks Inc. — All rights reserved 7

Introduction to Delta
Live Tables

©2023 Databricks Inc. — All rights reserved 8

Multi-Hop in the Lakehouse

Streaming analytics

CSV
JSON
TXT

Bronze Silver Gold

Databricks Auto
Loader Raw Ingestion and Filtered, Cleaned, Business-level
History Augmented Aggregates

Data quality
AI and reporting

©2023 Databricks Inc. — All rights reserved

The Reality is Not so Simple

Bronze Silver Gold

©2023 Databricks Inc. — All rights reserved

Large scale ETL is complex and brittle

Complex pipeline Data quality and Difﬁcult pipeline

development governance operations
Hard to build and maintain table Difﬁcult to monitor and enforce Poor observability at granular,
dependencies data quality data level

Difﬁcult to switch between batch Impossible to trace data lineage Error handling and recovery is
and stream processing laborious

©2023 Databricks Inc. — All rights reserved 11

Introducing Delta Live Tables
Make reliable ETL easy on Delta Lake

Operate with agility Trust your data Scale with reliability

Declarative tools to DLT has built-in Easily scale

build batch and declarative quality infrastructure
streaming data controls alongside your data
pipelines
Declare quality
expectations and
actions to take

©2023 Databricks Inc. — All rights reserved 12

What is a LIVE TABLE?

©2023 Databricks Inc. — All rights reserved 13

What is a Live Table?
Live Tables are materialized views for the lakehouse.

A live table is: Live tables provides tools to:

• Deﬁned by a SQL query • Manage dependencies
• Created and kept up-to-date by a • Control quality
pipeline
• Automate operations
• Simplify collaboration
CREATE OR REFRESH LIVE TABLE report
• Save costs
AS SELECT sum(profit)
• Reduce latency
FROM prod.sales

©2023 Databricks Inc. — All rights reserved 14

What is a Streaming Live Table?
Based on SparkTM Structured Streaming

A streaming live table is “stateful”: • Streaming Live tables compute results

over append-only streams such as
• Ensures exactly-once processing of
Kafka, Kinesis, or Auto Loader (ﬁles on
input rows
cloud storage)
• Inputs are only read once
• Streaming live tables allow you to reduce
costs and latency by avoiding
reprocessing of old data.
CREATE STREAMING LIVE TABLE report
AS SELECT sum(profit)
FROM cloud_files(prod.sales)

©2023 Databricks Inc. — All rights reserved 15

When should I use
streaming?

©2022 Databricks Inc. — All rights reserved 16

Using Spark Structured Streaming for ingestion
Easily ingest ﬁles from cloud storage as they are uploaded

This example creates a table with all the

json data stored in “/data”:
• cloud_files keeps track of which files
CREATE STREAMING LIVE TABLE raw_data have been read to avoid duplication and
AS SELECT * wasted work
FROM cloud_files("/data", "json”) • Supports both listing and notifications
for arbitrary scale
• Configurable schema inference and
schema evolution

©2022 Databricks Inc. — All rights reserved 17

Using the SQL STREAM() function
Stream data from any Delta table

CREATE STREAMING LIVE TABLE mystream • STREAM(my_table) reads a stream of

AS SELECT * new records, instead of a snapshot

FROM STREAM(my_table) • Streaming tables must be an

append-only table

Pitfall: my_table must be an append-only source. • Any append-only delta table can be
read as a stream (i.e. from the live
e.g. it may not:
schema, from the catalog, or just from a
• be the target of APPLY CHANGES INTO path).
• deﬁne an aggregate function
• be a table on which you’ve executed DML to
delete/update a row (see GDPR section)

©2023 Databricks Inc. — All rights reserved 18

How do I use DLT?

©2023 Databricks Inc. — All rights reserved 19

Creating Your First Live Table Pipeline
SQL to DLT in three easy steps…

Write create live table Create a pipeline Click start

• Table definitions are written • A Pipeline picks one or more • DLT will create or update all
(but not run) in notebooks notebooks of table the tables in the pipelines.
definitions, as well as any
• Databricks Repos allow you
configuration required.
to version control your table
definitions.

©2023 Databricks Inc. — All rights reserved 20

BEST PRACTICE

Development vs Production
Fast iteration or enterprise grade reliability

Development Mode Production Mode

• Reuses a long-running cluster • Cuts costs by turning off clusters

running for fast iteration. as soon as they are done (within 5
minutes)
• No retries on errors enabling
faster debugging. • Escalating retries, including
cluster restarts, ensure reliability
in the face of transient issues.

In the Pipelines UI:

©2023 Databricks Inc. — All rights reserved 21

What if I have
dependent tables?

©2023 Databricks Inc. — All rights reserved 22

Declare LIVE Dependencies
Using the LIVE virtual schema.

CREATE LIVE TABLE events • Dependencies owned by other producers

are just read from the catalog or spark
AS SELECT … FROM prod.raw_data
data source as normal.
• LIVE dependencies, from the same
CREATE LIVE TABLE report pipeline, are read from the LIVE schema.
AS SELECT … FROM LIVE.events • DLT detects LIVE dependencies and
executes all operations in correct order.
events report
• DLT handles parallelism and captures the
lineage of the data.

©2023 Databricks Inc. — All rights reserved 23

How do I ensure
Data Quality?

©2023 Databricks Inc. — All rights reserved 24

BEST PRACTICE

Ensure correctness with Expectations

Expectations are tests that ensure data quality in production

CONSTRAINT valid_timestamp Expectations are true/false expressions

that are used to validate each row during
EXPECT (timestamp > '2012-01-01’)
processing.
ON VIOLATION DROP

DLT offers ﬂexible policies on how to handle

@dlt.expect_or_drop( records that violate expectations:
"valid_timestamp", • Track number of bad records
col("timestamp") > '2012-01-01') • Drop bad records
• Abort processing for a single bad record

©2023 Databricks Inc. — All rights reserved 25

What about
operations?

©2023 Databricks Inc. — All rights reserved 26

Pipelines UI (1 of 5)
A one stop shop for ETL debugging and operations

• Visualize data ﬂows

between tables

©2023 Databricks Inc. — All rights reserved 27

Pipelines UI (2 of 5)
A one stop shop for ETL debugging and operations

• Visualize data ﬂows

between tables
• Discover metadata and
quality of each table

©2023 Databricks Inc. — All rights reserved 28

Pipelines UI (3 of 5)
A one stop shop for ETL debugging and operations

• Visualize data ﬂows

between tables
• Discover metadata and
quality of each table
• Access to historical
updates

©2023 Databricks Inc. — All rights reserved 29

Pipelines UI (4 of 5)
A one stop shop for ETL debugging and operations

• Visualize data ﬂows

between tables
• Discover metadata and
quality of each table
• Access to historical
updates
• Control operations

©2023 Databricks Inc. — All rights reserved 30

Pipelines UI (5 of 5)
A one stop shop for ETL debugging and operations

• Visualize data ﬂows

between tables
• Discover metadata and
quality of each table
• Access to historical
updates
• Control operations
• Dive deep into events

©2023 Databricks Inc. — All rights reserved 31

The Event Log
The event log automatically records all pipelines operations.

Operational Statistics Provenance Data Quality

Time and current status, for all Table schemas, deﬁnitions, and Expectation pass / failure / drop
operations declared properties statistics

Pipeline and cluster Table-level lineage Input/Output rows that caused

conﬁgurations expectation failures
Query plans used to update
Row counts tables

©2023 Databricks Inc. — All rights reserved 32

How can I use
parameters?

©2022 Databricks Inc. — All rights reserved 33

Modularize your code with conﬁguration
Avoid hard coding paths, topic names, and other constants in your code.

A pipeline’s conﬁguration is a
map of key value pairs that
can be used to parameterize
your code:
• Improve code CREATE STREAMING LIVE TABLE data AS
readability/maintainability SELECT * FROM cloud_files("${my_etl.input_path}", "json")

• Reuse code in multiple @dlt.table

pipelines for different data def data():
input_path = spark.conf.get("my_etl.input_path”)
spark.readStream.format("cloud_files”).load(input_path)

How can I do
change data capture
(CDC)?