Data
Engineerin
g with
Databricks
Course
Objectives
Leverage the
Databricks Lakehouse
Platform to perform
core responsibilities for
data pipeline
development
Use SQL and Python to
write production data
pipelines to extract,
transform, and load
data into tables and
views in the lakehouse
Simplify data ingestion
and incremental
change propagation
using Databricks-native
features and syntax
Orchestrate production
pipelines to deliver
fresh results for ad-hoc
analytics and
dashboarding
Course Agenda
Module : Databricks
Workspace and
Services
Module : Delta Lake
Module : Relational
Entities on Databricks
Module : ETL With
Spark SQL
Module : OPTIONAL
Python for Spark SQL
Module : Incremental
Data Processing
Module : Multi-Hop
Architecture
Module : Delta Live
Tables
Module : Task
Orchestration with Jobs
Module : Running a
DBSQL Query
Module : Managing
Permissions
Module :
Productionalizing
Dashboards and
Queries in DBSQL
Databricks
Certified Data
Engineer
Associate
Certification helps you
gain industry recognition,
competitive
differentiation, greater
productivity, and results.
This course helps you
prepare for the
Databricks Certified
Data Engineer
Associate exam
Please see the
Databricks Academy
for additional prep
materials
For more information visit:
databricks.com/learn/certification
© Databricks Inc. — All rights reserved
The
Databricks
Lakehouse
Platform
Using the
Databricks
Lakehouse
Platform
Learning Objectives
Describe the
components of the
Databricks Lakehouse
Complete basic code
development tasks
using services of the
Databricks Data
Science and
Engineering Workspace
Perform common table
operations using Delta
Lake in the Lakehouse
Using the
Databricks
Lakehouse
Platform
Agenda
Introduction to the
Databricks Lakehouse
Platform
Introduction to the
Databricks Workspace
and Services
• Using clusters, files,
notebooks, and repos •
Introduction to Delta
Lake
• Manipulating and optimizing
data in Delta tables
Customers
7000+
across the globe
Lakehous
e
One simple platform to unify
all of your data, analytics,
and AI workloads
Original creators of:
© Databricks Inc. — All rights reserved
Supporting
enterprises in
every
industry
Healthcare & Life Sciences
Public Sector
Manufacturing & Automotive
Retail & CPG
Media & Entertainment
Energy & Utilities
Financial Services
Digital Native
© Databricks Inc. — All rights reserved
Most
enterprises
struggle with
data
Data Warehousing
Data Engineering Streaming
Data Science and ML
Siloed stacks increase data architecture
complexity
Analytics and BI Data marts
Data warehouse
Structured data
© Databricks Inc. — All rights reserved
Transform
Real-time Database Streaming Data Engine
Streaming data sources
Data Science
Machine Learning
Extract
Load
Data prep Data Lake
Data Lake
Structured, semi-structured and unstructured data
Structured, semi-structured
and unstructured
Most
enterprises
struggle with
data
Amazon Redshift Azure Synapse Snowflake
SAP
Teradata
Google BigQuery
IBM Db
Oracle Autonomous Data Warehouse
Jupyter
Azure ML Studio Domino Data Labs TensorFlow
Data Science
Amazon SageMaker MatLAB
SAS
PyTorch
Machine Learning
Data Warehousing
Data Engineering
Streaming
Data Science and ML
Disconnected systems and proprietary
data formats make integration difficult
Hadoop
Amazon EMR Google Dataproc
Apache Airflow Apache Spark Cloudera
Apache Kafka
Apache Flink
Azure Stream Analytics Tibco Spotfire
Apache Spark Amazon Kinesis Google Dataflow Confluent
Siloed stacks increase data architecture
complexity
Analytics and BI Data marts
Data warehouse
Structured data
© Databricks Inc. — All rights reserved
Transform
Real-time Database Streaming Data Engine
Streaming data sources
Extract
Load
Data prep Data Lake
Data Lake
Structured, semi-structured and unstructured data
Structured, semi-structured
and unstructured
Most
enterprises
struggle with
data
Data Warehousing
Data Analysts
Data Engineering
Streaming
Data Science and ML
Data Scientists
Hadoop
Amazon EMR Google Dataproc
Apache Airflow Apache Spark Cloudera
Siloed data teams decrease productivity
Data Engineers
Apache Kafka
Apache Flink
Azure Stream Analytics Tibco Spotfire
Apache Spark Amazon Kinesis Google Dataflow Confluent
Data Engineers
Disconnected systems and proprietary
data formats make integration difficult
Siloed stacks increase data architecture
complexity
Amazon Redshift Azure Synapse Snowflake
SAP
Teradata
Google BigQuery
IBM Db
Oracle Autonomous Data Warehouse
Jupyter
Azure ML Studio Domino Data Labs TensorFlow
Data Science
Amazon SageMaker MatLAB
SAS
PyTorch
Machine Learning
Analytics and BI Data marts
Data warehouse
Structured data
© Databricks Inc. — All rights reserved
Transform
Real-time Database Streaming Data Engine
Streaming data sources
Extract
Load
Data prep Data Lake
Data Lake
Structured, semi-structured and unstructured data
Structured, semi-structured
and unstructured
Data Lake
Data Warehouse
Lakehouse
One platform to unify all of your data,
analytics, and AI workloads
© Databricks Inc. — All rights reserved
Data Lake
Data Warehouse
An open approach to bringing
data management and
governance to data lakes
Better reliability with transactions
48x faster data processing with indexing
Data governance at scale with fine-grained
access control lists
© Databricks Inc. — All rights reserved
The
Databricks
Lakehouse
Platform
Data Engineering
BI and SQL Analytics
Data Science and ML
Real-Time Data Applications
✓
✓✓
Simple
Open Collaborative
Databricks Lakehouse Platform
Data Management and Governance
Open Data Lake
Platform Security & Administration
Unstructured, semi-structured, structured, and
streaming data
© Databricks Inc. — All rights reserved
The
Databricks
Lakehouse
Platform
Databricks Lakehouse Platform
Data Management and Governance
Open Data Lake
Platform Security & Administration
Unstructured, semi-structured, structured, and
streaming data
Simple
Unify your data, analytics,
and AI on one common
platform for all data use
cases
✓
Data Engineering
BI and SQL Analytics
Data Science and ML
Real-Time Data Applications
© Databricks Inc. — All rights reserved
The
Databricks
Lakehouse
Platform
30
Million+
Monthly downloads
Open
Unify your data ecosystem
with open source standards
and formats.
Built on the innovation of
some of the most successful
open source data projects in
the world
✓
© Databricks Inc. — All rights reserved
The
Databricks
Lakehouse
Platform
Visual ETL & Data Ingestion
Business Intelligence
Azure Synapse
Google BigQuery
Amazon Redshift
Machine Learning
Open
Unify your data ecosystem
with open source standards
and formats.
450+
Partners across the data
landscape
✓
Azure Data Factory
Data Providers
Amazon SageMaker
Azure Machine Learning
Google
AI Platform
Lakehouse Platform
Centralized Governance
AWS Glue
Top Consulting & SI Partners
© Databricks Inc. — All rights reserved
The
Databricks
Lakehouse
Platform
Data Analysts
Collaborative
Unify your data teams to
collaborate across the entire
data and AI workflow
✓
Models Dashboards Notebooks
Datasets
Data Engineers
Data Scientists
© Databricks Inc. — All rights reserved
Databricks
Architectur
e and
Services
Databricks
Architecture
Control Plane
Web Application
Repos / Notebooks
Job Scheduling
Cluster Management
Databricks Cloud Account
Customer Cloud Account
Data Plane
Data processing with Apache Spark Clusters
Data Sources
Databricks File System (DBFS)
Databricks
Services
Control Plane in Databricks
Manage customer accounts,
datasets, and clusters
Databricks Web Cluster
Application Management
Clusters
Control Plane
Web Application
Repos / Notebooks
Job Scheduling
Cluster Management
Databricks Cloud Account
Customer Cloud Account
Data Plane
Data processing with Apache Spark Clusters
Data Sources
Databricks File System (DBFS)
Clusters
Overview
Clusters are made up of
one or more virtual
machine (VM) instances
Driver coordinates
activities of executors
Executors run tasks
composing a Spark job
CLUSTER
Driver
Executor
Core
Memory
Core
Local Storage
Executor
Core
Memory
Core
Local Storage
Clusters
Types
All-purpose Clusters
Analyze data collaboratively
using interactive notebooks
Create clusters from the
Workspace or API
Retains up to clusters for
up to days.
Job Clusters
Run automated jobs
The Databricks job scheduler
creates job clusters when
running jobs.
Retains up to clusters.
© Databricks Inc. — All rights reserved
Git
Versioning
with
Databricks
Repos
Databricks
Repos
Overview
Git Versioning
Native integration with Github,
Gitlab, Bitbucket and Azure
Devops
UI-based workflows
CI/CD Integration
API surface to integrate with
automation
Simplifies the
dev/staging/prod multi-
workspace story
CI CD
Enterprise ready
Allow lists to avoid exfiltration
Secret detection to avoid
leaking keys
© Databricks Inc. — All rights reserved
Databricks
Repos
CI/CD Integration
Control Plane in Databricks
Manage customer accounts, datasets,
and clusters
Repos / Jobs
Notebooks
Repos Service
Git and CI/CD Systems
Version Review Test
Databricks
Repos
Best practices for CI/CD
workflows
Admin workflow
User workflow in Databricks
Merge workflow in Git provider
Production job workflow in Databricks
Set up top-level Repos folders (example:
Production)
Clone remote repository to user folder
Pull request and review process
API call brings Repo in Production folder to latest
version
Create new branch based on main branch
Merge into main branch
Set up Git automation to update Repos on merge
Run Databricks job based on Repo in Production
folder
Create and edit code
Git automation calls Databricks Repos API
Steps in Databricks Steps in your Git provider
Commit and push to feature branch
© Databricks Inc. — All rights reserved
What is
Delta
Lake?
Delta Lake is
an open-
source
project that
enables
building a
data
lakehouse
on top of
existing
storage
systems
Delta Lake Is
Not...
Proprietary
technology
Storage format
Storage medium
Database service
or data warehouse
Delta Lake Is...
Open source
Builds upon
standard data
formats
Optimized for
cloud object
storage
Built for scalable
metadata handling
Delta Lake
brings ACID to
object storage
▪ Atomicity
▪ Consistency ▪
Isolation
▪ Durability
Problems
solved by ACID
. Hard to append
data
. Modification of
existing data
difficult
. Jobs failing mid
way
. Real-time
operations hard
. Costly to keep
historical data
versions
© Databricks Inc. — All rights reserved
Delta Lake is
the default
for all tables
created in
Databricks
ETL with
Spark SQL
and
Python
ETL With Spark
SQL and
Python
Learning Objectives
Leverage Spark SQL
DDL to create and
manipulate relational
entities on Databricks
Use Spark SQL to
extract, transform, and
load data to support
production workloads
and analytics in the
Lakehouse
Leverage Python for
advanced code
functionality needed in
production applications
ETL With Spark
SQL and
Python
Agenda
• Working with Relational
Entities on Databricks •
Managing databases, tables,
and views
• ETL with Spark SQL
• Extracting data from
external sources, loading and
updating data in the
lakehouse,
and common transformations
• Just Enough Python for
Spark SQL
• Building extensible functions
with Python-wrapped SQL
Increment
al Data
and Delta
Live
Tables
Incremental
Data and Delta
Live Tables
Learning Objectives
Incrementally process
data to power analytic
insights with Spark
Structured Streaming
and Auto Loader
Propagate new data
through multiple tables
in the data lakehouse
Leverage Delta Live
Tables to simplify
productionalizing SQL
data
pipelines with
Databricks
Incremental
Data and Delta
Live Tables
Agenda
• Incremental Data
Processing with
Structured Streaming
and Auto Loader
• Processing and aggregating
data incrementally in near real
time • Multi-hop in the
Lakehouse
• Propagating changes
through a series of tables to
drive production systems •
Using Delta Live Tables
• Simplifying deployment of
production pipelines and
infrastructure using SQL
Multi-hop
Architectur
e
Multi-Hop in
the Lakehouse
Streaming analytics
CSV JSON TXT
Databricks Auto Loader
Bronze
Silver Gold Data quality
AI and reporting
© Databricks Inc. — All rights reserved
Multi-Hop in
the Lakehouse
Bronze Layer
Typically just a raw copy
of ingested data
Replaces traditional data
lake
Provides efficient storage
and querying of full,
unprocessed history of
data
Bronze
© Databricks Inc. — All rights reserved
Multi-Hop in
the Lakehouse
Silver Layer
Reduces data storage
complexity, latency, and
redundancy Optimizes
ETL throughput and
analytic query
performance Preserves
grain of original data
(without aggregations)
Eliminates duplicate
records
Production schema
enforced
Data quality checks,
corrupt data quarantined
Silver
© Databricks Inc. — All rights reserved
Multi-Hop in
the Lakehouse
Gold Layer
Powers ML applications,
reporting, dashboards, ad
hoc analytics Refined
views of data, typically
with aggregations
Reduces strain on
production systems
Optimizes query
performance for
business-critical data
Gold
© Databricks Inc. — All rights reserved
Introducin
g Delta
Live
Tables
Multi-Hop in
the Lakehouse
Streaming analytics
CSV JSON TXT
Databricks Auto Loader
Bronze
Raw Ingestion and History
Silver
Filtered, Cleaned, Augmented
Data quality
Gold
Business-level Aggregates
AI and reporting
© Databricks Inc. — All rights reserved
The Reality is
Not so Simple
Bronze Silver Gold
© Databricks Inc. — All rights reserved
Large scale
ETL is complex
and brittle
Complex pipeline
development
Hard to build and maintain table
dependencies
Difficult to switch between batch and
stream processing
Data quality and
governance
Difficult to monitor and enforce
data quality
Impossible to trace data lineage
Difficult pipeline
operations
Poor observability at granular, data
level
Error handling and recovery is
laborious
Introducing
Delta Live
Tables
Make reliable ETL easy
on Delta Lake
Operate with agility
Declarative tools to build batch
and streaming data pipelines
Trust your data
DLT has built-in declarative
quality controls
Declare quality expectations
and actions to take
Scale with reliability
Easily scale infrastructure
alongside your data
Managing
Data
Access
and
Production
Pipelines
Managing Data
Access and
Production
Pipelines
Learning Objectives
Orchestrate tasks with
Databricks Jobs
Use Databricks SQL for
on-demand queries
Configure Databricks
Access Control Lists to
provide groups with
secure access to
production and
development
databases
Configure and schedule
dashboards and alerts
to reflect updates to
production data
pipelines
Managing Data
Access and
Production
Pipelines
Agenda
• Task Orchestration with
Databricks Jobs
• Scheduling notebooks and
DLT pipelines with
dependencies
• Running Your First
Databricks SQL Query
• Navigating, configuring, and
executing queries in
Databricks SQL
• Managing Permissions
in the Lakehouse
• Configuring permissions for
databases, tables, and views
in the data lakehouse
• Productionalizing
Dashboards and Queries
in DBSQL
• Scheduling queries,
dashboards, and alerts for
end-to-end analytic pipelines
Introducin
g Unity
Catalog
Data
Governance
Overview
Four key functional areas
Data Access Control
Control who has access to which data
Data Access Audit
Capture and record all access to data
Data Lineage
Capture upstream sources and
downstream consumers
Data Discovery
Ability to search for and discover
authorized assets
Data
Governance
Overview
Challenges
Structured
Semi-structured
Unstructured
Streaming
Cloud
Data Analysts
Data Scientists
Data Engineers
Machine Learning
Cloud
Cloud
© Databricks Inc. — All rights reserved
Databricks
Unity Catalog
Overview
Unify governance across clouds
Fine-grained governance for data
lakes across clouds - based on open
standard ANSI SQL.
Unify data and AI assets
Centrally share, audit, secure and
manage all data types with one simple
interface.
Unify existing catalogs
Works in concert with existing data,
storage, and catalogs - no hard
migration required.
© Databricks Inc. — All rights reserved
Databricks
Unity Catalog
Three-layer namespace
Traditional two-layer
namespace
SELECT * FROM .
Three-layer namespace
with Unity Catalog
SELECT * FROM .schema.table
schema
table
catalog
© Databricks Inc. — All rights reserved
Databricks
Unity Catalog
Security Model
Traditional Query
Lifecycle
. Submit query
. Filter unauthorized data
Workspace
. Check grants
Table ACL
Hive Metastore
SELECT * FROM table
Cluster or SQL Endpoint
. Lookup location
. Return path to table
. Cloud-specific credentials
Cloud Storage
© Databricks Inc. — All rights reserved
Databricks
Unity Catalog
Security Model
Query Life Cycle with
Unity Catalog
Workspace
. Submit query
. Filter unauthorized data
Cluster or SQL Endpoint
. Check namespace
. URLs and short-lived tokens
Audit Log
. Write to log
. Check grants and Delta metadata
Unity Catalog
SELECT * FROM table
. Ingest from array of URLs
Cloud Storage
. Cloud-specific credentials
© Databricks Inc. — All rights reserved
Course
Recap
Course
Objectives
Leverage the
Databricks Lakehouse
Platform to perform
core responsibilities for
data pipeline
development
Use SQL and Python to
write production data
pipelines to extract,
transform, and load
data into tables and
views in the lakehouse
Simplify data ingestion
and incremental
change propagation
using Databricks-native
features and syntax
Orchestrate production
pipelines to deliver
fresh results for ad-hoc
analytics and
dashboarding