Data
Engineering
with Databricks
©2022 Databricks Inc. — All rights reserved 1
Course Objectives
• Leverage the Databricks Lakehouse Platform to perform core
responsibilities for data pipeline development
• Use SQL and Python to write production data pipelines to extract,
transform, and load data into tables and views in the lakehouse
• Simplify data ingestion and incremental change propagation using
Databricks-native features and syntax
• Orchestrate production pipelines to deliver fresh results for ad-hoc
analytics and dashboarding
©2022 Databricks Inc. — All rights reserved 2
Course Agenda
• Module 1: Databricks Workspace and Services
• Module 2: Delta Lake
• Module 3: Relational Entities on Databricks
• Module 4: ETL With Spark SQL
• Module 5: OPTIONAL Python for Spark SQL
• Module 6: Incremental Data Processing
• Module 7: Multi-Hop Architecture
• Module 8: Delta Live Tables
• Module 9: Task Orchestration with Jobs
• Module 10: Running a DBSQL Query
• Module 11: Managing Permissions
• Module 12: Productionalizing Dashboards and Queries in DBSQL
©2022 Databricks Inc. — All rights reserved 3
The Databricks
Lakehouse
Platform
©2021 Databricks Inc. — All rights reserved 4
Using the Databricks Lakehouse Platform
Module 1 Learning Objectives
• Describe the components of the Databricks Lakehouse
• Complete basic code development tasks using services of the
Databricks Data Science and Engineering Workspace
• Perform common table operations using Delta Lake in the Lakehouse
©2022 Databricks Inc. — All rights reserved 5
Using the Databricks Lakehouse Platform
Module 1 Agenda
• Introduction to the Databricks Lakehouse Platform
• Introduction to the Databricks Workspace and Services
• Using clusters, files, notebooks, and repos
• Introduction to Delta Lake
• Manipulating and optimizing data in Delta tables
©2022 Databricks Inc. — All rights reserved 6
Customers
7000+
Lakehouse
across the globe
One simple platform to unify all of
your data, analytics, and AI workloads
Original creators of:
©2021 Databricks Inc. — All rights reserved
Supporting enterprises in every industry
Healthcare & Life Manufacturing & Media & Financial
Sciences Automotive Entertainment Services
Public Sector Retail & CPG Energy & Utilities Digital Native
©2021 Databricks Inc. — All rights reserved 8
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML
Siloed stacks increase data architecture complexity
Analytics and BI Transform Real-time Database Data Machine
Science Learning
Extract Load
Data marts Data prep
Streaming Data Engine
Data Lake Data Lake
Data warehouse
Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data
©2021 Databricks Inc. — All rights reserved 9
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML
Disconnected systems and proprietary data formats make integration difficult
Amazon Redshift Teradata Hadoop Apache Airflow Apache Kafka Apache Spark Jupyter Amazon SageMaker
Azure Synapse Google BigQuery
Amazon EMR Apache Spark Apache Flink Amazon Kinesis Azure ML Studio MatLAB
Snowflake IBM Db2
Google Dataproc Cloudera Azure Stream Analytics Google Dataflow Domino Data Labs SAS
SAP Oracle Autonomous
Data Warehouse Tibco Spotfire Confluent TensorFlow PyTorch
Siloed stacks increase data architecture complexity
Analytics and BI Transform Real-time Database Data Machine
Science Learning
Extract Load
Data marts Data prep
Streaming Data Engine
Data Lake Data Lake
Data warehouse
Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data
©2021 Databricks Inc. — All rights reserved 10
Most enterprises struggle with data
Data Warehousing Data Engineering Streaming Data Science and ML
Siloed data teams decrease productivity
Data Analysts Data Engineers Data Engineers Data Scientists
Disconnected systems and proprietary data formats make integration difficult
Amazon Redshift Teradata Hadoop Apache Airflow Apache Kafka Apache Spark Jupyter Amazon SageMaker
Azure Synapse Google BigQuery
Amazon EMR Apache Spark Apache Flink Amazon Kinesis Azure ML Studio MatLAB
Snowflake IBM Db2
Google Dataproc Cloudera Azure Stream Analytics Google Dataflow Domino Data Labs SAS
SAP Oracle Autonomous
Data Warehouse Tibco Spotfire Confluent TensorFlow PyTorch
Siloed stacks increase data architecture complexity
Analytics and BI Transform Real-time Database Data Machine
Science Learning
Extract Load
Data marts Data prep
Streaming Data Engine
Data Lake Data Lake
Data warehouse
Structured, Structured,
Structured data semi-structured semi-structured
Streaming data sources
and unstructured data and unstructured data
©2021 Databricks Inc. — All rights reserved 11
Lakehouse
One platform to unify all of
Data your data, analytics, and AI Data
Lake workloads Warehouse
©2021 Databricks Inc. — All rights reserved 12
An open approach to bringing
data management and
governance to data lakes
Data Better reliability with transactions Data
Lake 48x faster data processing with indexing Warehouse
Data governance at scale with
fine-grained access control lists
©2021 Databricks Inc. — All rights reserved 13
The Databricks Lakehouse Platform
Databricks Lakehouse Platform
Data BI and SQL Data Science Real-Time Data
Engineering Analytics and ML Applications
Data Management and Governance
✓ Simple
Open Data Lake
✓ Open
Platform Security & Administration
✓ Collaborative
Unstructured, semi-structured, structured, and streaming data
©2021 Databricks Inc. — All rights reserved 14
The Databricks Lakehouse Platform
Databricks Lakehouse Platform
✓ Simple Data
Engineering
BI and SQL
Analytics
Data Science
and ML
Real-Time Data
Applications
Unify your data, analytics, Data Management and Governance
and AI on one common
platform for all data use Open Data Lake
cases
Platform Security & Administration
Unstructured, semi-structured, structured, and streaming data
©2021 Databricks Inc. — All rights reserved 15
The Databricks Lakehouse Platform
✓ Open
30 Million+
Unify your data ecosystem
with open source standards
Monthly downloads
and formats.
Built on the innovation of
some of the most successful
open source data projects in
the world
©2021 Databricks Inc. — All rights reserved 16
The Databricks Lakehouse Platform
Visual ETL & Data Ingestion Business Intelligence
Azure
✓ Open
Azure Data
Factory Synapse
Google
BigQuery
Amazon
Redshift
Unify your data ecosystem Machine Learning
with open source standards Amazon Azure Machine
and formats. SageMaker Learning
Google
AI Platform
Lakehouse Platform
Data Providers
450+
Centralized Governance
AWS
Glue
Partners across the Top Consulting & SI Partners
data landscape
©2021 Databricks Inc. — All rights reserved 17
The Databricks Lakehouse Platform
Data Analysts
✓ Collaborative
Unify your data teams to
collaborate across the entire
data and AI workflow
Models
Dashboards
Notebooks
Datasets
Data Engineers Data Scientists
©2021 Databricks Inc. — All rights reserved 18
Databricks
Architecture
and Services
©2022 Databricks Inc. — All rights reserved 19
Databricks Architecture
Databricks Customer
Control Plane Cloud Account Cloud Account Data Plane
Data processing with Apache
Spark Clusters
Web Application
Repos /
Notebooks
Job Scheduling
Databricks
Cluster Data
Management File System
Sources
(DBFS)
©2022 Databricks Inc. — All rights reserved 20
Databricks Services
Control Plane in Databricks
Manage customer accounts, datasets, and clusters
Databricks Web Repos / Cluster
Jobs
Application Notebooks Management
©2022 Databricks Inc. — All rights reserved 21
Clusters
Databricks Customer
Control Plane Cloud Account Cloud Account Data Plane
Data processing with Apache
Spark Clusters
Web Application
Repos /
Notebooks
Job Scheduling
Databricks
Cluster Data
Management File System
Sources
(DBFS)
©2022 Databricks Inc. — All rights reserved 22
Clusters
Overview
CLUSTER
Clusters are made up of one or Executor
more virtual machine (VM) Core Memory
instances
Driver coordinates activities of Core Local Storage
executors Driver
Executor
Executors run tasks composing
a Spark job Core Memory
Core Local Storage
©2022 Databricks Inc. — All rights reserved 23
Clusters
Types
All-purpose Clusters Job Clusters
Analyze data collaboratively using Run automated jobs
interactive notebooks
The Databricks job scheduler creates
Create clusters from the Workspace job clusters when running jobs.
or API
Retains up to 30 clusters.
Retains up to 70 clusters for up to 30
days.
©2022 Databricks Inc. — All rights reserved 24
Git Versioning
with Databricks
Repos
©2022 Databricks Inc. — All rights reserved 25
Databricks Repos
Overview
Git Versioning CI/CD Integration Enterprise ready
Native integration with API surface to integrate Allow lists to avoid
Github, Gitlab, Bitbucket with automation exfiltration
and Azure Devops
Simplifies the Secret detection to avoid
UI-based workflows dev/staging/prod leaking keys
multi-workspace story
CI CD
©2022 Databricks Inc. — All rights reserved 26
Databricks Repos
CI/CD Integration
Control Plane in Databricks Git and CI/CD Systems
Manage customer accounts, datasets, and clusters
Databricks Web Repos / Cluster Version Review Test
Application Jobs Management
Notebooks
Repos Service
©2022 Databricks Inc. — All rights reserved 27
Databricks Repos
Best practices for CI/CD workflows
User workflow in Merge workflow in Git Production job workflow in
Admin workflow
Databricks provider Databricks
Set up top-level Clone remote
repository to user Pull request and API call brings Repo in
Repos folders
folder review process Production folder to
(example:
latest version
Production)
Create new branch
based on main Merge into main
branch branch
Set up Git Run Databricks job
automation to update based on Repo in
Repos on merge Create and edit code Production folder
Git automation calls
Databricks Repos API
Steps in Databricks
Commit and push to
feature branch Steps in your Git provider
©2022 Databricks Inc. — All rights reserved 28
What is
Delta Lake?
©2022 Databricks Inc. — All rights reserved 29
Delta Lake is an open-source
project that enables building a
data lakehouse on top of
existing storage systems
©2022 Databricks Inc. — All rights reserved 30
Delta Lake Is Not…
• Proprietary technology
• Storage format
• Storage medium
• Database service or data warehouse
©2022 Databricks Inc. — All rights reserved 31
Delta Lake Is…
• Open source
• Builds upon standard data formats
• Optimized for cloud object storage
• Built for scalable metadata handling
©2022 Databricks Inc. — All rights reserved 32
Delta Lake brings ACID to object storage
▪ Atomicity
▪ Consistency
▪ Isolation
▪ Durability
©2022 Databricks Inc. — All rights reserved 33
Problems solved by ACID
1. Hard to append data
2. Modification of existing data difficult
3. Jobs failing mid way
4. Real-time operations hard
5. Costly to keep historical data versions
©2022 Databricks Inc. — All rights reserved
Delta Lake is the default for all
tables created in Databricks
©2022 Databricks Inc. — All rights reserved 35
ETL with Spark
SQL and
Python
©2022 Databricks Inc. — All rights reserved 36
ETL With Spark SQL and Python
Module 2 Learning Objectives
• Leverage Spark SQL DDL to create and manipulate relational entities on
Databricks
• Use Spark SQL to extract, transform, and load data to support
production workloads and analytics in the Lakehouse
• Leverage Python for advanced code functionality needed in production
applications
©2022 Databricks Inc. — All rights reserved 37
ETL With Spark SQL and Python
Module 2 Agenda
• Working with Relational Entities on Databricks
• Managing databases, tables, and views
• ETL with Spark SQL
• Extracting data from external sources, loading and updating data in the lakehouse,
and common transformations
• Just Enough Python for Spark SQL
• Building extensible functions with Python-wrapped SQL
©2022 Databricks Inc. — All rights reserved 38
Incremental
Data and Delta
Live Tables
©2022 Databricks Inc. — All rights reserved 39
Incremental Data and Delta Live Tables
Module 3 Learning Objectives
• Incrementally process data to power analytic insights with Spark
Structured Streaming and Auto Loader
• Propagate new data through multiple tables in the data lakehouse
• Leverage Delta Live Tables to simplify productionalizing SQL data
pipelines with Databricks
©2022 Databricks Inc. — All rights reserved 40
Incremental Data and Delta Live Tables
Module 3 Agenda
• Incremental Data Processing with Structured Streaming and Auto
Loader
• Processing and aggregating data incrementally in near real time
• Multi-hop in the Lakehouse
• Propagating changes through a series of tables to drive production systems
• Using Delta Live Tables
• Simplifying deployment of production pipelines and infrastructure using SQL
©2022 Databricks Inc. — All rights reserved 41
Multi-hop
Architecture
©2022 Databricks Inc. — All rights reserved 42
Multi-Hop in the Lakehouse
Streaming analytics
CSV
JSON
TXT
Bronze Silver Gold
Databricks Auto
Loader
Data quality
AI and reporting
©2022 Databricks Inc. — All rights reserved
Multi-Hop in the Lakehouse
Bronze Layer
Typically just a raw copy of ingested data
Replaces traditional data lake
Bronze
Provides efficient storage and querying of full, unprocessed
history of data
©2022 Databricks Inc. — All rights reserved 44
Multi-Hop in the Lakehouse
Silver Layer
Reduces data storage complexity, latency, and redundancy
Optimizes ETL throughput and analytic query performance
Silver
Preserves grain of original data (without aggregations)
Eliminates duplicate records
Production schema enforced
Data quality checks, corrupt data quarantined
©2022 Databricks Inc. — All rights reserved 45
Multi-Hop in the Lakehouse
Gold Layer
Powers ML applications, reporting, dashboards, ad hoc analytics
Refined views of data, typically with aggregations
Gold
Reduces strain on production systems
Optimizes query performance for business-critical data
©2022 Databricks Inc. — All rights reserved 46
Introducing
Delta Live
Tables
©2022 Databricks Inc. — All rights reserved 47
Multi-Hop in the Lakehouse
Streaming analytics
CSV
JSON
TXT
Bronze Silver Gold
Databricks Auto
Loader Raw Ingestion and Filtered, Cleaned, Business-level
History Augmented Aggregates
Data quality
AI and reporting
©2022 Databricks Inc. — All rights reserved
The Reality is Not so Simple
Bronze Silver Gold
©2022 Databricks Inc. — All rights reserved
Large scale ETL is complex and brittle
Complex pipeline Data quality and Difficult pipeline
development governance operations
Hard to build and maintain table Difficult to monitor and enforce Poor observability at granular,
dependencies data quality data level
Difficult to switch between batch Impossible to trace data lineage Error handling and recovery is
and stream processing laborious
©2022 Databricks Inc. — All rights reserved 50
Introducing Delta Live Tables
Make reliable ETL easy on Delta Lake
Operate with agility Trust your data Scale with reliability
Declarative tools to DLT has built-in Easily scale
build batch and declarative quality infrastructure alongside
streaming data controls your data
pipelines
Declare quality
expectations and
actions to take
©2022 Databricks Inc. — All rights reserved 51
Managing Data
Access and
Production
Pipelines
©2022 Databricks Inc. — All rights reserved 52
Managing Data Access and Production
Pipelines
Module 4 Learning Objectives
• Orchestrate tasks with Databricks Jobs
• Use Databricks SQL for on-demand queries
• Configure Databricks Access Control Lists to provide groups with secure
access to production and development databases
• Configure and schedule dashboards and alerts to reflect updates to
production data pipelines
©2022 Databricks Inc. — All rights reserved 53
Managing Data Access and Production
Pipelines
Module 4 Agenda
• Task Orchestration with Databricks Jobs
• Scheduling notebooks and DLT pipelines with dependencies
• Running Your First Databricks SQL Query
• Navigating, configuring, and executing queries in Databricks SQL
• Managing Permissions in the Lakehouse
• Configuring permissions for databases, tables, and views in the data lakehouse
• Productionalizing Dashboards and Queries in DBSQL
• Scheduling queries, dashboards, and alerts for end-to-end analytic pipelines
©2022 Databricks Inc. — All rights reserved 54
Introducing
Unity Catalog
©2022 Databricks Inc. — All rights reserved 55
Data Governance Overview
Four key functional areas
Data Access Control Data Access Audit
Control who has access to which data Capture and record all access to data
Data Lineage Data Discovery
Capture upstream sources and downstream Ability to search for and discover authorized assets
consumers
©2022 Databricks Inc. — All rights reserved 56
Data Governance Overview
Challenges
Structured Cloud 2 Data Analysts
Semi-structured Data Scientists
Cloud 1
Unstructured Data Engineers
Cloud 3
Streaming Machine Learning
©2022 Databricks Inc. — All rights reserved 57
Databricks Unity Catalog
Overview
Unify governance across clouds Unify data and AI assets Unify existing catalogs
Fine-grained governance for data Centrally share, audit, secure and Works in concert with existing
lakes across clouds - based on manage all data types with one data, storage, and catalogs - no
open standard ANSI SQL. simple interface. hard migration required.
©2022 Databricks Inc. — All rights reserved 58
Databricks Unity Catalog
Three-layer namespace
Traditional two-layer namespace Three-layer namespace with Unity
Catalog
SELECT * FROM schema.table SELECT * FROM catalog.schema.table
©2022 Databricks Inc. — All rights reserved 59
Databricks Unity Catalog
Security Model
Workspace
Traditional Query Lifecycle
2. Check grants
Table ACL
1. Submit query 3. Lookup location
SELECT * FROM
table 6. Filter 4. Return path to
Cluster or
unauthorized data SQL Endpoint table Hive Metastore
5. Cloud-specific
credentials
Cloud Storage
©2022 Databricks Inc. — All rights reserved 60
Databricks Unity Catalog
Security Model
Workspace
Query Life Cycle with
Unity Catalog Audit Log
3. Write to log
1. Submit query 2. Check namespace 4. Check grants
SELECT * FROM and Delta
table metadata
8. Filter Cluster or 6. URLs and
unauthorized data SQL Endpoint short-lived tokens Unity Catalog
7. Ingest from
array of URLs
5. Cloud-specific
credentials
Cloud Storage
©2022 Databricks Inc. — All rights reserved 61
Course Recap
©2022 Databricks Inc. — All rights reserved 62
Course Objectives
• Leverage the Databricks Lakehouse Platform to perform core
responsibilities for data pipeline development
• Use SQL and Python to write production data pipelines to extract,
transform, and load data into tables and views in the lakehouse
• Simplify data ingestion and incremental change propagation using
Databricks-native features and syntax
• Orchestrate production pipelines to deliver fresh results for ad-hoc
analytics and dashboarding
©2022 Databricks Inc. — All rights reserved 63