Welcome
Advanced Data Engineering
with Databricks
©2022 Databricks Inc. — All rights reserved
Learning Objectives
1. Design databases and pipelines optimized for the Databricks Lakehouse
Platform.
2. Implement efficient incremental data processing to validate and enrich data
driving business decisions and applications.
3. Leverage Databricks-native features for managing access to sensitive data
and fulfilling right-to-be-forgotten requests.
4. Manage error troubleshooting, code promotion, task orchestration, and
production job monitoring using Databricks tools.
©2022 Databricks Inc. — All rights reserved
Course Tools
Lecture TA Help + Discussion Lab Notebooks
Breakout Resources Solutions
Rooms
©2022 Databricks Inc. — All rights reserved
Modules
1. Architecting for the Lakehouse
2. Bronze Ingestion Patterns
3. Promoting to Silver
4. Gold Query Layer
5. Storing Data Securely
6. Propagating Updates and Deletes
7. Orchestration and Scheduling
©2022 Databricks Inc. — All rights reserved
Welcome!
Your Instructor - Your Name
•
PHOTO
•
/in/profile
©2022 Databricks Inc. — All rights reserved
Welcome!
Let’s get to know you
• Name
• Role and team
• Length of experience with Spark and Databricks
• Motivation for attending
• Fun fact or favorite mobile app
©2022 Databricks Inc. — All rights reserved
Course Objective
Design and implement a
multi-pipeline multi-hop architecture
to enable the Lakehouse paradigm.
©2022 Databricks Inc. — All rights reserved
Our Company
moovio
©2022 Databricks Inc. — All rights reserved
Architecting for the
Lakehouse
Adopting the Lakehouse Architecture
Lakehouse Medallion Architecture
Streaming Design Patterns
©2022 Databricks Inc. — All rights reserved
Adopting the
Lakehouse
Architecture
©2022 Databricks Inc. — All rights reserved
Lakehouse
One platform to unify all of
Data your data, analytics, and AI Data
Lake workloads Warehouse
©2022 Databricks Inc. — All rights reserved
The Databricks Lakehouse Platform
Databricks Lakehouse Platform
Data BI and SQL Data Science Real-Time Data
Engineering Analytics and ML Applications
Data Management and Governance
✓ Simple
Open Data Lake
✓ Open
Platform Security & Administration
✓ Collaborative
Unstructured, semi-structured, structured, and streaming data
©2022 Databricks Inc. — All rights reserved
An open approach to bringing
data management and
governance to data lakes
Data Better reliability with transactions Data
Lake 48x faster data processing with indexing Warehouse
Data governance at scale with
fine-grained access control lists
©2022 Databricks Inc. — All rights reserved
Delta Lake brings ACID to object storage
▪ Atomicity
▪ Consistency
▪ Isolation
▪ Durability
©2022 Databricks Inc. — All rights reserved
Delta Lake provides ACID
guarantees scoped to tables
©2022 Databricks Inc. — All rights reserved
Problems solved by ACID transactions in Delta
1. Hard to append data
2. Modification of existing data difficult
3. Jobs failing mid way
4. Real-time operations hard
5. Costly to keep historical data versions
©2022 Databricks Inc. — All rights reserved
The Lakehouse
Medallion Architecture
©2022 Databricks Inc. — All rights reserved
Multi-hop Pipeline
Source:
Files or integrated systems
Bronze:
Raw data and metadata
Silver:
Validated data with atomic
grain
Gold:
Refined, aggregated data
©2022 Databricks Inc. — All rights reserved
Bronze Layer
©2022 Databricks Inc. — All rights reserved
Why is the Bronze Layer Important?
• Bronze layer replaces the traditional data lake
• Represents the full, unprocessed history of the data
• Captures the provenance (what, when, and from where) of data loaded into
the lakehouse
• Data is stored efficiently using Delta Lake
• If downstream layers discover later they need to ingest more, they can come
back to the Bronze source to obtain it.
©2022 Databricks Inc. — All rights reserved
Bronze Layer Guiding Principles
• The goal of this layer is data capture and provenance:
• Capture exactly what is ingested, without parsing or change.
• Typically a Delta Lake table with these fields in each row:
• Date received/ingested
• Data source (filename, external system, etc)
• Text field with raw unparsed JSON, CSV, or other data
• Other metadata
• Should be append only (batch or streaming)
• Plan ahead if data must be deleted for regulatory purposes
©2022 Databricks Inc. — All rights reserved
Processing Deletes
• Retain all records when possible
• Soft-deletes if necessary
• Hard-deletes may be required by regulatory processes
©2022 Databricks Inc. — All rights reserved
Silver Layer
©2022 Databricks Inc. — All rights reserved
Why is the Silver Layer important?
• Easier to query than the non-curated Bronze “data lake”
• Data is clean
• Transactions have ACID guarantees
• Represents the “Enterprise Data Model”
• Captures the full history of business action modeled
• Each record processed is preserved
• All records can be efficiently queried
• Reduces data storage complexity, latency, and redundancy
• Built for both ETL throughput AND analytic query performance
©2022 Databricks Inc. — All rights reserved
Silver Layer Guiding Principles
• Uses Delta Lake tables (with SQL table names)
• Preserves grain of original data (no aggregation)
• Eliminates duplicate records
• Production schema enforced
• Data quality checks passed
• Corrupt data quarantined
• Data stored to support production workloads
• Optimized for long-term retention and ad-hoc queries
©2022 Databricks Inc. — All rights reserved
Gold Layer
©2022 Databricks Inc. — All rights reserved
Why is the Gold Layer important?
• Powers ML applications, reporting, dashboards, ad hoc analytics
• Reduces costs associated with ad hoc queries on silver tables
• Allows fine grained permissions
• Reduces strain on production systems
• Shifts query updates to production workloads
©2022 Databricks Inc. — All rights reserved
Notebook
Streaming Design
Patterns
©2022 Databricks Inc. — All rights reserved
Bronze Ingestion
Patterns
Bronze Ingestion Patterns
Auto Load to Multiplex Bronze
Streaming from Multiplex Bronze
©2022 Databricks Inc. — All rights reserved
Bronze Ingestion
Patterns
©2022 Databricks Inc. — All rights reserved
Singleplex Ingestion
©2022 Databricks Inc. — All rights reserved
Multiplex Ingestion
©2022 Databricks Inc. — All rights reserved
Don't Use Kafka as Bronze
• Data retention limited by Kafka;
expensive to keep full history
• All processing happens on ingest
• If stream gets too far behind,
data is lost
• Cannot recover data (no history
to replay)
©2022 Databricks Inc. — All rights reserved
Delta Lake Bronze
©2022 Databricks Inc. — All rights reserved
Notebook
Auto Load to
Multiplex Bronze
©2022 Databricks Inc. — All rights reserved
Notebook
Streaming from
Multiplex Bronze
©2022 Databricks Inc. — All rights reserved
Promoting to Silver
Streaming Deduplication
Quality Enforcement
Slowly Changing Dimensions
Streaming Joins and Statefulness
©2022 Databricks Inc. — All rights reserved
Promoting Bronze to
Silver
©2022 Databricks Inc. — All rights reserved
Silver Layer Objectives
• Validate data quality and schema
• Enrich and transform data
• Optimize data layout and storage for downstream queries
• Provide single source of truth for analytics
©2022 Databricks Inc. — All rights reserved
Schema Enforcement & Evolution
• Enforcement prevents bad records from entering table
• Mismatch in type or field name
• Evolution allows new fields to be added
• Useful when schema changes in production/new fields added to nested data
• Cannot use evolution to remove fields
• All previous records will show newly added field as Null
• For previously written records, the underlying file isn’t modified.
• The additional field is simply defined in the metadata and dynamically read as null
©2022 Databricks Inc. — All rights reserved
Delta Lake Constraints
• Check NOT NULL or arbitrary boolean condition
• Throws exception on failure
ALTER TABLE tableName ADD CONSTRAINT constraintName
CHECK heartRate >= 0;
©2022 Databricks Inc. — All rights reserved
Alternative Quality Check Approaches
• Add a “validation” field that captures any validation errors and a null value
means validation passed.
• Quarantine data by filtering non-compliant data to alternate location
• Warn without failing by writing additional fields with constraint check results
to Delta tables
©2022 Databricks Inc. — All rights reserved
Notebook
Streaming
Deduplication
©2022 Databricks Inc. — All rights reserved
Notebook
Quality
Enforcement
©2022 Databricks Inc. — All rights reserved
Lab
Promoting to Silver
©2022 Databricks Inc. — All rights reserved
Slowly Changing
Dimensions in the
Lakehouse
©2022 Databricks Inc. — All rights reserved
Fact Tables as Incremental Data
• Often is a time series
• No intermediate aggregations
• No overwrite/update/delete operations
• Append-only operations
©2022 Databricks Inc. — All rights reserved
Using Dimension Tables in Incremental Updates
• Delta Lake enables stream-static
joins
• Each micro-batch captures the
most recent state of joined Delta
table
• Allows modification of dimension
while maintaining downstream
composability
©2022 Databricks Inc. — All rights reserved
Slowly Changing Dimensions (SCD)
• Type 0: No changes allowed (static/append only)
E.g. static lookup table
• Type 1: Overwrite (no history retained)
E.g. do not care about historic comparisons other than quite recent
(use Delta Time Travel)
• Type 2: Adding a new row for each change and marking the old as
obsolete
E.g. Able to record product price changes over time, integral to
business logic.
©2022 Databricks Inc. — All rights reserved
Type 0 and Type 1
user_id street name
1 123 Oak Ave Sam
2 99 Jump St Abhi
3 1000 Rodeo Dr Kasey
©2022 Databricks Inc. — All rights reserved
Type 2
user_id street name valid_from current
1 123 Oak Ave Sam 2020-01-01 true
2 99 Jump St Abhi 2020-01-01 false
3 1000 Rodeo Dr Kasey 2020-01-01 false
2 430 River Rd Abhi 2021-10-10 true
3 1000 Rodeo Dr Casey 2021-10-10 true
©2022 Databricks Inc. — All rights reserved
Applying SCD Principles to Facts
• Fact table usually append-only (Type 0)
• Can leverage event and processing times for append-only history
order_id user_id occurred_at action processed_time
123 1 2021-10-01 10:05:00 ORDER_CANCELLED 2021-10-01 10:05:30
123 1 2021-10-01 10:00:00 ORDER_PLACED 2021-10-01 10:06:30
©2022 Databricks Inc. — All rights reserved
Notebook
Type 2 SCD
©2022 Databricks Inc. — All rights reserved
Streaming Joins and
Statefulness
©2022 Databricks Inc. — All rights reserved
The Components of a Stateful Stream
windowedDF =
(eventsDF
.groupBy(window("eventTime",
"10 minutes",
"5 minutes"))
.count()
.writeStream
.trigger(processingTime=“5 minutes”)
)
©2022 Databricks Inc. — All rights reserved
Output Modes
Mode When Stateful Results Materialize
Append (default) Only materialize after watermark +
lateness passed
Complete Materialize every trigger, outputs
complete table
Update Materialize every trigger, outputs
only new values
©2022 Databricks Inc. — All rights reserved
Statefulness vs. Query Progress
• Many operations as specifically stateful (stream-stream joins, deduplication,
aggregation)
• Some operations just need to store incremental query progress and are not
stateful (appends with simple transformations, stream-static joins, merge)
• Progress and state are stored in checkpoints and managed by driver during
query processing
©2022 Databricks Inc. — All rights reserved
Managing Stream Parameters
GOAL: Balance parameters for sustainable, optimized throughput
• Input Parameters
• Control amount of data in each micro-batch
• State Parameters
• Control amount of data required to calculate query results
• Output Parameters
• Control number and size of files written
©2022 Databricks Inc. — All rights reserved
Reasoning about Stream Dimensions
RECORD 1
STRUCTURED
Input RECORD 2
...
STREAMING
RECORD N
State size m
(records to be
compared against)
RECORD 1
State RECORD 2
...
RECORD M
Input size n
(records in
mini-batch)
©2022 Databricks Inc. — All rights reserved
Input Parameters
©2022 Databricks Inc. — All rights reserved
Limiting the input dimension
RECORD 1
STRUCTURED
RECORD 2 Limit n in O(n⨯m)
STREAMING
...
RECORD N
Input
State size m
(records to be
RECORD 1 compared against)
RECORD 2
...
RECORD M
State
Input size n
(records in
mini-batch)
©2022 Databricks Inc. — All rights reserved
Why are input parameters important?
▪ Allows you to control the mini-batch size
▪ Defaults are large
▪ Delta Lake: 1000 files per micro-batch
▪ Pub/Sub & files: No limit to input batch size
▪ Optimal mini-batch size → Optimal cluster usage
▪ Suboptimal mini-batch size → performance cliff
▪ Shuffle Spill
©2022 Databricks Inc. — All rights reserved
Per Trigger Settings
• File Source
• maxFilesPerTrigger
• Delta Lake and Auto Loader
• maxFilesPerTrigger
• maxBytesPerTrigger
• Kafka
• maxOffsetsPerTrigger
©2022 Databricks Inc. — All rights reserved
Shuffle Partitions with Structured Streaming
• Should match the number of cores in the largest cluster size that might be
used in production
• Number of shuffle partitions == max parallelism
• Cannot be changed without new checkpoint
• Will lose query progress and state information
• Higher shuffle partitions == more files written
• Best practice: use Delta Live Tables for streaming jobs with variable volume
©2022 Databricks Inc. — All rights reserved
Tuning maxFilesPerTrigger
Base it on shuffle partition size
• Rule of thumb 1: Optimal shuffle partition size ~100-200 MB
• Rule of thumb 2: Set shuffle partitions equal to # of cores
• Use Spark UI to tune maxFilesPerTrigger until you get ~100-200 MB per
partition
• Note: Size on disk is not a good proxy for size in memory
• Reason is that file size is different from the size in cluster memory
©2022 Databricks Inc. — All rights reserved
State Parameters
©2022 Databricks Inc. — All rights reserved
Limiting the state dimension
Limit m in O(n⨯m)
RECORD 1
STRUCTURED
Input RECORD 2
...
STREAMING
RECORD N
State size m
(records to be
compared against)
RECORD 1
RECORD 2
...
RECORD M
State
Input size n
(records in
mini-batch)
©2022 Databricks Inc. — All rights reserved
Limiting the state dimension
RECORD 1
STRUCTURED
Input RECORD 2
...
STREAMING
RECORD N
▪ State Store backed operations
RECORD 1
RECORD 2
▪ Stateful (windowed) aggregations
...
RECORD M ▪ Drop duplicates
State ▪ Stream-Stream Joins
▪ Delta Lake table or external system
▪ Stream-Static Join / Merge
©2022 Databricks Inc. — All rights reserved
Why are state parameters important?
▪ Optimal parameters → Optimal cluster usage
▪ If not controlled, state explosion can occur
▪ Slower stream performance over time
▪ Heavy shuffle spill (Joins/Merge)
▪ Out of memory errors (State Store backed operations)
©2022 Databricks Inc. — All rights reserved
Example Query 1. Main input stream
salesSDF = (
▪ Static Delta Lake table used in spark
.readStream
stream-static join .format("delta")
.table("sales")
▪ State Store-backed windowed )
stateful aggregation
2. Join item category lookup
itemSalesSDF = (
salesSDF
.join( spark.table(“items”), “item_id”)
)
3. Aggregate sales per item per
hour
itemSalesPerHourSDF = (
itemSalesSDF
.groupBy(window(..., “1 hour”),
“item_category”)
.sum(“revenue”)
)
©2022 Databricks Inc. — All rights reserved
State Store Parameters
• Watermarking
• How much history to compare against
• Granularity
• The more granular the aggregate key / window, the more
state
• State store backend
• RocksDB / Default
©2022 Databricks Inc. — All rights reserved
Stream-Static Join & Merge
• Join driven by streaming data
1. Main input stream
• Join triggers shuffle
salesSDF = (
• Join itself is stateless spark
.readStream
.format("delta")
• Control state information with .table("sales")
)
predicate
• Goal is to broadcast static table to 2. Join item category lookup
streaming data itemSalesSDF = (
salesSDF
• Broadcasting puts all data on each
.join(
spark.table(“items”)
node .filter(“category=’Food’), # Predicate
on=[“item_id”]
)
)
©2022 Databricks Inc. — All rights reserved
Output Parameters
©2022 Databricks Inc. — All rights reserved
Limiting the output dimension
Limit m in O(n⨯m)
RECORD 1
STRUCTURED
Input RECORD 2
...
STREAMING
RECORD N
State size m
(records to be
compared against)
RECORD 1
RECORD 2
...
RECORD M
State
Input size n
(records in
mini-batch)
©2022 Databricks Inc. — All rights reserved
Why are output parameters important?
• Streaming jobs tend to create many small files
• Reading a folder with many small files is slow
• Poor performance for downstream jobs, self-joins, and merge
• Output type can impact state information retained
• Merge statements with full table scans increase state
©2022 Databricks Inc. — All rights reserved
Delta Lake Output Optimizations
• Optimized Writes
• Auto Compaction
• delta.tuneFileSizesForRewrites
• Insert-only merge
©2022 Databricks Inc. — All rights reserved
Notebook
Stream Static Joins
©2022 Databricks Inc. — All rights reserved
Gold Query Layer
Making Data Available for Analytics
Stored Views
Materialized Gold Tables
©2022 Databricks Inc. — All rights reserved
Lakehouse and the
Query Layer
©2022 Databricks Inc. — All rights reserved
What is the Query Layer?
• Stores refined datasets for use by data scientists
• Serves results for pre-computed ML models
• Contains enriched, aggregated views for use by analysts
• Star-schemas and data marts for BI queries
• Powers data-driven applications, dashboards, and reports
Also called the serving layer; gold tables exist at this level.
©2022 Databricks Inc. — All rights reserved
Tables and Views in the Query Layer
• Gold tables
• Saved views
• Databricks SQL saved queries
• Tables in RDS/NoSQL database
©2022 Databricks Inc. — All rights reserved
Gold Tables
• Refined, typically aggregated views of data saved using Delta Lake
• Can be updated with batch or stream processing
• Configured and scheduled as part of ETL workloads
• Results computed on write
• Read is simple deserialization; additional filters can be applied with
pushdowns
©2022 Databricks Inc. — All rights reserved
Saved Views
• Views can be registered to databases and made available to users using
ACLs
• Views are logical queries against source tables
• Logic is executed each time a view is queried
• Views registered against Delta tables will always query the most current
valid version of the table
©2022 Databricks Inc. — All rights reserved
Databricks SQL Saved Queries
• Similar to saved views in when logic is executed
• Auto-detect changes in upstream Delta tables
• Uses new feature Query Result Cache
• Caching allows reviewing dashboards and downloading CSVs without an
active SQL endpoint
• Easy to identify data sources (SQL present in saved query)
• Can be scheduled using Databricks SQL functionality
• Can automatically refresh Databricks SQL dashboards
©2022 Databricks Inc. — All rights reserved
Databricks SQL Endpoints
• Clusters optimized for SQL queries
• Serverless option for quick cluster startup and autoscaling
• Photon-enabled for vectorized execution
• Enhanced throughput for exchanging data with external SQL systems
• Optimized connectors for popular BI tools
©2022 Databricks Inc. — All rights reserved
Tables in External Systems
• Many downstream applications
may require refined data in a
different system
• NoSQL databases
• RDS
• Pub/sub messaging
• Must decide where single source of
truth lives
©2022 Databricks Inc. — All rights reserved
Recommendations
• Use saved views when filtering silver tables
CREATE VIEW sales_florida_2020 AS
SELECT *
FROM sales
WHERE state = ‘FL’ and year = 2020;
©2022 Databricks Inc. — All rights reserved
Recommendations
• Use Delta tables for common partial aggregates
CREATE TABLE store_item_sales AS
SELECT store_id, item_id, department, date,
city, state, region, country,
SUM(quantity) AS items_sold,
SUM(price) AS revenue
FROM sales
INNER JOIN stores ON sales.store_id = stores.store_id
INNER JOIN items ON sales.item_id = items.item_id
GROUP BY store_id, item_id, department, date,
city, state, region, country
©2022 Databricks Inc. — All rights reserved
Recommendations
• Share Databricks SQL queries and dashboards within teams
SELECT date, hour, SUM(quantity * price) hourly_sales
FROM sales
WHERE store_id = 42
AND date > date_sub(current_date(), 14)
GROUP BY date, hour;
©2022 Databricks Inc. — All rights reserved
Recommendations
• Create views with column aliases
CREATE VIEW sales_marketing AS
SELECT
id cust_id,
date deal_date,
total sale_amount
FROM sales;
©2022 Databricks Inc. — All rights reserved
Recommendations
• Analyze query history to identify new candidate gold tables
• Admins can access Databricks SQL query history
• Running analytics against query history can help identify
• Queries that are long-running
• Queries that are run regularly
• Queries that use common datasets
• Transitioning these queries to gold tables and scheduling as engineering
jobs may reduce total operating costs
• Query history can also be useful for identifying predicates used most
frequently; useful for ZORDER indexing during optimization
©2022 Databricks Inc. — All rights reserved
Notebook
Stored Views
©2022 Databricks Inc. — All rights reserved
Notebook
Materialized Gold
Tables
©2022 Databricks Inc. — All rights reserved
Storing Data Securely
PII & Regulatory Compliance
Storing PII Securely
Granting Privileged Access to PII
©2022 Databricks Inc. — All rights reserved
PII & Regulatory
Compliance
©2022 Databricks Inc. — All rights reserved
Regulatory Compliance
• EU = GDPR (General Data Protection Regulation)
• US = CCPA (California Consumer Privacy Act)
• Simplified Compliance Requirements
• Inform customers what personal information is collected
• Delete, update, or export personal information as requested
• Process request in a timely fashion (30 days)
©2022 Databricks Inc. — All rights reserved
How Lakehouse Simplifies Compliance
• Reduce copies of your PII
• Find personal information quickly
• Reliably change, delete, or export data
• Use transaction logs for auditing
©2022 Databricks Inc. — All rights reserved
Manage Access to PII
• Control access to storage locations with cloud permissions
• Limit human access to raw data
• Pseudonymize records on ingestion
• Use table ACLs to manage user permissions
• Configure dynamic views for data redaction
• Remove identifying details from demographic views
©2022 Databricks Inc. — All rights reserved
Pseudonymization
©2022 Databricks Inc. — All rights reserved
Pseudonymization
• Switches original data point with pseudonym for later re-identification
• Only authorized users will have access to keys/hash/table for
re-identification
• Protects datasets on record level for machine learning
• A pseudonym is still considered to be personal data according to the GDPR
©2022 Databricks Inc. — All rights reserved
Hashing
• Apply SHA or other hash to all PII
• Add random string "salt" to values before hashing
• Databricks secrets can be leveraged for obfuscating salt value
• Leads to some increase in data size
• Some operations will be less efficient
©2022 Databricks Inc. — All rights reserved
Tokenization
• Converts all PII to keys
• Values are stored in a secure lookup table
• Slow to write, but fast to read
• De-identified data stored in fewer bytes
©2022 Databricks Inc. — All rights reserved
Tokenization
©2022 Databricks Inc. — All rights reserved
GDPR with
vault
GDPR
without vault
©2022 Databricks Inc. — All rights reserved
Anonymization
©2022 Databricks Inc. — All rights reserved
Anonymization
• Protects entire tables, databases or entire data catalogues mostly for
Business Intelligence
• Personal data is irreversibly altered in such a way that a data subject can no
longer be identified directly or indirectly
• Usually a combination of more than one technique used in real-world
scenarios
©2022 Databricks Inc. — All rights reserved
Data Suppression
• Exclude columns with PII from views
• Remove rows where demographic groups are too small
• Use dynamic access controls to provide conditional access to full data
©2022 Databricks Inc. — All rights reserved
Generalization
• Categorical generalization
• Binning
• Truncating IP addresses
• Rounding
©2022 Databricks Inc. — All rights reserved
Categorical Generalization
• Removes precision from data
• Move from specific categories to more general
• Retain level of specificity that still provides insight without revealing identity
©2022 Databricks Inc. — All rights reserved
Binning
• Identify meaningful divisions in data and group on boundaries
• Allows access to demographic groups without being able to identify
individual PII
• Can use domain expertise to identify groups of interest
©2022 Databricks Inc. — All rights reserved
Truncating IP addresses
• Rounding IP address to /24 CIDR
• Replace last byte with 0
• Generalizes IP geolocation to city or
neighbourhood level
©2022 Databricks Inc. — All rights reserved
Rounding
• Apply generalized rounding rules to all number data, based on required
precision for analytics
• Provides gross estimates without specific values
• Example:
• Integers are rounded to multiples of 5
• Values less than 2.5 are rounded to 0 or omitted from reports
• Consider suppressing outliers
©2022 Databricks Inc. — All rights reserved
Notebook
Creating a
Pseudonymized PII
Lookup Table
©2022 Databricks Inc. — All rights reserved
Notebook
Storing PII Securely
©2022 Databricks Inc. — All rights reserved
Managing ACLs for the
Enterprise Lakehouse
©2022 Databricks Inc. — All rights reserved
Challenges with Analytics in the Data Lake
● Difficult to provide access on need-to-know basis
● Unclear governance & ownership of data assets
● Data proliferation resulting in possible compliance issues
● Difficult to ensure integrity of data and reporting
©2022 Databricks Inc. — All rights reserved
The Goal
Provide access to valuable data to users across the
company in a secure manner.
Ensure users are only able to access the data they’re
entitled to.
Detect whether any data has been altered or manipulated.
©2022 Databricks Inc. — All rights reserved
High Level Design DEV Data
Full Access
DEV Data Eng WS
Read Only
SCIM Provisioning Access
from SSO
Full Access
PROD Data Eng WS PROD Data
Read Only
Access
Analytics WS
Full Access
Derivative
Data
©2022 Databricks Inc. — All rights reserved
Grant Access to Production Datasets
Assumptions
● End-users need read-only access
● Datasets organized by database
GRANT USAGE, SELECT, READ_METADATA ON DATABASE hr TO
`HR`;
Alternative, grant access on specific tables:
GRANT USAGE ON DATABASE hr TO `HR`;
GRANT SELECT, READ_METADATA ON TABLE employees TO `HR`;
GRANT SELECT, READ_METADATA ON TABLE addresses TO `HR`;
©2022 Databricks Inc. — All rights reserved
Enable Secure Data Sharing
Assumptions
● Teams/depts need a private area to collaborate in
● Datasets must not be shared outside team/dept
● New tables are not automatically shared to other members
GRANT USAGE, CREATE ON DATABASE project_data TO `Data
Analysts`;
Alternatively, members automatically see all new tables:
GRANT USAGE, SELECT, READ_METADATA, CREATE ON DATABASE
project_data TO `Data Analysts`;
©2022 Databricks Inc. — All rights reserved
Enable Private User Sandboxes
Assumptions
● Users need a private area to store derivative datasets
● Datasets must not be shared with other users
GRANT USAGE, CREATE ON DATABASE user1 TO
`user1@databricks.com`;
©2022 Databricks Inc. — All rights reserved
Delegate Administration of Access Policies
Assumptions
● Need to delegate management of ACLs to data stewards/owners
● Monitor using SQL Analytics query logs and/or workspace audit logs
ALTER DATABASE hr OWNER TO `HR Admins`;
©2022 Databricks Inc. — All rights reserved
Delegate Database Creation to Trusted Users
Assumptions
● Delegate database creation to trusted users
● Database creators manage ACLs
● Monitor grants using SQL Analytics query logs and/or workspace audit logs
GRANT CREATE ON CATALOG TO `Data Admins`;
©2022 Databricks Inc. — All rights reserved
Give All Users Read-Only Access To All
Datasets
Assumptions
● Simple security model where all users have same level of access
● All new datasets are automatically accessible
GRANT USAGE, SELECT, READ_METADATA ON CATALOG TO `users`;
©2022 Databricks Inc. — All rights reserved
Dynamic Views on Databricks
● Need to redact fields based on user’s identity
● Do not give access to underlying table, only view
● Uses existing group membership to filter rows or columns
©2022 Databricks Inc. — All rights reserved
Rollout Plan
● SCIM provisioning with groups
● Identify & classify initial set of datasets
● Register in new workspace and apply ACLs
● For Fine-Grained
○ Define access policies (e.g. “Sales only see their own accounts…”, “Support technicians
can only see accounts within their region”, etc.)
○ Identify missing attributes needed for dynamic views
● Lifecycle policy for derivative datasets
● Define process to “promote” new or derivative datasets
○ Need to classify data & define access policies
○ Legal & compliance review
©2022 Databricks Inc. — All rights reserved
Notebook
Deidentified PII
Access
©2022 Databricks Inc. — All rights reserved
Propagating Updates
and Deletes
Processing Records from Change Data Feed
Deleting Data in the Lakehouse
©2022 Databricks Inc. — All rights reserved
Propagating Changes
with Delta Change
Data Feed
©2022 Databricks Inc. — All rights reserved
Multi-Hop in the Lakehouse
Bronze Silver Gold
Streaming
CSV, Analytics
JSON,
TXT…
Data Lake
AI & Reporting
Data quality
©2022 Databricks Inc. — All rights reserved
What is Stream Composability?
• Structured Streaming expects append-only sources
• Delta tables are composable if new streams can be initiated from them
©2022 Databricks Inc. — All rights reserved
Operations that break stream composability
● Complete aggregations
● Delete
● UPDATE/MERGE
Data is changed in place, breaking append-only expectations
©2022 Databricks Inc. — All rights reserved
Workarounds for Deleting Data
ignoreChanges ignoreDeletes
Allows deletion of full partitions Allows stream to be executed against
Delta table with upstream changes
No new data files are written with full
partition removal Must implement logic to avoid
processing duplicate records
Subsumes ignoreDeletes
©2022 Databricks Inc. — All rights reserved
What Delta Change Data Feed Does for You
Improve ETL Unify batch and BI on your Meet regulatory
pipelines streaming data lake needs
Process less data during Common change format Incrementally update Full history available
ETL to increase efficiency for batch and streaming the data supporting of changes made to
of your pipelines updates, appends, and your BI tool of choice the data, including
deletes deleted information
Delta Change Data Feed
©2022 Databricks Inc. — All rights reserved
Where Delta Change Data Feed Applies
BRONZE SILVER GOLD
Raw Ingestion Filtered, Cleaned, Business-level
and History Augmented Aggregates
External feeds,
Other CDC output,
Extracts
CDF
©2022 Databricks Inc. — All rights reserved
How Does Delta Change Data Feed Work?
Original Table (v1) Change data Change Data Feed Output
(Merged as v2)
Change Versio
PK B PK B Time
PK B Type n
A1 B1 12:00:0
A2 B2 Preimage 2
0
A2 B2 A2 Z2
Postimag 12:00:0
A2 Z2 2
A3 B3 A3 B3 e 0
12:00:0
A4 B4 A3 B3 Delete 2
0
12:00:0
A4 B4 Insert 2
0
A1 record did not receive an update or delete.
So it will not be output by CDF.
©2022 Databricks Inc. — All rights reserved
Notebook
Processing Records
from Change Data
Feed
©2022 Databricks Inc. — All rights reserved
Notebook
Propagating Deletes
with Change Data
Feed
©2022 Databricks Inc. — All rights reserved
Notebook
Deleting at Partition
Boundaries
©2022 Databricks Inc. — All rights reserved
Orchestration and
Scheduling
Multi-Task Jobs
Promoting Code with Repos
CLI and REST API
Deploying Workloads
©2022 Databricks Inc. — All rights reserved
Orchestration and
Scheduling with
Multi-Task Jobs
©2022 Databricks Inc. — All rights reserved
What is a Job?
Job
Task Schedule Cluster
What? When? How?
©2022 Databricks Inc. — All rights reserved
Orchestration with Multi-Task Jobs
Serial
1 2 3 4
Parallel
2
1 4
3
©2022 Databricks Inc. — All rights reserved
Jobs revisited
Silver tables pipeline
Job
Task 1 Task 2 Task 3
Task + Cluster Task + Cluster Task + Cluster
Schedule
Task 4
Task + Cluster
©2022 Databricks Inc. — All rights reserved
Common Jobs Patterns
Sequence Funnel Fan-out
Sequence Funnel Fan-out, star pattern
● Data transformation/ ● Multiple data sources ● Single data source
processing/cleaning ● Data collection ● Data ingestion and
● Bronze/Silver/Gold distribution
©2022 Databricks Inc. — All rights reserved
Demo
Creating a
Multi-Task Job
©2022 Databricks Inc. — All rights reserved
Jobs UI Lab
Task-1 Task-5*
Create Database Create Task 5
Task-2* Task-3* Task-6*
From Task 2 From Task 3 Errors
Task-4* Task-7*
Name-Param Cleanup
* Make sure to use a different cluster for Tasks #2 & #3
* Add the parameter “name” with some value (e.g. your name)
* Make sure to use the same cluster for Tasks #5, #6 & #7
©2022 Databricks Inc. — All rights reserved
* After running once with the error, update Task-6 to pass
Promoting Code
with Databricks Repos
©2022 Databricks Inc. — All rights reserved
CI/CD Integration Development /
Experimentatio Production Jobs
n
▲
▼
Git / CI/CD
Systems Version Review Test
Supported Git
Providers
©2022 Databricks Inc. — All rights reserved
Enterprise Readiness
Admin Console | Workspace Settings
©2022 Databricks Inc. — All rights reserved
Lab
Import a Git Repo
https://github.com/databricks-academy/cli-demo
©2022 Databricks Inc. — All rights reserved
Lab
Orchestration with
the Databricks CLI
©2022 Databricks Inc. — All rights reserved
Lab
Using the Databricks
REST API
©2022 Databricks Inc. — All rights reserved
Lab
Deploying Batch and
Streaming Workloads
©2022 Databricks Inc. — All rights reserved