KEMBAR78
Data Analysis With Databricks Version 2 | PDF | Sql | Information Science
0% found this document useful (0 votes)
175 views137 pages

Data Analysis With Databricks Version 2

The document outlines a course on Data Analysis with Databricks SQL, detailing its goals, agenda, and learning objectives. Participants will learn about Databricks SQL's capabilities, including data ingestion, querying, visualization, and dashboard creation, as well as the Unity Catalog for data governance. The course also emphasizes the benefits of the lakehouse architecture and Delta Lake for efficient data processing and analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
175 views137 pages

Data Analysis With Databricks Version 2

The document outlines a course on Data Analysis with Databricks SQL, detailing its goals, agenda, and learning objectives. Participants will learn about Databricks SQL's capabilities, including data ingestion, querying, visualization, and dashboard creation, as well as the Unity Catalog for data governance. The course also emphasizes the benefits of the lakehouse architecture and Delta Lake for efficient data processing and analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 137

Data Analysis

with
Databricks SQL

Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved 1
Meet your classmates
• Where is everyone joining us from today (city, country)?

©2023 Databricks Inc. — All rights reserved 2


Meet your classmates
• How long have you been working with Databricks SQL?

©2023 Databricks Inc. — All rights reserved 3


Meet your classmates
• What data analysis tools have you worked with in the past?

©2023 Databricks Inc. — All rights reserved 4


Meet your classmates
• What are you hoping to get out of this class?

©2023 Databricks Inc. — All rights reserved 5


Getting Started with
the Course

©2023 Databricks Inc. — All rights reserved 6


Course goals

1● Describe how Databricks SQL:


○ Works in the lakehouse architecture
○ Implements data security
2● Use Databricks SQL to:
○ Ingest data into the Platform
○ Use SQL commands specific to Databricks
○ Query data
○ Create visualizations and dashboards
○ Share queries and dashboards with others
○ Use automation and integration capabilities
©2023 Databricks Inc. — All rights reserved 7
Agenda
Module Name Duration

Databricks SQL Services and Capabilities 1 hour, 50 min

Data Management in Databricks SQL 1 hour, 40 min

Data Visualization and Dashboarding 3 hours, 30 min

● We will take 10 minute breaks about every hour

©2023 Databricks Inc. — All rights reserved 8


Activities used in this course

Lectures Demos Labs Knowledge checks

©2023 Databricks Inc. — All rights reserved 9


Technical Environment Overview
The Databricks SQL workspace
• Everyone:
• is in the same workspace
• has their own catalog
• is using the same SQL warehouse
• Only the instructor has administrator privileges in the workspace
• only a select few tasks in this course require admin privileges
• you will see these tasks in the slides in order to provide context
• the labs do not require admin privileges

©2023 Databricks Inc. — All rights reserved 10


Databricks SQL
Services and
Capabilities
Databricks Academy 2023

©2023 Databricks Inc. — All rights reserved 11


Databricks SQL Services and Capabilities

Lesson Name Duration

Lecture: Getting Started with Databricks SQL 15 min

Follow Along Demo: Setting Up a Catalog and Schema 20 mins

Follow Along Demo: Data Importing 20 mins

Follow Along Demo: A Simple (but Quick) Query, Visualization, and Dashboard 15 min

Lecture: Unity Catalog in Databricks SQL 20 min

Lecture: Lakehouse Architecture 20 min

Lecture: Integrations 10 min

©2023 Databricks Inc. — All rights reserved 12


Getting Started with
Databricks SQL

©2023 Databricks Inc. — All rights reserved 13


Learning Objectives
By the end of this lesson, you should be able to:

1 Describe what Databricks SQL is.

2 Describe the benefits of Databricks SQL.

©2023 Databricks Inc. — All rights reserved 14


Databricks SQL
Delivering analytics on the freshest
data with data warehouse Data Science
& AI
ETL &
Real-time Analytics
Orchestration Data
Warehousing

performance and data lake Databricks AI Delta Live Tables Workflows Databricks SQL

economics
Use generative AI to understand the semantics of your data

Data Intelligence Engine

Unity Catalog
Securely get insights in natural language
■ Better price / performance than other cloud
data warehouses Delta Lake
Data layout is automatically optimized based on usage patterns
■ Simplify discovery and sharing of new insights
Open Data Lake
■ Connect to familiar BI tools, like Tableau or All Raw Data
(Logs, Texts, Audio, Video, Images)
Power BI
■ Simplified administration and governance

©2023 Databricks Inc. — All rights reserved 15


Better price / performance
Run SQL queries on your
lakehouse and analyze your
freshest data with up to 6x
better price/performance than
traditional cloud data
warehouses.

Source: Performance Benchmark with Barcelona Supercomputing Center

©2023 Databricks Inc. — All rights reserved 16


Better together | Broad integration with BI tools
Connect your preferred BI tools
with optimized connectors that
provide fast performance, low
latency, and high user
conconcurrency to your data lake for
your existing BI tools.

©2023 Databricks Inc. — All rights reserved 17


A new home for data analysts
Enable data analysts to quickly
perform ad-hoc and
exploratory data analysis, with a
new SQL query editor,
visualizations and dashboards.
Automatic alerts can be triggered
for critical changes, allowing to
respond to business needs faster.

©2023 Databricks Inc. — All rights reserved 18


Simple administration and governance
Quickly setup SQL / BI
optimized compute with SQL
warehouses. Databricks
automatically determines
instance types and
configuration for the best
price/performance. Then, easily
manage usage, perform quick
auditing, and troubleshooting
with query history.

©2023 Databricks Inc. — All rights reserved 19


Use Cases

Connect existing BI Collaboratively Build data-


tools to one source explore the enhanced
of truth for all your latest and freshest applications
data data

©2023 Databricks Inc. — All rights reserved 20


Follow Along Demo:
Setting Up a Catalog and
Schema

©2023 Databricks Inc. — All rights reserved 21


Follow Along Demo
Setting Up a Catalog and Schema
• Get your username for the course
• Create a catalog
• Create a schema

©2023 Databricks Inc. — All rights reserved 22


Follow Along Demo:
Data Importing

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 23


Follow Along Demo
Data Importing
• Upload a .csv file
• Use the Catalog Explorer
• Create a table with data from object store

©2023 Databricks Inc. — All rights reserved 24


Follow Along Demo:
A Simple (but Quick)
Query, Visualization, and
Dashboard

©2023 Databricks Inc. — All rights reserved 25


Follow Along Demo
A Simple (but Quick) Query, Visualization, and
Dashboard
• Query data
• Create a visualization
• Create a dashboard

©2023 Databricks Inc. — All rights reserved 26


Knowledge Check

©2023 Databricks Inc. — All rights reserved 27


Knowledge check
Think about this question and volunteer an answer

Which of the following statements about


Databricks SQL is true? Select one response.
A. With Databricks SQL, queries deliver up to 2x
better price/performance than other cloud data
warehouses.
B. Delta Live Tables can be created in Databricks
SQL.
C. Databricks SQL automatically configures scaling
when creating SQL warehouses.
D. Databricks SQL clusters are powered by Photon.
©2023 Databricks Inc. — All rights reserved 28
Knowledge check
Think about this question and volunteer an answer

Which of the following features is used by


Databricks SQL to ensure your data is secure?
Select one response.
A. Built-in data governance
B. Delta sharing
C. Integration with 3rd party tools
D. Automatically scalable cloud infrastructure

©2023 Databricks Inc. — All rights reserved 29


Knowledge check
Think about this question and volunteer an answer

Which of the following features of Databricks is


used for running queries in Databricks SQL?
Select one response.
A. Dashboards
B. Job scheduler
C. SQL Editor
D. SQL warehouses

©2023 Databricks Inc. — All rights reserved 30


Knowledge check
Think about this question and volunteer an answer

What is the primary purpose of Databricks SQL?


A. To provide better price/performance and simplify
discovery for BI tools.
B. To manage administration and governance of data
warehouses.
C. To support a broad set of BI tools, including Tableau
and Power BI.
D. All of the above.

©2023 Databricks Inc. — All rights reserved 31


Unity Catalog in
Databricks SQL

DAWD 02-1

©2023 Databricks Inc. — All rights reserved 32


Learning Objectives
By the end of this lesson, you should be able to:

1 Describe the three-level namespacing system provided by Unity


Catalog.
Describe persistence and scope of catalogs, schemas (databases),
2
tables, and views on Databricks.

Compare and contrast the behavior of managed and unmanaged


3
tables.

©2023 Databricks Inc. — All rights reserved 33


Learning Objectives, cont.
By the end of this lesson, you should be able to:

4 Identify the legacy hive_metastore as appearing as the default


catalog to be compatible with Unity Catalog.
Describe how and where Unity Catalog stores the data behind its
5
catalogs, schema, and granular data objects.

Explain the impact of Unity Catalog on existing external storage


6
locations.

©2023 Databricks Inc. — All rights reserved 34


Unity Catalog
Architecture

Implements access control on data Unity Catalog


Access control is always enabled Identity
User/group
Provider Metastore
management
Works across multiple workspaces
Grants permissions to users at the
account level
Workspace Workspace

Compute Compute
resources resources

©2023 Databricks Inc. — All rights reserved 35


Unity Catalog
Comparison to workspace security model

Workspace security model Unity Catalog security model

Object/privilege/principal access control Object/privilege/principal access


model control model
Open by default Secure by default
Local to workspace Works across multiple workspaces
Grants privileges to workspace-level Grants
36 privileges to account-level
principals principals

©2023 Databricks Inc. — All rights reserved 36


Catalog Explorer UI
Single pane of glass for all of your data
UI driven access
control to simplify
secure data
permissioning

Browse and understand


data assets stored in
your Lakehouse

Data lineage
End-to-end table &
column lineage

©2023 Databricks Inc. — All rights reserved 37


Data Lineage
Mapping the flow of data in the lakehouse

Auto-capture runtime
data lineage across all
languages

Track lineage down to


the table and column
level

Leverage common
permission model from
Unity Catalog

©2023 Databricks Inc. — All rights reserved 38


Object Model

©2023 Databricks Inc. — All rights reserved 39


Metastore

• Stores data assets


• Permissions
• Created with default storage
location (external object store)
• Metastore Admin

©2023 Databricks Inc. — All rights reserved 40


Catalog

• First level of organization


• Users can see all catalogs where
USAGE is granted

©2023 Databricks Inc. — All rights reserved 41


Schema

• aka, Database
• Second level of organization
• Users can see all schemas where
USAGE is granted on both the
schema and the catalog

©2023 Databricks Inc. — All rights reserved 42


Managed Table

• Third level of organization


• Supported format: Delta
• Data is written to a new directory in
the metastore’s default location
• Created using CREATE TABLE
statement with no LOCATION
clause
• Example:
CREATE TABLE table1 …

©2023 Databricks Inc. — All rights reserved 43


External Table

• Third level of organization


• Data stored in a location outside
the managed storage location
• DROP TABLE does not delete data
• Can easily clone a table to a new
schema or table name without
moving data
• Supported formats:
• Delta, csv, json, avro, parquet, orc, text

©2023 Databricks Inc. — All rights reserved 44


Creating External Tables

• Two credential types:


• Storage Credential or External Location
• Use the LOCATION clause
• Example using External Location only
CREATE TABLE table2
LOCATION 's3://<bucket_path>/<table_directory>'
...

• Example using Storage Credential


CREATE TABLE table2
LOCATION 's3://<bucket_path>/<table_directory>'
...
WITH CREDENTIAL <credential-name>;

©2023 Databricks Inc. — All rights reserved 45


View

• Third level of organization


• Can be composed from tables and
views in multiple schemas or
catalogs
• Created using CREATE VIEW:

CREATE VIEW view1 AS


SELECT column1, column2
FROM table1 ...

©2023 Databricks Inc. — All rights reserved 46


Three-Level Namespace Notation

• Data objects must be specified with three elements, depending on


granularity required: Catalog, Schema, and Table
• Example:
CREATE TABLE main.default.department
(
deptcode INT,
deptname STRING,
location STRING
);
• Or, with a USE statement:
USE main.default;
SELECT * FROM department;

©2023 Databricks Inc. — All rights reserved 47


Lakehouse
Architecture

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 48


Learning Objectives
By the end of this lesson, you should be able to:

1 Describe the benefits of using Databricks SQL for in-platform data


processing.
Describe the medallion architecture as a sequential data
2
organization and pipeline system of progressively cleaner data.

Identify that bronze and silver layers data requires additional


3
processing and cleaning.

4 Describe the data in the gold layer of the medallion architecture.

©2023 Databricks Inc. — All rights reserved 49


Learning Objectives, cont.
By the end of this lesson, you should be able to:

Describe last-mile ETL workflows fully within the gold layer for
5
specific use cases.

Identify the gold layer as the most common layer for data analysts
6
using Databricks SQL.

7 Describe the benefits of working with streaming data.

©2023 Databricks Inc. — All rights reserved 50


The Lakehouse Architecture
● Full ACID Transaction
● Focus on your data
flow, instead of
worrying about failures. Data stream source
● Open Standards, Open
Source
Streaming
● Store petabytes of data Analytics
without worries of Batch source
lock-in. Growing
community including
Presto, Spark and more.
● Powered by Data Lake
AI & Reporting
Unifies Streaming /
CSV,
● JSON, TXT…
Batch. Convert existing
jobs with minimal
modifications.* Data stream source

©2023 Databricks Inc. — All rights reserved 51


*Streaming not available in SQL warehouse.
The Delta Lake Architecture
*Data Quality Levels *

Bronze Silver Gold


Streaming
Analytics
CSV,
JSON, TXT…

Raw Filtered, Cleaned Business-level


Ingestion Augmented Aggregates
Data Lake AI & Reporting

Quality

©2023 Databricks Inc. — All rights reserved 52


The Delta Lake Architecture
*Data Quality Levels *

Bronze Silver Gold


Streaming
Analytics
CSV,
JSON, TXT…
🔥
Raw Filtered, Cleaned Business-level
Ingestion Augmented Aggregates
Data Lake AI & Reporting

Quality

©2023 Databricks Inc. — All rights reserved 53


The Delta Lake Architecture
*Data Quality Levels *

Bronze Silver Gold


Streaming
Analytics
CSV,
JSON, TXT…

Raw Filtered, Cleaned Business-level


Ingestion Augmented Aggregates
Data Lake AI & Reporting

Quality

©2023 Databricks Inc. — All rights reserved 54


The Delta Lake Architecture
*Data Quality Levels *
OVERWRITE
INSERT
Bronze DELETE Silver MERGE Gold
Streaming
Analytics
CSV,
JSON, TXT…

Raw Filtered, Cleaned Business-level


Ingestion UPDATE Augmented Aggregates
Data Lake AI & Reporting

Quality

©2023 Databricks Inc. — All rights reserved 55


Knowledge Check

©2023 Databricks Inc. — All rights reserved 56


Knowledge check
Think about this question and volunteer an answer

Which of the following statements about the lakehouse


medallion architecture is true? Select one response.
A. The data in a single upstream table could be used to generate
multiple downstream tables.
B. The silver layer is for reporting and uses more de-normalized
and read-optimized data models with fewer joins.
C. The gold layer provides a broad view of all key business entities,
concepts and transactions.
D. Only minimal or "just-enough" transformations and data
cleansing rules are applied to each layer in the medallion
architecture.

©2023 Databricks Inc. — All rights reserved 57


Knowledge check
Think about this question and volunteer an answer

Which of the following describes the data quality of the gold


layer of data in the lakehouse medallion architecture? Select
one response.
A. The gold layer brings the data from different sources into an
Enterprise view.
B. The gold layer is comprised of clean aggregated data, ready to
use in production for a specific use case.
C. The table structures in the gold layer correspond to the source
system table structures "as-is”.
D. The focus of the gold layer is quick Change Data Capture and
the ability to provide a historical archive if needed without
rereading the data from the source system.
©2023 Databricks Inc. — All rights reserved 58
Knowledge check
Think about this question and volunteer an answer

What is the primary purpose of the bronze layer in the


"bronze-silver-gold medallion" paradigm in Delta Lake?
A. To store data in a format suitable for individual business
projects or reports.
B. To perform data cleansing, joining, and enrichment on raw data.
C. To provide a "single source of truth" for the enterprise across
various projects.
D. To ingest raw data quickly, keeping it in its original format for
both current and future projects.

©2023 Databricks Inc. — All rights reserved 59


Knowledge check
Think about this question and volunteer an answer

Which of the following statements describes the relationship


between the silver and gold layer of data? Select one response.
A. The gold layer has less clean data than the silver layer.
B. Project-specific business rules are applied from the silver to
gold layer.
C. Self-service analytics are enabled for the gold layer for ad-hoc
reporting in the silver layer.
D. The gold layer is where we land all the data from external
source systems, which are represented by the silver layer.

©2023 Databricks Inc. — All rights reserved 60


Integrations

DAWD 01-6

©2023 Databricks Inc. — All rights reserved 61


Learning Objectives
By the end of this lesson, you should be able to:

1 Identify Databricks SQL as a complementary tool for BI partner tool


workflows.
Identify Databricks SQL as a quick opportunity to create queries,
2
visualizations, and dashboards from within the Lakehouse

Identify Partner Connect as a tool for implementing simple


3
integrations with a number of other data products.

©2023 Databricks Inc. — All rights reserved 62


Databricks Partner Connect
Databricks Partner Connect is a dedicated ecosystem
of integrations that allows users to easily connect with
popular data ingestion, transformation and BI partner
products.

This helps data analysts get useful data into their


lakehouse faster without the need to manually configure
each product so they can get data-driven insights

©2023 Databricks Inc. — All rights reserved 63


Databricks Partner Connect
Helps Data Analysts who:
● Struggle to connect to their choice of BI tools
● Struggle to bring data from SaaS apps (Google Analytics,
Facebook, etc.) to run SQL queries
● Have to wait on eng/ops to generate data for analysis

©2023 Databricks Inc. — All rights reserved 64


Partner Connect Makes it Easy
How do I get the data What tools can I use I heard Fivetran is great!
from SFDC into Delta to ingest data into How do I connect it to
lake? Delta? Databricks?

cluster

■ Many partner integrations take as few as 6 clicks


■ No context or page switches required
DATABRICKS
■ Automatically launches a cluster, calls Partner API to pass on PAT token and the cluster
PARTNER configuration details
CONNECT
■ Sets up all the necessary configs for an optimized user experience
■ Creates trial account in the partner product if an account doesn’t exist
©2023 Databricks Inc. — All rights reserved 65
Databricks Partner Connect

Data BI and Data Prep and Machine


Ingestion Visualization Transformation Learning

©2023 Databricks Inc. — All rights reserved 66


Databricks Partner Connect

Data Data Reverse Semantic


Governance Quality ETL Layer

©2023 Databricks Inc. — All rights reserved 67


Built on an open foundation
Easily integrate with the entire data and AI ecosystem

©2023 Databricks Inc. — All rights reserved


Follow Along Demo:
Integrations

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 69


Follow Along Demo
Integrations
• Connecting to outside data
• Connecting to BI tools
• Partner Connect

©2023 Databricks Inc. — All rights reserved 70


Data
Management in
Databricks SQL

Databricks Academy 2023

©2023 Databricks Inc. — All rights reserved 71


Data Management in Databricks SQL

Lesson Name Duration

Lecture: Databricks SQL Warehouses 15 min

Follow Along Demo: Delta Lake in Databricks SQL 20 min

Lecture: Data Security 20 min

©2023 Databricks Inc. — All rights reserved 72


Databricks SQL
Warehouses

©2023 Databricks Inc. — All rights reserved 73


Learning Objectives
By the end of this lesson, you should be able to:

1 Describe the purpose of Databricks SQL warehouses.

Compare and contrast Classic, Pro, and Serverless Databricks SQL


2
warehouses.

Identify Serverless Databricks SQL warehouses as a quick-starting


3
option.

©2023 Databricks Inc. — All rights reserved 74


Learning Objectives, cont.
By the end of this lesson, you should be able to:

Describe basic Databricks SQL warehouse sizing and scaling


4 guidelines in response to slow-running single-user queries and
multi-user environments.
Describe the impact of Databricks SQL warehouse permissions on
5 query history availability.

©2023 Databricks Inc. — All rights reserved 75


Successful data management platforms rely
on efficient infrastructure
ML
engineers Data scientists ALL THE
USERS
Data Data
engineers analysts

INFRASTRUCTURE

ALL THE
DATA LAKE DATA WAREHOUSE DATA
©2023 Databricks Inc. — All rights reserved 76
Problems with Managing Infrastructure
Users Admins

Clusters

Lost Productivity Admin Effort


Waiting for results while Manually configure
clusters startup / scaleup. versions, cluster sizes,
instance types

Cost
Need to reduce costs

Finance

©2023 Databricks Inc. — All rights reserved 77


Databricks SQL Serverless
Managed Servers
Always-running server fleet
Patched and upgraded automatically

Optimized Capacity
Databricks SQL Idle clusters removed
Serverless
...
10 minutes after last
Compute query
(configurable)

Instant Compute Secure


< 10s second allocation to users 3-layer isolation with data
No waiting for clusters to startup or encryption
scale-up

©2023 Databricks Inc. — All rights reserved 78


Industry leading security architecture
supporting production workloads
1. Container Isolation 2. VM Isolation 3. Network Isolation

● Hardened container ● Workloads separated by ● All nodes egress is


images per industry best VM boundaries blocked except to nodes
practice in same cluster
● Blocked - reusing VMs
● Disable privilege access among customers ● Federated access
in the container through temporary
security tokens

● Ingress traffic from other


customers is blocked

©2023 Databricks Inc. — All rights reserved 79


Databricks SQL Serverless Benefits

Higher user Zero Management Lower Cost


productivity
● User queries start ● No configuration ● Pay what you consume;
instantly, no waiting for ● No performance tuning eliminate idle cluster time
cluster start-up ● No capacity ● No over-provisioning of
management resources
● Add more concurrent
users with instant cluster ● Automatic upgrades and ● Idle capacity removed 10
scaling patching minutes after last query

©2023 Databricks Inc. — All rights reserved 80


Serverless Compute Architecture
Databricks
Databricks Databricks control plane
Account
Account Benefits:

WORKSPACE WORKSPACE WORKSPACE


● Production ready
environment

● Robust security
VPC/VNET
foundation - data
Databricks Serverless compute
isolation and
encryption

Customers
Customer
Account
Account

Customer Storage
©2023 Databricks Inc. — All rights reserved 81
Warehouse Configuration
AWS Azure

©2023 Databricks Inc. — All rights reserved 82


Warehouse Configuration
GCP

©2023 Databricks Inc. — All rights reserved 83


Warehouse Configuration

In the course, SQL Warehouses have the following settings


• Cluster size – 2X-Small
• Scaling – Min: 1, Max 1
• Auto-stop – After ten minutes

©2023 Databricks Inc. — All rights reserved 84


Delta Lake

©2023 Databricks Inc. — All rights reserved 85


Follow Along Demo:
Delta Lake in
Databricks SQL
DAWD 01-5

©2023 Databricks Inc. — All rights reserved 86


Follow Along Demo
Delta Lake in Databricks SQL
• Create a schema
• Create views that withhold data from
unauthorized groups
• Optimize delta tables

©2023 Databricks Inc. — All rights reserved 87


Knowledge Check

©2023 Databricks Inc. — All rights reserved 88


Knowledge check
Think about this question and volunteer an answer

Which of the following statements describes the purpose of


Databricks SQL warehouses? Select one response.
A. SQL warehouses enable data analysts to find and share
dashboards.
B. SQL warehouses are a declarative framework for building data
processing pipelines.
C. SQL warehouses provide data discovery capabilities across
Databricks workspaces.
D. SQL warehouses allow users to run SQL commands on data
objects within Databricks SQL.

©2023 Databricks Inc. — All rights reserved 89


Knowledge check
Think about this question and volunteer an answer

What are the benefits of Delta Lake within the


Lakehouse Architecture?
A. Real-time data processing with low latency
B. Exclusive support for batch processing
C. ACID transactions, metadata scalability, and
storage improvement
D. Data isolation for multiple software development
environments

©2023 Databricks Inc. — All rights reserved 90


Knowledge check
Think about this question and volunteer an answer

Which of the following statements about SQL warehouse


sizing and scaling is true? Select two responses.
A. Increasing maximum scaling allows for multiple users to use
the same warehouse at the same time.
B. Scaling is set to a minimum of 1 and a maximum of 1 by
default.
C. The higher the cluster size, the higher the latency in your
queries.
D. The auto-stop feature will restart the warehouse if it remains
idle during the auto-stop period.

©2023 Databricks Inc. — All rights reserved 91


Knowledge check
Think about this question and volunteer an answer

Which feature of the platform provides users with


the ability to quickly connect to third-party tools
with simple to implement integrations? Select one
response.
A. SQL Editor
B. Partner Connect
C. Workflows
D. Features

©2023 Databricks Inc. — All rights reserved 92


Lab:
Lakehouse SQL

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 93


Lab
Lakehouse SQL
• Time Travel
• COPY INTO
• MERGE INTO

©2023 Databricks Inc. — All rights reserved 94


Data Security

DAWD 03-1

©2023 Databricks Inc. — All rights reserved 95


Learning Objectives
By the end of this lesson, you should be able to:

1 Describe the different levels of data object access available with


Unity Catalog.
Identify that catalogs, schemas, and tables can all have unique
2
owners.

Describe how to organize owned data objects for the purposes of


3
security.
Identify that the creator of a data object becomes the owner of
4
that data object.
©2023 Databricks Inc. — All rights reserved 96
Learning Objectives, cont.
By the end of this lesson, you should be able to:

5 Identify the responsibilities of data object ownership.

Update data object permissions to address user access needs in a


6
variety of common scenarios.

Identify PII data objects as needing additional, organization-specific


7
considerations.

©2023 Databricks Inc. — All rights reserved 97


The Life of a Query (Without Unity Catalog)
Per Workspace

2 nts
ra
SQL ec kG Table ACL
Ch
SELECT 3
1 Lookup Location
*
FROM
Sales2020; Cluster or SQL 4 Hive
6 Warehouse Return path to table Metastore
Cluster filters s3://sales/sales2020
unauthorized data
5
Instance Profile /
Service Principal /
Service Account

©2023 Databricks Inc. — All rights reserved 98


Cloud Storage
Unity Catalog Overview SQL access Audit
controls log
Managed
Data Source

Unity Catalog
(cross-workspace)
Managed
Data Source

Cluster or SQL
Warehouse
User Identity
Passthrough
Defined
Credentials External
Tables

Other
Existing Data
Sources
©2023 Databricks Inc. — All rights reserved 99
Databricks Unity Catalog
Audit
Unity Log Data (files on S3/ADLS/GCS)
Users
Catalog
table1 /dataset/pages/part-001
/dataset/pages/part-002
table2
/dataset/users/uk/part-001
view1 /dataset/users/uk/part-002
view2 /dataset/users/us/part-001
models
SQL Databases
view3

Fine-grained permissions on tables, fields, views: not


files
Industry standard interface: ANSI SQL grants ML Models

Uniform permission model for all data assets

Centrally audited Delta Shares

©2023 Databricks Inc. — All rights reserved 100


Attribute-Based Access Control
True policy driven ABAC that flows from your metadata

Policy pii

Sales 2020 Finance Data Eng Analyst


Visible Redact Mask
profile sku cost pii

userID purchase purhcaseAMT paymentDetails

... ... ... ...

pii

A record of all sales transactions is ingested from several data


sources, and only the finance team should be able to see payment
information. The PII policy flows down from column being tagged as
PII Finance Data Eng Analyst

©2023 Databricks Inc. — All rights reserved 101


Attribute-Based Access Control
CREATE ATTRIBUTE pii

ALTER TABLE iot_events ADD ATTRIBUTE pii ON email


ALTER TABLE users ADD ATTRIBUTE pii ON phone
...

GRANT SELECT ON SCHEMA iot_data


Set permission on all
HAVING ATTRIBUTE NOT IN (pii)
TO product_managers columns tagged pii
together

©2023 Databricks Inc. — All rights reserved 102


Unity Catalog: External Table with Defined
Credentials
CREATE CREDENTIAL iot_role TYPE AWS_ROLE ...

CREATE TABLE iot_data LOCATION s3:/...


WITH CREDENTIAL iot_role

iot_key
Credentials

Unity Catalog External


Tables

©2023 Databricks Inc. — All rights reserved 103


Unity Catalog: External Files with Passthrough
SELECT * FROM csv.`adls:/.../myfolder`
if a direct file path is specified,
we perform passthrough with
the user’s cloud credentials

CREATE VIEW v AS SELECT * FROM csv.`adls:/.../myfolder`


User ADLS
Credentials

External
Tables

©2023 Databricks Inc. — All rights reserved 104


Data
Visualization and
Dashboarding
Databricks Academy 2023

©2023 Databricks Inc. — All rights reserved 105


Data Visualization and Dashboarding

Lesson Name Duration

Follow Along Demo: Data Visualizations 40 min

Lab: Data Visualizations 30 min

Follow Along Demo: Dashboarding Basics 40 min

Lab: Dashboards 30 min

Follow Along Demo: Databricks SQL in Production 40 min

Lab: Databricks SQL in Production 30 min

©2023 Databricks Inc. — All rights reserved 106


Follow Along Demo:
Data Visualizations

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 107


Follow Along Demo
Data Visualizations
• Create common visualizations
• Create map visualizations

©2023 Databricks Inc. — All rights reserved 108


Knowledge Check

©2023 Databricks Inc. — All rights reserved 109


Knowledge check
Think about this question and volunteer an answer

What types of customized visualizations can be


created using Databricks SQL?
A. Pie charts, bar graphs, and line charts
B. Counter and funnels
C. Bar charts, combo charts, and geographical maps
D. All of the above

©2023 Databricks Inc. — All rights reserved 110


Knowledge check
Think about this question and volunteer an answer

How can you enable aggregation in a Databricks


SQL visualization?
A. Modify the underlying SQL query to add an aggregation
column.
B. Select the aggregation type directly in the visualization editor.
C. Use the Aggregation drop-down menu in the Visualization Type
options.
D. Aggregation is not supported in Databricks SQL visualizations.

©2023 Databricks Inc. — All rights reserved 111


Knowledge check
Think about this question and volunteer an answer

How can you add a query visualization to a


Databricks dashboard?
A. Select a query and choose the visualization type.
B. Drag and drop a visualization from the sidebar.
C. Copy and paste the visualization code into the
dashboard.
D. Query visualizations cannot be added to a Databricks
dashboard.

©2023 Databricks Inc. — All rights reserved 112


Knowledge check
Think about this question and volunteer an answer

What is the benefit of setting a refresh schedule for


a Databricks dashboard?

A. To change the color palette of visualizations.


B. To organize and label workspace objects.
C. To keep the data underlying visualizations up-to-date.
D. To create query parameters for customization.

©2023 Databricks Inc. — All rights reserved 113


Lab:
Data Visualizations

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 114


Lab
Data Visualizations
• Create a boxplot
• Create a funnel

©2023 Databricks Inc. — All rights reserved 115


Follow Along Demo:
Dashboarding Basics

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 116


Follow Along Demo
Dashboarding Basics
• Create a dashboard
• Parameterize queries
• Organize Databricks SQL assets

©2023 Databricks Inc. — All rights reserved 117


Knowledge Check

©2023 Databricks Inc. — All rights reserved 118


Knowledge check
Think about this question and volunteer an answer

Which of the following can be added to a query


so that the code can be rerun with different
variable inputs? Select one response.
A. User-defined functions
B. Parameters
C. Vectors
D. SQL warehouses

©2023 Databricks Inc. — All rights reserved 119


Knowledge check
Think about this question and volunteer an answer
A data analyst needs to create a Which of the following
visualization out of the following query: visualization types is best
SELECT order_date suited to depict the results of
this query? Select one
FROM sales
response.
WHERE order_date >= to_date('2020-01-01')
A. Funnel
AND order_date <= to_date('2021-01-01'); B. Stacked bar chart
C. Bar chart
D. Boxplot

©2023 Databricks Inc. — All rights reserved 120


Knowledge check
Think about this question and volunteer an answer

A team of stakeholders needs to be notified of changes


in a dashboard’s statistics on a daily basis. Which of the
following actions can be taken to ensure they always
have the newest information? Select one response.
A. A refresh schedule can be configured and stakeholders
can be subscribed to the dashboard's output.
B. A trigger alert can be created for the dashboard and
stakeholders can be added to the alert notification list.
C. A webhook can be created and shared with
stakeholders.
D. None of the above
©2023 Databricks Inc. — All rights reserved 121
Knowledge check
Think about this question and volunteer an answer

Which of the following data visualizations


displays a single number by default? Select one
response.
A. Bar chart
B. Counter
C. Map - markers
D. Funnel

©2023 Databricks Inc. — All rights reserved 122


Lab:
Dashboards

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 123


Lab
Dashboards
• Create your own dashboard

©2023 Databricks Inc. — All rights reserved 124


Follow Along Demo:
Databricks SQL in
Production
DAWD 01-5

©2023 Databricks Inc. — All rights reserved 125


Follow Along Demo
Databricks SQL in Production
• Automation in Databricks SQL
• Sharing Databricks SQL assets
• Creating gold-level tables

©2023 Databricks Inc. — All rights reserved 126


Knowledge Check

©2023 Databricks Inc. — All rights reserved 127


Knowledge check
Think about this question and volunteer an answer

Which of the following automations are


available in Databricks SQL? Select one
response.
A. Query refresh schedules
B. Dashboard refresh schedules
C. Alerts
D. All of the above

©2023 Databricks Inc. — All rights reserved 128


Knowledge check
Think about this question and volunteer an answer

What is the purpose of Alerts in Databricks


SQL?
A. To automatically execute SQL queries.
B. To organize queries within a folder structure.
C. To trigger notifications based on specific
conditions in scheduled queries.
D. To share dashboards with other team members.

©2023 Databricks Inc. — All rights reserved 129


Knowledge check
Think about this question and volunteer an answer

What is the purpose of configuring a refresh


schedule for a query in Databricks SQL?.
A. To automatically pull new data into a table.
B. To create a new table based on specified
criteria.
C. To manually execute queries on-demand.
D. To edit existing data in the database.

©2023 Databricks Inc. — All rights reserved 130


Knowledge check
Think about this question and volunteer an answer

What level of permissions is the owner of a


query granted on their query? Select one
response.
A. Can View
B. Can Run
C. Can Edit
D. Can Manage

©2023 Databricks Inc. — All rights reserved 131


Lab:
Databricks SQL in
Production
DAWD 01-5

©2023 Databricks Inc. — All rights reserved 132


Lab
Databricks SQL in Production
• Create automations
• Share a dashboard

©2023 Databricks Inc. — All rights reserved 133


Summary and Next
Steps

©2023 Databricks Inc. — All rights reserved 134


Earn a Databricks certification!
Certification helps you gain industry recognition, competitive
differentiation, greater productivity, and results.

• This course helps you prepare for the


Databricks Certified Data Analyst
Associate exam
• Recommended Self-Paced Courses
• Ingesting Data for Databricks SQL
• Integrating BI Tools with Databricks SQL
• Please see the Databricks Academy for
additional prep materials
For more information visit:
databricks.com/learn/certification

©2023 Databricks Inc. — All rights reserved 135


Thank you!

©2023 Databricks Inc. — All rights reserved 136


©2023 Databricks Inc. — All rights reserved 137

You might also like