Transform Data
with Spark
Module 02
©2023 Databricks Inc. — All rights reserved 1
Module Objectives
Transform Data with Spark
1. Extract data from a variety of file formats and data sources using Spark
2. Apply a number of common transformations to clean data using Spark
3. Reshape and manipulate complex data using advanced built-in functions
in Spark
4. Leverage UDFs for reusable code and apply best practices for
performance in Spark
©2023 Databricks Inc. — All rights reserved 2
Module Agenda
Transform Data with Spark
Data Objects in the Lakehouse
DE 2.1 - Querying Files Directly
DE 2.2 - Options for External Sources
DE 2.3L - Extract Data Lab
DE 2.4 - Cleaning Data
DE 2.5 - Complex Transformations
DE 2.6L - Reshape Data Lab
DE 2.7A – SQL UDFs and Control Flow
DE 2.7B - Python UDFs
©2023 Databricks Inc. — All rights reserved 3
Data Objects in the
Lakehouse
©2023 Databricks Inc. — All rights reserved 4
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
Table View Function
5
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
Table View Function
6
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
(Database)
Table View Function
7
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
Managed table
Table View Function
External table
8
©2023 Databricks Inc. — All rights reserved
Managed Tables
Metastore
Catalog
Schema
Managed table
Metastore storage
9
©2023 Databricks Inc. — All rights reserved
External Tables
Metastore
Catalog Storage credential
Schema External location
Managed table External table
Metastore storage External storage
10
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
Table View Function
11
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
Table View Function
Global Temporary
Temporary View
View 12
©2023 Databricks Inc. — All rights reserved
Data objects in the Lakehouse
Metastore
Catalog
Schema
(Database)
Table View Function
Function
13
©2023 Databricks Inc. — All rights reserved
Extracting Data
©2023 Databricks Inc. — All rights reserved 14
Query files directly
SELECT * FROM file_format.`path/to/file`
Files can be queried directly using SQL
• SELECT * FROM json.`path/to/files/`
• SELECT * FROM text.`path/to/files/`
Process based on specified file format
• json pulls schema from underlying data
• binaryFile and text file formats have fixed data schemas
• text → string value column (row for each line)
• binaryFile → path, modificationTime, length, content columns (row for each file)
©2023 Databricks Inc. — All rights reserved 15
Configure external tables with read options
CREATE TABLE USING data_source OPTIONS (...)
Many data sources require schema declaration and other options to
correctly read data
• CSV options for delimiter, header, etc
• JDBC options for url, user, password, etc
• Note: using the JDBC driver pulls RDBMS tables dynamically for Spark processing
©2023 Databricks Inc. — All rights reserved 16
DE 2.1: Querying Files
Directly
Use Spark SQL to directly query JSON data files
Leverage text and binaryFile methods to review raw file contents
©2023 Databricks Inc. — All rights reserved 17
DE 2.2: Providing Options
for External Sources
Use Spark SQL to configure options for extracting data from external sources
Create tables against external data sources for various file formats
Describe behavior when querying tables defined against external RDBMS sources
©2023 Databricks Inc. — All rights reserved 18
DE 2.3L: Extract Data Lab
©2023 Databricks Inc. — All rights reserved 19
DE 2.4: Cleaning Data
Summarize datasets and describe NULL behaviors
Retrieve and removing Duplicates
Validate datasets for expected counts, missing values, and duplicate records
Apply date_format and regexp_extract to clean and transform data
©2023 Databricks Inc. — All rights reserved 20
Complex
Transformations
©2023 Databricks Inc. — All rights reserved 21
Interact with Nested Data
Use built-in syntax to traverse nested data with Spark SQL
Use “:” (colon) syntax in queries to access subfields in JSON strings
SELECT value:device, value:geo ...
Use “.” (dot) syntax in queries to access subfields in STRUCT types
SELECT value.device, value.geo ...
©2023 Databricks Inc. — All rights reserved 22
Complex Types
Nested data types storing multiple values
• Array: arbitrary number of elements of same data type
• Map: set of key-value pairs
• Struct: ordered (fixed) collection of column(s) and any data type
Example table with complex types
CREATE TABLE employees (name STRING, salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING,city:STRING,state:STRING, zip:INT>)
©2023 Databricks Inc. — All rights reserved 23
DE 2.5: Complex
Transformations
Use : and . syntax to traverse nested data in strings and structs
Use .* syntax to flatten and query struct types
Parse JSON string fields
Flatten/unpack arrays and structs
©2023 Databricks Inc. — All rights reserved 24
explode lab explode outputs the elements of an array field into a separate row
for each element
SELECT
user_id, event_timestamp, event_name,
explode(items) AS item
FROM events
1
2
3
Each item in the items array above is exploded into its own row, resulting in the 3 rows below
1
2
3
©2023 Databricks Inc. — All rights reserved 25
flatten lab
collect_set returns an array of unique values from a field for each group of rows
flatten returns an array that flattens multiple arrays into one
SELECT user_id,
collect_set(event_name) AS event_history,
array_distinct(flatten(collect_set(items.item_id))) AS cart_history
FROM events
GROUP BY user_id
©2023 Databricks Inc. — All rights reserved 26
Collection example
collect_set returns an array with duplicate elements eliminated
collect_list returns an array with duplicate elements intact
df df.agg(collect_set(‘age’)) df.agg(collect_list(‘age’))
©2023 Databricks Inc. — All rights reserved 27
Parse JSON strings into structs
Create the schema to parse the JSON strings by providing an example JSON string from a row
that has no nulls
from_json uses JSON schema returned by schema_of_json to convert a column of JSON strings into structs
This highlighted JSON string is taken from the value field of a single row of data
Returns STRUCT column containing ARRAY of nested STRUCT
©2023 Databricks Inc. — All rights reserved 28
DE 2.5L: Reshape Data Lab
(Optional)
©2023 Databricks Inc. — All rights reserved 29
DE 2.7A: SQL UDFs and
Control Flow (Optional)
©2023 Databricks Inc. — All rights reserved 30
DE 2.7B: Python UDFs
(Optional)
©2023 Databricks Inc. — All rights reserved 31
©2023 Databricks Inc. — All rights reserved 32