The document discusses building ETL (Extract, Transform, Load) pipelines using Apache Spark, highlighting its capabilities in data extraction and transformation, especially for dirty or complex datasets. It covers key features like multi-line JSON and CSV support, structured streaming, and performance improvements in Spark 2.3, along with examples of ETL queries. The session aims to illustrate how Spark simplifies the complexities of managing data pipelines and enhances data processing efficiency.
2
TEAM
About Databricks
Started Sparkproject (now Apache Spark) at UC Berkeley in 2009
22
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
3.
3
About Me
• ApacheSpark Committer
• Software Engineer at Databricks
• Ph.D. in University of Florida
• Previously, IBM Master Inventor, QRep, GDPS A/A and STC
• Spark SQL, Database Replication, Information Integration
• Github: gatorsmile
4.
4
Overview
1. What’s anETL Pipeline?
2. Using Spark SQL for ETL
- Extract: Dealing with Dirty Data (Bad Records or Files)
- Extract: Multi-line JSON/CSV Support
- Transformation: High-order functions in SQL
- Load: Unified write paths and interfaces
3. New Features in Spark 2.3
- Performance (Data Source API v2, Python UDF)
5.
5
What is aData Pipeline?
1. Sequence of transformations on data
2. Source data is typically semi-structured/unstructured
(JSON, CSV etc.) and structured (JDBC, Parquet, ORC, the
other Hive-serde tables)
3. Output data is integrated, structured and curated.
– Ready for further data processing, analysis and reporting
6.
6
Example of aData Pipeline
Aggregate Reporting
Applications
ML
Model
Ad-hoc Queries
Database
Cloud
Warehouse
Kafka, Log
Kafka, Log
7.
7
ETL is theFirst Step in a Data Pipeline
1. ETL stands for EXTRACT, TRANSFORM and LOAD
2. Goal is to clean or curate the data
- Retrieve data from sources (EXTRACT)
- Transform data into a consumable format (TRANSFORM)
- Transmit data to downstream consumers (LOAD)
11
Why is ETLHard?
1. Too complex
2. Error-prone
3. Too slow
4. Too expensive
1. Various sources/formats
2. Schema mismatch
3. Different representation
4. Corrupted files and data
5. Scalability
6. Schema evolution
7. Continuous ETL
12.
12
This is whyETL is important
Consumers of this data don’t want to deal with this
messiness and complexity
14
Structured
Streaming
Spark SQL's flexibleAPIs,
support for a wide
variety of datasources,
build-in support for
structured streaming,
state of art catalyst
optimizer and tungsten
execution engine make it
a great framework for
building end-to-end ETL
pipelines.
15.
15
Data Source Supports
1.Built-in connectors in Spark:
– JSON, CSV, Text, Hive, Parquet, ORC, JDBC
2. Third-party data source connectors:
– https://spark-packages.org
3. Define your own data source connectors by
Data Source APIs
– Ref link: https://youtu.be/uxuLRiNoDio
20
Corrupt
Files
java.io.IOException. For example,java.io.EOFException: Unexpected end of input
stream at org.apache.hadoop.io.compress.DecompressorStream.decompress
java.lang.RuntimeException: file:/temp/path/c000.json is not a Parquet file (too
small)
spark.sql.files.ignoreCorruptFiles = true
[SPARK-17850] If true, the Spark jobs will
continue to run even when it encounters
corrupt files. The contents that have
been read will still be returned.
Dealing with Bad Data: Skip Corrupt Files
22
{"a":1, "b":2, "c":3}
{"a":{,b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "_corrupt_record")
.json(corruptRecords)
.show() The default can be configured via
spark.sql.columnNameOfCorruptRecord
Json: Dealing with Corrupt Records
23.
23
{"a":1, "b":2, "c":3}
{"a":{,b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", "DROPMALFORMED")
.json(corruptRecords)
.show()
Json: Dealing with Corrupt Records
24.
24
{"a":1, "b":2, "c":3}
{"a":{,b:3}
{"a":5, "b":6, "c":7}
spark.read
.option("mode", "FAILFAST")
.json(corruptRecords)
.show()
org.apache.spark.sql.catalyst.json
.SparkSQLJsonProcessingException:
Malformed line in FAILFAST mode:
{"a":{, b:3}
Json: Dealing with Corrupt Records
30
Functionality: Better CorruptionHandling
badRecordsPath: a user-specified path to store exception files for
recording the information about bad records/files.
- A unified interface for both corrupt records and files
- Enabling multi-phase data cleaning
- DROPMALFORMED + Exception files
- No need an extra column for corrupt records
- Recording the exception data, reasons and time.
Availability: Databricks Runtime 3.0
31.
31
Functionality: Better JSONand CSV Support
[SPARK-18352] [SPARK-19610] Multi-line JSON and CSV Support
- Spark SQL currently reads JSON/CSV one line at a time
- Before 2.2, it requires custom ETL
spark.read
.option(”multiLine",true)
.json(path)
Availability: Apache Spark 2.2
spark.read
.option(”multiLine",true)
.json(path)
32.
32
Transformation: Higher-order Functionin SQL
Transformation on complex objects like arrays, maps and
structures inside of columns.
UDF ? Expensive data serialization
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
33.
33
Transformation: Higher orderfunction in SQL
1) Check for element existence
SELECT EXISTS(values, e -> e > 30) AS v
FROM tbl_nested;
2) Transform an array
SELECT TRANSFORM(values, e -> e * e) AS v
FROM tbl_nested;
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
Transformation on complex objects like arrays, maps and
structures inside of columns.
34.
34
4) Aggregate anarray
SELECT REDUCE(values, 0, (value, acc) -> value + acc) AS sum
FROM tbl_nested;
Ref Databricks Blog: http://dbricks.co/2rUKQ1A
More cool features available in DB Runtime 3.0: http://dbricks.co/2rhPM4c
Availability: Databricks Runtime 3.0
3) Filter an array
SELECT FILTER(values, e -> e > 30) AS v
FROM tbl_nested;
Transformation: Higher order function in SQL
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
35.
35
Users can createHive-serde tables using
DataframeWriter APIs
Availability: Apache Spark 2.2
New Format in DataframeWriter API
df.write.format("parquet")
.saveAsTable("tab")
df.write.format("hive")
.option("fileFormat", "avro")
.saveAsTable("tab")
CREATE Hive-serde tables CREATE data source tables
36.
36
Availability: Apache Spark2.2
Unified CREATE TABLE [AS SELECT]
CREATE TABLE t1(a INT, b INT)
USING ORC
CREATE TABLE t1(a INT, b INT)
USING hive
OPTIONS(fileFormat 'ORC')
CREATE Hive-serde tables CREATE data source tables
CREATE TABLE t1(a INT, b INT)
STORED AS ORC
37.
37
CREATE [TEMPORARY] TABLE[IF NOT EXISTS]
[db_name.]table_name
USING table_provider
[OPTIONS table_property_list]
[PARTITIONED BY (col_name, col_name, ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)]
INTO num_buckets BUCKETS]
[LOCATION path]
[COMMENT table_comment]
[AS select_statement];
Availability: Apache Spark 2.2
Unified CREATE TABLE [AS SELECT]
Apache Spark preferred syntax
39
[SPARK-15689] Data SourceAPI v2
1. [SPARK-20960] An efficient column batch interface for data
exchanges between Spark and external systems.
o Cost for conversion to and from RDD[Row]
o Cost for serialization/deserialization
o Publish the columnar binary formats
2. Filter pushdown and column pruning
3. Additional pushdown: limit, sampling and so on.
Target: Apache Spark 2.3
40.
40
Performance: Python UDFs
1.Python is the most popular language for ETL
2. Python UDFs are often used to express elaborate data
conversions/transformations
3. Any improvements to python UDF processing will ultimately
improve ETL.
4. Improve data exchange between Python and JVM
5. Block-level UDFs
o Block-level arguments and return types
Target: Apache Spark 2.3
41.
41
Recap
1. What’s anETL Pipeline?
2. Using Spark SQL for ETL
- Extract: Dealing with Dirty Data (Bad Records or Files)
- Extract: Multi-line JSON/CSV Support
- Transformation: High-order functions in SQL
- Load: Unified write paths and interfaces
3. New Features in Spark 2.3
- Performance (Data Source API v2, Python UDF)
42.
42
UNIFIED ANALYTICS PLATFORM
TryApache Spark in Databricks!
• Collaborative cloud environment
• Free version (community edition)
4242
DATABRICKS RUNTIME 3.0
• Apache Spark - optimized for the cloud
• Caching and optimization layer - DBIO
• Enterprise security - DBES
Try for free today.
databricks.com