0% found this document useful (0 votes)

25 views58 pages

4 - Spark SQL

The document provides an overview of Spark SQL, highlighting its use of DataFrames and DataSets for relational data processing, and its advanced query optimizer, Catalyst. It discusses the differences between RDDs, DataFrames, and Spark SQL, as well as the benefits of using Spark SQL for complex data operations. Additionally, it introduces Project Tungsten, which focuses on optimizing memory management and execution performance in Spark applications.

Uploaded by

Karim Osama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views58 pages

4 - Spark SQL

Uploaded by

Karim Osama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

CIT650 Introduction to Big Data

Spark SQL

Project Tungesten

1
Spark SQL7

RDD APIs although richer and more concise than MapReduce, still
are considered low-level
We still need to benefit from the in-memory execution model of Spark
but make it accessible to more people
Similar to Hive and Impala for MapReduce/HDFS, Spark SQL wraps
RDD API calls with an SQL-like shell
Spark SQL uses DataFrames and DataSet abstractions
It has an advanced query optimizer, called Catalyst

7Armbrust, Michael, et al. “Spark sql: Relational data processing in spark.”

Proceedings of the 2015 ACM SIGMOD International Conference on Management of

Data. ACM, 2015.
2
Programming Interface

source: Armbrust, Michael, et al. “Spark SQL: Relational data processing in spark.” Proceedings of the 2015 ACM SIGMOD

International Conference on Management of Data. ACM, 2015.

3
DataFrame

A distributed collection of rows organized into named columns

A DataFrame is similar to Table (Relation) in relational databases
Supports complex types
Structs, maps, unions
Data can be selected, projected, joined, and aggregated
Similar to Python's Pandas

4
Creating DataFrames

From a structured file source

JSON or Parquet files
From an existing RDD
By Performing an operation on another DataFrame
By programmatically defining a schema
Example
peopleDF = sqlCtx.jsonFile("people.json")

5
Basic DataFrame Operatrions

Operations to deal with DataFrame metadata

schema ? returns a schema object describing the data
printSchema ? displays the schema as a tree
cache/persist ? persists the DataFrame to disk or memory
columns ? returns an array of column names
Dtypes ? retuns an array of pairs of column names and types
Explain ? prints debug information about the DataFrame to the console

6
Manipulating Data in DataFrames

Queries create new DataFrames

DataFrames are immutable
Analogous to RDD transformations
Some query methods
Select: returns a new DataFrame with the selected columns only from
base DataFrame
Join: joins the base DataFrame with the other DataFrame
where: keeps only the records that match the condition in the new
DataFrame
Actions return data
Lazy execution as with RDDs
Some actions: take(n), collect(), count()

7
DataFrame Query String
You can pass column names as String
peopleDF.select(“name”,”age”)
peopleDF.where(“age >21”)

8
Querying DataFrames Using Columns

You can refer to the column object. In Python:

peopleDF.select(peopleDF.age,peopleDF.name)
peopleDF.select(peopleDF.age+10, peopleDF.name.toUpperCase())
peopleDF.select(peopleDF("age"),peopleDF("name"))

peopleDF.select(peopleDF("age")+10,peopleDF("name").toUpperCase())
peopleDF.sort(peopleDF.age.desc())

9
SQL Queries

It is possible to query a DataFrame using SQL

First, register the DataFrame as a temp table
peopleDF.registerTempTable("people")
sqlCtx.sql("SELECT * FROM people where name like 'A%' ")

10
RDD Vrs. DataFrames Vrs. Spark SQL

Statement Operation Example Output

sc.textFile(...) Read data into RDD ["John\t29", "John\t31", "Jane\t21"]
.split("\t") Split lines into arrays [["John", "29"], ["John", "31"], ["Jane", "21"]]
.map(lambda x: (x[0], [int(x[1]), 1])) Map to key-value pairs [("John", [29, 1]), ("John", [31, 1]), ("Jane", [21, 1])]
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) Aggregate values by key [("John", [60, 2]), ("Jane", [21, 1])]
.map(lambda x: [x[0], x[1][0] / x[1][1]]) Calculate average age [("John", 30), ("Jane", 21)]
.collect() Collect results to driver [("John", 30), ("Jane", 21)]

11
RDD Vrs. DataFrames Vrs. Spark SQL

12
RDD Vrs. DataFrames Vrs. Spark SQL

13
Major Milestones in Spark SQL

14
DataFrames Vs. Datasets

Feature DataFrame API Example Dataset API Example

Type Safety DataFrames are not type-safe; errors in Datasets are type-safe; errors in column
column names or data types are caught at names or data types are caught at compile-
runtime. time.
API Usage Operates with untyped API Operates with a typed API, which uses Java
(Dataset<Row>), which uses column names classes to represent rows and compile-time
as strings. checked lambda functions.
Data Representation Represents data as rows without any Represents data as objects of a specified
compile-time type information. class, providing compile-time type
information.
Lambda Functions Uses lambda functions that work with Row Uses lambda functions that work with typed
objects, which provide no compile-time objects (e.g., Person), allowing for compile-
type checking. time type checking.
Serialization/Deserialization Implicitly converts data to and from Row Uses Encoders to convert data to and from
objects. Java objects, which can be more efficient.
Data Source Reads JSON directly into a DataFrame Reads JSON and converts it into a Dataset of
(Dataset<Row>). Java objects using Encoders and the Java
Bean class.

15
Spark RDD API Example

16
Spark DataFrame Example - SQL

17
Spark DataFrame Example

18
Spark Dataset Example

19
Catalyst: Plan Optimization and Execution8

An extensible query optimizer

Builds on Scala’s pattern matching capabilities
Plans are represented as trees
Optimization rules transform trees

8Armbrust, Michael, et al. “Spark sql: Relational data processing in

spark.”Proceedings of the 2015 ACM SIGMOD International Conference on

Management of Data. ACM, 2015.
20
Analysis

An attribute is unresolved if it's type is not

known or it's not matched to an input table.
To resolve attributes:
Look up relations by name from the
catalog.
Map named attributes to the input
provided given operator's children.
UID for references to the same
value
Propagate and coerce types
through expressions (e.g. 1 + col)

21
Logical Optimization

Applies standard rule-based optimization

constant folding,
predicate-pushdown,
projection pruning,
null propagation,
Boolean expression simplification
Subquery elimination, etc

22
Logical Optimization

Technique Description Example

Constant Folding Pre-evaluates constant expressions SELECT 1+2; becomes SELECT 3;.
during planning.
Predicate Pushdown Applies filters early, close to the data Filters data during read operation, not
source. afterwards.
Projection Pruning Retrieves only the columns needed by Selects specific columns from a table.
the query.
Null Propagation Simplifies expressions when null Parts of expressions with null become
values are involved, based on null.
nullability rules.
Boolean Expression Optimizes boolean logic by Simplifies col = TRUE AND TRUE to col
Simplification eliminating redundancies. = TRUE.
Subquery Elimination Replaces complex subqueries with Converts correlated subqueries to joins.
more efficient constructs like joins.

23
Trees

• A tree is the main data type in the catalyst optimizer.

• A tree contains node objects.
• A node can have one or more children.
• New nodes are defined as subclasses of TreeNode class.
• These objects are immutable in nature.
• The objects can be manipulated using functional
transformation

24
Tree Abstractions

Expression
An expression represents a new
value, computed based on input
values
e.g. 1 + 2 + t1.value
Attribute: A column of a
dataset (e.g. t1.id) or a column
generated by a specific data
operation (e.g. v)

25
SQL to Logical plan

Expression

26
Logical Plan

A Logical Plan describes

computation on datasets
without defining how to conduct
the computation
output: a list of attributes
generated by this Logical Plan,
e.g. [id, v]
constraints: a set of invariants
about the rows generated by
this plan, e.g., t2.id > 50 * 1000

27
Physical Plan

A Physical Plan describes

computation on datasets with
specific definitions on how to
conduct the computation
A physical plan is executable

28
Optimization

Transformations are used to optimize plans

Plans should be logically equivalent
Transformation is done via rules
Tree type preserving
Expression to Expression
Logical Plan to Logical Plan
Physical Plan to Physical Plan
Non-type preserving
Logical Plan to Physical Plan

29
Transforms

A transform is defined as a partial function

A partial function is a function that is defined for a subset of its
arguments
v a l ex p r essi o n : E x p r essi o n =
. . . ex p r essi o n . t r a n sf o r m {
c a s e Add ( L i t e r a l ( x , I n te g e r Ty p e ) , L i t e r a l ( y , I n te g e r Ty p e ) ) = > L
i t er al ( x + y)
}

30
Combining Multiple Rules

31
Combining Multiple Rules

32
Combining Multiple Rules

33
Combining Multiple Rules

34
Combining Multiple Rules

35
Project Tungesten

Explicit memory management: Leverage application semantics and

data schema to eliminate JVM GC overhead,
Cache-aware computation: exploit memory hierarchy,
Code generation: using code generation to exploit modern CPU: It
has been observed that Spark workloads are more constrained by CPU
and memory rather than network or I/O.

36
Off-Heap Memory Management

JVM objects and GC overhead is non-negligible.

Java objects have large inherent memory overhead
Example string “abcd” that would take 4 bytes using UT-8 encoding
takes 48 bytes when stored using JVM native String data type.
String object (24 bytes) wrapping around the character array (24 bytes)

37
Off-Heap Memory Management
Spark to have its data stored in binary format
Spark serializes/deserializes its own data
Exploit data schema to reduce the overhead
build on sun.misc.unsafe to give C-like memory access

38
Off-Heap Memory Management

39
Off-Heap Memory Management

40
Cache-aware Computation

It was observed that Spark applications spend several CPU cycles

waiting for data to be fetched from main memory
Why not benefit from the memory hierarchy and pre-fetch data?
3X faster

41
Code Generation

Avoid parsing row-by-row at runtime,

Imagine an expression (x+y)+1. Without code generation, lots of
virtual function calls, which result in huge overhead in total,
with code generation, at compile time and utilizing the data types,
Spark can generate byte-code optimized for the specific data types

42
Example
Consider the case where we need to filter a dataframe by the column year:
year >2015

43
Whole Stage Code Generation

Traditionally, Spark followed a volcano model

44
Volcano Iterator Model

All operators, scan, filter, project, ..., etc. implement an Iterator

interface
A physical query plan is a chaining of operator where each operator
iterates over the output of its parent operator, hence, data move up
like lava in a volcano
Operator-specific logic is applied and results are emitted to children
operators
Every handshake between 2 operators causes one virtual function call
+ reading parent output from memory + write the final output in
memory

45
Issues with Volcano Model

Too many virtual function calls

Extensive memory access
Unable to leverage lots of modern techniques: pipelining,
pre-fetching, SIMD, loop unrolling, ..., etc.

46
How would a dedicated code look like?

A College freshman would write the following code to do implement the

same query

47
Dedicated versus Volcano

48
Why?

49
Whole-stage Code Generation
Target was to reach a functionality of a general-purpose execution engine
like volcano model and Perform just like a hand built system that does
exactly what user wants to do
A new technique now popular in DB literature,
Simply fuse together the operators so the generated code looks like
hand-optimized

50
Vectorized Execution: in-memory columnar storage

After WSCG, we can still speedup the execution of the generated code.
How? Vectorization.
The idea is to take advantage of data level parallelism (DLP) in
modern CPUs. That is, to process data in batches of rows rather
than one row at a time.
Shift from row-based to columnar-based storage

51
Scalar versus Vector Processing

52
Data availability for vectorized processing

Columnar data storage. Benefits

Simple access versus complex off-set in row-based formats
Denser storage
Compatibility with in-memory cache
Enables harnessing more benefits from hardware, e.g. GPU
Avoiding CPU stalls
Keep data as close in CPU registers to avoid idle cycles
We have 4 operations F=fetch, D=decode, E=execute, W=write, if
data is missing, it has to be fetched from lower (slower) memory .

53
Ideal execution without CPU stalls

54
Execution with CPU stalls

55
Benchmarking Big SQL Systems

Several systems provide SQL-like access to big data

How do they perform?
What aspects affect their performance?
Benchmarking studies are important to give researchers and
practitioners insight about the capabilities of the different systems
One such study was conducted in Victor Aluko and Sherif Sakr : Big
SQL Systems: An Experimental Evaluation , 2018

56
Benchmark Scope

Systems: Hive, Impala, Spark SQL, and PrestoDB

Benchmarks: TPC-H, TPC-DS
Hardware setup: A cluster of 5 nodes, each node with Intel Xeon
2.4GHz with 16 cores, 64 GB of RAM, 1.2 TB SSD with disk speed
300MB/s
Metrics: Response time, CPU, memory, disk and network utilization
Data Formats: text, ORC, Paruet

57
TPC-H Results

Lab 4 - Apache Spark SQL
No ratings yet
Lab 4 - Apache Spark SQL
46 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Freedium - Cfd-I Spent 6 Hours Learning How Apache Spark Plans The Execution For Us
No ratings yet
Freedium - Cfd-I Spent 6 Hours Learning How Apache Spark Plans The Execution For Us
13 pages
SparkSQL for Data Engineers
No ratings yet
SparkSQL for Data Engineers
44 pages
Spark SQL for Data Engineers
No ratings yet
Spark SQL for Data Engineers
25 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
Big Data & Apache Spark Explained
No ratings yet
Big Data & Apache Spark Explained
31 pages
Bda U5
No ratings yet
Bda U5
42 pages
02 Sparkml
No ratings yet
02 Sparkml
104 pages
Spark SQL
No ratings yet
Spark SQL
12 pages
DE Bootcamp - Week 3 Day 2
No ratings yet
DE Bootcamp - Week 3 Day 2
4 pages
Report SQL PDF
No ratings yet
Report SQL PDF
21 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
BDA Lec9
No ratings yet
BDA Lec9
25 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Sparks QL Sig Mod 2015
No ratings yet
Sparks QL Sig Mod 2015
12 pages
Spark SQL - Relational Data Processing in Spark
No ratings yet
Spark SQL - Relational Data Processing in Spark
12 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Spark
No ratings yet
Spark
9 pages
BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)
No ratings yet
BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)
76 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Big Data With Spark Presentation
No ratings yet
Big Data With Spark Presentation
11 pages
Cs 744: Spark SQL: Shivaram Venkataraman Fall 2019
No ratings yet
Cs 744: Spark SQL: Shivaram Venkataraman Fall 2019
24 pages
Spark SQL Meetup - 4-8-2012
No ratings yet
Spark SQL Meetup - 4-8-2012
27 pages
SparkSql AND DF
No ratings yet
SparkSql AND DF
89 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Introduction to Data Analysis with Spark
No ratings yet
Introduction to Data Analysis with Spark
51 pages
Spark SQL - Updated
No ratings yet
Spark SQL - Updated
19 pages
Unit 4
No ratings yet
Unit 4
60 pages
Ams 560 Spark SQL
No ratings yet
Ams 560 Spark SQL
2 pages
Spark SQL
100% (1)
Spark SQL
34 pages
Module 4
No ratings yet
Module 4
29 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Page 01
No ratings yet
Page 01
2 pages
Data Engineer Interview
No ratings yet
Data Engineer Interview
20 pages
Apache Spark
No ratings yet
Apache Spark
8 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
Apache Spark and Scala
No ratings yet
Apache Spark and Scala
53 pages
Spark Tutorial
No ratings yet
Spark Tutorial
77 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Pi-Hole Setup Guide for NAS Users
No ratings yet
Pi-Hole Setup Guide for NAS Users
7 pages
Project Document Controller Resume
No ratings yet
Project Document Controller Resume
2 pages
Data-Oriented Technology Stack For Advanced Unity Users 2022 LTS Final
No ratings yet
Data-Oriented Technology Stack For Advanced Unity Users 2022 LTS Final
49 pages
Software Testing and Validation Concepts
No ratings yet
Software Testing and Validation Concepts
7 pages
Unix 2nd Unit
No ratings yet
Unix 2nd Unit
9 pages
Log
No ratings yet
Log
1,117 pages
FINAL Impact of AI On Society
No ratings yet
FINAL Impact of AI On Society
18 pages
Applet Life Cycle in Java
No ratings yet
Applet Life Cycle in Java
6 pages
Arcgis Enterprise Functionality Matrix Current
No ratings yet
Arcgis Enterprise Functionality Matrix Current
14 pages
Attendance Checking System Using Radio-Frequency Identification Device (RFID)
No ratings yet
Attendance Checking System Using Radio-Frequency Identification Device (RFID)
32 pages
Guide Des Licences Autonomes: Autodesk
No ratings yet
Guide Des Licences Autonomes: Autodesk
26 pages
DBMS Sol
No ratings yet
DBMS Sol
11 pages
Exploring Energy Consumption of AI Frameworks On A 64-Core RV64 Server CPU
No ratings yet
Exploring Energy Consumption of AI Frameworks On A 64-Core RV64 Server CPU
12 pages
3015 - Useful Tools That Come With DFSMS - A User Experience
No ratings yet
3015 - Useful Tools That Come With DFSMS - A User Experience
28 pages
("!I9Cl. - Qggsgls:P11Glggltqlg$Ail: Examination Control Division
No ratings yet
("!I9Cl. - Qggsgls:P11Glggltqlg$Ail: Examination Control Division
30 pages
Industrial Automation Solutions
No ratings yet
Industrial Automation Solutions
25 pages
GO (Ending A Grand Overture) Event Shop
No ratings yet
GO (Ending A Grand Overture) Event Shop
5 pages
Mtcloud Computing
No ratings yet
Mtcloud Computing
26 pages
Planning For Upgrading Avaya Aura® Applications To Release 8 1 X 2024-08-24-11-09-31
No ratings yet
Planning For Upgrading Avaya Aura® Applications To Release 8 1 X 2024-08-24-11-09-31
109 pages
Oracle SIH Data Model
No ratings yet
Oracle SIH Data Model
101 pages
Distributed Systems Course Guide
No ratings yet
Distributed Systems Course Guide
2 pages
CLD 2 Elapsed Time FGV
No ratings yet
CLD 2 Elapsed Time FGV
4 pages
Cloud Forensics: Tools & Future Trends
No ratings yet
Cloud Forensics: Tools & Future Trends
17 pages
Web3py Readthedocs Io en Stable
100% (1)
Web3py Readthedocs Io en Stable
244 pages
B2 19IT2034 AyushPremjith Devops
100% (1)
B2 19IT2034 AyushPremjith Devops
118 pages
DART Report and Acrobat - IntouchSupport
No ratings yet
DART Report and Acrobat - IntouchSupport
2 pages
AN1282 RS9116W Guide For SAPI Application
No ratings yet
AN1282 RS9116W Guide For SAPI Application
504 pages
Share Your Events With An Individual, A Group, or The Whole World
No ratings yet
Share Your Events With An Individual, A Group, or The Whole World
14 pages
JTG Frontend Home Assignment
No ratings yet
JTG Frontend Home Assignment
4 pages
Agile One Pager PDF
No ratings yet
Agile One Pager PDF
2 pages

4 - Spark SQL

Uploaded by

4 - Spark SQL

Uploaded by

CIT650 Introduction to Big Data

7Armbrust, Michael, et al. “Spark sql: Relational data processing in spark.”

Proceedings of the 2015 ACM SIGMOD International Conference on Management of

International Conference on Management of Data. ACM, 2015.

A distributed collection of rows organized into named columns

From a structured file source

Operations to deal with DataFrame metadata

Queries create new DataFrames

You can refer to the column object. In Python:

It is possible to query a DataFrame using SQL

Statement Operation Example Output

Feature DataFrame API Example Dataset API Example

An extensible query optimizer

8Armbrust, Michael, et al. “Spark sql: Relational data processing in

spark.”Proceedings of the 2015 ACM SIGMOD International Conference on

An attribute is unresolved if it's type is not

Applies standard rule-based optimization

Technique Description Example

• A tree is the main data type in the catalyst optimizer.

A Logical Plan describes

A Physical Plan describes

Transformations are used to optimize plans

A transform is defined as a partial function

Explicit memory management: Leverage application semantics and

JVM objects and GC overhead is non-negligible.

It was observed that Spark applications spend several CPU cycles

Avoid parsing row-by-row at runtime,

Traditionally, Spark followed a volcano model

All operators, scan, filter, project, ..., etc. implement an Iterator

Too many virtual function calls

A College freshman would write the following code to do implement the

Columnar data storage. Benefits

Several systems provide SQL-like access to big data

Systems: Hive, Impala, Spark SQL, and PrestoDB

You might also like