0% found this document useful (0 votes)

14 views10 pages

Spark SQL

Spark SQL is a module in Apache Spark that integrates relational processing with Spark's functional programming API, allowing for SQL queries and DataFrame operations. It features a Catalyst optimizer for query optimization and supports various data sources. Key components include the SparkSession for programming entry, a SQL Query Processor for logical planning, and a Global Catalog for metadata management.

Uploaded by

sparkengineer835

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views10 pages

Spark SQL

Uploaded by

sparkengineer835

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Spark SQL deep

dive
Introduction to spark SQL
➢ Spark SQL integrates relational processing with Spark's functional programming API.

➢ It is a module for structured data processing in Apache Spark.

➢ Allows querying data via SQL as well as the DataFrame and Dataset APIs.

➢ Provides a programming interface for working with structured data.

➢ Enables execution of SQL queries on structured data using Spark's powerful engine.

Benefits of Spark SQL

• Unified Data Access: Combines SQL queries with Spark programs.

• Optimized Execution: Uses Catalyst optimizer for query optimization.

• Interoperability: Works with various data sources like Hive, Parquet, JSON, etc.
2
SPARK SQL ARCHITECTURE
1. User / Application / GUI
Entry point for submitting SQL queries. Could be notebooks (like Databricks), applications, JDBC clients, etc.
2. SQL Query Processor
This consists of:
➤ DataFrame Interface:
• Converts high-level SQL queries or DataFrame operations into logical plans.
• Acts as a wrapper for user inputs, providing flexibility in both SQL and programmatic APIs.
➤ Catalyst Optimizer:
• Spark’s internal query optimizer.
• Transforms SQL or DataFrame queries into optimized logical and physical plans.
• Performs predicate pushdown, constant folding, type coercion, etc.
➤ Query Interface (Catalyst + DataFrame interface):
• Entire query translation and planning process occurs here.
• This is what makes Spark SQL efficient and scalable.
3
SPARK SQL ARCHITECTURE
3. Global Catalog
• Manages metadata.
• Keeps track of table names, schema, database locations, and more. Works like a metastore (e.g.,
Hive Metastore).
4. Spark SQL Core Layer
• Translates the optimized query plans into RDD
operations.
• Uses Spark’s Java interface (RDD API)
underneath for distributed execution.
5. Data Sources / Storage Layer
• Spark can connect to various data
systems using interfaces (connectors):

The query output is then returned back through this

same pipeline to the User/App/GUI.

4
Interface Connector Target

Interface 1 JDBC Driver NoSQL Systems like

Cassandra, MongoDB
Data Sources / Storage Layer

Interface 2 JDBC Driver RDBMS like MySQL,

PostgreSQL

------- Spark Java HDFS, S3, Parquet, etc.

Interface
End To End Flow

•SQL Parser: Converts SQL queries into logical plans.

•Catalyst Optimizer: Optimizes logical plans.
•Physical Planner: Converts logical plans into physical
execution plans.
•Execution Engine: Executes the physical plans.

5
Key concepts
1. SparkSession
•Definition: The SparkSession is the entry point to programming Spark with the Dataset and DataFrame
API.
•Purpose: It replaces the older SQLContext and HiveContext in
Spark 2.0 and later versions.
• To create a SparkSession, use the SparkSession.builder
method.

2. Creating DataFrames
•Definition: A DataFrame is a distributed collection
of data organized into named columns.
•Purpose: DataFrames provide a domain-specific
language for structured data manipulation.
•Usage:
• You can create a DataFrame from a variety of
data sources, including collections of data.

6
REGISTERING TEMP & GLOBAL TEMP VIEWS
1. Temp View

➢ Session-scoped: Only available to the current SparkSession.

➢ It exists only for the duration of the SparkSession and is automatically dropped when the session ends.

➢ Useful for sharing data within the same session

df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE id > 1").show()

2. Global Temp View

➢ The global temporary view is available across all SparkSessions and SparkContexts.
➢ It exists until the Spark application terminates.
➢ Usage: Useful for sharing data across different sessions and contexts.

df.createGlobalTempView("people")
spark.sql("SELECT * FROM global_temp.people").show()

7
Integration with Hive
➢ Description: Spark SQL can read data from Hive tables.
➢ Configuration: Requires Hive metastore configuration.

SQL queries can be run over:

• DataFrames registered as temp/global views
• External tables (Hive, JDBC, etc.) if configured

SQLContext (Legacy)
• Used in Spark 1.x for SQL functionalities.
• In Spark 2.0+, use SparkSession instead.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sparkContext) 8
CATALOG API
The Spark Catalog API provides methods to interact with the metadata of tables,
databases, functions, and other data objects within a Spark session. This API is useful
for managing and querying the structure and organization of your data.
spark.catalog.listTables()
spark.catalog.listDatabases()

9
Summary
• Spark SQL enables SQL-like queries over structured
data
• Use spark.sql() to run queries
• Understand Temp View vs Global Temp View
• Use Catalog API to explore your tables and DBs

Tomorrow we will cover: Broadcast Variables & Accumulators in Spark for sharing data across nodes and
debugging computations.

Report SQL PDF
No ratings yet
Report SQL PDF
21 pages
Spark SQL - Updated
No ratings yet
Spark SQL - Updated
19 pages
SparkSql AND DF
No ratings yet
SparkSql AND DF
89 pages
Spark SQL
No ratings yet
Spark SQL
18 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Spark SQL for Data Engineers
No ratings yet
Spark SQL for Data Engineers
25 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Spark SQL
No ratings yet
Spark SQL
12 pages
Sparks QL Sig Mod 2015
No ratings yet
Sparks QL Sig Mod 2015
12 pages
Spark SQL - Relational Data Processing in Spark
No ratings yet
Spark SQL - Relational Data Processing in Spark
12 pages
Lab 4 - Apache Spark SQL
No ratings yet
Lab 4 - Apache Spark SQL
46 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
Spark SQL
No ratings yet
Spark SQL
41 pages
Spark SQL
No ratings yet
Spark SQL
28 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Learning Spark - Chapter 4
No ratings yet
Learning Spark - Chapter 4
30 pages
Spark
No ratings yet
Spark
9 pages
Freedium - Cfd-I Spent 6 Hours Learning How Apache Spark Plans The Execution For Us
No ratings yet
Freedium - Cfd-I Spent 6 Hours Learning How Apache Spark Plans The Execution For Us
13 pages
Spark SQL PPT 3.2.3 and 3.2.4
No ratings yet
Spark SQL PPT 3.2.3 and 3.2.4
17 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Spark SQL Tutorial PDF
100% (1)
Spark SQL Tutorial PDF
35 pages
Spark SQL Tutorial
0% (1)
Spark SQL Tutorial
7 pages
Bda U5
No ratings yet
Bda U5
42 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
Page 01
No ratings yet
Page 01
2 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Apache Spark Defined
No ratings yet
Apache Spark Defined
14 pages
Apache Spark Analytics Made Simple PDF
No ratings yet
Apache Spark Analytics Made Simple PDF
76 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Py Spark
No ratings yet
Py Spark
9 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Unit 4
No ratings yet
Unit 4
60 pages
Spark SQL
100% (1)
Spark SQL
34 pages
M5 Q&a
No ratings yet
M5 Q&a
26 pages
4 - Spark SQL
No ratings yet
4 - Spark SQL
58 pages
SparkDataFrames 250719 202947
No ratings yet
SparkDataFrames 250719 202947
11 pages
MSBTE Questions of BDA
No ratings yet
MSBTE Questions of BDA
24 pages
Learning Journal
No ratings yet
Learning Journal
4 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
Cs 744: Spark SQL: Shivaram Venkataraman Fall 2019
No ratings yet
Cs 744: Spark SQL: Shivaram Venkataraman Fall 2019
24 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Unit 5 SQL 2024 25
No ratings yet
Unit 5 SQL 2024 25
19 pages
Apache Spark RDD Overview
No ratings yet
Apache Spark RDD Overview
15 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Spark 4.0
100% (1)
Spark 4.0
123 pages
Moudle 5 - Sparsk
No ratings yet
Moudle 5 - Sparsk
14 pages
SparkSQL Extensions for Huawei
No ratings yet
SparkSQL Extensions for Huawei
39 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
Hadoop
No ratings yet
Hadoop
4 pages
BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)
No ratings yet
BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)
76 pages
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
What's New in The FTM 2019 24.2 Update
No ratings yet
What's New in The FTM 2019 24.2 Update
3 pages
Linearly Compressed Pages
No ratings yet
Linearly Compressed Pages
13 pages
ALU Design for Tech Enthusiasts
No ratings yet
ALU Design for Tech Enthusiasts
6 pages
Embedded Systems with eCos RTOS
No ratings yet
Embedded Systems with eCos RTOS
35 pages
Woe: Se (Schwaltzvald) Woe: Se (Arunafeltz) Battlegrounds
No ratings yet
Woe: Se (Schwaltzvald) Woe: Se (Arunafeltz) Battlegrounds
4 pages
Philippine National It Standards (Philnits) Foundation Inc.: Preferred Exam Site Exam To Take Exam Date
No ratings yet
Philippine National It Standards (Philnits) Foundation Inc.: Preferred Exam Site Exam To Take Exam Date
2 pages
ASP Basics and Programming Guide
No ratings yet
ASP Basics and Programming Guide
19 pages
R2R34
No ratings yet
R2R34
2 pages
Direct Input Device Guide
0% (1)
Direct Input Device Guide
4 pages
NetApp SnapMirror Strategic Customer Presentation PDF
No ratings yet
NetApp SnapMirror Strategic Customer Presentation PDF
20 pages
Market Basket Analysis with Apriori & FP Growth
No ratings yet
Market Basket Analysis with Apriori & FP Growth
7 pages
Construye Páginas Web Con HTML: Actividad 1
No ratings yet
Construye Páginas Web Con HTML: Actividad 1
12 pages
Resources List For Resources Debug - Ap
No ratings yet
Resources List For Resources Debug - Ap
4 pages
Android NGN Stack 00
No ratings yet
Android NGN Stack 00
28 pages
Splinter Cell - A proposDuDev
No ratings yet
Splinter Cell - A proposDuDev
2 pages
Practical Assignment 1
No ratings yet
Practical Assignment 1
12 pages
Instructional Design Models
100% (7)
Instructional Design Models
30 pages
Inclusive Banking App Redesign
No ratings yet
Inclusive Banking App Redesign
2 pages
Unit V
No ratings yet
Unit V
46 pages
Cbse - Department of Skill Education Curriculum For Session 2023-2024
No ratings yet
Cbse - Department of Skill Education Curriculum For Session 2023-2024
20 pages
DSBDA3
No ratings yet
DSBDA3
3 pages
Typing Test English
No ratings yet
Typing Test English
1 page
Social Engineering Toolkit Guide
No ratings yet
Social Engineering Toolkit Guide
8 pages
Common Log Table PDF
0% (1)
Common Log Table PDF
2 pages
Microsoft PowerPoint
No ratings yet
Microsoft PowerPoint
16 pages
SAPexperts - Calculate Actual Costs Across Multiple Company Codes Using A New Business Function in SAP Enhancement Package 5 For SAP ERP 6
No ratings yet
SAPexperts - Calculate Actual Costs Across Multiple Company Codes Using A New Business Function in SAP Enhancement Package 5 For SAP ERP 6
12 pages
Soniya Kumari: Career Objective
No ratings yet
Soniya Kumari: Career Objective
2 pages
Module 5 - AE 1 MS Excel
No ratings yet
Module 5 - AE 1 MS Excel
20 pages
Untitled Design
No ratings yet
Untitled Design
3 pages
Introduction To Return
No ratings yet
Introduction To Return
3 pages

Spark SQL

Uploaded by

Spark SQL

Uploaded by

Spark SQL deep

➢ It is a module for structured data processing in Apache Spark.

➢ Provides a programming interface for working with structured data.

Benefits of Spark SQL

• Unified Data Access: Combines SQL queries with Spark programs.

• Optimized Execution: Uses Catalyst optimizer for query optimization.

The query output is then returned back through this

Interface 1 JDBC Driver NoSQL Systems like

Interface 2 JDBC Driver RDBMS like MySQL,

------- Spark Java HDFS, S3, Parquet, etc.

•SQL Parser: Converts SQL queries into logical plans.

➢ Session-scoped: Only available to the current SparkSession.

➢ Useful for sharing data within the same session

2. Global Temp View

SQL queries can be run over:

You might also like