Spark SQL deep
dive
Introduction to spark SQL
➢ Spark SQL integrates relational processing with Spark's functional programming API.
➢ It is a module for structured data processing in Apache Spark.
➢ Allows querying data via SQL as well as the DataFrame and Dataset APIs.
➢ Provides a programming interface for working with structured data.
➢ Enables execution of SQL queries on structured data using Spark's powerful engine.
Benefits of Spark SQL
• Unified Data Access: Combines SQL queries with Spark programs.
• Optimized Execution: Uses Catalyst optimizer for query optimization.
• Interoperability: Works with various data sources like Hive, Parquet, JSON, etc.
2
SPARK SQL ARCHITECTURE
1. User / Application / GUI
Entry point for submitting SQL queries. Could be notebooks (like Databricks), applications, JDBC clients, etc.
2. SQL Query Processor
This consists of:
➤ DataFrame Interface:
• Converts high-level SQL queries or DataFrame operations into logical plans.
• Acts as a wrapper for user inputs, providing flexibility in both SQL and programmatic APIs.
➤ Catalyst Optimizer:
• Spark’s internal query optimizer.
• Transforms SQL or DataFrame queries into optimized logical and physical plans.
• Performs predicate pushdown, constant folding, type coercion, etc.
➤ Query Interface (Catalyst + DataFrame interface):
• Entire query translation and planning process occurs here.
• This is what makes Spark SQL efficient and scalable.
3
SPARK SQL ARCHITECTURE
3. Global Catalog
• Manages metadata.
• Keeps track of table names, schema, database locations, and more. Works like a metastore (e.g.,
Hive Metastore).
4. Spark SQL Core Layer
• Translates the optimized query plans into RDD
operations.
• Uses Spark’s Java interface (RDD API)
underneath for distributed execution.
5. Data Sources / Storage Layer
• Spark can connect to various data
systems using interfaces (connectors):
The query output is then returned back through this
same pipeline to the User/App/GUI.
4
Interface Connector Target
Interface 1 JDBC Driver NoSQL Systems like
Cassandra, MongoDB
Data Sources / Storage Layer
Interface 2 JDBC Driver RDBMS like MySQL,
PostgreSQL
------- Spark Java HDFS, S3, Parquet, etc.
Interface
End To End Flow
•SQL Parser: Converts SQL queries into logical plans.
•Catalyst Optimizer: Optimizes logical plans.
•Physical Planner: Converts logical plans into physical
execution plans.
•Execution Engine: Executes the physical plans.
5
Key concepts
1. SparkSession
•Definition: The SparkSession is the entry point to programming Spark with the Dataset and DataFrame
API.
•Purpose: It replaces the older SQLContext and HiveContext in
Spark 2.0 and later versions.
• To create a SparkSession, use the SparkSession.builder
method.
2. Creating DataFrames
•Definition: A DataFrame is a distributed collection
of data organized into named columns.
•Purpose: DataFrames provide a domain-specific
language for structured data manipulation.
•Usage:
• You can create a DataFrame from a variety of
data sources, including collections of data.
6
REGISTERING TEMP & GLOBAL TEMP VIEWS
1. Temp View
➢ Session-scoped: Only available to the current SparkSession.
➢ It exists only for the duration of the SparkSession and is automatically dropped when the session ends.
➢ Useful for sharing data within the same session
df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE id > 1").show()
2. Global Temp View
➢ The global temporary view is available across all SparkSessions and SparkContexts.
➢ It exists until the Spark application terminates.
➢ Usage: Useful for sharing data across different sessions and contexts.
df.createGlobalTempView("people")
spark.sql("SELECT * FROM global_temp.people").show()
7
Integration with Hive
➢ Description: Spark SQL can read data from Hive tables.
➢ Configuration: Requires Hive metastore configuration.
SQL queries can be run over:
• DataFrames registered as temp/global views
• External tables (Hive, JDBC, etc.) if configured
SQLContext (Legacy)
• Used in Spark 1.x for SQL functionalities.
• In Spark 2.0+, use SparkSession instead.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sparkContext) 8
CATALOG API
The Spark Catalog API provides methods to interact with the metadata of tables,
databases, functions, and other data objects within a Spark session. This API is useful
for managing and querying the structure and organization of your data.
spark.catalog.listTables()
spark.catalog.listDatabases()
9
Summary
• Spark SQL enables SQL-like queries over structured
data
• Use spark.sql() to run queries
• Understand Temp View vs Global Temp View
• Use Catalog API to explore your tables and DBs
Tomorrow we will cover: Broadcast Variables & Accumulators in Spark for sharing data across nodes and
debugging computations.
10