KEMBAR78
Spark SQL | PDF | Apache Spark | Sql
0% found this document useful (0 votes)
14 views10 pages

Spark SQL

Spark SQL is a module in Apache Spark that integrates relational processing with Spark's functional programming API, allowing for SQL queries and DataFrame operations. It features a Catalyst optimizer for query optimization and supports various data sources. Key components include the SparkSession for programming entry, a SQL Query Processor for logical planning, and a Global Catalog for metadata management.

Uploaded by

sparkengineer835
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views10 pages

Spark SQL

Spark SQL is a module in Apache Spark that integrates relational processing with Spark's functional programming API, allowing for SQL queries and DataFrame operations. It features a Catalyst optimizer for query optimization and supports various data sources. Key components include the SparkSession for programming entry, a SQL Query Processor for logical planning, and a Global Catalog for metadata management.

Uploaded by

sparkengineer835
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Spark SQL deep

dive
Introduction to spark SQL
➢ Spark SQL integrates relational processing with Spark's functional programming API.

➢ It is a module for structured data processing in Apache Spark.

➢ Allows querying data via SQL as well as the DataFrame and Dataset APIs.

➢ Provides a programming interface for working with structured data.

➢ Enables execution of SQL queries on structured data using Spark's powerful engine.

Benefits of Spark SQL

• Unified Data Access: Combines SQL queries with Spark programs.

• Optimized Execution: Uses Catalyst optimizer for query optimization.

• Interoperability: Works with various data sources like Hive, Parquet, JSON, etc.
2
SPARK SQL ARCHITECTURE
1. User / Application / GUI
Entry point for submitting SQL queries. Could be notebooks (like Databricks), applications, JDBC clients, etc.
2. SQL Query Processor
This consists of:
➤ DataFrame Interface:
• Converts high-level SQL queries or DataFrame operations into logical plans.
• Acts as a wrapper for user inputs, providing flexibility in both SQL and programmatic APIs.
➤ Catalyst Optimizer:
• Spark’s internal query optimizer.
• Transforms SQL or DataFrame queries into optimized logical and physical plans.
• Performs predicate pushdown, constant folding, type coercion, etc.
➤ Query Interface (Catalyst + DataFrame interface):
• Entire query translation and planning process occurs here.
• This is what makes Spark SQL efficient and scalable.
3
SPARK SQL ARCHITECTURE
3. Global Catalog
• Manages metadata.
• Keeps track of table names, schema, database locations, and more. Works like a metastore (e.g.,
Hive Metastore).
4. Spark SQL Core Layer
• Translates the optimized query plans into RDD
operations.
• Uses Spark’s Java interface (RDD API)
underneath for distributed execution.
5. Data Sources / Storage Layer
• Spark can connect to various data
systems using interfaces (connectors):

The query output is then returned back through this


same pipeline to the User/App/GUI.

4
Interface Connector Target

Interface 1 JDBC Driver NoSQL Systems like


Cassandra, MongoDB
Data Sources / Storage Layer

Interface 2 JDBC Driver RDBMS like MySQL,


PostgreSQL

------- Spark Java HDFS, S3, Parquet, etc.


Interface
End To End Flow

•SQL Parser: Converts SQL queries into logical plans.


•Catalyst Optimizer: Optimizes logical plans.
•Physical Planner: Converts logical plans into physical
execution plans.
•Execution Engine: Executes the physical plans.

5
Key concepts
1. SparkSession
•Definition: The SparkSession is the entry point to programming Spark with the Dataset and DataFrame
API.
•Purpose: It replaces the older SQLContext and HiveContext in
Spark 2.0 and later versions.
• To create a SparkSession, use the SparkSession.builder
method.

2. Creating DataFrames
•Definition: A DataFrame is a distributed collection
of data organized into named columns.
•Purpose: DataFrames provide a domain-specific
language for structured data manipulation.
•Usage:
• You can create a DataFrame from a variety of
data sources, including collections of data.

6
REGISTERING TEMP & GLOBAL TEMP VIEWS
1. Temp View

➢ Session-scoped: Only available to the current SparkSession.

➢ It exists only for the duration of the SparkSession and is automatically dropped when the session ends.

➢ Useful for sharing data within the same session

df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE id > 1").show()

2. Global Temp View

➢ The global temporary view is available across all SparkSessions and SparkContexts.
➢ It exists until the Spark application terminates.
➢ Usage: Useful for sharing data across different sessions and contexts.

df.createGlobalTempView("people")
spark.sql("SELECT * FROM global_temp.people").show()

7
Integration with Hive
➢ Description: Spark SQL can read data from Hive tables.
➢ Configuration: Requires Hive metastore configuration.

SQL queries can be run over:


• DataFrames registered as temp/global views
• External tables (Hive, JDBC, etc.) if configured

SQLContext (Legacy)
• Used in Spark 1.x for SQL functionalities.
• In Spark 2.0+, use SparkSession instead.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sparkContext) 8
CATALOG API
The Spark Catalog API provides methods to interact with the metadata of tables,
databases, functions, and other data objects within a Spark session. This API is useful
for managing and querying the structure and organization of your data.
spark.catalog.listTables()
spark.catalog.listDatabases()

9
Summary
• Spark SQL enables SQL-like queries over structured
data
• Use spark.sql() to run queries
• Understand Temp View vs Global Temp View
• Use Catalog API to explore your tables and DBs

Tomorrow we will cover: Broadcast Variables & Accumulators in Spark for sharing data across nodes and
debugging computations.

10

You might also like