0% found this document useful (0 votes)

95 views11 pages

Comparison of SQL

The document compares SQL, Spark SQL, and PySpark across various features such as language type, execution environment, data handling, use cases, and performance. SQL is suited for traditional databases and smaller datasets, while Spark SQL excels in big data processing in distributed environments, and PySpark provides a Pythonic interface for leveraging Spark's capabilities. It also includes numerous practical examples demonstrating SQL and PySpark operations for data manipulation and analysis.

Uploaded by

sn070727

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views11 pages

Comparison of SQL

Uploaded by

sn070727

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Comparison of SQL, Spark SQL, and PySpark

Feature SQL Spark SQL PySpark

Language Declarative (SQL SQL Syntax + Python API for

Type Syntax) DataFrame API Spark

Distributed
Execution Typically runs on Distributed (across
(across Spark
Environmen a single multiple
cluster, uses
t machine/server machines/nodes)
Spark’s API)

Works with structured Works with large

Works with
Data and semi-structured datasets in
structured data
Handling data in distributed file distributed
in relational DBs
systems (HDFS, S3, etc.) environments

Small to
Big data processing, Big data, machine
medium-sized
Use Case ETL, analytics, data learning, data
transactional
lakes science, analytics
systems

Distributed
Single-node Distributed processing
Processing processing using
processing across a cluster
Spark and Python

Indexing, query Catalyst optimizer for Optimized query

Optimizatio
optimization automatic query plans via Spark
n
(manual) optimization SQL's Catalyst

RDDs,
Abstraction DataFrames, Datasets, DataFrames,
Tables and Views
s Tables (via SQL) Datasets, SQL
queries in Python

Inherited from
Fault Limited (ACID Fault tolerance with
Spark (RDDs,
Tolerance transactions) data replication in Spark
Datasets)

Integrates with
Works with Works with Hadoop, Python
Integration traditional Hive, cloud storage, and ecosystem (e.g.,
relational DBs more Pandas, Scikit-
learn)

Performanc Limited for large High performance for High

Feature SQL Spark SQL PySpark

performance,
big data with parallel
e datasets leverages Spark's
execution
distributed power

MLlib for machine

Limited to DB- SQL functions, UDFs, learning, GraphX,
Libraries
specific functions window functions, etc. Pandas, NumPy
support

Summary:

SQL is ideal for working with traditional relational databases and

small-to-medium-sized datasets where performance isn’t impacted
by single-node limitations.

Spark SQL is suited for big data processing and works well in
distributed environments, allowing you to run SQL queries over
large datasets spread across a cluster.

PySpark is the Python API for Spark, providing a more Pythonic

interface to leverage Spark’s power for distributed data processing
and analytics, and it includes machine learning capabilities through
MLlib.

list of SQL, Spark SQL, and PySpark practice examples,

covering from basic to advanced operations

1. Loading Data
-- SQL / Spark SQL:
SELECT * FROM employees;

-- PySpark:
df = spark.read.csv("employees.csv", header=True, inferSchema=True)
df.show()

2. Basic SELECT Query

SELECT name, age FROM employees WHERE age > 30;

-- PySpark:
df.filter(df['age'] > 30).select('name', 'age').show()
3. COUNT Aggregation
SELECT COUNT(*) FROM employees;

-- PySpark:
df.count()

4. SUM Aggregation
SELECT SUM(salary) FROM employees;

-- PySpark:
df.agg({"salary": "sum"}).show()

5. AVG Aggregation
SELECT AVG(salary) FROM employees;

-- PySpark:
df.agg({"salary": "avg"}).show()

6. GROUP BY Clause
SELECT department, COUNT(*) FROM employees GROUP BY department;

-- PySpark:
df.groupBy("department").count().show()

7. HAVING Clause
SELECT department, AVG(salary)
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000;

-- PySpark:
df.groupBy("department").avg("salary").filter("avg(salary) >
50000").show()

8. JOIN Operation
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id;

-- PySpark:
df_employees = spark.read.csv("employees.csv", header=True,
inferSchema=True)
df_departments = spark.read.csv("departments.csv", header=True,
inferSchema=True)
df_employees.join(df_departments, df_employees.department_id ==
df_departments.department_id).select('name',
'department_name').show()
9. INNER JOIN
SELECT e.name, d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;

-- PySpark:
df_employees.join(df_departments, df_employees.department_id ==
df_departments.department_id, "inner").show()

10. LEFT JOIN

SELECT e.name, d.department_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.department_id;

-- PySpark:
df_employees.join(df_departments, df_employees.department_id ==
df_departments.department_id, "left").show()

11. RIGHT JOIN

SELECT e.name, d.department_name
FROM employees e
RIGHT JOIN departments d ON e.department_id = d.department_id;

-- PySpark:
df_employees.join(df_departments, df_employees.department_id ==
df_departments.department_id, "right").show()

12. FULL OUTER JOIN

SELECT e.name, d.department_name
FROM employees e
FULL OUTER JOIN departments d ON e.department_id = d.department_id;

-- PySpark:
df_employees.join(df_departments, df_employees.department_id ==
df_departments.department_id, "outer").show()

13. CONCAT Function (Concatenate Strings)

SELECT CONCAT(first_name, ' ', last_name) AS full_name FROM
employees;

-- PySpark:
from pyspark.sql.functions import concat
df.withColumn("full_name", concat(df['first_name'], " ",
df['last_name'])).show()

14. CASE WHEN (Conditional Logic)

SELECT name,
CASE WHEN salary > 50000 THEN 'High' ELSE 'Low' END AS
salary_group
FROM employees;

-- PySpark:
from pyspark.sql.functions import when
df.withColumn("salary_group", when(df['salary'] > 50000,
'High').otherwise('Low')).show()

15. Subqueries
SELECT employee_id
FROM employees
WHERE salary = (SELECT MAX(salary) FROM employees);

-- PySpark:
max_salary = df.agg({"salary": "max"}).collect()[0][0]
df.filter(df['salary'] == max_salary).select('employee_id').show()

16. Using EXISTS with Subquery

SELECT name FROM employees e
WHERE EXISTS (SELECT * FROM departments d WHERE e.department_id =
d.department_id);

-- PySpark:
df_employees.join(df_departments, df_employees.department_id ==
df_departments.department_id).select("name").show()

17. WINDOW Functions (ROW_NUMBER)

SELECT name, salary, ROW_NUMBER() OVER (ORDER BY salary DESC) AS
rank
FROM employees;

-- PySpark:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window_spec = Window.orderBy(df['salary'].desc())
df.withColumn('rank', row_number().over(window_spec)).show()

18. Add a New Column

SELECT *, age + 1 AS age_in_next_year FROM employees;

-- PySpark:
df.withColumn("age_in_next_year", df['age'] + 1).show()

19. Drop a Column

SELECT * FROM employees DROP COLUMN age;

-- PySpark:
df.drop("age").show()
20. Rename a Column
SELECT first_name AS name FROM employees;

-- PySpark:
df.withColumnRenamed("first_name", "name").show()

21. DISTINCT
SELECT DISTINCT department FROM employees;

-- PySpark:
df.select("department").distinct().show()

22. LIMIT Clause

SELECT * FROM employees LIMIT 10;

-- PySpark:
df.limit(10).show()

23. IS NULL
SELECT * FROM employees WHERE age IS NULL;

-- PySpark:
df.filter(df['age'].isNull()).show()

24. Filtering with LIKE

SELECT * FROM employees WHERE name LIKE 'J%';

-- PySpark:
df.filter(df['name'].like('J%')).show()

25. Using BETWEEN for Range

SELECT * FROM employees WHERE age BETWEEN 30 AND 40;

-- PySpark:
df.filter(df['age'].between(30, 40)).show()

26. Group By with Aggregation

SELECT department, AVG(salary)
FROM employees
GROUP BY department;

-- PySpark:
df.groupBy("department").avg("salary").show()

27. Filtering Aggregated Data

SELECT department, AVG(salary)
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000;

-- PySpark:
df.groupBy("department").agg({"salary": "avg"}).filter("avg(salary)
> 50000").show()

28. UNION Operation

SELECT * FROM employees
UNION
SELECT * FROM contractors;

-- PySpark:
df_employees.union(df_contractors).show()

29. INTERSECT Operation

SELECT * FROM employees
INTERSECT
SELECT * FROM contractors;

-- PySpark:
df_employees.intersect(df_contractors).show()

30. EXCEPT Operation

SELECT * FROM employees
EXCEPT
SELECT * FROM contractors;

-- PySpark:
df_employees.exceptAll(df_contractors).show()

31. String Functions (UPPER, LOWER, TRIM)

SELECT UPPER(name), LOWER(name), TRIM(name) FROM employees;

-- PySpark:
from pyspark.sql.functions import upper, lower, trim
df.select(upper("name"), lower("name"), trim("name")).show()

32. Math Functions (ROUND, CEIL, FLOOR)

SELECT ROUND(salary, 2), CEIL(salary), FLOOR(salary) FROM
employees;

-- PySpark:
from pyspark.sql.functions import round, ceil, floor
df.select(round("salary", 2), ceil("salary"),
floor("salary")).show()
33. Date Functions (CURRENT_DATE, DATEDIFF)
SELECT CURRENT_DATE, DATEDIFF(CURRENT_DATE, hire_date) FROM
employees;

-- PySpark:
from pyspark.sql.functions import current_date, datediff
df.select(current_date(), datediff(current_date(),
df['hire_date'])).show()

34. Handling NULL Values (COALESCE)

SELECT COALESCE(salary, 0) FROM employees;

-- PySpark:
from pyspark.sql.functions import coalesce
df.select(coalesce(df['salary'], 0)).show()

35. Ranking Functions (RANK, DENSE_RANK)

SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rank
FROM employees;

-- PySpark:
from pyspark.sql.functions import rank
window_spec = Window.orderBy(df['salary'].desc())
df.withColumn("rank", rank().over(window_spec)).show()

36. Advanced Aggregation (GROUP_CONCAT)

SELECT department, GROUP_CONCAT(name) FROM employees GROUP BY
department;

-- PySpark:
from pyspark.sql.functions import collect_list
df.groupBy("department").agg(collect_list("name")).show()

37. CREATE Table

CREATE TABLE new_employees (
employee_id INT,
name VARCHAR(100),
department_id INT,
salary DECIMAL
);

-- PySpark:
df = spark.createDataFrame([(1, "John", 101, 50000)],
["employee_id", "name", "department_id", "salary"])
df.show()

38. INSERT Data into Table

INSERT INTO employees (employee_id, name, department_id, salary)
VALUES (1, 'John', 101, 50000);
-- PySpark:
new_data = [(1, "John", 101, 50000)]
new_df = spark.createDataFrame(new_data, ["employee_id", "name",
"department_id", "salary"])
df = df.union(new_df)
df.show()

39. ALTER Table to Add Column

ALTER TABLE employees ADD COLUMN age INT;

-- PySpark:
df = df.withColumn("age", df['age'])
df.show()

40. Removing Duplicate Rows

SELECT DISTINCT * FROM employees;

-- PySpark:
df.dropDuplicates().show()

41. Calculating Percentiles

SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS
median_salary FROM employees;

-- PySpark:
from pyspark.sql.functions import expr
df.select(expr('percentile_approx(salary,
0.5)').alias('median_salary')).show()

42. Create Temporary Views (SQL Queries in Spark SQL)

-- Spark SQL:
CREATE OR REPLACE TEMPORARY VIEW employees_view AS SELECT * FROM
employees;

-- PySpark:
df.createOrReplaceTempView("employees")
spark.sql("SELECT * FROM employees").show()

43. Regular Expressions

SELECT * FROM employees WHERE name REGEXP '^[J].*';

-- PySpark:
df.filter(df['name'].rlike('^[J].*')).show()

44. Merge/Upsert Data (Delta Lake in PySpark)

-- SQL / Spark SQL (Delta Lake):
MERGE INTO employees AS e
USING updates AS u
ON e.employee_id = u.employee_id
WHEN MATCHED THEN UPDATE SET e.salary = u.salary
WHEN NOT MATCHED THEN INSERT (employee_id, salary) VALUES
(u.employee_id, u.salary);

-- PySpark (
Delta Lake): from delta.tables import DeltaTable delta_table =
DeltaTable.forPath(spark, "/path/to/delta-table")
delta_table.alias("e").merge( updates.alias("u"), "e.employee_id = u.employee_id"
).whenMatchedUpdate(set={"salary": "u.salary"})
.whenNotMatchedInsert(values={"employee_id": "u.employee_id", "salary":
"u.salary"})
.execute()
---

### 45. Pivoting Data

```sql
SELECT department,
SUM(CASE WHEN gender = 'M' THEN salary ELSE 0 END) AS
male_salary,
SUM(CASE WHEN gender = 'F' THEN salary ELSE 0 END) AS
female_salary
FROM employees
GROUP BY department;

-- PySpark:
df.groupBy("department").pivot("gender").sum("salary").show()

46. Data Skipping Optimization

-- Spark SQL:
SET spark.sql.files.maxPartitionBytes=134217728;

-- PySpark:
spark.conf.set("spark.sql.files.maxPartitionBytes", 134217728)

47. Save Data to Parquet Format

-- Spark SQL:
CREATE TABLE employees_parquet USING parquet AS SELECT * FROM
employees;

-- PySpark:
df.write.parquet("employees.parquet")

48. Caching Data

-- Spark SQL:
CACHE TABLE employees;

-- PySpark:
df.cache()
49. Broadcast Join (Handling Large Datasets)
-- Spark SQL:
SELECT /*+ BROADCAST(d) */ e.name, d.department_name
FROM employees e JOIN departments d ON e.department_id =
d.department_id;

-- PySpark:
df_employees.join(broadcast(df_departments),
df_employees.department_id == df_departments.department_id).show()

50. DataFrame Operations with Functions

# PySpark:
from pyspark.sql.functions import col
df.filter(col('salary') > 50000).show()

DBT Certificate Study Guide V 1 7
No ratings yet
DBT Certificate Study Guide V 1 7
13 pages
BigQuery CheatSheet
100% (1)
BigQuery CheatSheet
100 pages
Pcap01 06
No ratings yet
Pcap01 06
28 pages
Questions Booklet
No ratings yet
Questions Booklet
118 pages
EXL Data Analyst Interview Questions
No ratings yet
EXL Data Analyst Interview Questions
43 pages
Aindumps 2023-Aug-06 by Edison 117q Vce
No ratings yet
Aindumps 2023-Aug-06 by Edison 117q Vce
9 pages
DBT Analytics Engineering
No ratings yet
DBT Analytics Engineering
23 pages
DBT Ebook
No ratings yet
DBT Ebook
143 pages
SQL Master
No ratings yet
SQL Master
45 pages
Databricks Certified Professional Data Engineer 3
No ratings yet
Databricks Certified Professional Data Engineer 3
18 pages
Big Data Computing - Assignment 7
No ratings yet
Big Data Computing - Assignment 7
3 pages
Lab Record DEV
100% (1)
Lab Record DEV
28 pages
Data Science & Machine Learning Using Python - CDR
No ratings yet
Data Science & Machine Learning Using Python - CDR
8 pages
Spark & Big Data Assignment Answers
No ratings yet
Spark & Big Data Assignment Answers
3 pages
C Sac 2221
No ratings yet
C Sac 2221
21 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Databricks Certified Associate Developer For Apache Spark 3.0
No ratings yet
Databricks Certified Associate Developer For Apache Spark 3.0
11 pages
Databricks Generative AI Engineer Associate Exam Practice Questions
No ratings yet
Databricks Generative AI Engineer Associate Exam Practice Questions
10 pages
Big Data Computing: Week 1 Quiz
No ratings yet
Big Data Computing: Week 1 Quiz
3 pages
Uber HYD COE Business Analyst JD - Analytics & Data Science-1 PDF
No ratings yet
Uber HYD COE Business Analyst JD - Analytics & Data Science-1 PDF
3 pages
SQL Most Asked Questions
No ratings yet
SQL Most Asked Questions
7 pages
1Z0 1041 23 Questions
No ratings yet
1Z0 1041 23 Questions
4 pages
Python Data Analytics Quiz
No ratings yet
Python Data Analytics Quiz
3 pages
Big Data Course Quiz
No ratings yet
Big Data Course Quiz
3 pages
Data Warehousing MCQ
No ratings yet
Data Warehousing MCQ
71 pages
Power BI Field List Icon Updates
No ratings yet
Power BI Field List Icon Updates
3 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Databricks Certified Data Analyst Associate Exam Practice Questions
No ratings yet
Databricks Certified Data Analyst Associate Exam Practice Questions
7 pages
Website: VCE To PDF Converter: Facebook: Twitter:: Number: 1z0-148 Passing Score: 800 Time Limit: 120 Min
No ratings yet
Website: VCE To PDF Converter: Facebook: Twitter:: Number: 1z0-148 Passing Score: 800 Time Limit: 120 Min
54 pages
Microsoft DP-600 Exam Guide
No ratings yet
Microsoft DP-600 Exam Guide
10 pages
Data Modelling and Visualization
No ratings yet
Data Modelling and Visualization
31 pages
Data Science Workshop
No ratings yet
Data Science Workshop
6 pages
DataStage Faq S
No ratings yet
DataStage Faq S
57 pages
Hemanshu Kumar Saraf - Resume New
No ratings yet
Hemanshu Kumar Saraf - Resume New
1 page
PL 300T00A ENU Powerpoint 01
No ratings yet
PL 300T00A ENU Powerpoint 01
24 pages
BigQuery Questions+Answers
100% (1)
BigQuery Questions+Answers
5 pages
Nuevo Quiz
No ratings yet
Nuevo Quiz
96 pages
Big Query Optimization Document
No ratings yet
Big Query Optimization Document
10 pages
DEA - JULY2024-No
No ratings yet
DEA - JULY2024-No
94 pages
60+ Data Engineer Interview Questions and Answers
No ratings yet
60+ Data Engineer Interview Questions and Answers
16 pages
2023 BD All Assignment
No ratings yet
2023 BD All Assignment
63 pages
DWH Concepts Interview Q&A
No ratings yet
DWH Concepts Interview Q&A
12 pages
Etl VS Elt
No ratings yet
Etl VS Elt
8 pages
04 Chap04 ClassificationMethods LDA QDA
No ratings yet
04 Chap04 ClassificationMethods LDA QDA
28 pages
Certy: Premium Exam Material
No ratings yet
Certy: Premium Exam Material
52 pages
IIT Madras Notes Machine Learning
No ratings yet
IIT Madras Notes Machine Learning
13 pages
Certified Data Engineer Associate v1.0: Collapse All
No ratings yet
Certified Data Engineer Associate v1.0: Collapse All
12 pages
Snowpro Advanced Data Engineer
No ratings yet
Snowpro Advanced Data Engineer
17 pages
Microsoft Dumps 70-761 v2017-01-10 by Matt 60q
No ratings yet
Microsoft Dumps 70-761 v2017-01-10 by Matt 60q
64 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Certified Data Analyst Associate Questions Answers Only
No ratings yet
Certified Data Analyst Associate Questions Answers Only
21 pages
Dimensional Modeling
No ratings yet
Dimensional Modeling
52 pages
Microsoft - Pass4sure - DP 203.free - pdf.2024 Mar 29
No ratings yet
Microsoft - Pass4sure - DP 203.free - pdf.2024 Mar 29
21 pages
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
No ratings yet
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
6 pages
Databricks Generative AI Engineer Associate Study Guide PDF
No ratings yet
Databricks Generative AI Engineer Associate Study Guide PDF
9 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
Databricks Vs SQL Cheat Sheet
100% (1)
Databricks Vs SQL Cheat Sheet
11 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Session 34-DataCache&Recovery
No ratings yet
Session 34-DataCache&Recovery
8 pages
ORACLE4PM
No ratings yet
ORACLE4PM
249 pages
Cleanliness Management System Dbms Mini Project Report
No ratings yet
Cleanliness Management System Dbms Mini Project Report
50 pages
DBMS 2 3
No ratings yet
DBMS 2 3
2 pages
Moz Pro Cookies
No ratings yet
Moz Pro Cookies
8 pages
BDA - Lecture 3
100% (1)
BDA - Lecture 3
17 pages
SAP HANA 2.0: Deprecations Reported by The HANA Statistics Server
No ratings yet
SAP HANA 2.0: Deprecations Reported by The HANA Statistics Server
3 pages
SQL Operators
No ratings yet
SQL Operators
12 pages
Database Mangement Cs 12 PDF
No ratings yet
Database Mangement Cs 12 PDF
17 pages
Dbms Rec
No ratings yet
Dbms Rec
74 pages
Practical Assignment-1 (RDBMS) : Practicle-2 Insert Data Into Following Tables
No ratings yet
Practical Assignment-1 (RDBMS) : Practicle-2 Insert Data Into Following Tables
3 pages
Unit-1 DDBMS Architecture
No ratings yet
Unit-1 DDBMS Architecture
14 pages
FY21 Azure Direct SQL Query
No ratings yet
FY21 Azure Direct SQL Query
5 pages
Add/Modify/Resize/Change/Delete/Drop/Re Name MS SQL Server Table Column Using T-SQL
No ratings yet
Add/Modify/Resize/Change/Delete/Drop/Re Name MS SQL Server Table Column Using T-SQL
3 pages
5-Review of DBMS Techniques - Normalization-09-01-2024
No ratings yet
5-Review of DBMS Techniques - Normalization-09-01-2024
62 pages
Database Systems Course Guide
No ratings yet
Database Systems Course Guide
2 pages
Database Design for Computing Students
No ratings yet
Database Design for Computing Students
5 pages
Netbackup Catalog Configuration
No ratings yet
Netbackup Catalog Configuration
9 pages
Informal Design Guidelines For Relational Databases
No ratings yet
Informal Design Guidelines For Relational Databases
19 pages
NORMALIZATION
No ratings yet
NORMALIZATION
6 pages
SQL Plan Baseline Stabilization Guide
No ratings yet
SQL Plan Baseline Stabilization Guide
3 pages
Correction Exercice 1 BD Facturation
No ratings yet
Correction Exercice 1 BD Facturation
3 pages
of Chapter-1.2 Data Models
No ratings yet
of Chapter-1.2 Data Models
54 pages
Import Data From Excel To Azure SQL Database Using Azure Data Factory
No ratings yet
Import Data From Excel To Azure SQL Database Using Azure Data Factory
24 pages
CRUD
No ratings yet
CRUD
2 pages
6-Select Statements Types
No ratings yet
6-Select Statements Types
7 pages
SQL Query Tuning Information Guide
No ratings yet
SQL Query Tuning Information Guide
12 pages
#Procedure To Find Square of A Given No
No ratings yet
#Procedure To Find Square of A Given No
10 pages
West Bengal State University Q16TQ20
No ratings yet
West Bengal State University Q16TQ20
6 pages

Comparison of SQL

Uploaded by

Comparison of SQL

Uploaded by

Comparison of SQL, Spark SQL, and PySpark

Feature SQL Spark SQL PySpark

Language Declarative (SQL SQL Syntax + Python API for

Works with structured Works with large

Indexing, query Catalyst optimizer for Optimized query

Performanc Limited for large High performance for High

MLlib for machine

SQL is ideal for working with traditional relational databases and

PySpark is the Python API for Spark, providing a more Pythonic

list of SQL, Spark SQL, and PySpark practice examples,

2. Basic SELECT Query

10. LEFT JOIN

11. RIGHT JOIN

12. FULL OUTER JOIN

13. CONCAT Function (Concatenate Strings)

14. CASE WHEN (Conditional Logic)

16. Using EXISTS with Subquery

17. WINDOW Functions (ROW_NUMBER)

18. Add a New Column

19. Drop a Column

22. LIMIT Clause

24. Filtering with LIKE

25. Using BETWEEN for Range

26. Group By with Aggregation

27. Filtering Aggregated Data

28. UNION Operation

29. INTERSECT Operation

30. EXCEPT Operation

31. String Functions (UPPER, LOWER, TRIM)

32. Math Functions (ROUND, CEIL, FLOOR)

34. Handling NULL Values (COALESCE)

35. Ranking Functions (RANK, DENSE_RANK)

36. Advanced Aggregation (GROUP_CONCAT)

37. CREATE Table

38. INSERT Data into Table

39. ALTER Table to Add Column

40. Removing Duplicate Rows

41. Calculating Percentiles

42. Create Temporary Views (SQL Queries in Spark SQL)

43. Regular Expressions

44. Merge/Upsert Data (Delta Lake in PySpark)

### **45. Pivoting Data**

46. Data Skipping Optimization

47. Save Data to Parquet Format

48. Caching Data

50. DataFrame Operations with Functions

You might also like

### 45. Pivoting Data