0% found this document useful (0 votes)

24 views4 pages

Quewtion SQL - Pyspark

The document provides a comparison of PySpark queries and their equivalent SQL queries for various data manipulation tasks. It includes operations such as selecting columns, filtering rows, counting, grouping, joining tables, and calculating aggregates. Additionally, it covers advanced techniques like window functions, pivoting tables, and using common table expressions (CTEs).

Uploaded by

Lucky Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views4 pages

Quewtion SQL - Pyspark

Uploaded by

Lucky Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Question PySpark Query SQL Query

Select all columns from a

df.select("*").show() SELECT * FROM table_name;
table.

Select specific columns

(e.g., name, age) from a df.select("name", "age").show() SELECT name, age FROM table_name;
table.

Filter rows where age is SELECT * FROM table_name WHERE age >
df.filter(df.age > 30).show()
greater than 30. 30;

Count the number of rows

df.count() SELECT COUNT(*) FROM table_name;
in a table.

Group by a column
SELECT department, COUNT(*) FROM
(e.g., department) and df.groupBy("department").count().show()
table_name GROUP BY department;
count the number of rows.

Calculate the average of a

df.select(avg("salary")).show() SELECT AVG(salary) FROM table_name;
column (e.g., salary).

Join two tables

SELECT * FROM table1 INNER JOIN table2 ON
(df1 and df2) on a common df1.join(df2, "id", "inner").show()
table1.id = table2.id;
column (e.g., id).

Perform a left join on two SELECT * FROM table1 LEFT JOIN table2 ON
df1.join(df2, "id", "left").show()
tables. table1.id = table2.id;

Find duplicate rows based SELECT email, COUNT(*) FROM table_name

df.groupBy("email").count().filter("count > 1").show()
on a column (e.g., email). GROUP BY email HAVING COUNT(*) > 1;

from pyspark.sql.window import Window

Rank rows based on a from pyspark.sql.functions import rank SELECT *, RANK() OVER (ORDER BY salary) AS
column (e.g., salary) using
rank FROM table_name;
window functions. window = Window.orderBy("salary")

df.withColumn("rank", rank().over(window)).show()

from pyspark.sql.window import Window

from pyspark.sql.functions import sum

SELECT date, sales, SUM(sales) OVER (ORDER
Calculate cumulative sum window = BY date ROWS BETWEEN UNBOUNDED
of a column (e.g., sales). Window.orderBy("date").rowsBetween(Window.unbo PRECEDING AND CURRENT ROW) AS
undedPreceding, Window.currentRow) cumulative_sum FROM table_name;

df.withColumn("cumulative_sum",
sum("sales").over(window)).show()
Question PySpark Query SQL Query

SELECT product, SUM(CASE WHEN year =

Pivot a table to transform
df.groupBy("product").pivot("year").agg(sum("sales")) 2021 THEN sales END) AS 2021, SUM(CASE
rows into columns
.show() WHEN year = 2022 THEN sales END) AS 2022
(e.g., year as columns).
FROM table_name GROUP BY product;

WITH RankedSalaries AS (

SELECT

salary,
from pyspark.sql.window import Window
DENSE_RANK() OVER (ORDER BY salary
from pyspark.sql.functions import row_number
Find the third highest DESC) AS dense_rank
value in a column window = Window.orderBy(desc("salary"))
FROM employees
(e.g., salary).
df.withColumn("row_num",
)
row_number().over(window)).filter(col("row_num")
== 3).show() SELECT salary

FROM RankedSalaries

WHERE dense_rank = 3;

Calculate the difference

SELECT revenue - cost AS profit FROM
between two columns df.withColumn("profit", df.revenue - df.cost).show()
table_name;
(e.g., revenue - cost).

Filter rows where a column

SELECT * FROM table_name WHERE id IN (1,
value is in a list (e.g., id in df.filter(df.id.isin([1, 2, 3])).show()
2, 3);
[1, 2, 3]).

Find the top N rows based SELECT * FROM table_name ORDER BY salary
df.orderBy(desc("salary")).limit(5).show()
on a column (e.g., salary). DESC LIMIT 5;

Replace null values in a

SELECT COALESCE(column_name, 0) FROM
column with a default df.na.fill(0, subset=["column_name"]).show()
table_name;
value (e.g., 0).

from pyspark.sql.functions import concat

Concatenate two columns
SELECT CONCAT(first_name, last_name) AS
(e.g., first_name and last_n df.withColumn("full_name", concat(df.first_name, full_name FROM table_name;
ame). df.last_name)).show()

from pyspark.sql.functions import year

Extract year from a date SELECT EXTRACT(YEAR FROM date_column)
column. AS year FROM table_name;
df.withColumn("year", year("date_column")).show()

Calculate the percentage of from pyspark.sql.window import Window SELECT sales, (sales / SUM(sales) OVER ()) *
total for each row 100 AS percentage FROM table_name;
(e.g., sales). from pyspark.sql.functions import sum

window = Window.partitionBy()
Question PySpark Query SQL Query

df.withColumn("percentage", (df.sales /
sum("sales").over(window)) * 100).show()

WITH RankedSalaries AS
from pyspark.sql.window import Window
SELECT department, salary, ROW_NUMBER()
from pyspark.sql.functions import row_number
OVER (PARTITION BY department ORDER BY
window = salary DESC) AS row_num
Find the nth highest salary Window.partitionBy("department").orderBy(desc("sal
department-wise. FROM table_name
ary"))
)
df.withColumn("row_num",
row_number().over(window)).filter(col("row_num") SELECT * FROM RankedSalaries WHERE
== n).show() row_num = n;

WITH AvgSalary AS (

SELECT AVG(salary) AS avg_salary FROM

Use a CTE to find
avg_salary = df.select(avg("salary")).collect()[0][0] table_name
employees with salary
greater than the average df.filter(df.salary > avg_salary).show()
salary.
SELECT * FROM table_name WHERE salary >
(SELECT avg_salary FROM AvgSalary);

df.createOrReplaceTempView("temp_table")
SELECT * FROM table_name WHERE (name,
Find duplicates using a spark.sql("SELECT * FROM temp_table WHERE (name, age) IN (SELECT name, age FROM table_name
subquery. age) IN (SELECT name, age FROM temp_table GROUP GROUP BY name, age HAVING COUNT(*) > 1);
BY name, age HAVING COUNT(*) > 1)").show()

from pyspark.sql.window import Window

from pyspark.sql.functions import dense_rank WITH RankedSalaries AS (

SELECT department, salary, DENSE_RANK()
Find the 3rd highest salary window = OVER (PARTITION BY department ORDER BY
in each department using a Window.partitionBy("department").orderBy(desc("sal salary DESC) AS rank
window function. ary")) FROM table_name

df.withColumn("rank", SELECT * FROM RankedSalaries WHERE rank

dense_rank().over(window)).filter(col("rank") == = 3;
3).show()

with cte as (
SELECT *,
query to obtain the third row_number() over (PARTITION by user_id
transaction of every user. ORDER by transaction_date) as row_num
from transactions)
select user_id,spend, transaction_date from
cte where row_num=3

with cte as
(SELECT salary,row_number() over (order by
second highest salary
salary desc) as row_num
among all employees.
FROM employee)
select salary as second_highest_salary from
cte where row_num=2
Question PySpark Query SQL Query

SELECT user_id,tweet_date,
round(avg(tweet_count) over
Tweets' Rolling Averages (PARTITION by user_id
ORDER BY tweet_date
rows BETWEEN 2 PRECEDING and current
ROW),2)
as rolling_avg_3d
FROM tweets

Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
Solutions 1742312993
No ratings yet
Solutions 1742312993
14 pages
Pyspark SQL and DataFrames
No ratings yet
Pyspark SQL and DataFrames
6 pages
Day 77
No ratings yet
Day 77
10 pages
Day 60
No ratings yet
Day 60
10 pages
SQL Final Document
No ratings yet
SQL Final Document
37 pages
V2 SQL Final Document
No ratings yet
V2 SQL Final Document
35 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
SQL & Python Interview Q&A
No ratings yet
SQL & Python Interview Q&A
7 pages
SQL & Pyspark
No ratings yet
SQL & Pyspark
9 pages
Pyspark SQL Final Document
No ratings yet
Pyspark SQL Final Document
31 pages
Pyspark - Cheatsheet With Comparison To SQL5 - Seequality
No ratings yet
Pyspark - Cheatsheet With Comparison To SQL5 - Seequality
36 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
Window Functions in SQL and PySpark
No ratings yet
Window Functions in SQL and PySpark
5 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Unit 4 Spark SQL
No ratings yet
Unit 4 Spark SQL
49 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
Pyspark and SQL
No ratings yet
Pyspark and SQL
57 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Apache Spark Exercise List
No ratings yet
Apache Spark Exercise List
6 pages
Ade 1737191501
No ratings yet
Ade 1737191501
29 pages
SQL Vs Pyspark-1
No ratings yet
SQL Vs Pyspark-1
9 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
Info Cept
No ratings yet
Info Cept
4 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Window Functions in SQL
No ratings yet
Window Functions in SQL
26 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Advanced SQL Analysis SSMS
No ratings yet
Advanced SQL Analysis SSMS
3 pages
Window Functions Spark
No ratings yet
Window Functions Spark
3 pages
Assignment SQL
No ratings yet
Assignment SQL
3 pages
Answer Key For SET-1 TO 3
No ratings yet
Answer Key For SET-1 TO 3
7 pages
Windows Analytical Function in SQL Server Successfull Succesfull
No ratings yet
Windows Analytical Function in SQL Server Successfull Succesfull
182 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Window Function by Pragya Rathi 1751487084 2
No ratings yet
Window Function by Pragya Rathi 1751487084 2
14 pages
SQL Query Questions Commonly Asked in Interviews 1732271818
No ratings yet
SQL Query Questions Commonly Asked in Interviews 1732271818
7 pages
Amazon Interview Questions & Answers
No ratings yet
Amazon Interview Questions & Answers
8 pages
Advanced SQL Techniques Guide
No ratings yet
Advanced SQL Techniques Guide
48 pages
SQL - Window - Functions
No ratings yet
SQL - Window - Functions
3 pages
Ans Key Set A
No ratings yet
Ans Key Set A
6 pages
Spark-Scala Code
No ratings yet
Spark-Scala Code
3 pages
Data Engineering 101 - Day 24 - SQL Vs PySpark
No ratings yet
Data Engineering 101 - Day 24 - SQL Vs PySpark
82 pages
TEST 3 Answer
No ratings yet
TEST 3 Answer
31 pages
Happay
No ratings yet
Happay
21 pages
Questions For Preparation
No ratings yet
Questions For Preparation
9 pages
TCS Rejected Many Due To Weak PySpark Logic!?
No ratings yet
TCS Rejected Many Due To Weak PySpark Logic!?
7 pages
SQL Query
No ratings yet
SQL Query
28 pages
DBMS - Set 1
No ratings yet
DBMS - Set 1
10 pages
Donut Disturb
No ratings yet
Donut Disturb
5 pages
My ANIKET Document
No ratings yet
My ANIKET Document
52 pages
Iot ENG 1
No ratings yet
Iot ENG 1
2 pages
User Guide
No ratings yet
User Guide
22 pages
Oracle DDL Lab Guide
No ratings yet
Oracle DDL Lab Guide
10 pages
PPOA Price Index (2014 Dec)
No ratings yet
PPOA Price Index (2014 Dec)
43 pages
APX 1500 NA Datasheet
No ratings yet
APX 1500 NA Datasheet
8 pages
DTC B1608/84 Front Satellite Sensor Bus LH Initialization Incomplete Lost Communication With Front Airbag Sensor LH Front Airbag Sensor LH Initialization Incom-Plete
No ratings yet
DTC B1608/84 Front Satellite Sensor Bus LH Initialization Incomplete Lost Communication With Front Airbag Sensor LH Front Airbag Sensor LH Initialization Incom-Plete
7 pages
SAP S/4 HANA Internal Order Analysis
No ratings yet
SAP S/4 HANA Internal Order Analysis
8 pages
PSA Nexteer EE Workshop 161207 Final
No ratings yet
PSA Nexteer EE Workshop 161207 Final
38 pages
BUFR PrepBUFR User Guide v1
No ratings yet
BUFR PrepBUFR User Guide v1
82 pages
Electrical Network Transfer Function
100% (2)
Electrical Network Transfer Function
21 pages
Social Media Questionnaire PDF Social Networking Service Privacy
No ratings yet
Social Media Questionnaire PDF Social Networking Service Privacy
1 page
Cell Phone Components
No ratings yet
Cell Phone Components
6 pages
Full Stack Development
No ratings yet
Full Stack Development
27 pages
Fast Sine
No ratings yet
Fast Sine
9 pages
Changelog
No ratings yet
Changelog
4 pages
It0047 Fa6a
No ratings yet
It0047 Fa6a
13 pages
10th IT - Sample Paper
No ratings yet
10th IT - Sample Paper
5 pages
Warehouse and Inventory Mangement (Module 3) - 1
No ratings yet
Warehouse and Inventory Mangement (Module 3) - 1
37 pages
Online Banking in Bangladesh
100% (3)
Online Banking in Bangladesh
8 pages
2.4 Variabel
No ratings yet
2.4 Variabel
13 pages
AGIL Fence Perimeter Intrusion Detection System English SP
No ratings yet
AGIL Fence Perimeter Intrusion Detection System English SP
6 pages
Lab 2
No ratings yet
Lab 2
3 pages
4.1 Hand Out Permutation and Combination
100% (2)
4.1 Hand Out Permutation and Combination
2 pages
Differences Between Quality Assurance and Quality Control - GeeksforGeeks
No ratings yet
Differences Between Quality Assurance and Quality Control - GeeksforGeeks
6 pages
Solved - Trigger Job in SAP BW From SAC - SAP Community
No ratings yet
Solved - Trigger Job in SAP BW From SAC - SAP Community
9 pages
COURSE OUTLINE-CSE 3310 Computer Graphics
No ratings yet
COURSE OUTLINE-CSE 3310 Computer Graphics
5 pages
CSIT561 Module8 Network Security
No ratings yet
CSIT561 Module8 Network Security
62 pages
Design Session
No ratings yet
Design Session
45 pages

Quewtion SQL - Pyspark

Uploaded by

Quewtion SQL - Pyspark

Uploaded by

Question PySpark Query SQL Query

Select all columns from a

Select specific columns

Count the number of rows

Calculate the average of a

Join two tables

Find duplicate rows based SELECT email, COUNT(*) FROM table_name

from pyspark.sql.window import Window

from pyspark.sql.window import Window

from pyspark.sql.functions import sum

SELECT product, SUM(CASE WHEN year =

Calculate the difference

Filter rows where a column

Replace null values in a

from pyspark.sql.functions import concat

from pyspark.sql.functions import year

SELECT AVG(salary) AS avg_salary FROM

from pyspark.sql.window import Window

from pyspark.sql.functions import dense_rank WITH RankedSalaries AS (

df.withColumn("rank", SELECT * FROM RankedSalaries WHERE rank

You might also like