0% found this document useful (0 votes)

509 views7 pages

Must Know Pyspark Coding Before Databricks Interview

Uploaded by

Richard Smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

509 views7 pages

Must Know Pyspark Coding Before Databricks Interview

Uploaded by

Richard Smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Must Know Pyspark Coding Before Your Next

Databricks Interview

Document by – Siddhartha Subudhi

Visit my LinkedIn profile
1. Find the second highest salary in a DataFrame using PySpark.

Scenario: You have a DataFrame of employee salaries and want to find the second highest salary.

from pyspark.sql import Window

from pyspark.sql.functions import col, dense_rank

windowSpec = Window.orderBy(col("salary").desc())

df_with_rank = df.withColumn("rank", dense_rank().over(windowSpec))

second_highest_salary = df_with_rank.filter(col("rank") == 2).select("salary")

second_highest_salary.show()

2. Count the number of null values in each column of a PySpark DataFrame.

Scenario: Given a DataFrame, identify how many null values each column contains.

from pyspark.sql.functions import col, isnan, when, count

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

3. Calculate the moving average over a window of 3 rows.

Scenario: For a stock price dataset, calculate a moving average over the last 3 days.

from pyspark.sql import Window

from pyspark.sql.functions import avg

windowSpec = Window.orderBy("date").rowsBetween(-2, 0)

df_with_moving_avg = df.withColumn("moving_avg", avg("price").over(windowSpec))

df_with_moving_avg.show()

4. Remove duplicate rows based on a subset of columns in a PySpark DataFrame.

Scenario: You need to remove duplicates from a DataFrame based on certain columns.

df = df.dropDuplicates(["column1", "column2"])

df.show()
5. Split a single column with comma-separated values into multiple columns.

Scenario: Your DataFrame contains a column with comma-separated values. You want to split this into multiple
columns.

from pyspark.sql.functions import split

df_split = df.withColumn("new_column1", split(df["column"], ",").getItem(0)) \

.withColumn("new_column2", split(df["column"], ",").getItem(1))

df_split.show()

6. Group data by a specific column and calculate the sum of another column.

Scenario: Group sales data by "product" and calculate the total sales.

df.groupBy("product").sum("sales").show()

7. Join two DataFrames on a specific condition.

Scenario: You have two DataFrames: one for customer data and one for orders. Join these DataFrames on the
customer ID.

df_joined = df_customers.join(df_orders, df_customers.customer_id == df_orders.customer_id, "inner")

df_joined.show()

8. Create a new column based on conditions from existing columns.

Scenario: Add a new column "category" that assigns "high", "medium", or "low" based on the value of the "sales"
column.

from pyspark.sql.functions import when

df = df.withColumn("category", when(df.sales > 500, "high")

.when((df.sales <= 500) & (df.sales > 200), "medium")

.otherwise("low"))

df.show()
9. Calculate the percentage contribution of each value in a column to the total.

Scenario: For a sales dataset, calculate the percentage contribution of each product's sales to the total sales.

from pyspark.sql.functions import sum, col

total_sales = df.agg(sum("sales").alias("total_sales")).collect()[0]["total_sales"]

df = df.withColumn("percentage", (col("sales") / total_sales) * 100)

df.show()

10. Find the top N records from a DataFrame based on a column.

Scenario: You need to find the top 5 highest-selling products.

df.orderBy(col("sales").desc()).limit(5).show()

11. Write PySpark code to pivot a DataFrame.

Scenario: You have sales data by "year" and "product", and you want to pivot the table to show "product" sales by
year.

df_pivot = df.groupBy("product").pivot("year").sum("sales")

df_pivot.show()

12. Add row numbers to a PySpark DataFrame based on a specific ordering.

Scenario: Add row numbers to a DataFrame ordered by "sales" in descending order.

from pyspark.sql.window import Window

from pyspark.sql.functions import row_number

windowSpec = Window.orderBy(col("sales").desc())

df_with_row_number = df.withColumn("row_number", row_number().over(windowSpec))

df_with_row_number.show()
13. Filter rows based on a condition.

Scenario: You want to filter only those customers who made purchases over ₹1000.

df_filtered = df.filter(df.purchase_amount > 1000)

df_filtered.show()

14. Flatten a JSON column in PySpark.

Scenario: Your DataFrame contains a JSON column, and you want to extract specific fields from it.

from pyspark.sql.functions import from_json, col

from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([

StructField("name", StringType(), True),

StructField("age", StringType(), True)

])

df = df.withColumn("json_data", from_json(col("json_column"), schema))

df.select("json_data.name", "json_data.age").show()

15. Convert a PySpark DataFrame column to a list.

Scenario: Convert a column from your DataFrame into a list for further processing.

column_list = df.select("column_name").rdd.flatMap(lambda x: x).collect()

16. Handle NULL values by replacing them with a default value.

Scenario: Replace all NULL values in the "sales" column with 0.

df = df.na.fill({"sales": 0})

df.show()

17. Perform a self-join on a PySpark DataFrame.

Scenario: You have a hierarchy of employees and want to find each employee's manager.

df_self_join = df.alias("e1").join(df.alias("e2"), col("e1.manager_id") == col("e2.employee_id"), "inner") \

.select(col("e1.employee_name"), col("e2.employee_name").alias("manager_name"))

df_self_join.show()
18. Write PySpark code to unpivot a DataFrame.

Scenario: You have a DataFrame with "year" columns and want to convert them to rows.

from pyspark.sql.functions import expr

df_unpivot = df.selectExpr("id", "stack(2, '2021', sales_2021, '2022', sales_2022) as (year, sales)")

df_unpivot.show()

19. Write a PySpark code to group data based on multiple columns and calculate aggregate functions.

Scenario: Group data by "product" and "region" and calculate the average sales for each group.

df.groupBy("product", "region").agg({"sales": "avg"}).show()

20. Write PySpark code to remove rows with duplicate values in any column.

Scenario: You want to remove rows where any column has duplicate values.

df_cleaned = df.dropDuplicates()

df_cleaned.show()

21. Write PySpark code to read a CSV file and infer its schema.

Scenario: You need to load a CSV file into a DataFrame, ensuring the schema is inferred.

df = spark.read.option("header", "true").option("inferSchema", "true").csv("path_to_csv")

df.show()

22. Write PySpark code to merge multiple small files into a single file.

Scenario: You have multiple small files in HDFS, and you want to consolidate them into one large file.

df.coalesce(1).write.mode("overwrite").csv("output_path")
23. Write PySpark code to calculate the cumulative sum of a column.

Scenario: You want to calculate a cumulative sum of sales in your DataFrame.

from pyspark.sql.window import Window

from pyspark.sql.functions import sum

windowSpec = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, 0)

df_with_cumsum = df.withColumn("cumulative_sum", sum("sales").over(windowSpec))

df_with_cumsum.show()

24. Write PySpark code to find outliers in a dataset.

Scenario: Detect outliers in the "sales" column based on the 1.5 * IQR rule.

from pyspark.sql.functions import expr

q1 = df.approxQuantile("sales", [0.25], 0.01)[0]

q3 = df.approxQuantile("sales", [0.75], 0.01)[0]

iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr

upper_bound = q3 + 1.5 * iqr

df_outliers = df.filter((col("sales") < lower_bound) | (col("sales") > upper_bound))

df_outliers.show()

25. Write PySpark code to convert a DataFrame to a Pandas DataFrame.

Scenario: Convert your PySpark DataFrame into a Pandas DataFrame for local processing.

pandas_df = df.toPandas()

PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
Azure Data Engineering for Pharma
100% (1)
Azure Data Engineering for Pharma
5 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Azure Databricks Team Data Science Lab
No ratings yet
Azure Databricks Team Data Science Lab
18 pages
How To Land On Azure Data Engineer Job
No ratings yet
How To Land On Azure Data Engineer Job
5 pages
Pyspark Hands On
No ratings yet
Pyspark Hands On
189 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Databricks Pyspark 1712042928
100% (1)
Databricks Pyspark 1712042928
21 pages
Naresh DE
No ratings yet
Naresh DE
5 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
16 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Azure DataEngineering End To End Videos
100% (1)
Azure DataEngineering End To End Videos
21 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
Azure DataBricks
No ratings yet
Azure DataBricks
37 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
Interview DE by Company Azurelib Dot Com
No ratings yet
Interview DE by Company Azurelib Dot Com
14 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
Databricks Vs SQL Cheat Sheet
100% (1)
Databricks Vs SQL Cheat Sheet
11 pages
What Is Azure Data Engineer
No ratings yet
What Is Azure Data Engineer
74 pages
Azure Data Engineering Interview Q & A - Topicwise
100% (1)
Azure Data Engineering Interview Q & A - Topicwise
57 pages
Azure Data Factory Interview Questions and Aswers
No ratings yet
Azure Data Factory Interview Questions and Aswers
5 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
ADF Copy Data
100% (1)
ADF Copy Data
81 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Structured Streaming
No ratings yet
Structured Streaming
12 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
PySpark Cheat 23
No ratings yet
PySpark Cheat 23
9 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Shelly Bansal - SR Data Engineer
No ratings yet
Shelly Bansal - SR Data Engineer
6 pages
Azure Databricks Engineering 1746278570
No ratings yet
Azure Databricks Engineering 1746278570
96 pages
Data Cleaning With PySpark
No ratings yet
Data Cleaning With PySpark
21 pages
Data Engineer Interview Prep
100% (1)
Data Engineer Interview Prep
16 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
SQL and PySpark Interview Questions
No ratings yet
SQL and PySpark Interview Questions
15 pages
Azure Data Engineering Course
No ratings yet
Azure Data Engineering Course
20 pages
HDFS Interview Prep Guide
No ratings yet
HDFS Interview Prep Guide
29 pages
Azure Data Engineer Content
No ratings yet
Azure Data Engineer Content
6 pages
A Data Pipeline Should Address These Issues:: Topics To Study
No ratings yet
A Data Pipeline Should Address These Issues:: Topics To Study
10 pages
Spark Big Data Tuning Guide
100% (1)
Spark Big Data Tuning Guide
20 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Spark QA
No ratings yet
Spark QA
34 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
Day 89
No ratings yet
Day 89
9 pages
Master Airflow With This Amazing Document!
No ratings yet
Master Airflow With This Amazing Document!
63 pages
Data Modelling Essentials
No ratings yet
Data Modelling Essentials
40 pages
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
20+ Key Difference in Spark
No ratings yet
20+ Key Difference in Spark
9 pages
SQL Learning Hub
No ratings yet
SQL Learning Hub
5 pages
Azure DE Roadmap2024
No ratings yet
Azure DE Roadmap2024
10 pages
PySpark Comprehensive Notes
No ratings yet
PySpark Comprehensive Notes
59 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Spark - groupByKey Vs reduceByKey
No ratings yet
Spark - groupByKey Vs reduceByKey
3 pages
Efficient SCD Management in Spark
No ratings yet
Efficient SCD Management in Spark
5 pages
Top 10 ChatGPT Prompting Techniques
100% (2)
Top 10 ChatGPT Prompting Techniques
14 pages
Python Portfolio Project For Data Analyst
No ratings yet
Python Portfolio Project For Data Analyst
13 pages
Most Asked Python Interview Questions at MAANG Companies
No ratings yet
Most Asked Python Interview Questions at MAANG Companies
26 pages
Guide To Building AI Agents From Scratch
100% (7)
Guide To Building AI Agents From Scratch
17 pages
PySpark 30 Days Practice Guide?
100% (1)
PySpark 30 Days Practice Guide?
35 pages
Spark Driver Role & Data Skew Solutions
No ratings yet
Spark Driver Role & Data Skew Solutions
33 pages
Step-By-Step Method To Find Drop Off Points in A User Flow
No ratings yet
Step-By-Step Method To Find Drop Off Points in A User Flow
17 pages
Discover India's Path To Net-Zero - Sustainable Growth & Green Energy!
No ratings yet
Discover India's Path To Net-Zero - Sustainable Growth & Green Energy!
1 page
Full Load
No ratings yet
Full Load
16 pages
BSNL ERP Asset Scrapping Process Guide
100% (1)
BSNL ERP Asset Scrapping Process Guide
3 pages
TRedess Fourth Series Product Brochure - v0417 - 3 PDF
No ratings yet
TRedess Fourth Series Product Brochure - v0417 - 3 PDF
24 pages
Performance-Proven and Cost-Effective CAL-LAB Lightning Isolators - Professional-Series (How To Apply and General Specifications)
No ratings yet
Performance-Proven and Cost-Effective CAL-LAB Lightning Isolators - Professional-Series (How To Apply and General Specifications)
2 pages
Routers Interview Questions and Answers Guide.: Global Guideline
No ratings yet
Routers Interview Questions and Answers Guide.: Global Guideline
5 pages
Oops - Eee, Ece, Cse & It
No ratings yet
Oops - Eee, Ece, Cse & It
2 pages
Blockchain and Distributed Ledger Technology (DLT)
No ratings yet
Blockchain and Distributed Ledger Technology (DLT)
10 pages
Role of Data For Emerging Technologies
87% (15)
Role of Data For Emerging Technologies
12 pages
Merise - MCP, MLC, MLD - Engl
100% (1)
Merise - MCP, MLC, MLD - Engl
7 pages
PCTOC UDK Development Kit
No ratings yet
PCTOC UDK Development Kit
176 pages
Bus Scheduling and Reservation System Abstract
No ratings yet
Bus Scheduling and Reservation System Abstract
13 pages
Ssssss
No ratings yet
Ssssss
5 pages
File and Text Encryption and Decryption Using Web Services (Synopsis)
100% (2)
File and Text Encryption and Decryption Using Web Services (Synopsis)
12 pages
Nptel PPT NEW PDF
No ratings yet
Nptel PPT NEW PDF
14 pages
14-Troubleshooting Spanning Tree
No ratings yet
14-Troubleshooting Spanning Tree
13 pages
Intrusion Detection in 802.11 Networks: Empirical Evaluation of Threats and A Public Dataset
No ratings yet
Intrusion Detection in 802.11 Networks: Empirical Evaluation of Threats and A Public Dataset
26 pages
System Calls
No ratings yet
System Calls
31 pages
Marketing Project On HP Laptops
100% (1)
Marketing Project On HP Laptops
36 pages
Cybercrime Prevention Act of 2012
No ratings yet
Cybercrime Prevention Act of 2012
8 pages
Dynamics 365 Business Central Guide
No ratings yet
Dynamics 365 Business Central Guide
12 pages
Apac Linux Partner Presentation 030107-1
100% (1)
Apac Linux Partner Presentation 030107-1
40 pages
Fat & SAT Procedure v01
No ratings yet
Fat & SAT Procedure v01
16 pages
ITU-Trends in Telecommunication Reform 2006
No ratings yet
ITU-Trends in Telecommunication Reform 2006
240 pages
Java UPC-A Barcodes Generator For Java, J2EE, JasperReports
No ratings yet
Java UPC-A Barcodes Generator For Java, J2EE, JasperReports
5 pages
Re-Initialization of The Material Stocks/ Movements Cube (0IC - C03) With 2LIS - 03 - BF, 2LIS - 03 - BX and 2LIS - 03 - UM in BW 7.x
No ratings yet
Re-Initialization of The Material Stocks/ Movements Cube (0IC - C03) With 2LIS - 03 - BF, 2LIS - 03 - BX and 2LIS - 03 - UM in BW 7.x
24 pages
Dynamic SQL with EXECUTE IMMEDIATE
No ratings yet
Dynamic SQL with EXECUTE IMMEDIATE
5 pages
Sony Vaio PDF
No ratings yet
Sony Vaio PDF
35 pages
Asic Lab4a Placement
No ratings yet
Asic Lab4a Placement
15 pages
Computer Networking Basics
No ratings yet
Computer Networking Basics
4 pages
BGP Route Reflector
No ratings yet
BGP Route Reflector
5 pages
HRIS Guide for HR Professionals
No ratings yet
HRIS Guide for HR Professionals
15 pages

Must Know Pyspark Coding Before Databricks Interview

Uploaded by

Must Know Pyspark Coding Before Databricks Interview

Uploaded by

Must Know Pyspark Coding Before Your Next

Document by – Siddhartha Subudhi

from pyspark.sql import Window

from pyspark.sql.functions import col, dense_rank

df_with_rank = df.withColumn("rank", dense_rank().over(windowSpec))

second_highest_salary = df_with_rank.filter(col("rank") == 2).select("salary")

2. Count the number of null values in each column of a PySpark DataFrame.

from pyspark.sql.functions import col, isnan, when, count

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

3. Calculate the moving average over a window of 3 rows.

from pyspark.sql import Window

from pyspark.sql.functions import avg

df_with_moving_avg = df.withColumn("moving_avg", avg("price").over(windowSpec))

4. Remove duplicate rows based on a subset of columns in a PySpark DataFrame.

from pyspark.sql.functions import split

df_split = df.withColumn("new_column1", split(df["column"], ",").getItem(0)) \

.withColumn("new_column2", split(df["column"], ",").getItem(1))

7. Join two DataFrames on a specific condition.

df_joined = df_customers.join(df_orders, df_customers.customer_id == df_orders.customer_id, "inner")

8. Create a new column based on conditions from existing columns.

from pyspark.sql.functions import when

df = df.withColumn("category", when(df.sales > 500, "high")

.when((df.sales <= 500) & (df.sales > 200), "medium")

from pyspark.sql.functions import sum, col

df = df.withColumn("percentage", (col("sales") / total_sales) * 100)

10. Find the top N records from a DataFrame based on a column.

Scenario: You need to find the top 5 highest-selling products.

11. Write PySpark code to pivot a DataFrame.

12. Add row numbers to a PySpark DataFrame based on a specific ordering.

Scenario: Add row numbers to a DataFrame ordered by "sales" in descending order.

from pyspark.sql.window import Window

from pyspark.sql.functions import row_number

df_with_row_number = df.withColumn("row_number", row_number().over(windowSpec))

df_filtered = df.filter(df.purchase_amount > 1000)

14. Flatten a JSON column in PySpark.

from pyspark.sql.functions import from_json, col

from pyspark.sql.types import StructType, StructField, StringType

StructField("name", StringType(), True),

StructField("age", StringType(), True)

df = df.withColumn("json_data", from_json(col("json_column"), schema))

15. Convert a PySpark DataFrame column to a list.

column_list = df.select("column_name").rdd.flatMap(lambda x: x).collect()

16. Handle NULL values by replacing them with a default value.

Scenario: Replace all NULL values in the "sales" column with 0.

17. Perform a self-join on a PySpark DataFrame.

df_self_join = df.alias("e1").join(df.alias("e2"), col("e1.manager_id") == col("e2.employee_id"), "inner") \

from pyspark.sql.functions import expr

df_unpivot = df.selectExpr("id", "stack(2, '2021', sales_2021, '2022', sales_2022) as (year, sales)")

df.groupBy("product", "region").agg({"sales": "avg"}).show()

df = spark.read.option("header", "true").option("inferSchema", "true").csv("path_to_csv")

Scenario: You want to calculate a cumulative sum of sales in your DataFrame.

from pyspark.sql.window import Window

from pyspark.sql.functions import sum

df_with_cumsum = df.withColumn("cumulative_sum", sum("sales").over(windowSpec))

24. Write PySpark code to find outliers in a dataset.

from pyspark.sql.functions import expr

q1 = df.approxQuantile("sales", [0.25], 0.01)[0]

q3 = df.approxQuantile("sales", [0.75], 0.01)[0]

lower_bound = q1 - 1.5 * iqr

upper_bound = q3 + 1.5 * iqr

df_outliers = df.filter((col("sales") < lower_bound) | (col("sales") > upper_bound))

25. Write PySpark code to convert a DataFrame to a Pandas DataFrame.

You might also like