0% found this document useful (0 votes)

26 views21 pages

Pyspark STAR Questions

The document outlines 20 situational STAR-based interview questions for PySpark, focusing on real-world scenarios faced by data engineers. Each question details a situation, task, action, and result, providing insights into problem-solving and optimization techniques in PySpark jobs. Key topics include performance bottlenecks, data integrity, schema evolution, and data transformation strategies.

Uploaded by

sushant shewale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views21 pages

Pyspark STAR Questions

Uploaded by

sushant shewale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

PySpark Interview: 20 Situational

STAR-Based Questions
Real-World Scenarios for Data Engineers
By Shambhav Kumar ·
Q1 - A PySpark job is taking 4x longer after
adding a new join. What do you do?

Situation: A new team member added a join to a production

job, and it now runs very slowly.

Task: Your goal is to identify the performance bottleneck and

fix it.

Action: Check the join type, partitioning, and skew. Use

broadcast joins or repartition as needed.

Result: Job performance improved by reducing shuffle and

leveraging broadcast join where applicable.

Swipe for more --->>>

Q2 - A data pipeline is silently dropping
some records. What would you investigate?

Situation: Downstream reports show missing rows despite no

errors in the logs.

Task: Debug why records are missing from a PySpark

transformation.

Action: Investigate filters, joins with nulls, and inner joins

causing drops. Look for `dropDuplicates()` or incorrect `filter()`
conditions.

Result: Identified a filter on a nullable column and corrected

logic to preserve valid data.

Swipe for more --->>>

Q3 - Your team receives a complaint about
nulls appearing after a groupBy(). What
might cause this?

Situation: After applying groupBy and aggregation, some rows

have nulls in unexpected places.

Task: Trace how nulls are introduced post-aggregation.

Action: Check groupBy columns, aggregation defaults, and

handling of missing keys. Validate input data.

Result: Realized some grouping keys were missing. Added

pre-validation and filled nulls before aggregation.

Swipe for more --->>>

Q4 - A job reading from a CSV file fails
intermittently. What steps would you take?

Situation: A PySpark job reading from a raw CSV source

sometimes fails with schema issues.

Task: Identify the root cause and make the job resilient.

Action: Enable `mode='PERMISSIVE'`, inspect corrupt

records, and apply schema inference only on sample files.

Result: The job now runs without crashing and logs bad
records for future review.

Swipe for more --->>>

Q5 - You see high shuffle in Spark UI. What
next?

Situation: Spark UI shows large shuffle read/write stages

causing slowness.

Task: Optimize the job to reduce shuffle.

Action: Repartition logically, use coalesce, leverage bucketing

if applicable, and avoid wide transformations before narrow
ones.

Result: Shuffle reduced significantly and job runtime improved

by 40%.

Swipe for more --->>>

Q6 - You're tasked to deduplicate records
based on a timestamp. How would you do it?

Situation: You have multiple versions of records per key and

must keep the latest one.

Task: Deduplicate using business logic.

Action: Use `row_number()` over a window ordered by

timestamp, and filter where row_number == 1.

Result: Clean, deduplicated dataset preserved the latest

record per entity.

Swipe for more --->>>

Q7 - A pipeline breaks because of schema
evolution in a Parquet file. How do you fix it?

Situation: Upstream added new columns to Parquet files,

causing downstream jobs to fail.

Task: Make your reader resilient to schema changes.

Action: Enable `mergeSchema=True` or explicitly define

schema evolution logic using `selectExpr()`.

Result: Downstream job handled schema changes gracefully

without errors.

Swipe for more --->>>

Q8 - You're asked to profile a new dataset
before use. What's your approach?

Situation: A new data source is being ingested into the lake.

Task: Generate a quick summary for validation and schema

checks.

Action: Use `df.describe()`, check `null` distribution, data

types, and unique values.

Result: Flagged unexpected nulls in critical columns before

integration.

Swipe for more --->>>

Q9 - Your PySpark job failed in production
due to memory error. How do you debug?

Situation: The job crashes during a `collect()` operation.

Task: Prevent driver OOM while still sampling data.

Action: Replace `collect()` with `limit()` and `toPandas()` on a

sample. Use Spark UI to trace memory usage.

Result: Fixed with smarter sampling and added monitoring for

large actions.

Swipe for more --->>>

Q10 - You observe duplicate records even
after `dropDuplicates()`. Why might this
happen?

Situation: A colleague reports seeing duplicates after using

`dropDuplicates()`.

Task: Investigate the cause of duplicates.

Action: Check whether correct subset of columns was used.

Also validate row content — sometimes extra whitespaces or
casing causes issues.

Result: Cleaned whitespace and used `trim()`/`lower()` to

make duplicates detectable.

Swipe for more --->>>

Q11 - A stakeholder needs a pivoted report.
How would you implement it in PySpark?

Situation: Monthly sales data needs to be converted into a

department-wise pivot.

Task: Transform long format into wide.

Action: Use `groupBy().pivot().agg()` and ensure column

explosion is avoided with limited distinct values.

Result: Clean, summarized pivot table ready for reporting.

Swipe for more --->>>

Q12 - Your PySpark job is producing too
many small files. What's the impact and fix?

Situation: You observe 10,000+ small Parquet files in S3.

Task: Reduce file count for performance.

Action: Use `coalesce()` or `repartition()` to control file count.

Set optimal number of output partitions.

Result: Query speed improved and storage costs reduced.

Swipe for more --->>>

Q13 - You're importing a column with
timestamps in multiple formats. How do you
standardize?

Situation: Input has a mix of 'yyyy-MM-dd' and 'dd-MM-yyyy'

formats.

Task: Clean and convert to consistent timestamp format.

Action: Use `to_timestamp()` with `when()` clause to detect

and parse each format.

Result: Unified timestamp column with accurate parsing logic.

Swipe for more --->>>

Q14 - You must join two datasets on a string
key, but one has trailing spaces. What's your
fix?

Situation: Join returns fewer records than expected.

Task: Ensure both keys match correctly.

Action: Use `trim()` or `regexp_replace()` to sanitize before

join.

Result: Join returns accurate results with full record matching.

Swipe for more --->>>

Q15 - You're asked to mask sensitive
columns before delivery. How do you do it?

Situation: You’re preparing data to share externally.

Task: Obfuscate or mask sensitive PII.

Action: Use `sha2()` for hashing or replace values using

`withColumn()` + `lit()`.

Result: Delivered compliant, masked dataset without leaking

personal data.

Swipe for more --->>>

Q16 - You're seeing skewed partition sizes in
Spark UI. What does it mean?

Situation: One stage takes significantly longer due to a few

big partitions.

Task: Address data skew.

Action: Identify skewed keys, use salting, or apply

`repartition()` by less skewed column.

Result: Balanced partition load and reduced job duration.

Swipe for more --->>>

Q17 - You need to validate that a column is
always numeric. How do you check it?

Situation: You’re unsure if a string column can be cast to

numeric safely.

Task: Validate and clean the data.

Action: Use `cast()` and filter out rows where cast results in
null. Log failures.

Result: Clean numeric column ensured downstream type

compatibility.

Swipe for more --->>>

Q18 - You’re asked to implement an Airflow
DAG for a PySpark job. What would you
ensure?

Situation: Your team wants to schedule a PySpark batch in

production.

Task: Make DAG reliable and maintainable.

Action: Use `BashOperator` or `SparkSubmitOperator`, add

retries, and capture logs to S3 or external systems.

Result: Robust, alert-monitored DAG deployed in Airflow.

Swipe for more --->>>

Q19 - You're integrating Hive tables with
PySpark. What should you be careful about?

Situation: You’re running Spark SQL on external

Hive-managed tables.

Task: Ensure compatibility and stability.

Action: Sync metastore configs, check schema compatibility,

and use correct SerDe formats.

Result: Hive tables are queryable and integrated seamlessly

with Spark.

Swipe for more --->>>

Q17 - You need to validate that a column is
always numeric. How do you check it?

Situation: You’re unsure if a string column can be cast to

numeric safely.

Task: Validate and clean the data.

Action: Use `cast()` and filter out rows where cast results in
null. Log failures.

Result: Clean numeric column ensured downstream type

compatibility.

Swipe for more --->>>

PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
Narsimlu - Azure Data Engineer - Resume .Pf-1
No ratings yet
Narsimlu - Azure Data Engineer - Resume .Pf-1
4 pages
Data Migration and CDC Tasks
No ratings yet
Data Migration and CDC Tasks
11 pages
Sr. Data Engineer with Azure Expertise
No ratings yet
Sr. Data Engineer with Azure Expertise
6 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Spark QA
No ratings yet
Spark QA
34 pages
Databricks & PySpark Learning Day-10
No ratings yet
Databricks & PySpark Learning Day-10
4 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Goldman Sachs
No ratings yet
Goldman Sachs
4 pages
A Near Real-Time Big Data Processing Architecture
No ratings yet
A Near Real-Time Big Data Processing Architecture
59 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Azure Resource Group & SQL Setup Guide
No ratings yet
Azure Resource Group & SQL Setup Guide
73 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
Azure Data Engineer Expertise
No ratings yet
Azure Data Engineer Expertise
7 pages
Azure Data Factory Workshop
No ratings yet
Azure Data Factory Workshop
26 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
Spark
No ratings yet
Spark
96 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Azure Databricks Notes
No ratings yet
Azure Databricks Notes
20 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Ajay Resume VLaF
No ratings yet
Ajay Resume VLaF
2 pages
Using Databricks Notebook in Talend Studio
No ratings yet
Using Databricks Notebook in Talend Studio
19 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Building Data Pipelines - 3
No ratings yet
Building Data Pipelines - 3
29 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Azure Project Execution Plan ADF+DBX+CICD
No ratings yet
Azure Project Execution Plan ADF+DBX+CICD
5 pages
Lead Data Engineer with AWS Expertise
No ratings yet
Lead Data Engineer with AWS Expertise
2 pages
Naveen Azure Latest
No ratings yet
Naveen Azure Latest
5 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Databricks Delta for Developers
No ratings yet
Databricks Delta for Developers
11 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
Leetcode SQL QnA 1693149052
No ratings yet
Leetcode SQL QnA 1693149052
60 pages
Naresh DE
No ratings yet
Naresh DE
5 pages
Snowflake Interview Question
No ratings yet
Snowflake Interview Question
20 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark Repartition1
No ratings yet
Spark Repartition1
7 pages
Simplify Your Streaming
No ratings yet
Simplify Your Streaming
27 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Shelly Bansal - SR Data Engineer
No ratings yet
Shelly Bansal - SR Data Engineer
6 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Documentation Project
No ratings yet
Documentation Project
56 pages
SCD Type-2 with Pandas in Spark
0% (1)
SCD Type-2 with Pandas in Spark
8 pages
Azure Data Engineering Roadmap
No ratings yet
Azure Data Engineering Roadmap
36 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
Learn Pandas Step by Step
No ratings yet
Learn Pandas Step by Step
3 pages
Deloite Data Engineer Interview Questions
No ratings yet
Deloite Data Engineer Interview Questions
24 pages
CI - CD For QA - What, Why, and Where You Fit
No ratings yet
CI - CD For QA - What, Why, and Where You Fit
11 pages
Mockups For Guru99 Bank: Ogin AGE
No ratings yet
Mockups For Guru99 Bank: Ogin AGE
4 pages
Srs Limited Draft
No ratings yet
Srs Limited Draft
470 pages
AVW-PCAP Manual
No ratings yet
AVW-PCAP Manual
7 pages
PHP & Mail Servers for Engineers
No ratings yet
PHP & Mail Servers for Engineers
7 pages
CMA Part 1 Essay Prep
No ratings yet
CMA Part 1 Essay Prep
151 pages
BIM Execution Plan 1688808650
No ratings yet
BIM Execution Plan 1688808650
27 pages
Instalar Tools Vms Rhev
No ratings yet
Instalar Tools Vms Rhev
16 pages
User Authentication Security Guide
No ratings yet
User Authentication Security Guide
52 pages
Myvas Mysejahtera App Covid 19 Vaccination Manual v2
No ratings yet
Myvas Mysejahtera App Covid 19 Vaccination Manual v2
62 pages
Radware ThreatReport Report 2024 RW-459
No ratings yet
Radware ThreatReport Report 2024 RW-459
59 pages
WORKSHEET-3.1 Java PDF New
No ratings yet
WORKSHEET-3.1 Java PDF New
3 pages
Bridge Athletic
No ratings yet
Bridge Athletic
2 pages
WinMD5 Free - Windows MD5 Utility Freeware For Windows 7 - 8 - 10 - 11
No ratings yet
WinMD5 Free - Windows MD5 Utility Freeware For Windows 7 - 8 - 10 - 11
2 pages
PACOM GMS Web v3.5 Installation Configuration Guide
No ratings yet
PACOM GMS Web v3.5 Installation Configuration Guide
23 pages
Mtcloud Computing
No ratings yet
Mtcloud Computing
26 pages
Bayesian Optimization Primer: 1. Sigopt
No ratings yet
Bayesian Optimization Primer: 1. Sigopt
4 pages
Serial Communication With PIC16F690 by Houston Pillay
No ratings yet
Serial Communication With PIC16F690 by Houston Pillay
12 pages
What's New in ProNest 2025
No ratings yet
What's New in ProNest 2025
29 pages
Secretary Resume Skills
100% (2)
Secretary Resume Skills
8 pages
Iphone 16e - Apple (UK)
No ratings yet
Iphone 16e - Apple (UK)
1 page
Shaik Shajahoor: Personal Experience
No ratings yet
Shaik Shajahoor: Personal Experience
2 pages
AltaLink-B8045 B8055 B8065 B8075 B8090-IAD-V1.1
No ratings yet
AltaLink-B8045 B8055 B8065 B8075 B8090-IAD-V1.1
56 pages
Ctech7 q1 WLP Week 2
No ratings yet
Ctech7 q1 WLP Week 2
2 pages
CertPrep NCP-CN PDF Questions
No ratings yet
CertPrep NCP-CN PDF Questions
5 pages
Plumbing Design With RME
No ratings yet
Plumbing Design With RME
20 pages
W3schools: CSS Margins
No ratings yet
W3schools: CSS Margins
6 pages
Notes - Strings, List, Tuple, Dictionary
No ratings yet
Notes - Strings, List, Tuple, Dictionary
25 pages
Assignment 1 Front Sheet: Qualification TEC Level 5 HND Diploma in Computing
No ratings yet
Assignment 1 Front Sheet: Qualification TEC Level 5 HND Diploma in Computing
35 pages
OpenText Extended ECM For SAP Solutions CE 21.4 - Installation and Upgrade Guide English (ERLK210400-IGD-En-03)
No ratings yet
OpenText Extended ECM For SAP Solutions CE 21.4 - Installation and Upgrade Guide English (ERLK210400-IGD-En-03)
302 pages
Log
No ratings yet
Log
2 pages
Mobile Center User Manual
No ratings yet
Mobile Center User Manual
191 pages
Microsoft Copilot Studio - Workshop
No ratings yet
Microsoft Copilot Studio - Workshop
66 pages