PySpark Interview: 20 Situational
STAR-Based Questions
Real-World Scenarios for Data Engineers
By Shambhav Kumar ·
Q1 - A PySpark job is taking 4x longer after
adding a new join. What do you do?
Situation: A new team member added a join to a production
job, and it now runs very slowly.
Task: Your goal is to identify the performance bottleneck and
fix it.
Action: Check the join type, partitioning, and skew. Use
broadcast joins or repartition as needed.
Result: Job performance improved by reducing shuffle and
leveraging broadcast join where applicable.
Swipe for more --->>>
Q2 - A data pipeline is silently dropping
some records. What would you investigate?
Situation: Downstream reports show missing rows despite no
errors in the logs.
Task: Debug why records are missing from a PySpark
transformation.
Action: Investigate filters, joins with nulls, and inner joins
causing drops. Look for `dropDuplicates()` or incorrect `filter()`
conditions.
Result: Identified a filter on a nullable column and corrected
logic to preserve valid data.
Swipe for more --->>>
Q3 - Your team receives a complaint about
nulls appearing after a groupBy(). What
might cause this?
Situation: After applying groupBy and aggregation, some rows
have nulls in unexpected places.
Task: Trace how nulls are introduced post-aggregation.
Action: Check groupBy columns, aggregation defaults, and
handling of missing keys. Validate input data.
Result: Realized some grouping keys were missing. Added
pre-validation and filled nulls before aggregation.
Swipe for more --->>>
Q4 - A job reading from a CSV file fails
intermittently. What steps would you take?
Situation: A PySpark job reading from a raw CSV source
sometimes fails with schema issues.
Task: Identify the root cause and make the job resilient.
Action: Enable `mode='PERMISSIVE'`, inspect corrupt
records, and apply schema inference only on sample files.
Result: The job now runs without crashing and logs bad
records for future review.
Swipe for more --->>>
Q5 - You see high shuffle in Spark UI. What
next?
Situation: Spark UI shows large shuffle read/write stages
causing slowness.
Task: Optimize the job to reduce shuffle.
Action: Repartition logically, use coalesce, leverage bucketing
if applicable, and avoid wide transformations before narrow
ones.
Result: Shuffle reduced significantly and job runtime improved
by 40%.
Swipe for more --->>>
Q6 - You're tasked to deduplicate records
based on a timestamp. How would you do it?
Situation: You have multiple versions of records per key and
must keep the latest one.
Task: Deduplicate using business logic.
Action: Use `row_number()` over a window ordered by
timestamp, and filter where row_number == 1.
Result: Clean, deduplicated dataset preserved the latest
record per entity.
Swipe for more --->>>
Q7 - A pipeline breaks because of schema
evolution in a Parquet file. How do you fix it?
Situation: Upstream added new columns to Parquet files,
causing downstream jobs to fail.
Task: Make your reader resilient to schema changes.
Action: Enable `mergeSchema=True` or explicitly define
schema evolution logic using `selectExpr()`.
Result: Downstream job handled schema changes gracefully
without errors.
Swipe for more --->>>
Q8 - You're asked to profile a new dataset
before use. What's your approach?
Situation: A new data source is being ingested into the lake.
Task: Generate a quick summary for validation and schema
checks.
Action: Use `df.describe()`, check `null` distribution, data
types, and unique values.
Result: Flagged unexpected nulls in critical columns before
integration.
Swipe for more --->>>
Q9 - Your PySpark job failed in production
due to memory error. How do you debug?
Situation: The job crashes during a `collect()` operation.
Task: Prevent driver OOM while still sampling data.
Action: Replace `collect()` with `limit()` and `toPandas()` on a
sample. Use Spark UI to trace memory usage.
Result: Fixed with smarter sampling and added monitoring for
large actions.
Swipe for more --->>>
Q10 - You observe duplicate records even
after `dropDuplicates()`. Why might this
happen?
Situation: A colleague reports seeing duplicates after using
`dropDuplicates()`.
Task: Investigate the cause of duplicates.
Action: Check whether correct subset of columns was used.
Also validate row content — sometimes extra whitespaces or
casing causes issues.
Result: Cleaned whitespace and used `trim()`/`lower()` to
make duplicates detectable.
Swipe for more --->>>
Q11 - A stakeholder needs a pivoted report.
How would you implement it in PySpark?
Situation: Monthly sales data needs to be converted into a
department-wise pivot.
Task: Transform long format into wide.
Action: Use `groupBy().pivot().agg()` and ensure column
explosion is avoided with limited distinct values.
Result: Clean, summarized pivot table ready for reporting.
Swipe for more --->>>
Q12 - Your PySpark job is producing too
many small files. What's the impact and fix?
Situation: You observe 10,000+ small Parquet files in S3.
Task: Reduce file count for performance.
Action: Use `coalesce()` or `repartition()` to control file count.
Set optimal number of output partitions.
Result: Query speed improved and storage costs reduced.
Swipe for more --->>>
Q13 - You're importing a column with
timestamps in multiple formats. How do you
standardize?
Situation: Input has a mix of 'yyyy-MM-dd' and 'dd-MM-yyyy'
formats.
Task: Clean and convert to consistent timestamp format.
Action: Use `to_timestamp()` with `when()` clause to detect
and parse each format.
Result: Unified timestamp column with accurate parsing logic.
Swipe for more --->>>
Q14 - You must join two datasets on a string
key, but one has trailing spaces. What's your
fix?
Situation: Join returns fewer records than expected.
Task: Ensure both keys match correctly.
Action: Use `trim()` or `regexp_replace()` to sanitize before
join.
Result: Join returns accurate results with full record matching.
Swipe for more --->>>
Q15 - You're asked to mask sensitive
columns before delivery. How do you do it?
Situation: You’re preparing data to share externally.
Task: Obfuscate or mask sensitive PII.
Action: Use `sha2()` for hashing or replace values using
`withColumn()` + `lit()`.
Result: Delivered compliant, masked dataset without leaking
personal data.
Swipe for more --->>>
Q16 - You're seeing skewed partition sizes in
Spark UI. What does it mean?
Situation: One stage takes significantly longer due to a few
big partitions.
Task: Address data skew.
Action: Identify skewed keys, use salting, or apply
`repartition()` by less skewed column.
Result: Balanced partition load and reduced job duration.
Swipe for more --->>>
Q17 - You need to validate that a column is
always numeric. How do you check it?
Situation: You’re unsure if a string column can be cast to
numeric safely.
Task: Validate and clean the data.
Action: Use `cast()` and filter out rows where cast results in
null. Log failures.
Result: Clean numeric column ensured downstream type
compatibility.
Swipe for more --->>>
Q18 - You’re asked to implement an Airflow
DAG for a PySpark job. What would you
ensure?
Situation: Your team wants to schedule a PySpark batch in
production.
Task: Make DAG reliable and maintainable.
Action: Use `BashOperator` or `SparkSubmitOperator`, add
retries, and capture logs to S3 or external systems.
Result: Robust, alert-monitored DAG deployed in Airflow.
Swipe for more --->>>
Q19 - You're integrating Hive tables with
PySpark. What should you be careful about?
Situation: You’re running Spark SQL on external
Hive-managed tables.
Task: Ensure compatibility and stability.
Action: Sync metastore configs, check schema compatibility,
and use correct SerDe formats.
Result: Hive tables are queryable and integrated seamlessly
with Spark.
Swipe for more --->>>
Q17 - You need to validate that a column is
always numeric. How do you check it?
Situation: You’re unsure if a string column can be cast to
numeric safely.
Task: Validate and clean the data.
Action: Use `cast()` and filter out rows where cast results in
null. Log failures.
Result: Clean numeric column ensured downstream type
compatibility.
Swipe for more --->>>