KEMBAR78
Pyspark STAR Questions | PDF | Apache Spark | Software Engineering
0% found this document useful (0 votes)
26 views21 pages

Pyspark STAR Questions

The document outlines 20 situational STAR-based interview questions for PySpark, focusing on real-world scenarios faced by data engineers. Each question details a situation, task, action, and result, providing insights into problem-solving and optimization techniques in PySpark jobs. Key topics include performance bottlenecks, data integrity, schema evolution, and data transformation strategies.

Uploaded by

sushant shewale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views21 pages

Pyspark STAR Questions

The document outlines 20 situational STAR-based interview questions for PySpark, focusing on real-world scenarios faced by data engineers. Each question details a situation, task, action, and result, providing insights into problem-solving and optimization techniques in PySpark jobs. Key topics include performance bottlenecks, data integrity, schema evolution, and data transformation strategies.

Uploaded by

sushant shewale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

PySpark Interview: 20 Situational

STAR-Based Questions
Real-World Scenarios for Data Engineers
By Shambhav Kumar ·
Q1 - A PySpark job is taking 4x longer after
adding a new join. What do you do?

Situation: A new team member added a join to a production


job, and it now runs very slowly.

Task: Your goal is to identify the performance bottleneck and


fix it.

Action: Check the join type, partitioning, and skew. Use


broadcast joins or repartition as needed.

Result: Job performance improved by reducing shuffle and


leveraging broadcast join where applicable.

Swipe for more --->>>


Q2 - A data pipeline is silently dropping
some records. What would you investigate?

Situation: Downstream reports show missing rows despite no


errors in the logs.

Task: Debug why records are missing from a PySpark


transformation.

Action: Investigate filters, joins with nulls, and inner joins


causing drops. Look for `dropDuplicates()` or incorrect `filter()`
conditions.

Result: Identified a filter on a nullable column and corrected


logic to preserve valid data.

Swipe for more --->>>


Q3 - Your team receives a complaint about
nulls appearing after a groupBy(). What
might cause this?

Situation: After applying groupBy and aggregation, some rows


have nulls in unexpected places.

Task: Trace how nulls are introduced post-aggregation.

Action: Check groupBy columns, aggregation defaults, and


handling of missing keys. Validate input data.

Result: Realized some grouping keys were missing. Added


pre-validation and filled nulls before aggregation.

Swipe for more --->>>


Q4 - A job reading from a CSV file fails
intermittently. What steps would you take?

Situation: A PySpark job reading from a raw CSV source


sometimes fails with schema issues.

Task: Identify the root cause and make the job resilient.

Action: Enable `mode='PERMISSIVE'`, inspect corrupt


records, and apply schema inference only on sample files.

Result: The job now runs without crashing and logs bad
records for future review.

Swipe for more --->>>


Q5 - You see high shuffle in Spark UI. What
next?

Situation: Spark UI shows large shuffle read/write stages


causing slowness.

Task: Optimize the job to reduce shuffle.

Action: Repartition logically, use coalesce, leverage bucketing


if applicable, and avoid wide transformations before narrow
ones.

Result: Shuffle reduced significantly and job runtime improved


by 40%.

Swipe for more --->>>


Q6 - You're tasked to deduplicate records
based on a timestamp. How would you do it?

Situation: You have multiple versions of records per key and


must keep the latest one.

Task: Deduplicate using business logic.

Action: Use `row_number()` over a window ordered by


timestamp, and filter where row_number == 1.

Result: Clean, deduplicated dataset preserved the latest


record per entity.

Swipe for more --->>>


Q7 - A pipeline breaks because of schema
evolution in a Parquet file. How do you fix it?

Situation: Upstream added new columns to Parquet files,


causing downstream jobs to fail.

Task: Make your reader resilient to schema changes.

Action: Enable `mergeSchema=True` or explicitly define


schema evolution logic using `selectExpr()`.

Result: Downstream job handled schema changes gracefully


without errors.

Swipe for more --->>>


Q8 - You're asked to profile a new dataset
before use. What's your approach?

Situation: A new data source is being ingested into the lake.

Task: Generate a quick summary for validation and schema


checks.

Action: Use `df.describe()`, check `null` distribution, data


types, and unique values.

Result: Flagged unexpected nulls in critical columns before


integration.

Swipe for more --->>>


Q9 - Your PySpark job failed in production
due to memory error. How do you debug?

Situation: The job crashes during a `collect()` operation.

Task: Prevent driver OOM while still sampling data.

Action: Replace `collect()` with `limit()` and `toPandas()` on a


sample. Use Spark UI to trace memory usage.

Result: Fixed with smarter sampling and added monitoring for


large actions.

Swipe for more --->>>


Q10 - You observe duplicate records even
after `dropDuplicates()`. Why might this
happen?

Situation: A colleague reports seeing duplicates after using


`dropDuplicates()`.

Task: Investigate the cause of duplicates.

Action: Check whether correct subset of columns was used.


Also validate row content — sometimes extra whitespaces or
casing causes issues.

Result: Cleaned whitespace and used `trim()`/`lower()` to


make duplicates detectable.

Swipe for more --->>>


Q11 - A stakeholder needs a pivoted report.
How would you implement it in PySpark?

Situation: Monthly sales data needs to be converted into a


department-wise pivot.

Task: Transform long format into wide.

Action: Use `groupBy().pivot().agg()` and ensure column


explosion is avoided with limited distinct values.

Result: Clean, summarized pivot table ready for reporting.

Swipe for more --->>>


Q12 - Your PySpark job is producing too
many small files. What's the impact and fix?

Situation: You observe 10,000+ small Parquet files in S3.

Task: Reduce file count for performance.

Action: Use `coalesce()` or `repartition()` to control file count.


Set optimal number of output partitions.

Result: Query speed improved and storage costs reduced.

Swipe for more --->>>


Q13 - You're importing a column with
timestamps in multiple formats. How do you
standardize?

Situation: Input has a mix of 'yyyy-MM-dd' and 'dd-MM-yyyy'


formats.

Task: Clean and convert to consistent timestamp format.

Action: Use `to_timestamp()` with `when()` clause to detect


and parse each format.

Result: Unified timestamp column with accurate parsing logic.

Swipe for more --->>>


Q14 - You must join two datasets on a string
key, but one has trailing spaces. What's your
fix?

Situation: Join returns fewer records than expected.

Task: Ensure both keys match correctly.

Action: Use `trim()` or `regexp_replace()` to sanitize before


join.

Result: Join returns accurate results with full record matching.

Swipe for more --->>>


Q15 - You're asked to mask sensitive
columns before delivery. How do you do it?

Situation: You’re preparing data to share externally.

Task: Obfuscate or mask sensitive PII.

Action: Use `sha2()` for hashing or replace values using


`withColumn()` + `lit()`.

Result: Delivered compliant, masked dataset without leaking


personal data.

Swipe for more --->>>


Q16 - You're seeing skewed partition sizes in
Spark UI. What does it mean?

Situation: One stage takes significantly longer due to a few


big partitions.

Task: Address data skew.

Action: Identify skewed keys, use salting, or apply


`repartition()` by less skewed column.

Result: Balanced partition load and reduced job duration.

Swipe for more --->>>


Q17 - You need to validate that a column is
always numeric. How do you check it?

Situation: You’re unsure if a string column can be cast to


numeric safely.

Task: Validate and clean the data.

Action: Use `cast()` and filter out rows where cast results in
null. Log failures.

Result: Clean numeric column ensured downstream type


compatibility.

Swipe for more --->>>


Q18 - You’re asked to implement an Airflow
DAG for a PySpark job. What would you
ensure?

Situation: Your team wants to schedule a PySpark batch in


production.

Task: Make DAG reliable and maintainable.

Action: Use `BashOperator` or `SparkSubmitOperator`, add


retries, and capture logs to S3 or external systems.

Result: Robust, alert-monitored DAG deployed in Airflow.

Swipe for more --->>>


Q19 - You're integrating Hive tables with
PySpark. What should you be careful about?

Situation: You’re running Spark SQL on external


Hive-managed tables.

Task: Ensure compatibility and stability.

Action: Sync metastore configs, check schema compatibility,


and use correct SerDe formats.

Result: Hive tables are queryable and integrated seamlessly


with Spark.

Swipe for more --->>>


Q17 - You need to validate that a column is
always numeric. How do you check it?

Situation: You’re unsure if a string column can be cast to


numeric safely.

Task: Validate and clean the data.

Action: Use `cast()` and filter out rows where cast results in
null. Log failures.

Result: Clean numeric column ensured downstream type


compatibility.

Swipe for more --->>>

You might also like