0% found this document useful (0 votes)

694 views12 pages

PySpark Cheatsheet

Uploaded by

sambitsouravd025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

694 views12 pages

PySpark Cheatsheet

Uploaded by

sambitsouravd025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

PySpark Cheatsheet

1. PySpark Overview

• Definition: PySpark is the Python API for Apache Spark, an open-source,

distributed computing framework.

• Core Components:

o RDD (Resilient Distributed Dataset): Immutable distributed

collections of objects.

o DataFrame: Distributed table with named columns; optimized for SQL

queries.

o Dataset: Strongly typed, distributed data structure (available in

Scala/Java).

• Languages Supported: Python, Scala, Java, and R.

2. Core Spark Concepts

• Driver: Manages the execution of tasks across the cluster.

• Executor: Performs computations and stores data on worker nodes.

• Partition: Logical division of data for parallel processing.

• Transformations: Create a new RDD/DataFrame from an existing one (e.g.,

map, filter).

• Actions: Trigger execution of transformations and return results (e.g., count,

collect).

3. PySpark Architecture

• Cluster Manager:

o YARN, Mesos, or Standalone cluster.

• Execution Process:

1. Job submitted by Driver.

2. Tasks divided into Stages based on shuffle boundaries.

3. Tasks run on Executors.

4. Common PySpark Operations

• Transformations:

o map: Applies a function to each element.

o filter: Filters elements based on a condition.

o groupBy: Groups data by a key.

o join: Joins two DataFrames based on a condition.

• Actions:

o show: Displays DataFrame.

o collect: Brings data to the driver.

o count: Counts the number of elements.

5. PySpark SQL

• Creating a Table:

• df.createOrReplaceTempView("table_name")

• spark.sql("SELECT * FROM table_name")

• Common SQL Functions:

o agg: Perform aggregations.

o alias: Rename columns.

o distinct: Remove duplicates.

6. Window Functions

• Definition: Perform operations over a window of rows.

• Types:

o Ranking: row_number, rank, dense_rank.

o Aggregations: sum, avg, max, min.

• Example:

• from pyspark.sql.window import Window

• window_spec = Window.partitionBy("col1").orderBy("col2")

• df.withColumn("rank", rank().over(window_spec))

7. DataFrame API vs. SQL API

• DataFrame API:

o Pythonic syntax.

o Example: df.select("col1", "col2").filter(df["col3"] > 10)

• SQL API:

o SQL-like syntax.

o Example: spark.sql("SELECT col1, col2 FROM table WHERE col3 > 10")

8. Persisting and Caching

• Caching: Stores data in memory for faster reuse.

• df.cache()

• Persistence: Allows control over storage levels (e.g., MEMORY_AND_DISK).

• df.persist(StorageLevel.DISK_ONLY)

9. Joins in PySpark

• Types of Joins:

o Inner, Left, Right, Full Outer, Semi, Anti.

• Broadcast Join: Optimized join when one DataFrame is small.

• from pyspark.sql.functions import broadcast

• df = large_df.join(broadcast(small_df), "key")

10. File Formats

• Supported Formats: CSV, JSON, Parquet, Avro, ORC.

• Reading Data:

• df = spark.read.format("csv").option("header", True).load("path")
• Writing Data:

• df.write.format("parquet").save("path")

11. Performance Optimization

• Repartitioning: Adjust the number of partitions for parallelism.

• df.repartition(10)

• Coalesce: Reduce the number of partitions without a shuffle.

• df.coalesce(1)

• Predicate Pushdown: Filters data early in the query execution.

12. Streaming with PySpark

• Reading Streams:

• df = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").load()

• Writing Streams:

• query = df.writeStream.format("console").start()

• query.awaitTermination()

13. Error Handling

• Common Exceptions:

o AnalysisException: Invalid query or missing columns.

o Py4JJavaError: Java exception in Spark operations.

• Debugging:

o Use .explain() to understand the query execution plan.

o Check Spark logs for detailed error messages.

14. Common PySpark Interview Questions

• What are the differences between RDD, DataFrame, and Dataset?

• How does Spark handle fault tolerance?

• Explain the concept of lazy evaluation in PySpark.

• How do you optimize joins in PySpark?

• Explain Spark's execution process (job, stages, and tasks).

1. PySpark Basics

Initialize SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("AppName").getOrCreate()

Create DataFrame:

data = [(1, "Alice"), (2, "Bob")]

columns = ["id", "name"]

df = spark.createDataFrame(data, columns)

Inspect DataFrame:

df.show() # Display rows

df.printSchema() # Show schema

df.describe().show() # Summary statistics

Read/Write Data:

# Read

df = spark.read.csv("file_path", header=True, inferSchema=True)

# Write

df.write.csv("output_path", header=True)

2. PySpark SQL

SQL Queries:

df.createOrReplaceTempView("table")
spark.sql("SELECT * FROM table WHERE id > 1").show()

Joins:

df1.join(df2, df1["key"] == df2["key"], "inner").show() # Types: inner, left, right, outer

3. Transformations

Basic Transformations:

df.select("column1", "column2").show() # Select columns

df.filter(df["column"] > 10).show() # Filter rows

df.withColumn("new_col", df["col"] * 2).show() # Add column

GroupBy and Aggregations:

from pyspark.sql.functions import count, avg, sum

df.groupBy("column").agg(count("*").alias("count"), avg("col2")).show()

Window Functions:

from pyspark.sql.window import Window

from pyspark.sql.functions import rank

window = Window.partitionBy("category").orderBy("sales")

df.withColumn("rank", rank().over(window)).show()

4. PySpark Functions

Common Functions:

from pyspark.sql.functions import col, lit, concat, when

df = df.withColumn("new_col", concat(col("col1"), lit("_"), col("col2"))) # Concatenate

df = df.withColumn("status", when(df["col"] > 10, "High").otherwise("Low")) #

Conditional

Date Functions:
from pyspark.sql.functions import current_date, datediff

df = df.withColumn("today", current_date())

df = df.withColumn("days_diff", datediff(df["date_col"], df["today"]))

5. PySpark RDD Operations

Basic RDD Operations:

rdd = spark.sparkContext.parallelize([1, 2, 3, 4])

mapped_rdd = rdd.map(lambda x: x * 2)

filtered_rdd = mapped_rdd.filter(lambda x: x > 4)

print(filtered_rdd.collect())

Actions:

print(rdd.count())

print(rdd.collect())

Transformations:

rdd1 = spark.sparkContext.parallelize([1, 2, 3])

rdd2 = spark.sparkContext.parallelize([3, 4, 5])

union_rdd = rdd1.union(rdd2)

intersection_rdd = rdd1.intersection(rdd2)

6. PySpark Optimization

Persist and Cache:

df.cache() # Cache in memory

df.persist() # Persist in memory and disk

df.unpersist() # Remove from cache

Repartition:

df = df.repartition(4) # Increase partitions

df = df.coalesce(2) # Decrease partitions

7. PySpark Interview Patterns

1. Self Join Example:

df.alias("df1").join(df.alias("df2"), col("df1.id") == col("df2.supervisor"), "inner").show()

2. Window Function Example:

from pyspark.sql.functions import row_number

window = Window.partitionBy("category").orderBy("sales")

df.withColumn("row_number", row_number().over(window)).show()

3. Aggregate Example:

df.groupBy("department").agg(

count("*").alias("count"),

avg("salary").alias("avg_salary")

).show()

8. PySpark Advanced Topics

Broadcast Joins:

from pyspark.sql.functions import broadcast

df_large.join(broadcast(df_small), "key").show()

UDF (User-Defined Functions):

from pyspark.sql.functions import udf

from pyspark.sql.types import StringType

def uppercase(name):

return name.upper()

uppercase_udf = udf(uppercase, StringType())

df.withColumn("uppercase_name", uppercase_udf(df["name"])).show()

Accumulators:

acc = spark.sparkContext.accumulator(0)

def add_to_acc(value):

acc.add(value)

rdd.foreach(add_to_acc)

print(acc.value)

11. Broadcast Joins

• Definition: Optimizes join operations when one DataFrame is small enough to fit
in memory.

• Syntax:

from pyspark.sql.functions import broadcast

result = large_df.join(broadcast(small_df), "key")

• Use Case: Useful for improving performance by avoiding shuffle operations.

12. Window Functions

• Usage: Perform operations like ranking, cumulative sums, etc., over a specific
window of rows.

• Example:

from pyspark.sql.window import Window

from pyspark.sql.functions import rank, col

window_spec = Window.partitionBy("department").orderBy("salary")
ranked_df = employees.withColumn("rank", rank().over(window_spec))

ranked_df.show()

• Common Functions: row_number, rank, dense_rank, lag, lead, ntile.

13. Data Partitioning

• Repartitioning: Changes the number of partitions.

df_repartitioned = df.repartition(4)

• Coalesce: Reduces the number of partitions without shuffling.

df_coalesced = df.coalesce(2)

14. Accumulators

• Definition: Variables used to perform aggregations.

• Syntax:

acc = spark.sparkContext.accumulator(0)

rdd.foreach(lambda x: acc.add(1))

print(acc.value)

15. Caching and Persistence

• Caching: Stores RDD/DataFrame in memory for reuse.

df.cache()

• Persistence: Allows specifying storage levels (e.g., memory, disk).

df.persist(StorageLevel.MEMORY_AND_DISK)

16. Skew Handling

• Salting: Add random prefixes to keys to distribute data evenly during joins.

• Example:

df_with_salt = df.withColumn("salted_key", concat(col("key"), lit("_"), col("random_id")))

17. Fault Tolerance

• RDD Lineage: RDDs keep track of transformations for automatic recovery in

case of failure.

• Action Retry: Automatically retries failed tasks.

18. Integration with Other Tools

• Integration with Hive:

spark.sql("SELECT * FROM hive_table")

• Reading/Writing to Kafka:

df = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe", "topic1").load()

df.writeStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("topic", "topic2").start()

19. Advanced File Formats

• Avro:

df.write.format("avro").save("path")

• ORC:
df.write.format("orc").save("path")

20. Performance Tuning

• Common Parameters:

o spark.sql.shuffle.partitions: Adjust for better parallelism.

o spark.executor.memory: Increase memory for executors.

o spark.executor.cores: Set the number of cores per executor.

• Example:

spark.conf.set("spark.sql.shuffle.partitions", 50)

Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Databricks Vs SQL Cheat Sheet
100% (1)
Databricks Vs SQL Cheat Sheet
11 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Azure Data Engineering for Pharma
100% (1)
Azure Data Engineering for Pharma
5 pages
Structured Streaming
No ratings yet
Structured Streaming
12 pages
Shelly Bansal - SR Data Engineer
No ratings yet
Shelly Bansal - SR Data Engineer
6 pages
Azure DataBricks
No ratings yet
Azure DataBricks
37 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Databricks Questions
No ratings yet
Databricks Questions
23 pages
ADF Copy Data
100% (1)
ADF Copy Data
81 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Spark Big Data Tuning Guide
100% (1)
Spark Big Data Tuning Guide
20 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
Pyspark Hands On
No ratings yet
Pyspark Hands On
189 pages
Cloud Migration for Banking Data
No ratings yet
Cloud Migration for Banking Data
1 page
Azure DataEngineering End To End Videos
100% (1)
Azure DataEngineering End To End Videos
21 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Databricks Pyspark 1712042928
100% (1)
Databricks Pyspark 1712042928
21 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
SnowProCore Exam Study Guide 011425 COF C02
No ratings yet
SnowProCore Exam Study Guide 011425 COF C02
14 pages
Databricks Certified Data Engineer Associate
No ratings yet
Databricks Certified Data Engineer Associate
4 pages
Databricks Certified Data Engineer Professional Practice Questions
No ratings yet
Databricks Certified Data Engineer Professional Practice Questions
13 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
PySpark Cheat 23
No ratings yet
PySpark Cheat 23
9 pages
Interview DE by Company Azurelib Dot Com
No ratings yet
Interview DE by Company Azurelib Dot Com
14 pages
Azure Data Engineer Interview Questions and Answers
No ratings yet
Azure Data Engineer Interview Questions and Answers
7 pages
Top 100+ Data Engineer Interview Questions and Answers For 2022
No ratings yet
Top 100+ Data Engineer Interview Questions and Answers For 2022
4 pages
Data Engineering YouTube Roadmap
No ratings yet
Data Engineering YouTube Roadmap
4 pages
Spark QA
No ratings yet
Spark QA
34 pages
PySpark and Azure Data Engineer Free Notes
100% (1)
PySpark and Azure Data Engineer Free Notes
65 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Databricks Data Engineer Associate Notes
100% (1)
Databricks Data Engineer Associate Notes
5 pages
Azure Databricks Engineering 1746278570
No ratings yet
Azure Databricks Engineering 1746278570
96 pages
Snowflake
No ratings yet
Snowflake
43 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Databricks Certified Data Engineer Associate Exam Guide
No ratings yet
Databricks Certified Data Engineer Associate Exam Guide
7 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Databricks Question 1668314325
100% (1)
Databricks Question 1668314325
104 pages
How To Land On Azure Data Engineer Job
No ratings yet
How To Land On Azure Data Engineer Job
5 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
HDFS Interview Prep Guide
No ratings yet
HDFS Interview Prep Guide
29 pages
DBT Interview Questions
No ratings yet
DBT Interview Questions
18 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
PySpark Core Concepts & Interview Prep
No ratings yet
PySpark Core Concepts & Interview Prep
8 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
6 Sheet Set Manager Strategies For Success
No ratings yet
6 Sheet Set Manager Strategies For Success
39 pages
Accenture Interview Guide 2005
No ratings yet
Accenture Interview Guide 2005
3 pages
Lecture 3: Business Intelligence: OLAP, Data Warehouse, and Column Store
No ratings yet
Lecture 3: Business Intelligence: OLAP, Data Warehouse, and Column Store
119 pages
Spatial Databases (All Chapters) (Seng 3174)
No ratings yet
Spatial Databases (All Chapters) (Seng 3174)
71 pages
VPMS Report
100% (1)
VPMS Report
45 pages
Custom COPY Function Using AMDP - SAP Blogs
No ratings yet
Custom COPY Function Using AMDP - SAP Blogs
9 pages
5 Redo-Log-Files
No ratings yet
5 Redo-Log-Files
18 pages
Working With Motor Locked-Rotor Test Data: Procedure
No ratings yet
Working With Motor Locked-Rotor Test Data: Procedure
3 pages
ACAv3 EN M04 AddingAStorageLayerWithAmazonS3 Instructor Deck
No ratings yet
ACAv3 EN M04 AddingAStorageLayerWithAmazonS3 Instructor Deck
76 pages
Automating NetApp ONTAP With WFA
No ratings yet
Automating NetApp ONTAP With WFA
29 pages
Review of Related Literature of Attendance Monitoring System
100% (1)
Review of Related Literature of Attendance Monitoring System
5 pages
eCATT Data Migration Guide
No ratings yet
eCATT Data Migration Guide
9 pages
Time Table Management
No ratings yet
Time Table Management
5 pages
Prototype Phase and Final Viva Questions
No ratings yet
Prototype Phase and Final Viva Questions
10 pages
MNFJZDNBA
No ratings yet
MNFJZDNBA
3 pages
Prasad Pradhan
No ratings yet
Prasad Pradhan
5 pages
Constraints in SQL Are Not Mandatory To Use While Creating The Table
No ratings yet
Constraints in SQL Are Not Mandatory To Use While Creating The Table
16 pages
ICS Part-II (Chapter Wise Short Questions)
No ratings yet
ICS Part-II (Chapter Wise Short Questions)
15 pages
2025 ReleaseNotes
No ratings yet
2025 ReleaseNotes
16 pages
SIEM Deck - Blog
No ratings yet
SIEM Deck - Blog
12 pages
Online Hotel Reservation System
No ratings yet
Online Hotel Reservation System
87 pages
Statistics Chapter 2
No ratings yet
Statistics Chapter 2
1 page
IT354 Project
No ratings yet
IT354 Project
4 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
9 pages
System Operations On AWS
No ratings yet
System Operations On AWS
4 pages
E3D Admin - PML - Net - CourseContent
No ratings yet
E3D Admin - PML - Net - CourseContent
5 pages
Lending Tech for Banks & Fintechs
No ratings yet
Lending Tech for Banks & Fintechs
22 pages
Dev Dhanasekaran CV
No ratings yet
Dev Dhanasekaran CV
3 pages
Requirements For Installing Oracle10gR2 On RHEL 5 OL 5 (x86 - 64) (Doc ID 421308.1)
No ratings yet
Requirements For Installing Oracle10gR2 On RHEL 5 OL 5 (x86 - 64) (Doc ID 421308.1)
5 pages
Gopal Resume
No ratings yet
Gopal Resume
4 pages