KEMBAR78
Pyspark - Notes 1 | PDF | Apache Spark | Apache Hadoop
0% found this document useful (0 votes)
28 views3 pages

Pyspark - Notes 1

The document is a comprehensive tutorial on PySpark, covering its definition, advantages over Hadoop MapReduce, architecture, and core concepts like RDDs and DataFrames. It emphasizes the speed and developer-friendliness of Spark, along with practical guidance on setting up Databricks, performing data transformations, and executing SQL queries. Additionally, it includes tips for self-practice and real-world projects to enhance learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views3 pages

Pyspark - Notes 1

The document is a comprehensive tutorial on PySpark, covering its definition, advantages over Hadoop MapReduce, architecture, and core concepts like RDDs and DataFrames. It emphasizes the speed and developer-friendliness of Spark, along with practical guidance on setting up Databricks, performing data transformations, and executing SQL queries. Additionally, it includes tips for self-practice and real-world projects to enhance learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Complete PySpark Tutorial | Learn PySpark from Basics to Advanced Step-by-Step

https://youtu.be/1J7qZ5SNGaQ?si=vIiTxx-QCA5BoYAS

1. What is PySpark?
- PySpark is the Python API for Apache Spark, used for big data processing and analytics.
- Spark is a general-purpose, in-memory computation engine.
- General-purpose: You can do multiple types of data tasks (cleaning, querying, ML, etc.) in
Spark, not just one thing.
- In-memory: Spark processes data in RAM (fast), not just from hard disks (slow like
MapReduce).
- Computation engine: Spark’s main job is to process data, not store it.

Why use Spark over Hadoop MapReduce?


- In Hadoop, data goes repeatedly from disk to disk.
- In Spark, once data is loaded, most processing happens in memory, making it much faster (up
to 100x).

2. Spark’s Advantages
- Speed: Much faster than MapReduce (in-memory, fewer disk reads/writes).
- Developer-Friendly: Supports APIs in Python, Scala, Java; writing distributed code is easier.
- Supports Multiple Workloads: Interactive SQL, real-time stream processing, machine learning,
graph processing—all in one.

3. Spark Architecture Overview


- Master-Slave Architecture:
- Cluster = group of computers (“nodes”) working together.
- Master node: Coordinates jobs.
- Worker nodes: Do the actual computation.
- Distributed Processing: Tasks are split across several nodes for faster results.

Example:
If 1 worker takes 5 days for a dashboard, 5 workers can finish in 1 day each - parallel
processing.

How Job Scheduling Works


- Driver program: Entry point, creates the Spark context.
- Cluster manager: Allocates resources (memory, CPUs) across nodes.
- Driver splits your code into jobs and tasks, then sends them to worker nodes (executors).

4. RDD (Resilient Distributed Dataset)


- RDD is Spark’s core data structure.
- Resilient: If a part fails, Spark can recompute it from its “lineage” (history).
- Distributed: Data is split across many nodes.
- Dataset: Collection of objects/data you process.
RDD Operations:
1. Transformations:
- e.g., `map`, `filter`
- Do NOT immediately compute; just build a plan (DAG).
2. Actions:
- e.g., `collect`, `count`
- Trigger the computation and bring results back.

Lazy Evaluation
- Nothing happens until you call an Action. Spark remembers all transformations and executes
them together—it’s efficient.

5. DAG (Directed Acyclic Graph) & Lazy Evaluation


- Spark builds a DAG from your transformations. This explains the execution plan and
dependencies.
- Lazy evaluation: Spark only processes data when you run an action (ex: `collect()`, `count()`).
Transformations are just registered until then.

6. Setting Up Databricks with PySpark


- Databricks provides a free Community Edition for Spark practice in the cloud.
- Sign up at Databricks Community Edition.
- Create a cluster (compute): can use up to 15GB RAM for free, but clusters terminate after
1-2 hours. Just spin up a new one.

Uploading Data
- You can upload CSV/JSON files and easily create tables/dataframes.
- Two ways to create data tables:
1. Use notebook and PySpark code.
2. Use the UI to upload and set schema/columns.

7. DataFrames in PySpark
- DataFram*: More user-friendly, structured version of RDDs.
- Easy transformations like `.select()`, `.withColumn()`, `.filter()`, `.drop()`, `.sort()`, `.groupBy()`,
`.join()`, `.union()`, `.fillna()`.
- Handle various file types: CSV, JSON, Parquet.
- DataFrames are the most common way to work with data in PySpark.

8. PySpark SQL
- Use Spark SQL to run SQL queries on your DataFrames or tables.
- You can create “views” or temporary tables from DataFrames and write SQL to
manipulate/join/filter data.

. More Topics Covered


- StructType: Define custom DataFrame schemas.
- Pivot/Unpivot: Reshape data.
- User Defined Functions (UDFs): Write custom Python functions to use inside DataFrames.
- Window Functions: Calculations across rows related to the current one.
- Partitioning: Split data for optimized processing.
- Cache vs. Persist: Control how/if data stays in memory.
- Explode: Flatten arrays/structs in columns.

10. Practice Projects & Real Interview Q&A


- Several real-world use-cases and projects are included.
- Multiple frequently asked PySpark interview questions are explained.

11. Tips for Self-Practice


- For hands-on:
- Use Databricks Community Edition (free) or install Spark locally.
- Upload sample files and test transformations/actions.
- Practice building RDDs, then DataFrames, try SQL too.
- Use documentation and “cheat sheets” for PySpark syntax.
- Focus on understanding DAGs, transformations vs actions, DataFrame APIs, and basic cluster
resource concepts.

You might also like