0% found this document useful (0 votes)

25 views4 pages

Pyspark

my py-research

Uploaded by

aditya.c

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views4 pages

Pyspark

my py-research

Uploaded by

aditya.c

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 4

Sure! Here's a basic tutorial on PySpark, which is the Python API for Apache Spark.

Spark is a distributed computing system that can handle large-scale data processing
tasks, and PySpark allows you to use it in Python.

1. Setting Up PySpark
Before diving into the code, you need to set up PySpark. You can install it via
pip:

bash
Copy
Edit
pip install pyspark
If you're using Jupyter notebooks, you might also want to install findspark, which
helps find your Spark installation.

bash
Copy
Edit
pip install findspark
Then, you can initialize SparkContext (Spark's entry point for execution) in your
script.

2. SparkSession
PySpark 2.0 introduced the SparkSession as an entry point for working with data,
and it's generally preferred over the older SparkContext. Let's start by
initializing it:

python
Copy
Edit
from pyspark.sql import SparkSession

# Create a Spark session

spark = SparkSession.builder \
.appName("PySpark Tutorial") \
.getOrCreate()

# Verify the session is created

print(spark.version)
3. RDDs (Resilient Distributed Datasets)
RDDs are the basic data structure in Spark, and they allow parallel operations. You
can create RDDs from data sources such as local files or parallelize existing
Python collections.

Creating an RDD from a Python list:

python
Copy
Edit
# Parallelize a Python list into an RDD
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Show the RDD contents

print(rdd.collect()) # Output: [1, 2, 3, 4, 5]
Operations on RDDs
You can perform various operations on RDDs, such as map(), filter(), and reduce().

map() applies a function to each element.

filter() filters elements based on a condition.

reduce() aggregates elements using a function.

python
Copy
Edit
# Applying map to square each element
squared_rdd = rdd.map(lambda x: x ** 2)
print(squared_rdd.collect()) # Output: [1, 4, 9, 16, 25]

# Applying filter to keep even numbers only

even_rdd = rdd.filter(lambda x: x % 2 == 0)
print(even_rdd.collect()) # Output: [2, 4]

# Applying reduce to sum all elements

sum_rdd = rdd.reduce(lambda x, y: x + y)
print(sum_rdd) # Output: 15
4. DataFrames (Preferred API)
Although RDDs are powerful, Spark's DataFrame API provides a higher-level
abstraction with optimized performance. DataFrames are similar to pandas DataFrames
but are distributed.

Creating a DataFrame
You can create a DataFrame by loading data from various formats like CSV, JSON, or
Parquet, or by converting an RDD.

python
Copy
Edit
# Create a DataFrame from a list of tuples
data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Show the DataFrame

df.show()
Output:

pgsql
Copy
Edit
+-------+---+
| Name|Age|
+-------+---+
| Alice| 30|
| Bob| 25|
|Charlie| 35|
+-------+---+
Performing Operations on DataFrames
DataFrames allow you to use SQL-like queries and operations such as select(),
filter(), groupBy(), etc.

python
Copy
Edit
# Select specific columns
df.select("Name").show()
# Filter rows based on a condition
df.filter(df.Age > 30).show()

# Group by a column and perform aggregation

df.groupBy("Age").count().show()
Using SQL Queries
PySpark allows you to run SQL queries on DataFrames using spark.sql().

python
Copy
Edit
# Register the DataFrame as a temporary SQL table
df.createOrReplaceTempView("people")

# Run SQL query

result = spark.sql("SELECT Name FROM people WHERE Age > 30")
result.show()
5. Reading and Writing Data
PySpark can read from and write to various data formats, including CSV, JSON,
Parquet, etc.

Read Data from CSV

python
Copy
Edit
# Load data from a CSV file into a DataFrame
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

# Show the DataFrame

df.show()
Write Data to CSV
python
Copy
Edit
# Write the DataFrame to a CSV file
df.write.csv("path/to/output.csv", header=True)
Read/Write Parquet Files
Parquet is an optimized, columnar format for storing large datasets.

python
Copy
Edit
# Read Parquet file
df_parquet = spark.read.parquet("path/to/file.parquet")

# Write to Parquet
df.write.parquet("path/to/output.parquet")
6. PySpark Machine Learning (MLlib)
PySpark provides a library called MLlib for machine learning. You can use it for
classification, regression, clustering, and more.

Example: Logistic Regression

python
Copy
Edit
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
# Sample data
data = [(0, 1.0, 0.1), (1, 2.0, 1.1), (0, 3.0, 3.0), (1, 4.0, 4.5)]
columns = ["label", "feature1", "feature2"]

df = spark.createDataFrame(data, columns)

# VectorAssembler to combine feature columns

assembler = VectorAssembler(inputCols=["feature1", "feature2"],
outputCol="features")

# Logistic Regression model

lr = LogisticRegression(featuresCol="features", labelCol="label")

# Create a pipeline
pipeline = Pipeline(stages=[assembler, lr])

# Fit the model

model = pipeline.fit(df)

# Make predictions
predictions = model.transform(df)
predictions.show()
7. Performance Optimization
Caching/Persistence: If you are working with a DataFrame or RDD multiple times, you
can cache it in memory for faster access.

python
Copy
Edit
df.cache()
Avoid Shuffling: Shuffling data across nodes is expensive. Try to avoid operations
that require shuffling like join() or groupBy() when possible.

Broadcast Variables: If you have a small dataset that you need to join with a large
dataset, consider using broadcast variables to send the small dataset to all worker
nodes.

8. Conclusion
PySpark is a powerful tool for distributed data processing. It supports RDDs,
DataFrames, and Machine Learning tasks. With its scalability and ease of use, it's
widely used for big data processing and analysis.

If you're just getting started, I recommend exploring the PySpark documentation

further and experimenting with real-world datasets to get a deeper understanding.

Is there a particular part of PySpark you'd like to explore more, or any specific
example you'd like to see?

2. Key Parts of Hadoop (Simple Definitions)

Component What it does
HDFS A system that stores files on many machines.
MapReduce A program that breaks work into small tasks.
YARN Manages resources and jobs.

Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
Pyspark
No ratings yet
Pyspark
10 pages
Pyspark - Notes 1
No ratings yet
Pyspark - Notes 1
3 pages
PySpark Core Concepts & Interview Prep
No ratings yet
PySpark Core Concepts & Interview Prep
8 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Apache Spark Lecture Notes
No ratings yet
Apache Spark Lecture Notes
4 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
PySpark Interview Questions Big Data
No ratings yet
PySpark Interview Questions Big Data
8 pages
Py Spark
No ratings yet
Py Spark
177 pages
Complete Spark & Azure Databricks Interview Guide - Claude
No ratings yet
Complete Spark & Azure Databricks Interview Guide - Claude
46 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
Py Spark
No ratings yet
Py Spark
9 pages
Pyspark Modules&packages RDD
No ratings yet
Pyspark Modules&packages RDD
9 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
BDA Exp E1
No ratings yet
BDA Exp E1
5 pages
Spark Guide for 4th Year Engineering Students
No ratings yet
Spark Guide for 4th Year Engineering Students
241 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
200 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
No ratings yet
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
16 pages
PySpark Notes
No ratings yet
PySpark Notes
190 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
Page 01
No ratings yet
Page 01
2 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
35-Unit5 DataAnalytics IoT Adoop Spark Part4
No ratings yet
35-Unit5 DataAnalytics IoT Adoop Spark Part4
12 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
Big Data & Apache Spark Explained
No ratings yet
Big Data & Apache Spark Explained
31 pages
PySpark Interview Questions Guide
100% (3)
PySpark Interview Questions Guide
126 pages
0805 Learning Apache Spark With Python
No ratings yet
0805 Learning Apache Spark With Python
147 pages
Spark 101
No ratings yet
Spark 101
25 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Python Pickling and Spark Vectors
No ratings yet
Python Pickling and Spark Vectors
8 pages
Day 1 Stlecture Notes
No ratings yet
Day 1 Stlecture Notes
4 pages
Big Data Analysis Certification
No ratings yet
Big Data Analysis Certification
23 pages
PySpark ELT Cheat Sheet Guide
No ratings yet
PySpark ELT Cheat Sheet Guide
8 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Deloitte & EY Data Engineer Interview Questions
No ratings yet
Deloitte & EY Data Engineer Interview Questions
26 pages
Synopsis Tsms (Hemlata Final)
No ratings yet
Synopsis Tsms (Hemlata Final)
25 pages
Go16 Ac Ch01 Grader 1g As Instructions
No ratings yet
Go16 Ac Ch01 Grader 1g As Instructions
2 pages
Database Management System L6 MARCH MOCK
No ratings yet
Database Management System L6 MARCH MOCK
5 pages
Querying With Transact-SQL: Lab 8 - Grouping Sets and Pivoti NG Data
No ratings yet
Querying With Transact-SQL: Lab 8 - Grouping Sets and Pivoti NG Data
2 pages
Persistent Systems Model Paper Questions
No ratings yet
Persistent Systems Model Paper Questions
12 pages
History Literature Review Guide
100% (2)
History Literature Review Guide
4 pages
Student Performance PowerBI Report Updated
No ratings yet
Student Performance PowerBI Report Updated
23 pages
Grade 2 Data Analysis Lesson
No ratings yet
Grade 2 Data Analysis Lesson
9 pages
s3 User Guide PDF
No ratings yet
s3 User Guide PDF
142 pages
Creating New Web Service Using PLSQL Package
No ratings yet
Creating New Web Service Using PLSQL Package
25 pages
Motivating Parent Students at BSU
No ratings yet
Motivating Parent Students at BSU
45 pages
W3Schools Quiz Results
No ratings yet
W3Schools Quiz Results
16 pages
University Management System
0% (1)
University Management System
17 pages
SQL Server Managed Support Services
No ratings yet
SQL Server Managed Support Services
5 pages
Janata Bank Financial Performance Report
No ratings yet
Janata Bank Financial Performance Report
3 pages
Veeam Backup & Replication 11a Release Notes
No ratings yet
Veeam Backup & Replication 11a Release Notes
39 pages
Power BI & Data Visualization Expert
No ratings yet
Power BI & Data Visualization Expert
3 pages
SQL Injection
No ratings yet
SQL Injection
32 pages
Curriculum Table Chart - Revised
No ratings yet
Curriculum Table Chart - Revised
7 pages
RDBMS Course Material
No ratings yet
RDBMS Course Material
281 pages
CST 232 Assignment 1 1718
No ratings yet
CST 232 Assignment 1 1718
5 pages
Dissertation Methodology Mixed Methods
100% (2)
Dissertation Methodology Mixed Methods
4 pages
Data Modeling for Database Design
No ratings yet
Data Modeling for Database Design
51 pages
Bartending NC II. MODULE
No ratings yet
Bartending NC II. MODULE
81 pages
Twoday Kapacity Microsoft Fabric Event 281123 Faelles Spor
No ratings yet
Twoday Kapacity Microsoft Fabric Event 281123 Faelles Spor
63 pages
ERP Data Migration & Deployment
No ratings yet
ERP Data Migration & Deployment
20 pages
Ai Class X All Qas-2
No ratings yet
Ai Class X All Qas-2
44 pages
Adt Lab Manual
No ratings yet
Adt Lab Manual
31 pages
Methods of Research
No ratings yet
Methods of Research
16 pages
Azure AI Solution Design Exam Prep
No ratings yet
Azure AI Solution Design Exam Prep
112 pages

Pyspark

Uploaded by

Pyspark

Uploaded by

Sure! Here's a basic tutorial on PySpark, which is the Python API for Apache Spark.

# Create a Spark session

# Verify the session is created

Creating an RDD from a Python list:

# Show the RDD contents

map() applies a function to each element.

reduce() aggregates elements using a function.

# Applying filter to keep even numbers only

# Applying reduce to sum all elements

# Show the DataFrame

# Group by a column and perform aggregation

# Run SQL query

Read Data from CSV

# Show the DataFrame

Example: Logistic Regression

# VectorAssembler to combine feature columns

# Logistic Regression model

# Fit the model

If you're just getting started, I recommend exploring the PySpark documentation

2. Key Parts of Hadoop (Simple Definitions)

You might also like