0% found this document useful (0 votes)

39 views4 pages

Assignment2 Problem

Uploaded by

2023dc04090

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views4 pages

Assignment2 Problem

Uploaded by

2023dc04090

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

BDS - Apache Spark Assignment

Big Data Analysis— NYC Taxi Trips

Abstract:
This assignment aims to provide hands-on experience with Apache Spark using a real-world dataset.
Students will explore various capabilities of Spark including DataFrame operations, SQL querying,
filtering, aggregations, window functions, joins, and data persistence. The dataset used is based on
New York City's publicly available taxi trip records. Through this assignment, students will gain practical
skills in large-scale data processing and analytics with PySpark, a critical skill set for data scientists and
engineers.

Dataset:
- Source: NYC TLC Trip Record Data
- File: yellow_tripdata_2025-01.parquet
[https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet]
- Additional: taxi_zone_lookup.csv for join-based exercises
[https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv]

Environment:
- Apache Spark 3.x+
- PySpark (via Jupyter, Databricks, Colab, or local environment)
- Python 3.8+

Learning Objectives:
1. Load and process large datasets using PySpark
2. Apply various DataFrame and SQL operations
3. Perform aggregations and analytics using window functions
4. Join datasets and write partitioned outputs

Assignment Questions:
Use Spark Dataframe transformations for Q1 - Q10. Leverage Spark SQL for Q11.

Basic Loading & Exploration

Q1. Load the dataset into a Spark DataFrame with proper schema inference and print the Schema.

Q2. Display count of rows and the top 10 records of the dataset.

Filtering and Column Operations

Q3. Filter trips where 'trip_distance > 10' miles and 'passenger_count >= 7'.
Q4. Add a new column 'trip_duration_minutes' by calculating the difference between 'dropoff' and
'pickup' timestamps. List the first 5 rows showing only the columns drop off time, pick up time and the
new trip_duration_minutes.

Aggregation & Grouping

Q5. Compute the average 'trip_distance' and 'total_amount' grouped by 'passenger_count'. List the
values in the order of passenger_count.

Q6. Identify the top 5 pickup dates with the highest total fare (total_amount) collected.

Window Functions
Q7. For each day, rank the top 3 longest trips by 'trip_distance' using window functions. List 12 records
(3 longest trips each day for the first 4 days)

Q8. Add a column showing the running total of 'total_amount' for each day. Display
tpep_pickup_datetime, total_amount, and the running_total for top 10 records.

Joins
Q9. Join the dataset with 'taxi_zone_lookup.csv' using 'PULocationID' and report the average fare by
'Borough'. Use inner-join.

Spark SQL Queries

Q10. Register the DataFrame as a temporary SQL view and use Spark SQL to:
- Compute hourly trip volume and list it in the order of pickup_hour (list all).
- Compare average fares between weekdays and weekends

Write to output
Q11. Persist the top 100 highest-fare trips into a CSV file.

Q12. If the data is re-partitioned by pickup date, evaluate the potential for skewed partitioning in the
resulting dataset. Justify your answer.

Q13. Repartition the data by pickup-date and write to partitioned output files. Show the folder structure
of the reportioned files and list the files in one of the folders.
Submission Requirements:
Deliverables:
1. Report (PDF/Word):
• Document the following:
a. Group members and contribution percentage
b. Problem-solving approach.
c. Details of the development environment and setup.
d. Code/query and results of each analytical query
Note: The primary evaluation will be based on the submitted report. Therefore, it
must be comprehensive and include all required details. For each question, relevant
code along with its corresponding output should be clearly documented. Outputs
should be easily readable—either presented in well-formatted tables, clear images
of execution results, or preferably a combination of both.

2. Jupyter Notebook:
• Develop in any of the specified environments.
• Ensure high code quality with inline documentation.
• Submit the notebook file (.ipynb) containing outputs for all cells.

3. CSV file generated as the output of Q11

4. Video (mp4):
• Record the data loading and execution of code/queries and their outcomes in a
.mp4 file.

Submission File Format:

• Submit the three files specified in the 'Deliverables' section viz. the Jupyter
notebook(.ipynb) file, the report (.pdf/.docx) file and the video file
o Report: Asgn2_Grp_<your_group_no>_report.pdf OR
Asgn2_Grp_<your_group_no>_report.docx
o Notebook: Asgn2_Grp_<your_group_no>_code.ipynb
o CSV output of Q11: Asgn2_Grp_<your_group_no>_Q11.csv
o Video: Asgn2_Grp_<your_group_no>_video.mp4
Evaluation Criteria:
Component Weight
Correctness 40%

Code readability 20%

Output Interpretation 20%

Spark Features 20%

Academic Integrity:
All work must be your own group. Collaboration is allowed only for discussion, not for code sharing.
Violations will be handled as per university policy.

Appendix:
Useful functions: 'withColumn', 'groupBy', 'agg', 'row_number()', 'sum', 'avg', 'join', 'window', 'orderBy',
'to_date', 'unix_timestamp', 'sql', 'write', 'partitionBy', etc.

Int 421
No ratings yet
Int 421
2 pages
DataFrame Practical Questions
No ratings yet
DataFrame Practical Questions
8 pages
Candidates Task EET Global June 2023
No ratings yet
Candidates Task EET Global June 2023
3 pages
Analytics Quefile Without Answer
No ratings yet
Analytics Quefile Without Answer
3 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
Rakesh Kumar - 21554244 - Big Data - Assessment 2
No ratings yet
Rakesh Kumar - 21554244 - Big Data - Assessment 2
23 pages
Car Analytics Solution
No ratings yet
Car Analytics Solution
4 pages
Big Data With Hadoop & Spark - VII
No ratings yet
Big Data With Hadoop & Spark - VII
3 pages
Diploma in Information Technology: Centralized Question Bank
No ratings yet
Diploma in Information Technology: Centralized Question Bank
4 pages
Data Science & Big Data Lab Manual
No ratings yet
Data Science & Big Data Lab Manual
117 pages
TDIA2 TP3 Spark
No ratings yet
TDIA2 TP3 Spark
2 pages
DataGrokr Technical Assignment - Data Engineering - Internshala
No ratings yet
DataGrokr Technical Assignment - Data Engineering - Internshala
5 pages
Data Science Internship Tasks
No ratings yet
Data Science Internship Tasks
3 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
Lab Spark
No ratings yet
Lab Spark
3 pages
AdityaPandey Cdac
No ratings yet
AdityaPandey Cdac
2 pages
DS Question Bank Unit-1 Part-2
No ratings yet
DS Question Bank Unit-1 Part-2
3 pages
N N N N N N: A Ovel Approach To A Alyze Uber Datausi G Machi E Lear I G
No ratings yet
N N N N N N: A Ovel Approach To A Alyze Uber Datausi G Machi E Lear I G
17 pages
DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
100% (1)
DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
256 pages
Naan Mudhalvan Phase 2
No ratings yet
Naan Mudhalvan Phase 2
13 pages
DSBDAL Lab Manual
No ratings yet
DSBDAL Lab Manual
26 pages
Assignment 1 - 553
No ratings yet
Assignment 1 - 553
8 pages
DataGrokr Technical Assignment - Data Engineering
No ratings yet
DataGrokr Technical Assignment - Data Engineering
4 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Group Assignment 01
No ratings yet
Group Assignment 01
3 pages
DSBDA Manual
No ratings yet
DSBDA Manual
76 pages
Dsbda Lab Manual
No ratings yet
Dsbda Lab Manual
167 pages
Pyspark Tutorial 3
No ratings yet
Pyspark Tutorial 3
5 pages
Dsbdal Lab Manual
No ratings yet
Dsbdal Lab Manual
107 pages
Computer Science - Term 1 (2020 - 21) Grade 10 Advanced - Checkpoint 4 Part - A
No ratings yet
Computer Science - Term 1 (2020 - 21) Grade 10 Advanced - Checkpoint 4 Part - A
5 pages
Assignment 8
No ratings yet
Assignment 8
2 pages
Data Science Lab Group Submission
No ratings yet
Data Science Lab Group Submission
13 pages
Set 2
No ratings yet
Set 2
3 pages
Assignment No 1 Output
No ratings yet
Assignment No 1 Output
42 pages
Pyspark File Commands and Theory
No ratings yet
Pyspark File Commands and Theory
29 pages
Data Mining Journal 1 Kashan
No ratings yet
Data Mining Journal 1 Kashan
13 pages
Rough Note Text
No ratings yet
Rough Note Text
4 pages
Pyspark Scenario Based Qs
No ratings yet
Pyspark Scenario Based Qs
13 pages
DataGrokr Technical Assignment
No ratings yet
DataGrokr Technical Assignment
4 pages
Big Data Technologies Lab
No ratings yet
Big Data Technologies Lab
8 pages
PracticalList - EDT - BCA - 2024 SET B1 - 4
No ratings yet
PracticalList - EDT - BCA - 2024 SET B1 - 4
8 pages
DataGrokr Technical Assignment
No ratings yet
DataGrokr Technical Assignment
4 pages
S24 - Bigdata Lab Final 005
No ratings yet
S24 - Bigdata Lab Final 005
9 pages
4BUIS014W Business Computing-Portfolio
No ratings yet
4BUIS014W Business Computing-Portfolio
7 pages
Self Intoduction 1 Project
No ratings yet
Self Intoduction 1 Project
11 pages
Text 3
No ratings yet
Text 3
3 pages
Dva File
No ratings yet
Dva File
29 pages
Big Data Lab Manual: Open-Ended Analysis
No ratings yet
Big Data Lab Manual: Open-Ended Analysis
3 pages
SPA Group 13 - Assignment 2 Problem Statement
No ratings yet
SPA Group 13 - Assignment 2 Problem Statement
2 pages
Data Science Manual
No ratings yet
Data Science Manual
155 pages
Index
No ratings yet
Index
4 pages
SL-III Lab Manual
No ratings yet
SL-III Lab Manual
74 pages
Lab - 01 - Data Engineering Practice
No ratings yet
Lab - 01 - Data Engineering Practice
4 pages
Pyspark Questions
No ratings yet
Pyspark Questions
2 pages
Bda Assignment-1
No ratings yet
Bda Assignment-1
3 pages
Data Science & Big Data Lab Guide
No ratings yet
Data Science & Big Data Lab Guide
167 pages
IP Project On Car Rental System in India
100% (5)
IP Project On Car Rental System in India
33 pages
6CS030 Big Data 2019/0 Portfolio - Part 1: Worksheet Three - 5% Hand-Out: Week 9. Demo: Week 10 Workshop
No ratings yet
6CS030 Big Data 2019/0 Portfolio - Part 1: Worksheet Three - 5% Hand-Out: Week 9. Demo: Week 10 Workshop
2 pages
Convert To Bi Publisher 11g 1611815
No ratings yet
Convert To Bi Publisher 11g 1611815
53 pages
Creating A Simple PHP Guestbook: Written: Wahyu Subagiyo
No ratings yet
Creating A Simple PHP Guestbook: Written: Wahyu Subagiyo
9 pages
MDaemon ActiveDirectoryMonitoring-V2
No ratings yet
MDaemon ActiveDirectoryMonitoring-V2
9 pages
Blood Donation System Proposal
No ratings yet
Blood Donation System Proposal
6 pages
04-DDD - Assignment Brief 2
No ratings yet
04-DDD - Assignment Brief 2
3 pages
Dicom and Pacs
No ratings yet
Dicom and Pacs
28 pages
Natraj JDBC
No ratings yet
Natraj JDBC
162 pages
3,4,5 Solns
No ratings yet
3,4,5 Solns
8 pages
SAC - Performance Best Practices For Planning
100% (1)
SAC - Performance Best Practices For Planning
40 pages
DDB Cse
No ratings yet
DDB Cse
6 pages
Disaster Recovery 101 2020 Ebook
No ratings yet
Disaster Recovery 101 2020 Ebook
25 pages
Example - Configuring A Database Connection With VBS
No ratings yet
Example - Configuring A Database Connection With VBS
5 pages
DB2 Connect Version 9
No ratings yet
DB2 Connect Version 9
139 pages
Last Minute Revision Dbms
No ratings yet
Last Minute Revision Dbms
23 pages
Sentiment Classifier
No ratings yet
Sentiment Classifier
4 pages
Graded Quiz Working With Data in Spreadsheets
No ratings yet
Graded Quiz Working With Data in Spreadsheets
14 pages
Oppians Halieutika. Fishes
No ratings yet
Oppians Halieutika. Fishes
273 pages
How To Run Load Tests On SOA Suite Components Using JMeter PDF
No ratings yet
How To Run Load Tests On SOA Suite Components Using JMeter PDF
16 pages
Answer Key
No ratings yet
Answer Key
6 pages
The Little Book of Mongo DB 97dbc9a8fe
No ratings yet
The Little Book of Mongo DB 97dbc9a8fe
32 pages
Oracle GoldenGate Pocket Reference-NSM
100% (2)
Oracle GoldenGate Pocket Reference-NSM
4 pages
Interface Python With MYSQL - Tutorial - 2
No ratings yet
Interface Python With MYSQL - Tutorial - 2
7 pages
Relational Algebra: Operators Expression Trees
No ratings yet
Relational Algebra: Operators Expression Trees
28 pages
SQL Joins for Dognition Data Analysis
No ratings yet
SQL Joins for Dognition Data Analysis
8 pages
Windows File Download Exploit
No ratings yet
Windows File Download Exploit
2 pages
1 Import and Handling Data - Jupyter Notebook
No ratings yet
1 Import and Handling Data - Jupyter Notebook
9 pages
DBMS Question Bank
No ratings yet
DBMS Question Bank
12 pages
Application of Cloud Computing
No ratings yet
Application of Cloud Computing
17 pages
Adeboye-SAP PM
No ratings yet
Adeboye-SAP PM
3 pages
Linux Based Networks Course Outline
No ratings yet
Linux Based Networks Course Outline
6 pages

Assignment2 Problem

Uploaded by

Assignment2 Problem

Uploaded by

BDS - Apache Spark Assignment

Big Data Analysis— NYC Taxi Trips

Basic Loading & Exploration

Filtering and Column Operations

Aggregation & Grouping

Spark SQL Queries

3. CSV file generated as the output of Q11

Submission File Format:

Code readability 20%

Output Interpretation 20%

Spark Features 20%

You might also like