BDS - Apache Spark Assignment
Big Data Analysis— NYC Taxi Trips
Abstract:
This assignment aims to provide hands-on experience with Apache Spark using a real-world dataset.
Students will explore various capabilities of Spark including DataFrame operations, SQL querying,
filtering, aggregations, window functions, joins, and data persistence. The dataset used is based on
New York City's publicly available taxi trip records. Through this assignment, students will gain practical
skills in large-scale data processing and analytics with PySpark, a critical skill set for data scientists and
engineers.
Dataset:
- Source: NYC TLC Trip Record Data
- File: yellow_tripdata_2025-01.parquet
[https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet]
- Additional: taxi_zone_lookup.csv for join-based exercises
[https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv]
Environment:
- Apache Spark 3.x+
- PySpark (via Jupyter, Databricks, Colab, or local environment)
- Python 3.8+
Learning Objectives:
1. Load and process large datasets using PySpark
2. Apply various DataFrame and SQL operations
3. Perform aggregations and analytics using window functions
4. Join datasets and write partitioned outputs
Assignment Questions:
Use Spark Dataframe transformations for Q1 - Q10. Leverage Spark SQL for Q11.
Basic Loading & Exploration
Q1. Load the dataset into a Spark DataFrame with proper schema inference and print the Schema.
Q2. Display count of rows and the top 10 records of the dataset.
Filtering and Column Operations
Q3. Filter trips where 'trip_distance > 10' miles and 'passenger_count >= 7'.
Q4. Add a new column 'trip_duration_minutes' by calculating the difference between 'dropoff' and
'pickup' timestamps. List the first 5 rows showing only the columns drop off time, pick up time and the
new trip_duration_minutes.
Aggregation & Grouping
Q5. Compute the average 'trip_distance' and 'total_amount' grouped by 'passenger_count'. List the
values in the order of passenger_count.
Q6. Identify the top 5 pickup dates with the highest total fare (total_amount) collected.
Window Functions
Q7. For each day, rank the top 3 longest trips by 'trip_distance' using window functions. List 12 records
(3 longest trips each day for the first 4 days)
Q8. Add a column showing the running total of 'total_amount' for each day. Display
tpep_pickup_datetime, total_amount, and the running_total for top 10 records.
Joins
Q9. Join the dataset with 'taxi_zone_lookup.csv' using 'PULocationID' and report the average fare by
'Borough'. Use inner-join.
Spark SQL Queries
Q10. Register the DataFrame as a temporary SQL view and use Spark SQL to:
- Compute hourly trip volume and list it in the order of pickup_hour (list all).
- Compare average fares between weekdays and weekends
Write to output
Q11. Persist the top 100 highest-fare trips into a CSV file.
Q12. If the data is re-partitioned by pickup date, evaluate the potential for skewed partitioning in the
resulting dataset. Justify your answer.
Q13. Repartition the data by pickup-date and write to partitioned output files. Show the folder structure
of the reportioned files and list the files in one of the folders.
Submission Requirements:
Deliverables:
1. Report (PDF/Word):
• Document the following:
a. Group members and contribution percentage
b. Problem-solving approach.
c. Details of the development environment and setup.
d. Code/query and results of each analytical query
Note: The primary evaluation will be based on the submitted report. Therefore, it
must be comprehensive and include all required details. For each question, relevant
code along with its corresponding output should be clearly documented. Outputs
should be easily readable—either presented in well-formatted tables, clear images
of execution results, or preferably a combination of both.
2. Jupyter Notebook:
• Develop in any of the specified environments.
• Ensure high code quality with inline documentation.
• Submit the notebook file (.ipynb) containing outputs for all cells.
3. CSV file generated as the output of Q11
4. Video (mp4):
• Record the data loading and execution of code/queries and their outcomes in a
.mp4 file.
Submission File Format:
• Submit the three files specified in the 'Deliverables' section viz. the Jupyter
notebook(.ipynb) file, the report (.pdf/.docx) file and the video file
o Report: Asgn2_Grp_<your_group_no>_report.pdf OR
Asgn2_Grp_<your_group_no>_report.docx
o Notebook: Asgn2_Grp_<your_group_no>_code.ipynb
o CSV output of Q11: Asgn2_Grp_<your_group_no>_Q11.csv
o Video: Asgn2_Grp_<your_group_no>_video.mp4
Evaluation Criteria:
Component Weight
Correctness 40%
Code readability 20%
Output Interpretation 20%
Spark Features 20%
Academic Integrity:
All work must be your own group. Collaboration is allowed only for discussion, not for code sharing.
Violations will be handled as per university policy.
Appendix:
Useful functions: 'withColumn', 'groupBy', 'agg', 'row_number()', 'sum', 'avg', 'join', 'window', 'orderBy',
'to_date', 'unix_timestamp', 'sql', 'write', 'partitionBy', etc.