Taxi Trip Analysis Using Hive
This case study requires analyzing a large dataset using Hive for exploratory data analysis. Here’s a
structured approach to tackle the tasks:
1. Setup and Table Creation
Create the Table
Run the following DDL script in Hive to create the table schema for storing the taxi data:
CREATE TABLE IF NOT EXISTS taxidata (
vendor_id STRING,
pickup_datetime STRING,
dropoff_datetime STRING,
passenger_count INT,
trip_distance DECIMAL(9,6),
pickup_longitude DECIMAL(9,6),
pickup_latitude DECIMAL(9,6),
rate_code INT,
store_and_fwd_flag STRING,
dropoff_longitude DECIMAL(9,6),
dropoff_latitude DECIMAL(9,6),
payment_type STRING,
fare_amount DECIMAL(9,6),
extra DECIMAL(9,6),
mta_tax DECIMAL(9,6),
tip_amount DECIMAL(9,6),
tolls_amount DECIMAL(9,6),
total_amount DECIMAL(9,6),
trip_time_in_secs INT
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES ("skip.header.line.count"="1");
2. Load the Data
Place the CSV file (2018_Yellow_Taxi_Trip_Data.csv) in Hadoop's file system (HDFS).
Steps to Load Data into HDFS and Hive
1. Upload the CSV file into HDFS:
2. hdfs dfs -put 2018_Yellow_Taxi_Trip_Data.csv /user/hive/taxidata/
3. Load the data into the Hive table:
4. LOAD DATA INPATH '/user/hive/taxidata/2018_Yellow_Taxi_Trip_Data.csv'
5. INTO TABLE taxidata;
3. Run Basic Queries to Verify the Data
Run simple queries to validate that the data is loaded correctly:
1. Check Row Count:
2. SELECT COUNT(*) FROM taxidata;
3. Preview Data:
4. SELECT * FROM taxidata LIMIT 10;
4. Analysis Queries
4.1. Total Number of Trips
SELECT COUNT(*) AS total_trips FROM taxidata;
4.2. Total Revenue Generated
SELECT SUM(total_amount) AS total_revenue FROM taxidata;
4.3. Fraction Paid for Tolls
SELECT SUM(tolls_amount) / SUM(total_amount) AS toll_fraction FROM taxidata;
4.4. Fraction Paid as Tips
SELECT SUM(tip_amount) / SUM(total_amount) AS tip_fraction FROM taxidata;
4.5. Average Trip Amount
SELECT AVG(total_amount) AS average_trip_amount FROM taxidata;
4.6. Average Trip Distance
SELECT AVG(trip_distance) AS average_trip_distance FROM taxidata;
4.7. Number of Different Payment Types
SELECT COUNT(DISTINCT payment_type) AS num_payment_types FROM taxidata;
4.8. Metrics for Each Payment Type
SELECT
payment_type,
AVG(total_amount) AS average_fare,
AVG(tip_amount) AS average_tip,
AVG(mta_tax) AS average_tax
FROM taxidata
GROUP BY payment_type;
4.9. Hourly Revenue Analysis
To find the hour of the day with the highest average revenue, extract the hour from
pickup_datetime:
SELECT
HOUR(TO_TIMESTAMP(pickup_datetime)) AS hour_of_day,
AVG(total_amount) AS average_revenue
FROM taxidata
GROUP BY HOUR(TO_TIMESTAMP(pickup_datetime))
ORDER BY average_revenue DESC
LIMIT 1;
5. Notes
• Ensure the file is formatted correctly and accessible in HDFS before loading.
• Verify Hive connectivity and configurations (like Hadoop and Hive services being active).
• Use TO_TIMESTAMP in Hive 2.2 or above for date-time manipulation.
• Always test queries on smaller datasets if needed for performance tuning.
This workflow should give you insights into the taxi dataset while using Hive effectively!