KEMBAR78
Taxi Trip Analysis Using Hive | PDF | Information Retrieval | Computing
0% found this document useful (0 votes)
85 views3 pages

Taxi Trip Analysis Using Hive

Uploaded by

Monika Dukiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views3 pages

Taxi Trip Analysis Using Hive

Uploaded by

Monika Dukiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Taxi Trip Analysis Using Hive

This case study requires analyzing a large dataset using Hive for exploratory data analysis. Here’s a
structured approach to tackle the tasks:

1. Setup and Table Creation

Create the Table

Run the following DDL script in Hive to create the table schema for storing the taxi data:

CREATE TABLE IF NOT EXISTS taxidata (

vendor_id STRING,

pickup_datetime STRING,

dropoff_datetime STRING,

passenger_count INT,

trip_distance DECIMAL(9,6),

pickup_longitude DECIMAL(9,6),

pickup_latitude DECIMAL(9,6),

rate_code INT,

store_and_fwd_flag STRING,

dropoff_longitude DECIMAL(9,6),

dropoff_latitude DECIMAL(9,6),

payment_type STRING,

fare_amount DECIMAL(9,6),

extra DECIMAL(9,6),

mta_tax DECIMAL(9,6),

tip_amount DECIMAL(9,6),

tolls_amount DECIMAL(9,6),

total_amount DECIMAL(9,6),

trip_time_in_secs INT

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','


STORED AS TEXTFILE

TBLPROPERTIES ("skip.header.line.count"="1");

2. Load the Data

Place the CSV file (2018_Yellow_Taxi_Trip_Data.csv) in Hadoop's file system (HDFS).

Steps to Load Data into HDFS and Hive

1. Upload the CSV file into HDFS:

2. hdfs dfs -put 2018_Yellow_Taxi_Trip_Data.csv /user/hive/taxidata/

3. Load the data into the Hive table:

4. LOAD DATA INPATH '/user/hive/taxidata/2018_Yellow_Taxi_Trip_Data.csv'

5. INTO TABLE taxidata;

3. Run Basic Queries to Verify the Data

Run simple queries to validate that the data is loaded correctly:

1. Check Row Count:

2. SELECT COUNT(*) FROM taxidata;

3. Preview Data:

4. SELECT * FROM taxidata LIMIT 10;

4. Analysis Queries

4.1. Total Number of Trips

SELECT COUNT(*) AS total_trips FROM taxidata;

4.2. Total Revenue Generated

SELECT SUM(total_amount) AS total_revenue FROM taxidata;

4.3. Fraction Paid for Tolls

SELECT SUM(tolls_amount) / SUM(total_amount) AS toll_fraction FROM taxidata;

4.4. Fraction Paid as Tips

SELECT SUM(tip_amount) / SUM(total_amount) AS tip_fraction FROM taxidata;

4.5. Average Trip Amount

SELECT AVG(total_amount) AS average_trip_amount FROM taxidata;

4.6. Average Trip Distance


SELECT AVG(trip_distance) AS average_trip_distance FROM taxidata;

4.7. Number of Different Payment Types

SELECT COUNT(DISTINCT payment_type) AS num_payment_types FROM taxidata;

4.8. Metrics for Each Payment Type

SELECT

payment_type,

AVG(total_amount) AS average_fare,

AVG(tip_amount) AS average_tip,

AVG(mta_tax) AS average_tax

FROM taxidata

GROUP BY payment_type;

4.9. Hourly Revenue Analysis

To find the hour of the day with the highest average revenue, extract the hour from
pickup_datetime:

SELECT

HOUR(TO_TIMESTAMP(pickup_datetime)) AS hour_of_day,

AVG(total_amount) AS average_revenue

FROM taxidata

GROUP BY HOUR(TO_TIMESTAMP(pickup_datetime))

ORDER BY average_revenue DESC

LIMIT 1;

5. Notes

• Ensure the file is formatted correctly and accessible in HDFS before loading.

• Verify Hive connectivity and configurations (like Hadoop and Hive services being active).

• Use TO_TIMESTAMP in Hive 2.2 or above for date-time manipulation.

• Always test queries on smaller datasets if needed for performance tuning.

This workflow should give you insights into the taxi dataset while using Hive effectively!

You might also like