FORECASTING AIRLINE DELAYS
On any given day, more than 87,000 flights take place in the United States alone. About one-third of
these flights are commercial flights, operated by companies like United, American Airlines, and
JetBlue. While about 80% of commercial flights take-off and land as scheduled, the other 20% suffer
from delays due to various reasons. A certain number of delays are unavoidable, due to unexpected
events, but some delays could hopefully be avoided if the factors causing delays were better
understood and addressed.
In this problem, we'll use a dataset of 9,381 flights that occurred in June through August of a specific
year between the three busiest US airports -- Atlanta (ATL), Los Angeles (LAX), and Chicago (ORD) --
to predict flight delays. The dataset AirlineDelay.csv includes the following 23 variables:
1. Flight = the origin-destination pair (LAX-ORD, ATL-LAX, etc.)
2. Carrier = the carrier operating the flight (American Airlines, Delta Air Lines, etc.)
3. Month = the month of the flight (June, July, or August)
4. DayOfWeek = the day of the week of the flight (Monday, Tuesday, etc.)
5. NumPrevFlights = the number of previous flights taken by this aircraft in the same day
6. PrevFlightGap = the amount of time between when this flight's aircraft is scheduled to arrive
at the airport and when it's scheduled to depart for this flight
7. HistoricallyLate = the proportion of time this flight has been late historically
8. InsufficientHistory = whether or not we have enough data to determine the historical record
of the flight (equal to 1 if we don't have at least 3 records, equal to 0 if we do)
9. OriginInVolume = the amount of incoming traffic volume at the origin airport, normalized by
the typical volume during the flight's time and day of the week
10. OriginOutVolume = the amount of outgoing traffic volume at the origin airport, normalized
by the typical volume during the flight's time and day of the week
11. DestInVolume = the amount of incoming traffic volume at the destination airport,
normalized by the typical volume during the flight's time and day of the week
12. DestOutVolume = the amount of outgoing traffic volume at the destination airport,
normalized by the typical volume during the flight's time and day of the week
13. OriginPrecip = the amount of rain at the origin over the course of the day, in tenths of
millimeters
14. OriginAvgWind = average daily wind speed at the origin, in miles per hour
15. OriginWindGust = fastest wind speed during the day at the origin, in miles per hour
16. OriginFog = whether or not there was fog at some point during the day at the origin (1 if
there was, 0 if there wasn't)
17. OriginThunder = whether or not there was thunder at some point during the day at the
origin (1 if there was, 0 if there wasn't)
18. DestPrecip = the amount of rain at the destination over the course of the day, in tenths of
millimeters
19. DestAvgWind = average daily wind speed at the destination, in miles per hour
20. DestWindGust = fastest wind speed during the day at the destination, in miles per hour
21. DestFog = whether or not there was fog at some point during the day at the destination (1 if
there was, 0 if there wasn't)
22. DestThunder = whether or not there was thunder at some point during the day at the
destination (1 if there was, 0 if there wasn't)
23. TotalDelay = the amount of time the aircraft was delayed, in minutes (this is our dependent
variable)
PROBLEM 1 - LOADING THE DATA
1. Load the data
2. Split the data in Testing and Training set (30%+70%)
3. Check for missing values, outliers etc. If any impute as may be required
Answer - Done
PROBLEM 2 - A LINEAR REGRESSION MODEL
Build a linear regression model to predict "TotalDelay" using all of the other variables as
independent variables. Use the training set to build the model.
Answer – R2 is 8.9%
PROBLEM 3 - CORRELATIONS
Check for correlations between the numerical variables. Would you like to modify your model built
in PROBLEM 2?
Answer – No, all to be selected
PROBLEM 4 - PREDICTIONS ON THE TEST SET
Make predictions on the test set using your linear regression model. What is the Sum of Squared
Errors (SSE) on the test set?
Answer – SSE = 5101504
PROBLEM 5 - A CLASSIFICATION PROBLEM
Let's turn this problem into a multi-class classification problem by creating a new dependent
variable. Our new dependent variable will take three different values: "No Delay", "Minor Delay",
and "Major Delay". If delay is 0 or less than 0 then call in “No Delay”. Up to 30 min call that “Minor
Delay” and any delay more than 30 min should be called as “Major Delay”. Create this variable,
called "DelayClass", in your dataset you should do this in Excel:
1. How many flights in the dataset Airlines had no delay? - 4688
2. How many flights in the dataset Airlines had a minor delay? - 3150
3. How many flights in the dataset Airlines had a major delay? - 1543
PROBLEM 6 - A Multi Class Boosted Tree MODEL
Build a model to predict "DelayClass" using all of the other variables as independent variables.
Answer – Overall accuracy is 52.18 %
PROBLEM 7- Two class model
Using Excel again, convert above problem to two class problem. Any delay upto 30 min will be called
as “No Delay” and more than 30 min will be called as “Delay”
Answer - Overall accuracy is 81.9%
Using Azure Studio compare the model accuracy of: (Split in train and test)
1. Logistic Regression – 8.9%
2. SVM – 84.4 %
3. GBM – 81.7 %
4. Random Forest – 52.18 %