Flight Arrival Delay Prediction

9542 words (38 pages) Essay

18th May 2020 Computer Science Reference this

Disclaimer: This work has been submitted by a university student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKEssays.com.

 

 

 

Abstract

 

The main objective of this paper is to address our research question on “whether a particular flight arrival at its destination will be delayed or not?”. Based upon the initial review of the dataset, a supervised learning approach will be considered to answer this question. This means we will segregate our dataset into training and testing components.

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Find out more

The model we develop will be trained on the Airline On-Time Performance dataset which includes all commercial flight arrival and departure details in the USA, between October 1987 and February 2019. This is a relatively large dataset with almost 186 million records. Given the timeline and scale of the Capstone project, we will elect a subset of the dataset to perform our analysis and work with.

Logistic regression, Naïve Bayes and SVM models are some of the techniques we will consider using for training and testing our model in R programming language. 

The following questions will also be considered using descriptive statistical methods:

  1. Best day of week/time of year to fly to minimize delays?
  2. Carrier suffering from more delays?
  3. How well does departure delays predict arrival delays?

The dataset is publicly hosted at Stat Computing and is originally sourced from RITA, a unit of U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS).

Introduction

The inconveniences resulted from flight delays have been a long-time challenge for passengers, airports and airlines. According to the study conducted by the U.S. Federal Aviation Administration (FAA) in 2010, the data from 2007 was analyzed in order to quantify the economic impact of flight delays. It was found that 32.9 billion USD was borne by the American passengers and airlines. You can review the study here.

Over the years, numerous papers and studies have been devoted to address this challenge. The purpose of this paper is to use the dataset that is thoroughly explained here to train and test a predicting Machine Learning model to predict arrival flight delays based on the features with the highest relevance to the topic. This will be decided based on descriptive statistical analysis on the data.

The aim is to predict whether a flight will arrive at the destination with delay or not, given the circumstances.

Literature Review

 

A preliminary research of the literature review from various publications and papers on the topic is summarized below:

 

1. Predicting Airline Delays1 (Bandyopadhyay & Guerrero, 2012)

The paper uses dataset originally sourced from the Bureau of Transportation Statistics. The objective is to analyze and predict flight departure delays for a sample of flights in the USA, the main goals being:

  1. Identify the most influencing factors causing flight delays
  2. Predict if a specific flight will be delayed or not,
  3. Estimate the magnitude and impact in case of a delay.

Linear regression is used to identify the most influencing factors causing flight delays. Next, an SVM (Support Vector Machine) classifier is used to predict if there will be any delays. Finally, a non-parametric quadratic regression algorithm is proposed to estimate the magnitude of delays. 

 

 

2. Estimating Flight Departure Delay Distributions2 (TU, Ball, & Jank, 2006)

The paper aims to develop a strategic departure delay prediction model for estimating flight departure delays, required as part of air traffic congestion prediction models based on the identification of major factors influencing flight departure delays. The model employs nonparametric methods for daily and seasonal trends, and uses a mixture distribution to estimate the residual error. To overcome problems with local optima in the mixture distribution, a global optimization version of the Expectation Maximization algorithm borrowing ideas from Genetic Algorithms is developed. The model demonstrates reasonable goodness of fit, robustness and predictive capabilities. Flight data from Denver International Airport in the year 2000 was used.

 

 

3. Predicting Departure Delays of US Domestic Flights3 (Cole & Donoghue, 2017)

This project trains a logistic regression model to predict flight delays of more than 15 minutes, based on statistics of past flights. Features of the flights known at the time of booking, such as the airline, month, week, and hour of departure were used to train the model. The best algorithm trained separate models for each airport and achieved an accuracy of 0.689 (area under the receiver operating characteristic curve).

 

4. Characterization and Prediction of Air Traffic Delays4 (Rebollo & Balakrishnan, 2014)

This paper proposes a new set of models predicting flight delays over a 2 years’ period (2007 and 2008) in the USA, using the 100 most-delayed links in the system. The primary objective is to predict departure delays on a specific link (network of flights) or at a particular airport, sometime in the future. The models include temporal and spatial_delay_states as explanatory variables. Random Forest algorithms were adapted, to predict departure delays between 2 and 24 hours ahead (in the future). In addition to local delay variables, the paper proposes incorporating new network delay variables, which characterize the global delay state of the entire National Airspace System at the time of prediction. The proposed prediction models’ performance is analyzed in classifying delays as above or below a certain threshold, including prediction of delay values. For a 2-hour forecast, the average test error across 100 links is 19% in the case of classifying delays as above or below the 60 minutes threshold.

5. Predictive Modeling of Aircraft Flight Delay5 (Kalliguddi, Leboulluec, 2017)

This paper investigates the significant factors responsible for flight delays in 2016 based on data extracted from the Bureau of Transportation Statistics (BTS) comprising one million instances across 8 attributes. Machine learning techniques and statistical models such as Decision Trees, Random Forest and Multiple Linear Regressions were used to develop a predictive model in order to identify delays in advance. By identifying critical parameters responsible for flight delay, the model attempts to put forth a solution to the delay losses incurred by the airline industry.

 

6. Analysis of Aircraft Arrival Delay and Airport On-Time Performance6 (Bai, 2006)

This research paper develops statistical models of airport delay and single flight arrival delay, using data sourced from the Federal Aviation Administration (FAA). Multivariate regression, ANOVA, neural networks and logistic regression were used to detect the pattern of airport delay, aircraft arrival delay and schedule performance. These models are then integrated in the form of a system for aircraft delay analysis and airport delay assessment. The assessment of an airport’s schedule performance is discussed.

 

7. Multi-Factor Model for Predicting Delays at U.S. Airports7 (Xu, Sherry, & Laskey, 2008)

This project uses multi-factor models to predict airport delays in 15-minute periods across thirty-four U.S. airports. The models are developed with linear regressions (piece-wise) and Multi-Adaptive Regression Splines (MARS) for generated delays and absorbed delays at each airport. The models were generated based on historic data for each airport. After application of several test datasets, accuracy evaluation shows mean absolute prediction error of 5.3 minutes for generated delay and 2.2 minutes for absorbed delay across all the airports. A summary of the factors that influence the performance of each airport is provided and the implications of each is discussed.

8. Flight Delay Prediction8 (Martinez, 2012)

The project proposes estimate the probability distribution of flight delays using kernel density estimation models. It does not try to model the underlying processes, rather only analyzes past observations. The models, of increasing complexity, have been implemented, optimized and evaluated on a large scale, using several years of US domestic flights delay records. As part of the evaluation, the performance of some of the models to predict delay distributions are analyzed.

 

9. Modeling Flight Delays9 (Sauvestre, Duperier & Leaf, 2016)

Using publicly available flight information and weather data, the paper aims to predict whether a flight will be delayed by more than 15 minutes across the 40 largest airports in the United States. A flight’s delay can arise as a result of a previous flight’s delay, hence features to capture these second-order behaviors were incorporated in the analysis. Data was classified using Random Forest, Gaussian Naive Bayes, Logistic Regression, and Neural Networks, and achieved a best overall F1-score of 82% using a Random Forest classifier.

 

10. Application of ML Algorithms to Predict Flight Arrival Delays10 (Kuhn, Jamadagni, 2017)

Recognizing the harmful economic and environmental impact of the growth in aviation industry, this paper applies machine learning algorithms like decision tree, logistic regression and neural networks classifiers to predict if a given flight’s arrival will be delayed or not. It simplifies the analysis and predicts with a test accuracy of approximately 91% for all three classifiers, using only 3 critical attributes from a selection of attributes such as departure date, departure delay, distance between the two airports, scheduled arrival time etc. A comparison of the decision tree classifier with logistic regression and a simple neural network for various figures of merit is also provided.

Dataset

As mentioned in the abstract, the Airline On-Time Performance dataset includes all commercial flight arrival and departure details in the USA, between October 1987 and February 2019. This is a relatively large dataset with almost 186 million records. Given the timeline and scale of the Capstone project, I have elected to use 2 years’ worth of data from 2007 and 2008.

The dataset is publicly hosted at Stat Computing and is originally sourced from RITA, and comprises the following 29 features11 with 14,462,943 observations prior to data cleaning.

S. No

Name

Description

1

Year

2007-2008

2

Month

1-12

3

DayofMonth

1-31

4

DayOfWeek

1 (Monday) – 7 (Sunday)

5

DepTime

actual departure time (local, hhmm)

6

CRSDepTime

scheduled departure time (local, hhmm)

7

ArrTime

actual arrival time (local, hhmm)

8

CRSArrTime

scheduled arrival time (local, hhmm)

9

UniqueCarrier

unique carrier code

10

FlightNum

flight number

11

TailNum

plane tail number

12

ActualElapsedTime

in minutes

13

CRSElapsedTime

in minutes

14

AirTime

in minutes

15

ArrDelay

arrival delay, in minutes

16

DepDelay

departure delay, in minutes

17

Origin

origin IATA airport code

18

Dest

destination IATA airport code

19

Distance

in miles

20

TaxiIn

taxi in time, in minutes

21

TaxiOut

taxi out time in minutes

22

Cancelled

was the flight cancelled?

23

CancellationCode

reason for cancellation (A = carrier, B = weather, C = NAS, D = security)

24

Diverted

1 = yes, 0 = no

25

CarrierDelay

in minutes

26

WeatherDelay

in minutes

27

NASDelay

in minutes

28

SecurityDelay

in minutes

29

LateAircraftDelay

in minutes

Table 1 – Dataset Variables Description

 

Approach

 

Our approach will follow a simple 5 step process, with each stage building upon the one before.

  1. In the first step we will examine and clean the data.
  1. Next, we will perform descriptive analysis to better understand the salient features of the data and answer our research questions.
  1. Our features selection for data modelling will depend on the most influencing attributes.
  1. Our next aim will be to test three different algorithms to identify the best performing model.
  1. Finally, we will train, tune and test our model for the chosen machine learning algorithm.

We will utilize R, Excel and Tableau as the main tools to help with our analytics.

Figure 1 – Approach Steps

 

Step 1 – Data Cleaning

 

The focus here in this step is to remove the ‘noise’ in the data and make it ready for our analysis:

  1. Features were converted to the most appropriate and relevant format.
  2. Records comprising NAs were removed.
  3. Delays with negative values were converted to zero.
  4. “Arrival delay” and “Departure delay” were transformed to binary variables making it easier to identify if an aircraft was delayed or not.

R code for the data cleaning process can be accessed here.

After cleaning the data and removing features that are not relevant and records with “NA” values, the dataset is reduced to 18 features and 14,130,317 records with the following data types:

Figure 2 – Dataset Structure

A more detailed and thorough description of the other remaining steps in our approach and our findings will follow in the Final Project Report.

 

Step 2 – Descriptive Analytics

As part of our exploratory data analysis, we performed numerous data visualizations to understand the most influencing factors impacting arrival delays and identify any hidden patterns. In addition, this analysis will further help answer the following research questions introduced in our abstract:

  1. Best day of week/time of year to fly to minimize delays?
  2. Which carrier suffers from more delays?
  3. How well does departure delays predict arrival delays?

R code for the descriptive analytics process can be accessed here.

Step 3 – Feature Selection

Boruta package in R was used to perform the feature selection. Boruta is an all relevant feature selection wrapper algorithm, capable of working with any classification method that output variable importance measure (VIM); by default, Boruta uses Random Forest. The method performs a top-down search for relevant features by comparing original attributes’ importance with importance achievable at random, estimated using their permuted copies, and progressively eliminating irrelevant features to stabilize that test.

In other words, the algorithm iteratively compares the importance of attributes with the importance of shadow attributes, created by shuffling original ones. Attributes that have significantly worse importance than shadow ones are being consecutively dropped. On the other hand, attributes that are significantly better than shadows are admitted to be confirmed.

Out of 9 initial features, 2 were rejected and 7 were confirmed. The most influencing factors on “Arrival delays” are:

  1. Departure Delay
  2. Arrival Time
  3. Departure Time
  4. Distance
  5. Carrier
  6. Month
  7. Destination

R code for the feature selection process can be accessed here.

Step 4 – Model Selection

 

The primary objective on this research project is to classify future flight arrivals as either “Arrive on-time” or “Arrive with delay”. To that end, 3 different algorithms were tested accordingly and the corresponding findings documented for:

  1. Logistic Regression,
  2. Naïve Bayes, and
  3. Support Vector Model (SVM)

SVM turns out to be the highest performing model with an accuracy of 76.7%.

R code for the model selection process can be accessed here (in progress)

Step 5 – Tuning, Training, Testing

Our final step of the approach involved performing a tuning of the SVM model to decide the best performing parameters of gamma and cost. Testing the model on different sizes of the test datasets varying between 12,500 and 125,000 did not have a significant impact on the accuracy of the model. Introduction of new variables and adjustment of cost improved the accuracy significantly.

A summary of the different tests and trainings executed using SVM and the results is available here.

 

Results

 

Step 2 – Descriptive Analysis

R code for descriptive analytics can be accessed here. The overall delay time (in minutes) for arrival flights and departure flights in 2008 versus 2007 improved by 24% and 18% respectively.

Figure 3 – Annual Delays

Even though the number of flights in 2008 fell 6% from 2007, total delays improved by 21%. That is an average of 21.5 minutes’ delay per flight in 2007 versus 18.1 minutes in 2008.

Figure 4 – Annual Number of Flights

Most arrival delays appear to happen on Fridays, Thursdays, Mondays and Sundays. It is therefore likely that passengers face less delays on Tuesdays, Wednesdays and Saturdays.

Figure 5 – Daily Delays

The total frequency of flights per weekday follows the same pattern as delays (in mins). Looking at flight arrivals and departures, flights on average are delayed 20.5 minutes on Mondays, 21.9 on Thursdays, 24.9 on Fridays and 21.7 on Sundays.

 

Figure 6 – Daily Number of Flights

December, June, July, February and August experience the longest delays. These are primarily high travel seasons with highest number of flights.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 7 – Monthly Delays

 

July, August, May and March saw the highest frequency of flights, which is a bit out of sync in comparison with the total monthly delayed minutes.

 

Figure 8 – Monthly Number of Flights

Based on the above analysis, we can now address the first research question, “When is the best day of week/time of year to fly to minimize delays?”

Passengers who travel on Tuesdays, Wednesdays and Saturdays are less likely to experience flight delays compared to other day in the week. The same applies to months: passengers travelling in April, May, September, October and November are less likely to experience flight delays compared to other months in the year.

Find out how UKEssays.com can help you!

Our academic experts are ready and waiting to assist with any writing project you may have. From simple essay plans, through to full dissertations, you can guarantee we have a service perfectly matched to your needs.

View our services

To address the second research question, “Which carrier suffers from more delays?”, we first look at the top 10 airlines with the greatest number of delayed arrival and departure minutes. We then separately look at the overall outlook of the top 10 carriers with the most delayed minutes. AA (American Airlines) and WN (Northwest Airline) are carries at the top of this list.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 9 – Top 10 Carriers Delays

 

Looking at the number of flights operated by each carrier, we note that WN (Northwest Airline), AA (American Airlines) and OO (SkyWest Airlines) stand out.

 

 

Figure 10 – Total Number of Flights per Carrier

 

Atlanta, Orlando and Dallas are among the top 3 high traffic airport destinations as illustrated by the number of flights in Figure 11.

Figure 11 – Highest Traffic Destinations

Phoenix, Atlanta and Kentucky are the top 3 airports with the greatest number of on-time flights as illustrated in Figure 12.

Figure 12 – Airports with the highest number of on-time flights

An illustration of the 2007 Arrival and Departure delays per airport, indicates that Orlando, Atlanta, Dallas and Newark are the most congested airports.

Figure 13 – Arrival and Departure Delay per Airport

Airports with the highest number of delays in 2007.

-  Colors represent total departure delay.

-  Size of the bubbles represent total arrival delay.

Analyzing Airlines, Airports and Arrival Delays simultaneously, we can see from Figure 14 that EV (ExpressJet Airlines) experiences its longest delays at Atlanta, AA (American Airlines) and MQ (Envoy Air) at Orlando and Dallas. XE (ExpressJet) experiences its longest delays at Newark.

Figure 14 – Arrival Delay per Airport

   -   Colors represent carriers.

   -   Size of bubbles represent total arrival

       delay per destination

The number of arrival and departure delays in excess of 15 minutes decreased by ~16% in 2008.

Figure 15 – Sum of Annual Delays exceeding 15 minutes

Such delays mostly tend to happen on Fridays, Thursdays and Mondays.

Figure 16 – Sum of Weekday Delays exceeding 15 minutes

From a monthly perspective, passengers traveling in December, June and July are more likely to face arrival and departure delays over 15 minutes.

Figure 17 – Sum of Monthly Delays exceeding 15 minutes

WN (Southwest Airlines), AA (American Airlines), MQ (Envoy Air), UA (United Airlines) and OO (SkyWest Airlines) are carriers with the greatest number of arrival delays exceeding 15 minutes.

Figure 18 – Airline Carriers with most no. of delays exceeding 15 minutes

Flights with no delays in 2008 amounted to 24,576 which is a 17% decline from 29,620 flights in 2007.

Figure 19 – Annual On-Time Flights

 

 

 

 

 

 

 

 

 

WN (Southwest Airline), YV (Mesa Airlines) and OH (Comair) are the top 3 carriers with no delays.

Figure 20 – Airlines with most On-Time Flights

Figure 21 illustrates that Late Aircraft delays category is the biggest contributor for all delays, followed by National Air System (NAS), Carrier, Weather and Security delays. Late Aircraft delays corresponds to an aircraft with a previous flight delay.

Figure 21 – Most Influential Delay Causes

An analysis of the Arrival and Departure times illustration reveals that it is better to avoid flights departing between 0600 and 2000 as these are more likely to arrive with delays, with peak delays occurring for flights departing at 1700. Figure 22b shows the number of departing flights (with arrival delays) increase between 0600 and 1700. This makes sense as most of the air traffic for departing flights is also experienced between these hours, as illustrated in Figure 22a. The best hours to fly with the lowest probability of running into arrival delays are between 2000 and 0400.

 

Figure 22 a – Arrivals and Departures by Hour

 

Figure 22 b – Best hours of the day to fly

Our last research question looks to determine how well do departure delays predict arrival delays? To address this, we draw out different combinations of “Arrival delays” and “Departure delays”.

The pie chart in Figure 23 below shows that in almost 77% of the cases, if there is a departure delay i.e. (DepDelL = ‘1’), then there is an arrival delay (ArrDelL = ‘1’) or if the departure is on-time (DepDelL = ‘0’), there is no arrival delay (ArrDelL = ‘0’).

Figure 23 – Arrival and Departure Delays Dependency

 

 

Interpreting the Phi Coefficient

To understand and answer our last research question better, we calculated something called the phi coefficient, first introduced by Karl Pearson.

 

The phi coefficient is a measure of the degree of association between two binary variables, in our case: Arrival delays and Departure delays. This measure is similar to the correlation coefficient in its interpretation.

The phi coefficient is a symmetrical statistic, which means the independent variable (Departure delays) and dependent variables (Arrival delays) are interchangeable.

The interpretation for the phi coefficient is similar to the Pearson Correlation Coefficient. The range is from -1 to 1, where:

  • 0 is no relationship.
  • 1 is a perfect positive relationship: most of our data falls along the diagonal cells.
  • -1 is a perfect negative relationship: most of our data is not on the diagonal.

In this case our phi coefficient is determined as 0.53 which shows a weak positive association between Arrival delays and Departure delays. As such, we can say Departure delays is one of the positive influencing factors on Arrival delays.

 

Predictive Analysis

 

Step 3 – Feature Selection

 

The first step towards building our model is to start with feature selection. As mentioned before, Boruta package in R was used to perform the feature selection. Please refer to Step 3 – Feature Selection, for more details on the Boruta Package. R code for feature selection is available here.

In order to predict Arrival Delays, we consider the following attributes in feature selection out of which “DayofWeek” and “Origin” appears to be the least influential on delays. These are therefore excluded from the model.

The charts below illustrate and summarize the Boruta feature selection results. Figure 24a demonstrates all the features we ran the feature selection test on. “Departure delay” by far has the greatest impact on “Arrival Delay”. To have better visibility over all other attributes and their level of importance, Figure 24b shows a zoomed in version of Figure 24a. The green boxplots are our confirmed features and the red ones are the rejected features. “Day of week” and “Origin” are not significantly affecting “Arrival delay”.

Figure 24a – Boruta Result Plot

 

                Figure 24b – Boruta Result Plot (Zoomed In)

At this stage of our analysis and model selection, we aim to predict whether an aircraft will be delayed or not at the destination airport given the selected features:

  1. Departure Delay
  2. Arrival Time
  3. Departure Time
  4. Distance
  5. Carrier
  6. Month
  7. Destination

Put differently, we want to know if an aircraft will be delayed on “Arrival” or not. Since this is a classification problem, we have chosen the following models:

  1. Logistic Regression
  2. Naïve Bayes
  3. SVM (Support Vector Machines)

 

Step 4 – Model Selection

R code for model selection is available here.

Logistic Regression

 

Running the logistic regression on our dataset with 7 features and 12,500 training and testing datasets, gives us around 50% accuracy which is pretty low. Before tuning and testing different features, we try Naïve Bayes and SVM models as well.

Method

Accuracy

Sensitivity

Specificity

Logistic Regression

50.09%

57.64%

41.06%

Table 2 – Logit Regression Model Results

 

Naïve Bayes

Running Naïve Bayes on the same dataset, resulted in a much lower accuracy around 37%, which is not good enough

Method

Accuracy

Sensitivity

Specificity

Naïve Bayes

36.99%

38.05%

35.73%

Table 3 – Naïve Bayes Model Results

 

Support Vector Machines (SVMs)

Support-Vector Machines (SVMs) is a supervised learning model with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier.

An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap (known as hyperplane) that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

The first step is to define the best gamma and cost values for our SVM model. To do so, we run the tune.svm command on the selected parameters in a 10-fold cross validation on a sample of 12,500 data points in our scenario. In order to select the best kernel, we ran SVM with all 4 kernel options and “Radial” proved to be the best kernel.

Based on the results, the best gamma and cost values are both equal to 0.01 which gives us the performance of 0.2018586. The SVM model is now run with the best selected parameters.

Refer to Figure 25 for details of the output results above.

Figure 25 – SVM Performance on Parameter Selection Plot (10-fold Cross Validation)

Figure 25 above shows the result of training our SVM model on 12,500 training dataset and testing the model on 12,500 test datapoints. Training and testing datasets are sampled independently and there is no overlap between the two sets.

SVM results in an accuracy of 76.68% i.e. the prediction of aircrafts arriving on-time or with delay in about 77% of the cases is correctly done.

Method

Accuracy

Sensitivity

Specificity

SVM

76.68%

83.21%

68.74%

Table 4 – SVM Model Results

Our prediction accuracy could potentially improve if we include other influencing factors such as “weather”. This influencing factor has been studied in “Predicting Airline Delays” and “Multi-Factor Model for Predicting Delays at U.S. Airports” papers as mentioned in the literature review section.

Step 5 – Tuning, Training and Testing

Comparing the three methods accuracy rates, SVM proves to be the leading model with an accuracy of 76.68%. This number is higher than the accuracy obtained for Logistic Regression and Naïve Bayes models, hence making SVM our preferred classifying model of choice. The results summarized below were all based on training and testing datasets size of 12,500 records.

Method

Accuracy

Sensitivity

Specificity

SVM

76.68%

83.21%

68.74%

Logistic Regression

50.09%

57.64%

41.06%

Naïve Bayes

36.99%

38.05%

35.73%

Table 5 – Overall Model Performance Comparison

Testing the model on different sizes of the test datasets varying between 12,500 and 125,000 does not have a significant impact on the accuracy of the model. Table 6 summarizes the different tests and trainings that were run using SVM. Introduction of new variables and adjustment of cost improved the accuracy significantly.

Table 6 – Summary of SVM Tests

Conclusion

Based on our assessment of the descriptive analytics performed, we can conclude the following:

     Tuesdays, Wednesdays and Saturdays are the best days to take a flight.

     April, May, September, October and November are months that experience a significantly smaller number of flight delays compared to other months.

     Flights departing between 2000 and 0400 are less likely to arrive with delay.

Most of the above time factors are influenced by air traffic, so if the air traffic patterns shift, there is a likelihood these timings could be affected.

     AA (American Airlines), WN (Northwest Airline) and MQ (Envoy Air) are airline carriers with the most delayed minutes.

     These airlines also carry the most traffic and have the highest frequency of flights.

In almost 77% of the cases, if there is a departure delay, then there is an arrival delay or if the departure is on-time, there is no arrival delay. The phi coefficient determined is 0.53 which shows a weak positive association between Arrival delays and Departure delays. As such, we can say Departure delays is one of the positive influencing factors on Arrival delays.

After assessing the 3 classification models, SVM is the preferred method of choice. The SVM model was ran on 12,500 training and testing datasets, resulting in an accuracy of 76.68% i.e. the prediction of aircrafts arriving on-time or with delay in about 77% of the cases is correctly done. The model parameters gamma and cost are 0.01 and the kernel used, is radial.

Our prediction accuracy could potentially improve if we include other strong influencing factors such as “weather” in our model.

 

References

[1] Predicting Airline Delays (Bandyopadhyay & Guerrero, 2012)

      http://cs229.stanford.edu/proj2012/BandyopadhyayGuerrero-PredictingFlightDelays.pdf

[2] Estimating Flight Departure Delay Distributions (TU, Ball, & Jank, 2006)

      http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.132.1147&rep=rep1&type=pdf

[3] Predicting Departure Delays of US Domestic Flights (Cole & Donoghue, 2017)

      https://srcole.github.io/assets/flight_delay/report.pdf

[4] Characterization and Prediction of Air Traffic Delays (Rebollo & Balakrishnan, 2014)

      http://www.mit.edu/~hamsa/pubs/RebolloBalakrishnanTRC2014.pdf

[5] Predictive Modeling of Aircraft Flight Delay (Kalliguddi, Leboulluec, 2017)

      http://www.hrpub.org/download/20171130/UJM3-12110417.pdf

[6] Analysis of Aircraft Arrival Delay and Airport On-Time Performance (Bai, 2006)

      http://etd.fcla.edu/CF/CFE0001049/Bai_Yuqiong_200605_MS.pdf

[7] Multi-Factor Model for Predicting Delays at U.S. Airports (Xu, Sherry, & Laskey, 2008)

      http://catsr.ite.gmu.edu/pubs/XuMultiFactorModelAirportDelaysTRBv6.pdf

[8] Flight Delay Prediction (Martinez, 2012)

      https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/153312/eth-5404-01.pdf

[9] Modeling Flight Delays (Sauvestre, Duperier & Leaf, 2016)

      http://cs229.stanford.edu/proj2016/report/DuperierSauvestreLeaf-ModelingFlightDelays-report.pdf

[10] Application of ML Algorithms to Predict Flight Arrival Delays (Kuhn & Jamadagni, 2017)

        http://cs229.stanford.edu/proj2017/final-reports/5243248.pdf

[11] http://stat-computing.org/dataexpo/2009/the-data.html


 

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on the UKDiss.com website then please: