Path: blob/main/Netezza/Airline_Delays.ipynb
1928 views
Airline Delay Analysis with Jupyter Notebook
Global Travel Associates (GTA) has been receiving feedback from their air travel clients within the United States concerning the increase in flight delays during the return to travel in 2023 after the pandemic. GTA has learned that the U.S. Department of Transportation, Bureau of Transportation Statistics has a data set available that contains flight delay information for the United States from 2003 through June 2023. GTA will perform analysis on this data using a Jupyter notebook to determine if: (1) flight delays have increased in 2023 over prior years as their customers claim, and (2) arrive at a list of airlines and airports that can be recommended to their customers where flight delays are at a minimum (to reduce the chances of their customers encountering a flight delay).
GTA wants to keep the costs of this analysis to a minimum, so the current data (2019 through April 2023) will be stored in a fully managed Netezza Performance Server as a Service (NPSaaS) database on Microsoft Azure and the historical data (2003 through 2018) will be stored in a Parquet format file on AWS S3 storage (low cost, resilient cloud storage) to reduce the overall cost of the data analysis effort.
Netezza python driver and python libraries
Netezza Cloud Connection and Verify Available Tables
The table creation and data loading portion of the demonstration has already been completed. The first query in this notebook checks that the tables are defined and available.
Quick View of the Data
The next query is a snapshot of the first couple of rows in the AIRLINE_DELAY_CAUSE_CURRENT table. The first objective in this analysis is to assess the current state of airline delays, to do this the data between 2019 and 2023 contained in the AIRLINE_DELAY_CAUSE_CURRENT table will be accessed. This query selects the first 5 rows of the table to verify the table exists and contains data.
Defining Columns
Year
Month
Carrier: Abbreviation
Carrier_Name: Carrier name full
Airport: Abbreviation
Airport_City
Airport_State: State Abbreviation
Airport_Name: Airport Name Full
Arr_Flights: Total Arrived Flights
Arr_Del15: Total Delayed Flights as defined by 15 minutes or more
Carrier_CT: Total delayed flights due to air carrier issues
Weather_CT: Total delayed flights due to weather
NAS_CT: Total delayed flights due to National Aviation System
Security_CT: Count of delayed flights caused by security
Late_Aircraft_CT: Count of delayed flights caused by late arriving aircraft
Arr_Cancelled: Count of cancelled flights
Arr_Diverted: Count of aircrafts diverted
Arr_Delay: Total minutes delayed
Carrier_Delay: Total minutes delayed due to the carrier
Weather_Delay: Total minutes delayed due to weather
NAS_Delay: Total minutes delayed due to National Aviation System
Security_Delay: Total minutes delayed due to Security
Late_Aircraft_Delay: Total minutes delayed due to late arriving aircraft
Please note: "count" (_CT) columns are pro-rated by minutes. For example, if the total delay was 45 minutes and 15 minutes was due to Weather and 30 minutes was due to Late Aircraft the count would be .33 for Weather and .66 for Late Aircraft.
How are these categories defined?
Air Carrier: The cause of the cancellation or delay was due to circumstances within the airline's control (for example; maintenance or crew problems, aircraft cleaning, baggage loading, fueling, and other reasons.).
Extreme Weather: Significant meteorological conditions (actual or forecasted) that, in the judgment of the carrier, delays or prevents the operation of a flight such as a thunderstorm, tornado, blizzard or hurricane.
National Aviation System (NAS): Delays and cancellations attributable to the national aviation system that refer to a broad set of conditions, such as non-extreme weather conditions, airport operations, heavy traffic volume, and air traffic control.
Late-arriving aircraft: A previous flight using the same aircraft arrived late, causing the present flight to depart late.
Security: Delays or cancellations caused by evacuation of a terminal or concourse, re-boarding of aircraft because of security breach, inoperative screening equipment and/or long lines in excess of 29 minutes at screening areas.
Analysis 1 - comparing total delayed flights by reason
The end goal is to get an understanding of how airline delays have changed over time. The analysis begins with Query 1, comparing total delayed flights by reason (carrier, weather, NAS, security, and late aircraft).
Causes of Delays Over Peak Covid
The following code displays a bar chart visualizing the different delay counts by reason. This allows a quick assessment of why flights may have been having issues.
With the data visualized, it is clear there are some major descrepancies both by year as well as reason. Security days are minumal across all years. While Carrier and Late Aircrafts are the number 1 and 2 cause for flight delays consistently. An uninformed viewer may interpret 2020 and 2023 as being a particularly great years to travel since there were significantly less delays. However, due to the impacts of COVID, this assumption may be inaccurate. Additionally the year 2023 is in progress (not yet a complete year) so there is missing data that does not allow a full year comparison.
Because 2023 is a partial year, the data needs to be normalized to allow flight delays to be accurately compared between years. By examining the total flight delays compared to the total number of flights for the period, a more accurate analysis and comparison of flight delay data can be achieved.
Analysis 2 - normalized view of the flight delay data
The following code cells are a build to the final graph of flight delay data. First the sums are collected for the Total Cancelled Flights, Total Delayed Flights, Total Flights, Carrier Delays, Weather Delays, NAS Delays, Security Delays and Late Aircraft Delays. Next, using the pandas and numpy python libraries, the dataframe can calculate ratios for Delayed, Cancelled, and total Distrupted Flights (Delayed and Canceled combined). Lastly, a new data frame can be created from the original data frame indicating the percentage of delayed flights that had to do with Carrier, Weather, NAS, Security, and Late Aircraft Delays. The final cell combines all this data into a normalized and informed graph that displays distrupted flight ratios between 2019 and 2023.
Understanding Delays with Context
Calculate the Flight Ratios to account for partial 2023 data and COVID distruptions
Delayed_Total_Ratio, Cancelled_Total_Ratio, Distrupted_Total_Ratio
With this new normalized view of the flight delay data, it can be seen that the initial analysis of flight delays was not completely fair. Although 2020 was not a great year to travel due to COVID-19, of the flights that did operate, the overall chance of a flight delay was relatively low compared to prior years. 2023 looked as though it had a low quantity of flight delays. However, when analyzing the amount of flight delays compared to the total number of flights scheduled, it is clear that 2023 has not been performing well. To date, 2023 has the highest percentage of distrupted flights overall.
This data analysis provides a strong understanding of how the airline industry has been performing with respect to flight delays for the past couple of years. The analysis within this demonstration will now be extended to include historical data and determine if the current 2019 through mid-year 2023 data trends are identical to past trends or whether flight delays have been improving over time.
Analysis 3 - current data to historical data
It's good practice to get an overall understanding of the data prior to diving deeper. The first query run using historical data will look at the total flight delays by year from 2003 through 2018.
Historical data stored in AWS S3 - Netezza Parquet Tables used to access history (AIRLINE_DELAY_CAUSE_HISTORY)
Comparing Current to Historical
A quick and simple line graph shows the overall trend has been a decline in flight delays from 2003 through 2018. Lets see if a closer look at the data supports this fact.
Analysis 4 - delays been trending from the early 2000s until the 2018 timeframe
The goal of analysis 4 is to get the total average delays by month from 2003-2012 and compare that to the total avereage delays by month from 2013-2018. This data analysis can answer the question, "Overall how have flight delays been trending from the early 2000s until the 2018 timeframe?"
First, a query is run to gather the total flight delays per month of each year from 2002 through 2018.
Total Delays by Month and Year
From this data the average flight delays by month from 2003-2012 as well as the average flight delays by month from 2013-2018 are determined.
Average Delays by Month from 2003-2012 compared to 2013-2018
At a glance, this numeric output in tabular format provides a good comparison but lets perform some additional analysis and view it in a graph format to get a better understanding of the data.
Above, the flight delay information for each month-year combination has been calculated and this data highlights that the flight delays for every month from 2003-2012 were higher than the same months from 2012-2018. In summary, the data shows a 20% decline in flight delays between the two historical time periods (2003-2012 and 2013-2018). This confirms the initial thinking that airlines delays were in decline leading up to 2019.
Delays Compared 2003-2012 and 2013-2018
Data visualization using bar chart confirms the smaller number of flight delays for the latter of the historical time periods.
Analysis 5 - Impact of COVID and other events on Flight Delays
Now that we have seen a decline in flight delays leading up to 2019, lets compare to our current data with the historical data to answer the question, "Are flight delays still in decline or have they risen since the impact of COVID-19 and other events?"
Now a simple query is used to determine the number of delays for each month in 2023 (up to April 2023 which is the most recent data).
Delay Trends for 2023
Total Compared Delays over Time
Combining the 2023 monthly data with the original historical data provides this result ...
As mentioned earlier, 2023 has been trending very poorly for delays when compared to the earlier years (2019, 2020, 2021, and 2022) in the current time period data. The data in the graph highlights that 2023 flight delays are trending very poorly against the averages in the late 2010s when delays were improving. In some cases, 2023 is even worse than the average flight delaysin the earlier 2000s (2003-2012). This is a fantastic visual to understand how the airline industry has not completely recovered after the COVID-19 pandemic or has been impacted by other outside facters causing worsening flight delays.
Analysis 6 - Which air carriers and airports are best?
For the last portion of the analysis, the data analysis will determine which air carriers and airports are best to use if a passenger is trying to avoid delays. Since many of the airlines have merged or no longer exist, the air carrier recommendation should suggest only those air carriers still in operation as well as major airports with over 100,000 arriving flights overall.
Delays by carrier
From the current collections of data, a list of all active carries will be extracted.
Of the currently operating air carriers, the recommendation for the air carrier with least percentage of flight delays would be Delta.
A similar query will now be executed to determine the major airports and their flight delays in order from the most flight delays to the least flight delays.
Airport Delays
Of the major airports operating 100,000 flights or more, the major airports to avoid for encountering more flight delays are Newark, LaGuardia, Fort Lauderdale, and Orlando.
Conclusion
The data analysis for flight delay data is now complete. To review, the initial analysis was based on current flight delay data from 2019 through April 2023. Due to awareness of the COVID-19 pandemic as a possible distruptor for the flight delay data, the flight delay data was normalized to compare distrupted flight ratios across all three years and this analysis found that 2023 was trending with an increase in delays. Next, there were several queries performed to compare delay trends from the historical periods 2003-2012 and 2013-2018. By taking the average delays per month, the analysis showed that flight delays were decreasing in the late 2010s (2012-2018). Recovery from the COVID-19 pandemic and other global factors did impact flight delays when looking at 2023 flight data compared to pre-pandemic flight delay trends, as there has been a definite increase in flight delays during 2023. Lastly, this data analysis included information about airports and air carriers that were operating flights between 2003 and 2023. The final queries were able to determine which air carriers operate with the least amount of flight delays and which major aiports encounter the most flight delays. This allows a recomendation to be provided to travelers on which air carriers to fly and which airports to use.