GitHub Repository: CloudPak-Outcomes/Outcomes-Projects
Path: blob/main/Netezza/Airline_Delays.ipynb
¹⁹²⁸ views

Kernel: Python 3

Airline Delay Analysis with Jupyter Notebook

Global Travel Associates (GTA) has been receiving feedback from their air travel clients within the United States concerning the increase in flight delays during the return to travel in 2023 after the pandemic. GTA has learned that the U.S. Department of Transportation, Bureau of Transportation Statistics has a data set available that contains flight delay information for the United States from 2003 through June 2023. GTA will perform analysis on this data using a Jupyter notebook to determine if: (1) flight delays have increased in 2023 over prior years as their customers claim, and (2) arrive at a list of airlines and airports that can be recommended to their customers where flight delays are at a minimum (to reduce the chances of their customers encountering a flight delay).

GTA wants to keep the costs of this analysis to a minimum, so the current data (2019 through April 2023) will be stored in a fully managed Netezza Performance Server as a Service (NPSaaS) database on Microsoft Azure and the historical data (2003 through 2018) will be stored in a Parquet format file on AWS S3 storage (low cost, resilient cloud storage) to reduce the overall cost of the data analysis effort.

Netezza python driver and python libraries

In [ ]:

# import python libraries 
!pip install nzpy
!pip install pandasql
import numpy as np 
import nzpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import numpy as np
from pandasql import sqldf
import warnings
warnings.filterwarnings('ignore')

Netezza Cloud Connection and Verify Available Tables

In [ ]:

# Netezza Cloud Connection Information
nz_host             = ""
nz_port             = 5480
nz_database         = ""
nz_user             = ""
nz_password         = ""

In [ ]:

# Connect to Netezza Performance Server on Cloud
nzcon = nzpy.connect(user=nz_user, password=nz_password, host=nz_host, database=nz_database, port=nz_port)

if bool(nzcon):
    print("Host     : " + nz_host)
    print("Port     :", nz_port)
    print("User     : " + nz_user)
    print("Password : ********")
    print("Database : " + nz_database)
    print()
    print("Connection successful.")
    print()
    print("Notebook is ready.")

The table creation and data loading portion of the demonstration has already been completed. The first query in this notebook checks that the tables are defined and available.

In [ ]:

# List Netezza tables
# AIRLINE_DELAY_CAUSE_CURRENT - Netezza Table with 2019 - April 2023
# AIRLINE_DELAY_CAUSE_HISTORY - Neteza Parquet Table with 2003 - 2019

q0 = f"""
select OBJNAME as TABLENAME,
case
when objclass = 4905 then 'NETEZZA TABLE'
when objclass = 4911 then 'NETEZZA TABLE'
when objclass = 4999 then 'PARQUET TABLE'
when objclass = 4996 then 'EXTERNAL DATASOURCE'
end as OBJ_TYPE
from _t_object 
where objname like '%AIRLINE%' 
and objclass in (4905, 4911, 4999, 4996)
and objdb =
(select objid from _t_object where objname = '{nz_database.upper()}')
"""

q1 = """
select tablename from _v_table
"""
# if Netezza user with select on _t_object
df = pd.read_sql_query(q0, nzcon)

#df = pd.read_sql_query(q1, nzcon)
df

Quick View of the Data

The next query is a snapshot of the first couple of rows in the AIRLINE_DELAY_CAUSE_CURRENT table. The first objective in this analysis is to assess the current state of airline delays, to do this the data between 2019 and 2023 contained in the AIRLINE_DELAY_CAUSE_CURRENT table will be accessed. This query selects the first 5 rows of the table to verify the table exists and contains data.

In [ ]:

# years in Netezza table
q = """
SELECT 
DISTINCT YEAR
FROM AIRLINE_DELAY_CAUSE_CURRENT
ORDER BY 1
"""

df = pd.read_sql_query(q, nzcon)
pd.set_option('display.max_columns', None)
#df.columns = df.columns.str.decode('utf-8')
df

In [ ]:

# years in parquet table
q = """
SELECT 
DISTINCT YEAR
FROM AIRLINE_DELAY_CAUSE_HISTORY
ORDER BY 1
"""

df = pd.read_sql_query(q, nzcon)
pd.set_option('display.max_columns', None)
#df.columns = df.columns.str.decode('utf-8')
df

In [ ]:

# years in Netezza table
q = """
SELECT 
*
FROM AIRLINE_DELAY_CAUSE_CURRENT
LIMIT 10
"""

df = pd.read_sql_query(q, nzcon)
pd.set_option('display.max_columns', None)
#df.columns = df.columns.str.decode('utf-8')
df

Defining Columns

Year
Month
Carrier: Abbreviation
Carrier_Name: Carrier name full
Airport: Abbreviation
Airport_City
Airport_State: State Abbreviation
Airport_Name: Airport Name Full
Arr_Flights: Total Arrived Flights
Arr_Del15: Total Delayed Flights as defined by 15 minutes or more
Carrier_CT: Total delayed flights due to air carrier issues
Weather_CT: Total delayed flights due to weather
NAS_CT: Total delayed flights due to National Aviation System
Security_CT: Count of delayed flights caused by security
Late_Aircraft_CT: Count of delayed flights caused by late arriving aircraft
Arr_Cancelled: Count of cancelled flights
Arr_Diverted: Count of aircrafts diverted
Arr_Delay: Total minutes delayed
Carrier_Delay: Total minutes delayed due to the carrier
Weather_Delay: Total minutes delayed due to weather
NAS_Delay: Total minutes delayed due to National Aviation System
Security_Delay: Total minutes delayed due to Security
Late_Aircraft_Delay: Total minutes delayed due to late arriving aircraft

Please note: "count" (_CT) columns are pro-rated by minutes. For example, if the total delay was 45 minutes and 15 minutes was due to Weather and 30 minutes was due to Late Aircraft the count would be .33 for Weather and .66 for Late Aircraft.

How are these categories defined?

Air Carrier: The cause of the cancellation or delay was due to circumstances within the airline's control (for example; maintenance or crew problems, aircraft cleaning, baggage loading, fueling, and other reasons.).

Extreme Weather: Significant meteorological conditions (actual or forecasted) that, in the judgment of the carrier, delays or prevents the operation of a flight such as a thunderstorm, tornado, blizzard or hurricane.

National Aviation System (NAS): Delays and cancellations attributable to the national aviation system that refer to a broad set of conditions, such as non-extreme weather conditions, airport operations, heavy traffic volume, and air traffic control.

Late-arriving aircraft: A previous flight using the same aircraft arrived late, causing the present flight to depart late.

Security: Delays or cancellations caused by evacuation of a terminal or concourse, re-boarding of aircraft because of security breach, inoperative screening equipment and/or long lines in excess of 29 minutes at screening areas.

Analysis 1 - comparing total delayed flights by reason

The end goal is to get an understanding of how airline delays have changed over time. The analysis begins with Query 1, comparing total delayed flights by reason (carrier, weather, NAS, security, and late aircraft).

Causes of Delays Over Peak Covid

In [ ]:

# SQL delays
q = """
SELECT year,
SUM(CAST(carrier_ct AS FLOAT)) as Carrier_Delay,
SUM(CAST(weather_ct AS FLOAT)) as Weather_Delay,
SUM(CAST(nas_ct AS FLOAT)) as NAS_Delay,
SUM(CAST(security_ct AS FLOAT)) as Security_Delay,
SUM(CAST(late_aircraft_ct AS FLOAT)) as Late_Aircraft_Delay
FROM AIRLINE_DELAY_CAUSE_CURRENT
GROUP BY year ORDER BY YEAR"""

pd.options.display.float_format = '{:,.2f}'.format
delays_cause_df = pd.read_sql_query(q, nzcon)
#delays_cause_df.columns = delays_cause_df.columns.str.decode('utf-8')
delays_cause_df

The following code displays a bar chart visualizing the different delay counts by reason. This allows a quick assessment of why flights may have been having issues.

In [ ]:

# set width of bars
barWidth = 0.15

# Set position of bar on X axis
r1 = np.arange(len(delays_cause_df['CARRIER_DELAY']))
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
r4 = [x + barWidth for x in r3]
r5 = [x + barWidth for x in r4]

plt.figure(figsize=(15, 6))  # width:10, height:8

# Make the plot
plt.bar(r1, delays_cause_df['CARRIER_DELAY'], color='blue', width=barWidth, edgecolor='white', label='Carrier Delay')
plt.bar(r2, delays_cause_df['WEATHER_DELAY'], color='orange', width=barWidth, edgecolor='white', label='Weather Delay')
plt.bar(r3, delays_cause_df['NAS_DELAY'], color='pink', width=barWidth, edgecolor='white', label='NAS Delay')
plt.bar(r4, delays_cause_df['LATE_AIRCRAFT_DELAY'], color='purple', width=barWidth, edgecolor='white', label='Late Aircraft Delay')
plt.bar(r5, delays_cause_df['SECURITY_DELAY'], color='green', width=barWidth, edgecolor='white', label='Security Delay')

# Add xticks on the middle of the group bars
plt.xlabel('Year', fontweight='bold')
plt.ylabel('Total Flights', fontweight='bold')
plt.xticks([r + barWidth for r in range(len(delays_cause_df['CARRIER_DELAY']))], ['2019', '2020', '2021', '2022', '2023'])



# Create legend & Show graphic
plt.legend()
plt.show()

With the data visualized, it is clear there are some major descrepancies both by year as well as reason. Security days are minumal across all years. While Carrier and Late Aircrafts are the number 1 and 2 cause for flight delays consistently. An uninformed viewer may interpret 2020 and 2023 as being a particularly great years to travel since there were significantly less delays. However, due to the impacts of COVID, this assumption may be inaccurate. Additionally the year 2023 is in progress (not yet a complete year) so there is missing data that does not allow a full year comparison.

Because 2023 is a partial year, the data needs to be normalized to allow flight delays to be accurately compared between years. By examining the total flight delays compared to the total number of flights for the period, a more accurate analysis and comparison of flight delay data can be achieved.

Analysis 2 - normalized view of the flight delay data

The following code cells are a build to the final graph of flight delay data. First the sums are collected for the Total Cancelled Flights, Total Delayed Flights, Total Flights, Carrier Delays, Weather Delays, NAS Delays, Security Delays and Late Aircraft Delays. Next, using the pandas and numpy python libraries, the dataframe can calculate ratios for Delayed, Cancelled, and total Distrupted Flights (Delayed and Canceled combined). Lastly, a new data frame can be created from the original data frame indicating the percentage of delayed flights that had to do with Carrier, Weather, NAS, Security, and Late Aircraft Delays. The final cell combines all this data into a normalized and informed graph that displays distrupted flight ratios between 2019 and 2023.

Understanding Delays with Context

In [ ]:

q = """
SELECT year,
SUM(CAST(arr_cancelled AS FLOAT)) as Total_Cancelled_Flights,
SUM(CAST(arr_del15 AS FLOAT)) as Total_Delayed_Flights,
SUM(CAST(arr_flights AS FLOAT)) as Total_Flights,
SUM(CAST(carrier_ct AS FLOAT)) as Carrier_Delay,
SUM(CAST(weather_ct AS FLOAT)) as Weather_Delay,
SUM(CAST(nas_ct AS FLOAT)) as NAS_Delay,
SUM(CAST(security_ct AS FLOAT)) as Security_Delay,
SUM(CAST(late_aircraft_ct AS FLOAT)) as Late_Aircraft_Delay
FROM AIRLINE_DELAY_CAUSE_CURRENT
GROUP BY year ORDER BY YEAR"""

pd.options.display.float_format = '{:,.2f}'.format
delays_cause_df2 = pd.read_sql_query(q, nzcon)
#delays_cause_df2.columns = delays_cause_df2.columns.str.decode('utf-8')
delays_cause_df2

Calculate the Flight Ratios to account for partial 2023 data and COVID distruptions

Delayed_Total_Ratio, Cancelled_Total_Ratio, Distrupted_Total_Ratio

In [ ]:

q = """
WITH FLIGHT_DELAYS AS(
    SELECT
        YEAR,
        SUM(CAST(arr_cancelled AS FLOAT)) AS Total_Cancelled_Flights,
        SUM(CAST(arr_del15 AS FLOAT)) AS Total_Delayed_Flights,
        SUM(CAST(arr_flights AS FLOAT)) AS Total_Flights,
        SUM(CAST(carrier_ct AS FLOAT)) AS Carrier_Delay_sum,
        SUM(CAST(weather_ct AS FLOAT)) AS Weather_Delay_sum,
        SUM(CAST(nas_ct AS FLOAT)) AS NAS_Delay_sum,
        SUM(CAST(security_ct AS FLOAT)) AS Security_Delay_sum,
        SUM(CAST(late_aircraft_ct AS FLOAT)) AS Late_aircraftd_delay_sum
    FROM
        AIRLINE_DELAY_CAUSE_CURRENT
    GROUP BY
        YEAR
)
SELECT
    YEAR,
    --Total_Cancelled_Flights,
    --Total_Delayed_Flights,
    --Total_Flights,
    --Carrier_Delay_SUM,
    --Weather_Delay,
    --NAS_Delay,
    --Security_Delay,
    --Late_aircraftd_delay,
    Total_Delayed_Flights/Total_Flights*100 AS Delayed_Total_Ratio,
    100 * TOTAL_CANCELLED_FLIGHTS/TOTAL_FLIGHTS AS Cancelled_Total_Ratio,
    (TOTAL_CANCELLED_FLIGHTS + TOTAL_DELAYED_FLIGHTS) / TOTAL_FLIGHTS * 100 AS Distrupted_Total_Ratio,
    Carrier_Delay_SUM / TOTAL_DELAYED_FLIGHTS * Delayed_Total_Ratio AS CARRIER_DELAY,
    Weather_Delay_SUM / TOTAL_DELAYED_FLIGHTS * Delayed_Total_Ratio AS WEATHER_DELAY,
    NAS_Delay_SUM / TOTAL_DELAYED_FLIGHTS * Delayed_Total_Ratio AS NAS_DELAY,
    Security_Delay_SUM / TOTAL_DELAYED_FLIGHTS * Delayed_Total_Ratio AS Security_DELAY,
    Late_aircraftd_delay_sum / TOTAL_DELAYED_FLIGHTS * Delayed_Total_Ratio AS Late_aircraft_delay
FROM
    FLIGHT_DELAYS AS Cancelled_Total_Ratio
ORDER BY
    YEAR
"""

pd.options.display.float_format = '{:,.2f}'.format
delays_ratio_df = pd.read_sql_query(q, nzcon)
#delays_ratio_df.columns = delays_cause_df2.columns.str.decode('utf-8')
delays_ratio_df

In [ ]:

# normalized view of the flight delay data
barWidth = 0.15

# Set position of bar on X axis
r1 = np.arange(len(delays_cause_df2['YEAR']))
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]

plt.figure(figsize=(14, 6))  # width:10, height:8

# Make the plot
#plt.bar(r1, delays_cause_df2['Delayed:Total Ratio'], color='#d0c4ef', width=barWidth, edgecolor='white', label='Delay Ratio')
plt.bar(r2, delays_ratio_df['CANCELLED_TOTAL_RATIO'], color='#c75f44', width=barWidth, edgecolor='white', label='Cancelled Ratio')
plt.bar(r3, delays_ratio_df['DISTRUPTED_TOTAL_RATIO'], color='#7e9b89', width=barWidth, edgecolor='white', label='Distrupted Ratio')
plt.bar(r1, delays_ratio_df['CARRIER_DELAY'], color='#d0c4ef', width=barWidth, edgecolor='white', label='CARRIER_DELAY')
plt.bar(r1, delays_ratio_df['WEATHER_DELAY'], bottom=delays_ratio_df['CARRIER_DELAY'], color='#7e9bb9', width=barWidth, edgecolor='white', label='WEATHER_DELAY')
plt.bar(r1, delays_ratio_df['NAS_DELAY'], bottom=delays_ratio_df['CARRIER_DELAY']+delays_ratio_df['WEATHER_DELAY'], color='#4e9bc9', width=barWidth, edgecolor='white', label='NAS_DELAY')
plt.bar(r1, delays_ratio_df['SECURITY_DELAY'], bottom=delays_ratio_df['CARRIER_DELAY']+delays_ratio_df['WEATHER_DELAY']+delays_ratio_df['NAS_DELAY'], color='#2e9b89', width=barWidth, label='SECURITY_DELAY')
plt.bar(r1, delays_ratio_df['LATE_AIRCRAFT_DELAY'], bottom=delays_ratio_df['CARRIER_DELAY']+delays_ratio_df['WEATHER_DELAY']+delays_ratio_df['NAS_DELAY']+delays_ratio_df['SECURITY_DELAY'], color='#3eab89', width=barWidth, edgecolor='white', label='LATE_AIRCRAFT_DELAY')




# Add xticks on the middle of the group bars
plt.xlabel('Year', fontweight='bold')
plt.ylabel('Percentage', fontweight='bold')
plt.xticks([r + barWidth for r in range(len(delays_cause_df2['YEAR']))], ['2019', '2020', '2021', '2022', '2023'])


plt.title(label = "Normalized Distruption Ratios")
# Create legend & Show graphic
plt.legend(bbox_to_anchor=(1.01, 1), loc='upper left', borderaxespad=0)
plt.show()

With this new normalized view of the flight delay data, it can be seen that the initial analysis of flight delays was not completely fair. Although 2020 was not a great year to travel due to COVID-19, of the flights that did operate, the overall chance of a flight delay was relatively low compared to prior years. 2023 looked as though it had a low quantity of flight delays. However, when analyzing the amount of flight delays compared to the total number of flights scheduled, it is clear that 2023 has not been performing well. To date, 2023 has the highest percentage of distrupted flights overall.

This data analysis provides a strong understanding of how the airline industry has been performing with respect to flight delays for the past couple of years. The analysis within this demonstration will now be extended to include historical data and determine if the current 2019 through mid-year 2023 data trends are identical to past trends or whether flight delays have been improving over time.

Analysis 3 - current data to historical data

It's good practice to get an overall understanding of the data prior to diving deeper. The first query run using historical data will look at the total flight delays by year from 2003 through 2018.

Historical data stored in AWS S3 - Netezza Parquet Tables used to access history (AIRLINE_DELAY_CAUSE_HISTORY)

Comparing Current to Historical

In [ ]:

# SQL - Total Delays by Year 2003-2018 (Parquet table)
q = """
SELECT 
year, SUM(CAST(arr_del15 as float)) as "Total Delays"
FROM AIRLINE_DELAY_CAUSE_HISTORY GROUP BY year order by year asc
"""

all_delay_df = pd.read_sql_query(q, nzcon)
#all_delay_df.columns = all_delay_df.columns.str.decode('utf-8')
all_delay_df

In [ ]:

# Graph - Total Delays by Year
plt.plot(all_delay_df["YEAR"],all_delay_df["Total Delays"])
plt.title('Total Delays by Year')
plt.xlabel('Year')
plt.ylabel('Total Delays')
plt.rcParams["figure.figsize"] = (17,5)
plt.show()

A quick and simple line graph shows the overall trend has been a decline in flight delays from 2003 through 2018. Lets see if a closer look at the data supports this fact.

The goal of analysis 4 is to get the total average delays by month from 2003-2012 and compare that to the total avereage delays by month from 2013-2018. This data analysis can answer the question, "Overall how have flight delays been trending from the early 2000s until the 2018 timeframe?"

First, a query is run to gather the total flight delays per month of each year from 2002 through 2018.

Total Delays by Month and Year

In [ ]:

# SQL - Total Delays by Month and Year 2003-2018 (Parquet table)
q = """
SELECT 
year, month,
SUM(arr_del15) as TOTAL_DELAYS_PER_MONTH
FROM AIRLINE_DELAY_CAUSE_HISTORY GROUP BY year, month order by year, month
"""

delay_by_month_df1 = pd.read_sql_query(q, nzcon)
#delay_by_month_df1.columns = delay_by_month_df1.columns.str.decode('utf-8')

delay_by_month_df1

From this data the average flight delays by month from 2003-2012 as well as the average flight delays by month from 2013-2018 are determined.

Average Delays by Month from 2003-2012 compared to 2013-2018

In [ ]:

# avg delays compared by time period

q = """
SELECT t1.FINAL_MONTH1 as Month, t1."2003-2012", t2."2013-2018"

FROM 

(SELECT month as FINAL_MONTH1, AVG(SUM1) as "2003-2012" 
FROM
(SELECT 
month, year,
SUM(CAST(arr_del15 as FLOAT)) as SUM1
FROM AIRLINE_DELAY_CAUSE_HISTORY 
GROUP BY year, month order by year, month) as t02
WHERE year <= 2012 GROUP BY FINAL_MONTH1 ORDER BY FINAL_MONTH1) as t1

INNER JOIN 

(SELECT month as FINAL_MONTH2, AVG(SUM2) as "2013-2018" 
FROM
(SELECT 
month, year,
SUM(CAST(arr_del15 as FLOAT)) as SUM2
FROM AIRLINE_DELAY_CAUSE_HISTORY
GROUP BY year, month order by year, month) as t02
WHERE year >= 2013 and year <= 2019 GROUP BY FINAL_MONTH2 ORDER BY FINAL_MONTH2) as t2

ON t1.FINAL_MONTH1=t2.FINAL_MONTH2 ORDER BY t1.FINAL_MONTH1

"""
pd.options.display.max_rows = None
delay_by_month_df2 = pd.read_sql_query(q, nzcon)
#delay_by_month_df2.columns = delay_by_month_df2.columns.str.decode('utf-8')

delay_by_month_df2

At a glance, this numeric output in tabular format provides a good comparison but lets perform some additional analysis and view it in a graph format to get a better understanding of the data.

In [ ]:

# calculate early years to later years
array_1 = delay_by_month_df2['2003-2012']
array_2 = delay_by_month_df2['2013-2018']

sum_total = 0

zip_object = zip(array_1, array_2)
for element1, element2 in zip_object:
    if (element1 > element2):
        sum_total += 1


early_sum = delay_by_month_df2['2003-2012'].sum()
late_sum = delay_by_month_df2['2013-2018'].sum()
print(str(sum_total) + " out of 12 months the early 2000s had more delays than late 2000s")
print(((late_sum - early_sum)/late_sum)*100)

Above, the flight delay information for each month-year combination has been calculated and this data highlights that the flight delays for every month from 2003-2012 were higher than the same months from 2012-2018. In summary, the data shows a 20% decline in flight delays between the two historical time periods (2003-2012 and 2013-2018). This confirms the initial thinking that airlines delays were in decline leading up to 2019.

Delays Compared 2003-2012 and 2013-2018

In [ ]:

# set width of bars
barWidth = 0.25

# Set position of bar on X axis
r1 = np.arange(len(delay_by_month_df2['MONTH']))
r2 = [x + barWidth for x in r1]

plt.figure(figsize=(15, 6))  # width:10, height:8

# Make the plot
plt.bar(r1, delay_by_month_df2['2003-2012'], color='blue', width=barWidth, edgecolor='white', label='Delays 2003-2012')
plt.bar(r2, delay_by_month_df2['2013-2018'], color='green', width=barWidth, edgecolor='white', label='Delays 2013-2018')

# Add xticks on the middle of the group bars
plt.xlabel('Month', fontweight='bold')
plt.ylabel('Total Delays', fontweight='bold')
plt.xticks([r + barWidth for r in range(len(delay_by_month_df2['MONTH']))], delay_by_month_df2['MONTH'])



# Create legend & Show graphic
plt.legend()
plt.show()

Data visualization using bar chart confirms the smaller number of flight delays for the latter of the historical time periods.

Analysis 5 - Impact of COVID and other events on Flight Delays

Now that we have seen a decline in flight delays leading up to 2019, lets compare to our current data with the historical data to answer the question, "Are flight delays still in decline or have they risen since the impact of COVID-19 and other events?"

Now a simple query is used to determine the number of delays for each month in 2023 (up to April 2023 which is the most recent data).

Delay Trends for 2023

In [ ]:

q = """
SELECT month,

SUM(case when year = 2023 then SUM else 0 end) as "2023" FROM 

(SELECT 
month, year,
SUM(CAST(arr_del15 as FLOAT)) as SUM
FROM AIRLINE_DELAY_CAUSE_CURRENT GROUP BY year, month order by year, month) as t1 

GROUP BY month order by month
"""

delay_by_month_df3 = pd.read_sql_query(q, nzcon)
#delay_by_month_df3.columns = delay_by_month_df3.columns.str.decode('utf-8')
delay_by_month_df3

Total Compared Delays over Time

Combining the 2023 monthly data with the original historical data provides this result ...

In [ ]:

# set width of bars
barWidth = 0.25

# Set position of bar on X axis
r1 = np.arange(len(delay_by_month_df2['MONTH']))
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]

plt.figure(figsize=(15, 6))  # width:10, height:8

# Make the plot
plt.bar(r1, delay_by_month_df2['2003-2012'], color='#4169e1', width=barWidth, edgecolor='white', label='Delays 2003-2012')
plt.bar(r2, delay_by_month_df2['2013-2018'], color='#93c47d', width=barWidth, edgecolor='white', label='Delays 2013-2018')
plt.bar(r3, delay_by_month_df3['2023'], color='#ca4c4c', width=barWidth, edgecolor='white', label='Delays 2023')

# Add xticks on the middle of the group bars
plt.xlabel('Month', fontweight='bold')
plt.ylabel('Total Delays', fontweight='bold')
plt.xticks([r + barWidth for r in range(len(delay_by_month_df2['MONTH']))], delay_by_month_df2['MONTH'])



# Create legend & Show graphic
plt.legend()
plt.show()

As mentioned earlier, 2023 has been trending very poorly for delays when compared to the earlier years (2019, 2020, 2021, and 2022) in the current time period data. The data in the graph highlights that 2023 flight delays are trending very poorly against the averages in the late 2010s when delays were improving. In some cases, 2023 is even worse than the average flight delaysin the earlier 2000s (2003-2012). This is a fantastic visual to understand how the airline industry has not completely recovered after the COVID-19 pandemic or has been impacted by other outside facters causing worsening flight delays.

Analysis 6 - Which air carriers and airports are best?

For the last portion of the analysis, the data analysis will determine which air carriers and airports are best to use if a passenger is trying to avoid delays. Since many of the airlines have merged or no longer exist, the air carrier recommendation should suggest only those air carriers still in operation as well as major airports with over 100,000 arriving flights overall.

Delays by carrier

In [ ]:

q = """
SELECT carrier_name, sum(arr_del15)
FROM AIRLINE_DELAY_CAUSE_HISTORY GROUP BY carrier_name ORDER BY carrier_name

"""

delay_by_month_df3 = pd.read_sql_query(q, nzcon)
#delay_by_month_df3.columns = delay_by_month_df3.columns.str.decode('utf-8')
delay_by_month_df3

From the current collections of data, a list of all active carries will be extracted.

In [ ]:

## find Carriers that are still operating 

#q = """
#SELECT UNIQUE(carrier_name)
#FROM AIRLINE_DELAY_CAUSE_CURRENT

#"""
#carrier_names_df = pd.read_sql_query(q, nzcon)
#carrier_names_df.columns = carrier_names_df.columns.str.decode('utf-8')
#carrier_list = carrier_names_df['CARRIER_NAME'].to_list()
#carriers = "','".join(carrier_list)
#carrier_list



## find Carriers that are still operating 

q = """
SELECT UNIQUE(carrier_name)
FROM AIRLINE_DELAY_CAUSE_CURRENT
"""
carrier_names_df = pd.read_sql_query(q, nzcon)
carrier_list = carrier_names_df['CARRIER_NAME'].to_list()
joined_carrier_list = "','".join(carrier_list)
joined_carrier_list

In [ ]:

q = """
SELECT 
t1.carrier_name,
SUM(t1.TOTAL_FLIGHTS) as TOTAL_FLIGHTS,
SUM(t1.Total_Delayed_Flights) as Total_Delayed_Flights,
SUM(t1.Diverted_Flights) as Diverted_Flights,
SUM(t1.Cancelled_Flights) as Cancelled_Flights,
(SUM(t1.Diverted_Flights) + SUM(t1.Total_Delayed_Flights) + SUM(t1.Cancelled_Flights))/SUM(t1.TOTAL_FLIGHTS) as Delay_Percentage

FROM 
(SELECT carrier_name,
SUM(CAST(arr_flights as FLOAT)) as TOTAL_FLIGHTS,
SUM(CAST(arr_del15 as FLOAT)) as Total_Delayed_Flights,
SUM(CAST(arr_diverted as FLOAT)) as Diverted_Flights,
SUM(CAST(arr_cancelled as FLOAT)) as Cancelled_Flights
FROM 
AIRLINE_DELAY_CAUSE_CURRENT WHERE AIRLINE_DELAY_CAUSE_CURRENT.CARRIER_NAME in ('{}')
GROUP BY AIRLINE_DELAY_CAUSE_CURRENT.CARRIER_NAME

UNION ALL

SELECT carrier_name,
SUM(CAST(arr_flights as FLOAT)) as TOTAL_FLIGHTS,
SUM(CAST(arr_del15 as FLOAT)) as Total_Delayed_Flights,
SUM(CAST(arr_diverted as FLOAT)) as Diverted_Flights,
SUM(CAST(arr_cancelled as FLOAT)) as Cancelled_Flights
FROM 
AIRLINE_DELAY_CAUSE_HISTORY WHERE AIRLINE_DELAY_CAUSE_HISTORY.CARRIER_NAME in ('{}')

GROUP BY AIRLINE_DELAY_CAUSE_HISTORY.CARRIER_NAME
) as t1

GROUP BY t1.carrier_name ORDER BY Delay_Percentage ASC
""".format(joined_carrier_list, joined_carrier_list)


carrier_delay_df = pd.read_sql_query(q, nzcon)
#carrier_delay_df.columns = carrier_delay_df.columns.str.decode('utf-8')
carrier_delay_df

In [ ]:

q = """
SELECT 
t1.carrier_name,
SUM(t1.TOTAL_FLIGHTS) as TOTAL_FLIGHTS,
SUM(t1.Total_Delayed_Flights) as Total_Delayed_Flights,
SUM(t1.Diverted_Flights) as Diverted_Flights,
SUM(t1.Cancelled_Flights) as Cancelled_Flights,
(SUM(t1.Diverted_Flights) + SUM(t1.Total_Delayed_Flights) + SUM(t1.Cancelled_Flights))/SUM(t1.TOTAL_FLIGHTS) as Delay_Percentage

FROM 
(SELECT carrier_name,
SUM(CAST(arr_flights as FLOAT)) as TOTAL_FLIGHTS,
SUM(CAST(arr_del15 as FLOAT)) as Total_Delayed_Flights,
SUM(CAST(arr_diverted as FLOAT)) as Diverted_Flights,
SUM(CAST(arr_cancelled as FLOAT)) as Cancelled_Flights
FROM 
AIRLINE_DELAY_CAUSE_CURRENT GROUP BY AIRLINE_DELAY_CAUSE_CURRENT.CARRIER_NAME

UNION ALL

SELECT carrier_name,
SUM(CAST(arr_flights as FLOAT)) as TOTAL_FLIGHTS,
SUM(CAST(arr_del15 as FLOAT)) as Total_Delayed_Flights,
SUM(CAST(arr_diverted as FLOAT)) as Diverted_Flights,
SUM(CAST(arr_cancelled as FLOAT)) as Cancelled_Flights
FROM 
AIRLINE_DELAY_CAUSE_HISTORY

GROUP BY AIRLINE_DELAY_CAUSE_HISTORY.CARRIER_NAME
) as t1

GROUP BY t1.carrier_name ORDER BY Delay_Percentage ASC
"""


carrier_delay_df = pd.read_sql_query(q, nzcon)
#carrier_delay_df.columns = carrier_delay_df.columns.str.decode('utf-8')
carrier_delay_df

Of the currently operating air carriers, the recommendation for the air carrier with least percentage of flight delays would be Delta.

A similar query will now be executed to determine the major airports and their flight delays in order from the most flight delays to the least flight delays.

Airport Delays

In [ ]:

q = """
SELECT 
t1.airport_name,
SUM(t1.TOTAL_FLIGHTS) as TOTAL_FLIGHTS,
SUM(t1.Total_Delayed_Flights) as Total_Delayed_Flights,
SUM(t1.Diverted_Flights) as Diverted_Flights,
SUM(t1.Cancelled_Flights) as Cancelled_Flights,
(SUM(t1.Diverted_Flights) + SUM(t1.Total_Delayed_Flights) + SUM(t1.Cancelled_Flights))/SUM(t1.TOTAL_FLIGHTS) as Distrupted_Percentage

FROM 
(SELECT airport_name,
SUM(CAST(arr_flights as FLOAT)) as TOTAL_FLIGHTS,
SUM(CAST(arr_del15 as FLOAT)) as Total_Delayed_Flights,
SUM(CAST(arr_diverted as FLOAT)) as Diverted_Flights,
SUM(CAST(arr_cancelled as FLOAT)) as Cancelled_Flights
FROM 
AIRLINE_DELAY_CAUSE_CURRENT GROUP BY AIRLINE_DELAY_CAUSE_CURRENT.airport_name

UNION ALL

SELECT airport_name,
SUM(CAST(arr_flights as FLOAT)) as TOTAL_FLIGHTS,
SUM(CAST(arr_del15 as FLOAT)) as Total_Delayed_Flights,
SUM(CAST(arr_diverted as FLOAT)) as Diverted_Flights,
SUM(CAST(arr_cancelled as FLOAT)) as Cancelled_Flights
FROM 
AIRLINE_DELAY_CAUSE_HISTORY WHERE arr_flights > 10000

GROUP BY AIRLINE_DELAY_CAUSE_HISTORY.airport_name
) as t1 WHERE t1.TOTAL_FLIGHTS > 100000

GROUP BY t1.airport_name ORDER BY Distrupted_Percentage DESC
"""

airport_delay_df = pd.read_sql_query(q, nzcon)
#airport_delay_df.columns = airport_delay_df.columns.str.decode('utf-8')
airport_delay_df

Of the major airports operating 100,000 flights or more, the major airports to avoid for encountering more flight delays are Newark, LaGuardia, Fort Lauderdale, and Orlando.

Conclusion

The data analysis for flight delay data is now complete. To review, the initial analysis was based on current flight delay data from 2019 through April 2023. Due to awareness of the COVID-19 pandemic as a possible distruptor for the flight delay data, the flight delay data was normalized to compare distrupted flight ratios across all three years and this analysis found that 2023 was trending with an increase in delays. Next, there were several queries performed to compare delay trends from the historical periods 2003-2012 and 2013-2018. By taking the average delays per month, the analysis showed that flight delays were decreasing in the late 2010s (2012-2018). Recovery from the COVID-19 pandemic and other global factors did impact flight delays when looking at 2023 flight data compared to pre-pandemic flight delay trends, as there has been a definite increase in flight delays during 2023. Lastly, this data analysis included information about airports and air carriers that were operating flights between 2003 and 2023. The final queries were able to determine which air carriers operate with the least amount of flight delays and which major aiports encounter the most flight delays. This allows a recomendation to be provided to travelers on which air carriers to fly and which airports to use.

In [ ]:

%%time
q = """
SELECT 
month,
SUM(arr_del15) as TOTAL_DELAYS_PER_MONTH
FROM EXT_AIRLINE_DELAY_CAUSE_2018 GROUP BY month ORDER BY TOTAL_DELAYS_PER_MONTH
"""

df = pd.read_sql_query(q, nzcon)
df

In [ ]:

%%time
q = """
SELECT 
month,
SUM(arr_del15) as TOTAL_DELAYS_PER_MONTH FROM AIRLINE_DELAY_CAUSE_2018_LOCAL GROUP BY month ORDER BY TOTAL_DELAYS_PER_MONTH
"""

df = pd.read_sql_query(q, nzcon)
df

In [ ]:

Airline Delay Analysis with Jupyter Notebook

Netezza python driver and python libraries

Netezza Cloud Connection and Verify Available Tables

Quick View of the Data

Defining Columns

Analysis 1 - comparing total delayed flights by reason

Causes of Delays Over Peak Covid

Analysis 2 - normalized view of the flight delay data

Understanding Delays with Context

Calculate the Flight Ratios to account for partial 2023 data and COVID distruptions

Analysis 3 - current data to historical data

Comparing Current to Historical

Total Delays by Month and Year

Average Delays by Month from 2003-2012 compared to 2013-2018

Delays Compared 2003-2012 and 2013-2018

Analysis 5 - Impact of COVID and other events on Flight Delays

Delay Trends for 2023

Total Compared Delays over Time

Analysis 6 - Which air carriers and airports are best?

Delays by carrier

Airport Delays

Conclusion

Product

Resources

Company

Airline Delay Analysis with Jupyter Notebook

Netezza python driver and python libraries

Netezza Cloud Connection and Verify Available Tables

Quick View of the Data

Defining Columns

Analysis 1 - comparing total delayed flights by reason

Causes of Delays Over Peak Covid

Analysis 2 - normalized view of the flight delay data

Understanding Delays with Context

Calculate the Flight Ratios to account for partial 2023 data and COVID distruptions

Analysis 3 - current data to historical data

Comparing Current to Historical

Analysis 4 - delays been trending from the early 2000s until the 2018 timeframe

Total Delays by Month and Year

Average Delays by Month from 2003-2012 compared to 2013-2018

Delays Compared 2003-2012 and 2013-2018

Analysis 5 - Impact of COVID and other events on Flight Delays

Delay Trends for 2023

Total Compared Delays over Time

Analysis 6 - Which air carriers and airports are best?

Delays by carrier

Airport Delays

Conclusion