Analysis of flight delays for Newark Airport

This is my final end to end data project for CodeClan. Written in Python in Jupyter Notebooks. The aim was to provide data-driven exploratory analysis to provide insight on the impact of weather on flight delays. To achieve this I did an exploratory data analysis and modelling. This entry contains the exploratory data analysis.

Newark airport has requested an investigation into flight departure delays.
Flight delays are a key performance indicator of airport operations.
The following three business questions are explored:

  • How does Newark compare to other NY airports
  • What is the impact of weather on flight departure delays
  • What other reasons impact flight departure delays
# import libraries
import pandas as pd
import numpy as np
import pandas_profiling as pp
import datetime as dt
import seaborn as sns
from matplotlib import pyplot as plt
import folium
import branca
# read in data
ewr_flights = pd.read_csv('data/clean_data/ewr_flights.csv')

# create combined flights_all data
flight_files = ['data/clean_data/ewr_flights.csv',
                'data/clean_data/jfk_flights.csv',
                'data/clean_data/lga_flights.csv']

flights_all = pd.concat(map(pd.read_csv, flight_files))

Preliminary analysis

Investigating weather data

# profile report for weather variables
pp.ProfileReport(ewr_flights.loc[:, ['time_hour', 'dep_delay_true', 'wind_dir',
       'wind_speed', 'wind_gust', 'visib', 'temp_max', 'temp_min',
       'temp_ave', 'temp_departure', 'ppt', 'new_snow', 'snow_depth']])
Summarize dataset: 100%|██████████| 27/27 [00:14<00:00,  1.88it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.18s/it]
Render HTML: 100%|██████████| 1/1 [00:02<00:00,  2.27s/it]

Summary of profile report on weather.

  • wind_speed and wind_gust are highly positively correlated. As are all the temperature variables, temp_max, temp_min, temp_ave.
  • Precipitation (ppt) has a very skewed distribution and 48% of the data is zero. The two snow variables, new_snow and snow_depth have 92% zeros in their data columns.
  • Wind direction has a bimodal distribution.

24% of the flights are delayed and 76% of the flights are not delayed.

Investigating plane information

A pandas profiling report was also run for plane information. (see code)

Summary of profile report on plane information

  • The type variable is highly correlated with manufacturer and engine, remove this variable.
  • The manufacturer variable has high cardinality (lots of different categories). Recode this to the top 5 manufacturers, recode NaNs to unknown and set the rest to other.

Feature engineering and data wrangling

Random forest is known to be a good model for flight delay data. I will prepare the data for this model.

Feature engineering

The following features will be changed.

  • convert wind_speed into a categorical wind_scale using the Beaufort scale
  • drop columns with high correlation identified in pandas profiling report (wind_gust, temp_max, temp_min)
  • remove type
  • recode manufacturer to the top 5 manufacturers, recode NaNs to unknown and set the rest to other
  • recode hour, month and day as categorical variables
  • remove other columns not thought to be important for flight delay model (keep dest_airport):
    dest_timezone, dest_alt, dest_lon, dest_lat, dest_faa, ori_timezone, ori_lat, ori_lon,
    ori_airport, ori_faa, time_hour, minute, hour, dest, tailnum, flight, carrier, arr_delay,
    sched_arr_time, arr_time, dep_delay, sched_dep_time, dep_time, day.

Convert weather data types

# add categorical wind_scale for wind_speed
wind_labels = ['Calm', 'Light Breeze', 'Gentle Breeze',
'Moderate Breeze', 'Fresh Breeze', 'Strong Breeze', 'Near Gale',
'Gale']

wind_bins = [0, 3, 7, 12, 18, 24, 31, 38, 46]

flights_all['wind_scale'] = pd.cut(flights_all['wind_speed'],
bins = wind_bins, labels=wind_labels,
include_lowest = True)
weather_drop = ['wind_gust', 'temp_max', 'temp_min']

flights_all_trim = flights_all.drop(columns=weather_drop)

Convert plane data types

flights_all_trim.drop(columns=['type', 'model'], inplace = True)

man_keep = ['Unknown', 'BOEING', 'AIRBUS', 'AIRBUS INDUSTRIE',
 'EMBRAER', 'BOMBARDIER INC', 'MCDONNELL DOUGLAS AIRCRAFT CO', 'MCDONNELL DOUGLAS']

flights_all_trim.manufacturer = np.where(flights_all_trim.manufacturer.isin(man_keep),
flights_all_trim.manufacturer, 'Other')
#fix manufacturer doubling for airbus and mcdonnell douglas
flights_all_trim.manufacturer = flights_all_trim.manufacturer.map({
    'AIRBUS INDUSTRIE': 'AIRBUS',
    'AIRBUS': 'AIRBUS',
    'MCDONNELL DOUGLAS': 'MCDONNELL DOUGLAS',
    'MCDONNELL DOUGLAS AIRCRAFT CO': 'MCDONNELL DOUGLAS',
    'BOEING': 'BOEING',
    'EMBRAER': 'EMBRAER',
    'BOMBARDIER INC': 'BOMBARDIER INC',
    'Unknown': 'Unknown',
    'Other': 'Other'})

flights_all_trim.head()

monthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflight...dest_faadest_airportdest_latdest_londest_altdest_timezonecarrier_namemanufacturerenginewind_scale
011545.05450.0833.08312.0B6527...MCOOrlando International Airport28.429399-81.30899896.0America/New_YorkJetBlue AirwaysAIRBUSTurbo-fanGentle Breeze
111556.05551.0709.0714-5.0UA320...IADWashington Dulles International Airport38.944500-77.455803312.0America/New_YorkUnited Air Lines Inc.AIRBUSTurbo-fanGentle Breeze
211559.0600-1.0848.0904-16.0AA1301...MIAMiami International Airport25.793200-80.2906048.0America/New_YorkAmerican Airlines Inc.UnknownUnknownGentle Breeze
311557.0600-3.0853.0903-10.0B6605...FLLFort Lauderdale Hollywood International Airport26.072599-80.1527029.0America/New_YorkJetBlue AirwaysAIRBUSTurbo-fanGentle Breeze
411731.065536.0830.080822.0UA2016...BOSGeneral Edward Lawrence Logan International Ai...42.364300-71.00520320.0America/New_YorkUnited Air Lines Inc.AIRBUSTurbo-fanGentle Breeze

5 rows × 43 columns

flights_all_trim.shape
(294167, 43)

Convert timeseries data into categories

flights_all_trim['month_flag'] = flights_all_trim.month.astype(str)

flights_all_trim['month_flag'] = flights_all_trim['month_flag'].map({
    '1': 'Jan',
    '2': 'Feb',
    '3': 'Mar',
    '4': 'Apr',
    '5': 'May',
    '6': 'Jun',
    '7': 'Jul',
    '8': 'Aug',
    '9': 'Sep',
    '10': 'Oct',
    '11': 'Nov',
    '12': 'Dec'})
flights_all_trim.loc[:, 'time_hour'] = pd.to_datetime(flights_all_trim.time_hour, yearfirst=True)

flights_all_trim['weekday'] = flights_all_trim['time_hour'].dt.day_name()

flights_all_trim['hour'] = pd.Categorical(flights_all_trim['hour'], ordered = True)
to_drop=['dest_timezone', 'ori_alt', 'dest_lon', 'dest_lat', 'dest_faa',
'ori_timezone', 'ori_lat', 'ori_lon', 'ori_airport', 'ori_faa', 'time_hour',
'minute', 'dest', 'tailnum', 'flight', 'carrier', 'arr_delay', 'sched_arr_time',
'arr_time', 'dep_delay', 'month', 'sched_dep_time','dep_time', 'day']

flights_model_data = flights_all_trim.drop(columns=to_drop)

Remove data for EWR (Newark) - save the rest of the data for future work, can run the model on the data for the other airports to see how it performs

# split into EWR, and the others
ewr_model_data = flights_model_data.loc[(flights_model_data['origin'] == 'EWR')].copy()

Exploratory Data Analysis

I will start with an overview of departure delays across all airports and then move to delays at EWR, focusing on weather and other factors.

# set figure sizes for all plots
plt.rcParams['figure.figsize'] = [5, 3]
plt.rcParams['figure.dpi'] = 120
plt.rcParams['font.size'] = 10
flights_all_trim.loc[:, 'dep_delay_count'] = flights_all_trim.groupby(['origin', 'month']).dep_delay_true.transform('sum')
flights_all_trim.loc[:, 'total_flight_count'] = flights_all_trim.groupby(['origin', 'month']).dep_delay.transform('count')
flights_all_trim.loc[:, 'percent_delay'] = (flights_all_trim.dep_delay_count / flights_all_trim.total_flight_count) * 100


hue_order = ['EWR', 'JFK', 'LGA']
all_flights_pal = sns.color_palette(['#005083', '#6876a4', '#ffd28f'])

sns.barplot(
    x = 'month_flag', y = 'total_flight_count',
    hue = 'origin', palette = all_flights_pal,
    hue_order = hue_order, data = flights_all_trim)
plt.title("Total flights per month for NY airports in 2017")
plt.xlabel("Month")
plt.ylabel("Total flights")
plt.legend(bbox_to_anchor=(1.01, 1), loc = 2,
           borderaxespad=0)
plt.tight_layout()
plt.savefig('figures/all_airports_total_plot.png', pad_inches=0.5);

png

sns.barplot(
    x = 'month_flag', y = 'percent_delay',
    hue = 'origin', hue_order = hue_order,
    palette = all_flights_pal, data = flights_all_trim)
plt.title("Delayed flights for NY airports in 2017")
plt.xlabel("Month")
plt.ylabel("Percentage of delayed flights")
plt.legend(bbox_to_anchor=(1.01, 1), loc = 2,
           borderaxespad=0)
plt.tight_layout()
plt.savefig('figures/all_airports_delays_plot.png', pad_inches=0.5);

png

sns.pairplot(ewr_model_data,
x_vars=['air_time', 'distance', 'dest_alt'],
y_vars=['air_time', 'distance', 'carrier_name', 'dest_alt'],
hue = 'dep_delay_true',
corner = True);

png

ewr_all = flights_all_trim.loc[(flights_all_trim['origin'] == 'EWR')].copy().reset_index(drop = True)
# function to create a column with percentage of delayed flights per grouping column (col)
def percent_delay(col):
    ewr_all.loc[:, 'dep_delay_per_'+ col] = ewr_all.groupby([col]).dep_delay_true.transform('sum')
    ewr_all.loc[:, 'total_flights_per_' + col] = ewr_all.groupby([col]).dep_delay.transform('count')
    ewr_all.loc[:, 'percent_delay_per_' + col] = (ewr_all['dep_delay_per_'+ col] / ewr_all['total_flights_per_' + col]) * 100

EDA weather and flight delays

# set custom graph options

line_pal = sns.color_palette(['#f9ab00', '#49006a'])
visib_plot = sns.lineplot(
    x = 'month_flag',
    y = 'visib',
    hue = 'dep_delay_true',
    marker = 'o',
    ci = False,
    palette = line_pal,
    data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Visibility (miles)')
plt.title('Average monthly visibility for departures')
plt.legend(title='', labels=['On time', 'Delayed'])
plt.tight_layout()
plt.savefig('figures/visib_plot.png', pad_inches=0.5);

png

sns.lineplot(x = 'month_flag',
y = 'ppt',
hue = 'dep_delay_true',
marker='o',
palette = line_pal,
ci=False,
legend=False,
data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Rainfall (inches)')
plt.title('Average monthly rainfall for departures')
#plt.legend(title='', labels=['Not delayed', 'Delayed'])
plt.tight_layout()
plt.savefig('figures/ppt_plot.png', pad_inches=0.5);

png

sns.lineplot(x = 'month_flag',
y = 'wind_speed',
hue = 'dep_delay_true',
marker='o',
palette = line_pal,
ci = False,
legend = False,
data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Wind speed (mph)')
plt.title('Average monthly windspeed for departures')
#plt.legend(title='', labels=['Not delayed', 'Delayed'])
plt.tight_layout()
plt.savefig('figures/windspeed_plot.png', pad_inches=0.5);

png

sns.lineplot(x = 'month_flag',
y = 'wind_dir',
hue = 'dep_delay_true',
marker='o',
palette = line_pal,
ci = False,
data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Wind direction (degrees)')
plt.title('Average monthly wind direction for departures')
plt.legend(title='', labels=['On-time', 'Delayed'])
plt.tight_layout()
plt.savefig('figures/wind_dir_plot.png', pad_inches=0.7);

png

sns.lineplot(x = 'month_flag',
y = 'temp_departure',
hue = 'dep_delay_true',
marker='o',
palette = line_pal,
data = ewr_all)
plt.xticks(rotation=90)
plt.legend(title='Departure Delay', labels=['No', 'Yes']);

png

sns.lineplot(x = 'month_flag',
y = 'temp_ave',
marker = 'o',
hue = 'dep_delay_true',
palette = line_pal,
data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Temperature (F)')
plt.title('Average monthly temperature for departures')
plt.legend(title='', labels=['On-time', 'Delayed'])
plt.tight_layout()
plt.savefig('./figures/ave_temp_plot.png', pad_inches=0.7);

png

sns.lineplot(x = 'month_flag',
y = 'snow_depth',
marker='o',
hue = 'dep_delay_true',
palette = line_pal,
ci = False,
data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Snow depth (inches)')
plt.title('Depth of snow per month for Newark departures')
plt.legend(title='', labels=['Not delayed', 'Delayed']);

png

sns.lineplot(x = 'month_flag',
y = 'new_snow',
hue = 'dep_delay_true',
marker='o',
palette = line_pal,
ci = False,
data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Amount of new snow (inches)')
plt.title('Snowfall per month for Newark departures')
plt.legend(title='', labels=['Not delayed', 'Delayed']);

png

Summary

  • Weather which affects departure delays year round includes wind_speed, rainfall, wind_direction and visib (visibility).

  • For the winter months, new_snow and snow_depth impact on flight delays.

For the business presentation, graphs were selected which show differences in flight delays over different months.

EDA for other reasons for delays

Carrier delays

percent_delay('carrier_name')
carrier = ewr_all.drop_duplicates(subset=['carrier_name']).reset_index(drop = True)

ordered_data = carrier.sort_values('percent_delay_per_carrier_name').carrier_name
sns.barplot(
    x = 'carrier_name',
    y = 'percent_delay_per_carrier_name',
    color = '#0868ac',
    order = ordered_data,
    data = carrier)
plt.xticks(rotation=90)
plt.xlabel('')
plt.ylabel('Delayed flights (%)')
plt.title('Delays per airline carrier')
plt.tight_layout()
plt.subplots_adjust(bottom=0.15)
plt.savefig('figures/carrier_delays_plot.png', dpi = 400, pad_inches=0.85, bbox_inches='tight');

png

percent_delay('weekday')

sns.barplot(y = 'percent_delay_per_weekday',
x = 'weekday',
color = '#7fcdbb',
data = ewr_all)
plt.xticks(rotation=90)
plt.ylabel('Delayed flights (%)')
plt.xlabel('')
plt.title('Delays per day of the week')
plt.tight_layout()
plt.subplots_adjust(bottom=0.15)
plt.savefig('figures/weekdays_plot.png', dpi = 400, pad_inches=0.85, bbox_inches='tight');

png

percent_delay('hour')

sns.barplot(
    y = 'percent_delay_per_hour',
    x = 'hour',
    color = '#41b6c4',
    data = ewr_all)
plt.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 ['1am', '5am', '6am', '7am', '8am', '9am',
 '10am', '11am', '12am', '1pm', '2pm', '3pm',
 '4pm', '5pm', '6pm', '7pm', '8pm', '9pm', '10pm', '11pm'],
 rotation=90)
plt.ylabel('Delayed flights (%)')
plt.xlabel('Hour')
plt.title('Delays per hour')
plt.tight_layout()
plt.subplots_adjust(bottom=0.15)
plt.savefig('figures/hour_plot.png', dpi = 400, pad_inches=0.85, bbox_inches='tight');

png

percent_delay('manufacturer')
#8c96c6

sns.barplot(y = 'percent_delay_per_manufacturer',
x = 'manufacturer',
color = '#3b528b',
data = ewr_all)
plt.xticks(rotation=90)
plt.ylabel('Delayed flights (%)')
plt.xlabel('')
plt.title('Delays per manufacturer')
plt.tight_layout()
plt.subplots_adjust(bottom=0.15)
plt.savefig('figures/manufacturer_plot.png', dpi = 400, pad_inches=0.85, bbox_inches='tight');

png

percent_delay('engine')

sns.barplot(y = 'percent_delay_per_engine',
x = 'engine',
color = '#8c6bb1',
data = ewr_all)
plt.xticks(rotation=90)
plt.ylabel('Delayed flights (%)')
plt.xlabel('')
plt.title('Delays per type of plane engine')
plt.tight_layout()
plt.subplots_adjust(bottom=0.15)
plt.savefig('figures/engine_plot.png', dpi = 400, pad_inches=0.85, bbox_inches='tight');

png

Summary

  • Many other factors are also involved in departure delays.

  • Airline carrier, time and month of departure and plane engine show the most variation in terms of total percentage flights which were delayed.

Geospatial data exploration

The data per destination airport needs to be summarised so that it can be plotted on a map.

There are 88 different destination airports in total, all domestic (US).

Possible useful summary visuals:

  • percentage delayed flights per destination airport sized by total number of flights
  • percentage delayed flights per destination airport coloured/sized by delay time
  • percentage delayed flights selecting top 5 delayed airlines on the route
# select all columns useful for geospatial analysis
ewr_geo = ewr_flights.loc[:, ['ori_airport', 'ori_lat', 'ori_lon',
'ori_alt','dest_airport', 'dest_lat', 'dest_lon', 'carrier_name', 'dep_delay',
'dep_delay_true', 'air_time', 'distance', 'hour', 'wind_speed', 'visib']].copy()
# generate percentage delay column per destination for plotting

ewr_geo.loc[:, 'dep_delay_count'] = ewr_geo.groupby(['dest_airport']).dep_delay_true.transform('sum')
ewr_geo.loc[:, 'total_flights_per_dest_airport'] = ewr_geo.groupby(['dest_airport']).dep_delay.transform('count')
ewr_geo.loc[:, 'percent_delay_per_dest_airport'] = (ewr_geo.dep_delay_count / ewr_geo.total_flights_per_dest_airport) * 100


ewr_unique_dest = ewr_geo.drop_duplicates(['dest_airport', 'percent_delay_per_dest_airport']).reset_index(drop = True)
ewr_unique_dest['percent_delay_per_dest_airport'] = ewr_unique_dest['percent_delay_per_dest_airport'].round()
ewr_unique_dest.dropna(inplace = True)
# remove unwanted columns
ewr_unique_dest = ewr_unique_dest.loc[:, ['ori_airport', 'ori_lat', 'ori_lon',
'dest_airport', 'dest_lat', 'dest_lon', 'total_flights_per_dest_airport',
'percent_delay_per_dest_airport']].copy()


ewr_unique_dest = ewr_unique_dest.reset_index(drop = True)
# scale total flights to a value between 1 and 20 for plotting

ewr_unique_dest['scaled_flights_per_dest'] = \
    1 + (ewr_unique_dest['total_flights_per_dest_airport'] - \
         ewr_unique_dest['total_flights_per_dest_airport'].min())\
        /(ewr_unique_dest['total_flights_per_dest_airport'].max() - \
          ewr_unique_dest['total_flights_per_dest_airport'].min())\
             * 19
# make a color column in the dataset to iterate through in the below folium script
# bin percent delay into 9 and 5

color_hex_labels= ['#f7fcfd', '#e0ecf4', '#bfd3e6',
                        '#9ebcda', '#8c96c6', '#8c6bb1', '#88419d',
                        '#810f7c', '#4d004b']


ewr_unique_dest['color_hex_delay'] = pd.cut(ewr_unique_dest['percent_delay_per_dest_airport'],
bins = 9, labels=color_hex_labels,
include_lowest = True)
us_map = folium.Map(location=[40.6925, -74.1687],
zoom_start = 5,
tiles="cartodbpositron")

#iterrows returns index (stored in _) and row which we need
for _, row in ewr_unique_dest.iterrows():

    folium.CircleMarker([row['ori_lat'], row['ori_lon']],
                        radius=5,
                        fill_color='#005083', # ewr airport colour
                        stroke=False,
                        fill_opacity = 1
                       ).add_to(us_map)

    folium.CircleMarker([row['dest_lat'], row['dest_lon']],
                        radius=10,
                        fill_color=[row['color_hex_delay']], # destination colour
                        stroke=False,
                        fill_opacity = 1
                       ).add_to(us_map)

    folium.PolyLine([[row['ori_lat'], row['ori_lon']],
                     [row['dest_lat'], row['dest_lon']]],
                     color = 'black',
                     weight= [row['scaled_flights_per_dest']], # size line on total flights
                     opacity = 0.1
                     ).add_to(us_map)
# create legend for map

legend = branca.colormap.StepColormap(['#f7fcfd', '#e0ecf4', '#bfd3e6',
                        '#9ebcda', '#8c96c6', '#8c6bb1', '#88419d',
                        '#810f7c', '#4d004b'],
                        vmin=6, vmax=54, index=[6, 11, 16, 22, 27, 32, 38, 43, 48, 54],
                        caption='Percentage of delayed flights')

legend.add_to(us_map)

654

#view map

us_map
Make this Notebook Trusted to load map: File -> Trust Notebook

Summary of EDA

  • Many factors involved in departure delays - weather, time of day, airline carrier, destination airport
  • Uncertain of the importance of each one
  • Move to a model to gain insight on factors contribution to departure delay
Anne Braae
Anne Braae
Freelance Data Scientist

Related