Analysis of flight delays for Newark Airport
This is my final end to end data project for CodeClan. Written in Python in Jupyter Notebooks. The aim was to provide data-driven exploratory analysis to provide insight on the impact of weather on flight delays. To achieve this I did an exploratory data analysis and modelling. This entry contains the exploratory data analysis.
Newark airport has requested an investigation into flight departure delays.
Flight delays are a key performance indicator of airport operations.
The following three business questions are explored:
- How does Newark compare to other NY airports
- What is the impact of weather on flight departure delays
- What other reasons impact flight departure delays
# import libraries
import pandas as pd
import numpy as np
import pandas_profiling as pp
import datetime as dt
import seaborn as sns
from matplotlib import pyplot as plt
import folium
import branca
# read in data
ewr_flights = pd.read_csv('data/clean_data/ewr_flights.csv')
# create combined flights_all data
flight_files = ['data/clean_data/ewr_flights.csv',
'data/clean_data/jfk_flights.csv',
'data/clean_data/lga_flights.csv']
flights_all = pd.concat(map(pd.read_csv, flight_files))
Preliminary analysis
Investigating weather data
# profile report for weather variables
pp.ProfileReport(ewr_flights.loc[:, ['time_hour', 'dep_delay_true', 'wind_dir',
'wind_speed', 'wind_gust', 'visib', 'temp_max', 'temp_min',
'temp_ave', 'temp_departure', 'ppt', 'new_snow', 'snow_depth']])
Summarize dataset: 100%|██████████| 27/27 [00:14<00:00, 1.88it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:02<00:00, 2.18s/it]
Render HTML: 100%|██████████| 1/1 [00:02<00:00, 2.27s/it]
Summary of profile report on weather.
wind_speed
andwind_gust
are highly positively correlated. As are all the temperature variables,temp_max
,temp_min
,temp_ave
.- Precipitation (
ppt
) has a very skewed distribution and 48% of the data is zero. The two snow variables,new_snow
andsnow_depth
have 92% zeros in their data columns. - Wind direction has a bimodal distribution.
24% of the flights are delayed and 76% of the flights are not delayed.
Investigating plane information
A pandas profiling report was also run for plane information. (see code)
Summary of profile report on plane information
- The
type
variable is highly correlated withmanufacturer
andengine
, remove this variable. - The
manufacturer
variable has high cardinality (lots of different categories). Recode this to the top 5 manufacturers, recode NaNs to unknown and set the rest to other.
Feature engineering and data wrangling
Random forest is known to be a good model for flight delay data. I will prepare the data for this model.
Feature engineering
The following features will be changed.
- convert
wind_speed
into a categoricalwind_scale
using the Beaufort scale - drop columns with high correlation identified in pandas profiling report (
wind_gust
,temp_max
,temp_min
) - remove
type
- recode
manufacturer
to the top 5 manufacturers, recode NaNs to unknown and set the rest to other - recode
hour
,month
andday
as categorical variables - remove other columns not thought to be important for flight delay model (keep
dest_airport
):dest_timezone
,dest_alt
,dest_lon
,dest_lat
,dest_faa
,ori_timezone
,ori_lat
,ori_lon
,ori_airport
,ori_faa
,time_hour
,minute
,hour
,dest
,tailnum
,flight
,carrier
,arr_delay
,sched_arr_time
,arr_time
,dep_delay
,sched_dep_time
,dep_time
,day
.
Convert weather data types
# add categorical wind_scale for wind_speed
wind_labels = ['Calm', 'Light Breeze', 'Gentle Breeze',
'Moderate Breeze', 'Fresh Breeze', 'Strong Breeze', 'Near Gale',
'Gale']
wind_bins = [0, 3, 7, 12, 18, 24, 31, 38, 46]
flights_all['wind_scale'] = pd.cut(flights_all['wind_speed'],
bins = wind_bins, labels=wind_labels,
include_lowest = True)
weather_drop = ['wind_gust', 'temp_max', 'temp_min']
flights_all_trim = flights_all.drop(columns=weather_drop)
Convert plane data types
flights_all_trim.drop(columns=['type', 'model'], inplace = True)
man_keep = ['Unknown', 'BOEING', 'AIRBUS', 'AIRBUS INDUSTRIE',
'EMBRAER', 'BOMBARDIER INC', 'MCDONNELL DOUGLAS AIRCRAFT CO', 'MCDONNELL DOUGLAS']
flights_all_trim.manufacturer = np.where(flights_all_trim.manufacturer.isin(man_keep),
flights_all_trim.manufacturer, 'Other')
#fix manufacturer doubling for airbus and mcdonnell douglas
flights_all_trim.manufacturer = flights_all_trim.manufacturer.map({
'AIRBUS INDUSTRIE': 'AIRBUS',
'AIRBUS': 'AIRBUS',
'MCDONNELL DOUGLAS': 'MCDONNELL DOUGLAS',
'MCDONNELL DOUGLAS AIRCRAFT CO': 'MCDONNELL DOUGLAS',
'BOEING': 'BOEING',
'EMBRAER': 'EMBRAER',
'BOMBARDIER INC': 'BOMBARDIER INC',
'Unknown': 'Unknown',
'Other': 'Other'})
flights_all_trim.head()
month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | ... | dest_faa | dest_airport | dest_lat | dest_lon | dest_alt | dest_timezone | carrier_name | manufacturer | engine | wind_scale | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 545.0 | 545 | 0.0 | 833.0 | 831 | 2.0 | B6 | 527 | ... | MCO | Orlando International Airport | 28.429399 | -81.308998 | 96.0 | America/New_York | JetBlue Airways | AIRBUS | Turbo-fan | Gentle Breeze |
1 | 1 | 1 | 556.0 | 555 | 1.0 | 709.0 | 714 | -5.0 | UA | 320 | ... | IAD | Washington Dulles International Airport | 38.944500 | -77.455803 | 312.0 | America/New_York | United Air Lines Inc. | AIRBUS | Turbo-fan | Gentle Breeze |
2 | 1 | 1 | 559.0 | 600 | -1.0 | 848.0 | 904 | -16.0 | AA | 1301 | ... | MIA | Miami International Airport | 25.793200 | -80.290604 | 8.0 | America/New_York | American Airlines Inc. | Unknown | Unknown | Gentle Breeze |
3 | 1 | 1 | 557.0 | 600 | -3.0 | 853.0 | 903 | -10.0 | B6 | 605 | ... | FLL | Fort Lauderdale Hollywood International Airport | 26.072599 | -80.152702 | 9.0 | America/New_York | JetBlue Airways | AIRBUS | Turbo-fan | Gentle Breeze |
4 | 1 | 1 | 731.0 | 655 | 36.0 | 830.0 | 808 | 22.0 | UA | 2016 | ... | BOS | General Edward Lawrence Logan International Ai... | 42.364300 | -71.005203 | 20.0 | America/New_York | United Air Lines Inc. | AIRBUS | Turbo-fan | Gentle Breeze |
5 rows × 43 columns
flights_all_trim.shape
(294167, 43)
Convert timeseries data into categories
flights_all_trim['month_flag'] = flights_all_trim.month.astype(str)
flights_all_trim['month_flag'] = flights_all_trim['month_flag'].map({
'1': 'Jan',
'2': 'Feb',
'3': 'Mar',
'4': 'Apr',
'5': 'May',
'6': 'Jun',
'7': 'Jul',
'8': 'Aug',
'9': 'Sep',
'10': 'Oct',
'11': 'Nov',
'12': 'Dec'})
flights_all_trim.loc[:, 'time_hour'] = pd.to_datetime(flights_all_trim.time_hour, yearfirst=True)
flights_all_trim['weekday'] = flights_all_trim['time_hour'].dt.day_name()
flights_all_trim['hour'] = pd.Categorical(flights_all_trim['hour'], ordered = True)
to_drop=['dest_timezone', 'ori_alt', 'dest_lon', 'dest_lat', 'dest_faa',
'ori_timezone', 'ori_lat', 'ori_lon', 'ori_airport', 'ori_faa', 'time_hour',
'minute', 'dest', 'tailnum', 'flight', 'carrier', 'arr_delay', 'sched_arr_time',
'arr_time', 'dep_delay', 'month', 'sched_dep_time','dep_time', 'day']
flights_model_data = flights_all_trim.drop(columns=to_drop)
Remove data for EWR (Newark) - save the rest of the data for future work, can run the model on the data for the other airports to see how it performs
# split into EWR, and the others
ewr_model_data = flights_model_data.loc[(flights_model_data['origin'] == 'EWR')].copy()
Exploratory Data Analysis
I will start with an overview of departure delays across all airports and then move to delays at EWR, focusing on weather and other factors.
# set figure sizes for all plots
plt.rcParams['figure.figsize'] = [5, 3]
plt.rcParams['figure.dpi'] = 120
plt.rcParams['font.size'] = 10
flights_all_trim.loc[:, 'dep_delay_count'] = flights_all_trim.groupby(['origin', 'month']).dep_delay_true.transform('sum')
flights_all_trim.loc[:, 'total_flight_count'] = flights_all_trim.groupby(['origin', 'month']).dep_delay.transform('count')
flights_all_trim.loc[:, 'percent_delay'] = (flights_all_trim.dep_delay_count / flights_all_trim.total_flight_count) * 100
hue_order = ['EWR', 'JFK', 'LGA']
all_flights_pal = sns.color_palette(['#005083', '#6876a4', '#ffd28f'])
sns.barplot(
x = 'month_flag', y = 'total_flight_count',
hue = 'origin', palette = all_flights_pal,
hue_order = hue_order, data = flights_all_trim)
plt.title("Total flights per month for NY airports in 2017")
plt.xlabel("Month")
plt.ylabel("Total flights")
plt.legend(bbox_to_anchor=(1.01, 1), loc = 2,
borderaxespad=0)
plt.tight_layout()
plt.savefig('figures/all_airports_total_plot.png', pad_inches=0.5);
sns.barplot(
x = 'month_flag', y = 'percent_delay',
hue = 'origin', hue_order = hue_order,
palette = all_flights_pal, data = flights_all_trim)
plt.title("Delayed flights for NY airports in 2017")
plt.xlabel("Month")
plt.ylabel("Percentage of delayed flights")
plt.legend(bbox_to_anchor=(1.01, 1), loc = 2,
borderaxespad=0)
plt.tight_layout()
plt.savefig('figures/all_airports_delays_plot.png', pad_inches=0.5);
sns.pairplot(ewr_model_data,
x_vars=['air_time', 'distance', 'dest_alt'],
y_vars=['air_time', 'distance', 'carrier_name', 'dest_alt'],
hue = 'dep_delay_true',
corner = True);
ewr_all = flights_all_trim.loc[(flights_all_trim['origin'] == 'EWR')].copy().reset_index(drop = True)
# function to create a column with percentage of delayed flights per grouping column (col)
def percent_delay(col):
ewr_all.loc[:, 'dep_delay_per_'+ col] = ewr_all.groupby([col]).dep_delay_true.transform('sum')
ewr_all.loc[:, 'total_flights_per_' + col] = ewr_all.groupby([col]).dep_delay.transform('count')
ewr_all.loc[:, 'percent_delay_per_' + col] = (ewr_all['dep_delay_per_'+ col] / ewr_all['total_flights_per_' + col]) * 100
EDA weather and flight delays
# set custom graph options
line_pal = sns.color_palette(['#f9ab00', '#49006a'])
visib_plot = sns.lineplot(
x = 'month_flag',
y = 'visib',
hue = 'dep_delay_true',
marker = 'o',
ci = False,
palette = line_pal,
data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Visibility (miles)')
plt.title('Average monthly visibility for departures')
plt.legend(title='', labels=['On time', 'Delayed'])
plt.tight_layout()
plt.savefig('figures/visib_plot.png', pad_inches=0.5);
sns.lineplot(x = 'month_flag',
y = 'ppt',
hue = 'dep_delay_true',
marker='o',
palette = line_pal,
ci=False,
legend=False,
data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Rainfall (inches)')
plt.title('Average monthly rainfall for departures')
#plt.legend(title='', labels=['Not delayed', 'Delayed'])
plt.tight_layout()
plt.savefig('figures/ppt_plot.png', pad_inches=0.5);
sns.lineplot(x = 'month_flag',
y = 'wind_speed',
hue = 'dep_delay_true',
marker='o',
palette = line_pal,
ci = False,
legend = False,
data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Wind speed (mph)')
plt.title('Average monthly windspeed for departures')
#plt.legend(title='', labels=['Not delayed', 'Delayed'])
plt.tight_layout()
plt.savefig('figures/windspeed_plot.png', pad_inches=0.5);
sns.lineplot(x = 'month_flag',
y = 'wind_dir',
hue = 'dep_delay_true',
marker='o',
palette = line_pal,
ci = False,
data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Wind direction (degrees)')
plt.title('Average monthly wind direction for departures')
plt.legend(title='', labels=['On-time', 'Delayed'])
plt.tight_layout()
plt.savefig('figures/wind_dir_plot.png', pad_inches=0.7);
sns.lineplot(x = 'month_flag',
y = 'temp_departure',
hue = 'dep_delay_true',
marker='o',
palette = line_pal,
data = ewr_all)
plt.xticks(rotation=90)
plt.legend(title='Departure Delay', labels=['No', 'Yes']);
sns.lineplot(x = 'month_flag',
y = 'temp_ave',
marker = 'o',
hue = 'dep_delay_true',
palette = line_pal,
data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Temperature (F)')
plt.title('Average monthly temperature for departures')
plt.legend(title='', labels=['On-time', 'Delayed'])
plt.tight_layout()
plt.savefig('./figures/ave_temp_plot.png', pad_inches=0.7);
sns.lineplot(x = 'month_flag',
y = 'snow_depth',
marker='o',
hue = 'dep_delay_true',
palette = line_pal,
ci = False,
data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Snow depth (inches)')
plt.title('Depth of snow per month for Newark departures')
plt.legend(title='', labels=['Not delayed', 'Delayed']);
sns.lineplot(x = 'month_flag',
y = 'new_snow',
hue = 'dep_delay_true',
marker='o',
palette = line_pal,
ci = False,
data = ewr_all)
plt.xticks(rotation=90)
plt.xlabel('Month')
plt.ylabel('Amount of new snow (inches)')
plt.title('Snowfall per month for Newark departures')
plt.legend(title='', labels=['Not delayed', 'Delayed']);
Summary
Weather which affects departure delays year round includes
wind_speed
,rainfall
,wind_direction
andvisib
(visibility).For the winter months,
new_snow
andsnow_depth
impact on flight delays.
For the business presentation, graphs were selected which show differences in flight delays over different months.
EDA for other reasons for delays
Carrier delays
percent_delay('carrier_name')
carrier = ewr_all.drop_duplicates(subset=['carrier_name']).reset_index(drop = True)
ordered_data = carrier.sort_values('percent_delay_per_carrier_name').carrier_name
sns.barplot(
x = 'carrier_name',
y = 'percent_delay_per_carrier_name',
color = '#0868ac',
order = ordered_data,
data = carrier)
plt.xticks(rotation=90)
plt.xlabel('')
plt.ylabel('Delayed flights (%)')
plt.title('Delays per airline carrier')
plt.tight_layout()
plt.subplots_adjust(bottom=0.15)
plt.savefig('figures/carrier_delays_plot.png', dpi = 400, pad_inches=0.85, bbox_inches='tight');
percent_delay('weekday')
sns.barplot(y = 'percent_delay_per_weekday',
x = 'weekday',
color = '#7fcdbb',
data = ewr_all)
plt.xticks(rotation=90)
plt.ylabel('Delayed flights (%)')
plt.xlabel('')
plt.title('Delays per day of the week')
plt.tight_layout()
plt.subplots_adjust(bottom=0.15)
plt.savefig('figures/weekdays_plot.png', dpi = 400, pad_inches=0.85, bbox_inches='tight');
percent_delay('hour')
sns.barplot(
y = 'percent_delay_per_hour',
x = 'hour',
color = '#41b6c4',
data = ewr_all)
plt.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
['1am', '5am', '6am', '7am', '8am', '9am',
'10am', '11am', '12am', '1pm', '2pm', '3pm',
'4pm', '5pm', '6pm', '7pm', '8pm', '9pm', '10pm', '11pm'],
rotation=90)
plt.ylabel('Delayed flights (%)')
plt.xlabel('Hour')
plt.title('Delays per hour')
plt.tight_layout()
plt.subplots_adjust(bottom=0.15)
plt.savefig('figures/hour_plot.png', dpi = 400, pad_inches=0.85, bbox_inches='tight');
percent_delay('manufacturer')
#8c96c6
sns.barplot(y = 'percent_delay_per_manufacturer',
x = 'manufacturer',
color = '#3b528b',
data = ewr_all)
plt.xticks(rotation=90)
plt.ylabel('Delayed flights (%)')
plt.xlabel('')
plt.title('Delays per manufacturer')
plt.tight_layout()
plt.subplots_adjust(bottom=0.15)
plt.savefig('figures/manufacturer_plot.png', dpi = 400, pad_inches=0.85, bbox_inches='tight');
percent_delay('engine')
sns.barplot(y = 'percent_delay_per_engine',
x = 'engine',
color = '#8c6bb1',
data = ewr_all)
plt.xticks(rotation=90)
plt.ylabel('Delayed flights (%)')
plt.xlabel('')
plt.title('Delays per type of plane engine')
plt.tight_layout()
plt.subplots_adjust(bottom=0.15)
plt.savefig('figures/engine_plot.png', dpi = 400, pad_inches=0.85, bbox_inches='tight');
Summary
Many other factors are also involved in departure delays.
Airline carrier, time and month of departure and plane engine show the most variation in terms of total percentage flights which were delayed.
Geospatial data exploration
The data per destination airport needs to be summarised so that it can be plotted on a map.
There are 88 different destination airports in total, all domestic (US).
Possible useful summary visuals:
- percentage delayed flights per destination airport sized by total number of flights
- percentage delayed flights per destination airport coloured/sized by delay time
- percentage delayed flights selecting top 5 delayed airlines on the route
# select all columns useful for geospatial analysis
ewr_geo = ewr_flights.loc[:, ['ori_airport', 'ori_lat', 'ori_lon',
'ori_alt','dest_airport', 'dest_lat', 'dest_lon', 'carrier_name', 'dep_delay',
'dep_delay_true', 'air_time', 'distance', 'hour', 'wind_speed', 'visib']].copy()
# generate percentage delay column per destination for plotting
ewr_geo.loc[:, 'dep_delay_count'] = ewr_geo.groupby(['dest_airport']).dep_delay_true.transform('sum')
ewr_geo.loc[:, 'total_flights_per_dest_airport'] = ewr_geo.groupby(['dest_airport']).dep_delay.transform('count')
ewr_geo.loc[:, 'percent_delay_per_dest_airport'] = (ewr_geo.dep_delay_count / ewr_geo.total_flights_per_dest_airport) * 100
ewr_unique_dest = ewr_geo.drop_duplicates(['dest_airport', 'percent_delay_per_dest_airport']).reset_index(drop = True)
ewr_unique_dest['percent_delay_per_dest_airport'] = ewr_unique_dest['percent_delay_per_dest_airport'].round()
ewr_unique_dest.dropna(inplace = True)
# remove unwanted columns
ewr_unique_dest = ewr_unique_dest.loc[:, ['ori_airport', 'ori_lat', 'ori_lon',
'dest_airport', 'dest_lat', 'dest_lon', 'total_flights_per_dest_airport',
'percent_delay_per_dest_airport']].copy()
ewr_unique_dest = ewr_unique_dest.reset_index(drop = True)
# scale total flights to a value between 1 and 20 for plotting
ewr_unique_dest['scaled_flights_per_dest'] = \
1 + (ewr_unique_dest['total_flights_per_dest_airport'] - \
ewr_unique_dest['total_flights_per_dest_airport'].min())\
/(ewr_unique_dest['total_flights_per_dest_airport'].max() - \
ewr_unique_dest['total_flights_per_dest_airport'].min())\
* 19
# make a color column in the dataset to iterate through in the below folium script
# bin percent delay into 9 and 5
color_hex_labels= ['#f7fcfd', '#e0ecf4', '#bfd3e6',
'#9ebcda', '#8c96c6', '#8c6bb1', '#88419d',
'#810f7c', '#4d004b']
ewr_unique_dest['color_hex_delay'] = pd.cut(ewr_unique_dest['percent_delay_per_dest_airport'],
bins = 9, labels=color_hex_labels,
include_lowest = True)
us_map = folium.Map(location=[40.6925, -74.1687],
zoom_start = 5,
tiles="cartodbpositron")
#iterrows returns index (stored in _) and row which we need
for _, row in ewr_unique_dest.iterrows():
folium.CircleMarker([row['ori_lat'], row['ori_lon']],
radius=5,
fill_color='#005083', # ewr airport colour
stroke=False,
fill_opacity = 1
).add_to(us_map)
folium.CircleMarker([row['dest_lat'], row['dest_lon']],
radius=10,
fill_color=[row['color_hex_delay']], # destination colour
stroke=False,
fill_opacity = 1
).add_to(us_map)
folium.PolyLine([[row['ori_lat'], row['ori_lon']],
[row['dest_lat'], row['dest_lon']]],
color = 'black',
weight= [row['scaled_flights_per_dest']], # size line on total flights
opacity = 0.1
).add_to(us_map)
# create legend for map
legend = branca.colormap.StepColormap(['#f7fcfd', '#e0ecf4', '#bfd3e6',
'#9ebcda', '#8c96c6', '#8c6bb1', '#88419d',
'#810f7c', '#4d004b'],
vmin=6, vmax=54, index=[6, 11, 16, 22, 27, 32, 38, 43, 48, 54],
caption='Percentage of delayed flights')
legend.add_to(us_map)
#view map
us_map
Summary of EDA
- Many factors involved in departure delays - weather, time of day, airline carrier, destination airport
- Uncertain of the importance of each one
- Move to a model to gain insight on factors contribution to departure delay