The original dataset contains around 120 million records for flights from 1987 to 2008, however, the following presentation is utilizing a curated sample dataset that is composed of 5 million records.
The sample dataset includes attributes that will be discussed in this presentation like Flight name, Airline name, Actual departure time, Actual arrival time and Date.
The analysis process will focus on trivial questions, as the presentation progress, we will be answering more complex questions.
Some of these questions are:
and more .......
plt.figure(figsize=(12,7))
colors = [base_color if (x < Year_counts.max()) else alt for x in Year_counts ]
sns.barplot(x=Year_counts.index,y=Year_counts, palette=colors,edgecolor='black');
plt.xlabel('Year')
plt.ylabel('Count')
plt.title('What was the busiest year for flights ?',size=15)
for i in range (Year_counts.shape[0]):
plt.text(i,
Year_counts.values[i],
year_text[i],
ha='center',
va='bottom')
2007 was the busiest year for flights.
plt.figure(figsize=(12,7))
colors = [base_color if (x < Month_counts.max()) else alt for x in Month_counts ]
sns.barplot(x=Month_counts.index,y=Month_counts, palette=colors);
plt.xlabel('Month')
plt.xticks(ticks=[0,1,2,3,4,5,6,7,8,9,10,11],
labels=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
plt.ylabel('Count')
plt.title('What was the busiest month for flights ?',size=15)
for i in range (Month_counts.shape[0]):
plt.text(i,
Month_counts.values[i],
Month_text[i],
ha='center',
va='bottom')
October was the busiest month for flights.
plt.figure(figsize=(12,7))
colors = [base_color if (x < DayOfWeek_counts.max()) else alt for x in DayOfWeek_counts ]
sns.barplot(x=DayOfWeek_counts.index,y=DayOfWeek_counts, palette=colors);
plt.xlabel('DayOfWeek')
plt.xticks(ticks=[0,1,2,3,4,5,6],
labels=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
plt.ylabel('Count')
plt.title('What was the busiest Day Of Week for flights ?',size=15)
for i in range (DayOfWeek_counts.shape[0]):
plt.text(i,
DayOfWeek_counts.values[i],
DayOfWeek_text[i],
ha='center',
va='bottom')
While the answer is Tuesday, It looks Monday and Wednesday also have similiar traffic, on the other hand Thursday, Friday and Saturday shows somewhat a gradual decrease compared to the 1st three days mentioned.
plt.figure(figsize=(12,7))
colors = [base_color if (x < flight_counts.max()) else alt for x in flight_counts ]
sns.barplot(x=flight_counts.index,y=flight_counts,order=flight_counts.index, palette=colors);
plt.xlabel('Flight Name')
plt.ylabel('Count')
plt.title('Top 10 most frequent Flights',size=15)
for i in range (flight_counts.shape[0]):
plt.text(i,
flight_counts.values[i],
flight_text[i],
ha='center',
va='bottom')
In the top spot: Alaska Airlines flight number 65.
The top 6 most frequent flights go to Alaska Airlines.
The last 4 belong to Southwest Airlines.
fig=plt.figure(figsize=(15,4))
plt.subplot(1, 2, 1)
sns.regplot(data=fly.sample(10000), x='DepDelay', y ='ArrDelay',
x_jitter=0.4, scatter_kws={'alpha':1/30},fit_reg= False);
plt.axhline(0,color=sns.color_palette()[7],lw=0.3)
plt.axvline(0,color=sns.color_palette()[7],lw=0.3)
plt.title('Scatter chart')
plt.xlabel('Departure Delay (min)')
plt.ylabel('Arrival Delay (min)')
plt.subplot(1, 2, 2)
bins_x = np.arange(-15, 15+1,1)
bins_y = np.arange(-30, 40+1, 1)
plt.hist2d(data = fly.sample(10000), x = 'DepDelay', y = 'ArrDelay',
cmin=0.5, cmap='viridis_r',bins=[bins_x,bins_y])
plt.axhline(0,color=sns.color_palette()[7],lw=0.3)
plt.axvline(0,color=sns.color_palette()[7],lw=0.3)
plt.colorbar()
plt.title('Heatmap')
plt.xlabel('Departure Delay (min)')
plt.ylabel('Arrival Delay (min)');
- The dark vertical plob of points means that flights departured on time, however variablity along y-axis at this specific vertical line shows a higher tendency to be "Early arrival" more than of the "Late Arrival".
- In the third quadrant (lower left section) has higher concentration of points meaning, "Early departures are associated with "Early arrival"
g=sns.FacetGrid(data=fly[fly.Year>1990].sample(10000), col='Year',sharey=True,col_wrap=6)
g.set_axis_labels("Departure Delay(min)","Arrival Delay(min)");
g.map(plt.scatter,'DepDelay','ArrDelay',alpha=0.3);
g.map(plt.axhline,color='grey', y=0,lw=0.3);
g.map(plt.axvline,color='grey' ,x=0,lw=0.3);
plt.suptitle('Time Progression',va='bottom',size=15,y=1);
g.set_axis_labels("Departure Delay(min)","Arrival Delay(min)");
The more recent a year is, the more early departures and early arrivals develop in terms of value, you can actually see that the points creep deeper into the 4th quadrant as the years progress, depeicting an overall improvment in both departure and arrival delays.
plt.figure(figsize=(13,7))
colors = [base_color if (x < carrier_counts.max()) else alt for x in carrier_counts ]
sns.barplot(x=carrier_counts.index,y=carrier_counts,order=carrier_counts.index, palette=colors);
plt.xlabel('Airline')
plt.ylabel('Count')
plt.title('Which Airline had most traffic?',size=15)
for i in range (carrier_counts.shape[0]):
plt.text(i,
carrier_counts.values[i],
carrier_text[i],
ha='center',
va='bottom')
WN (Southwest Airlines) collectively had the highest air traffic from 1987 to 2008.
While the second place goes to DL (Delta Air Lines Inc.) with a slight relative difference.
g=sns.FacetGrid(data=top_fly,hue='UniqueCarrier',hue_order=top10car,height=6,aspect=1.8)
g.map(sns.pointplot,'Year','DepDelay',order=yearlist, ci=None, linestyles="-",scale=0.5)
plt.axhline(0,color=sns.color_palette()[7],lw=0.3)
plt.title("Top 10 air-traffic airlines' departure delays through the years",size=15)
plt.xlabel('Year')
plt.ylabel('Avg. Departure Delay (min)');
plt.legend(title='Airline',labels=top10car,loc='upper right');
Around 2001 departure delays started to decrease, and by 2003 most of airlines on average had early departures.