Data Expo 2009: Airline on time data
Flight Data Analysis
1987 to 2008

Dataset overview

Dataset overview

The original dataset contains around 120 million records for flights from 1987 to 2008, however, the following presentation is utilizing a curated sample dataset that is composed of 5 million records.

The sample dataset includes attributes that will be discussed in this presentation like Flight name, Airline name, Actual departure time, Actual arrival time and Date.

Investigation overview

Investigation overview

The analysis process will focus on trivial questions, as the presentation progress, we will be answering more complex questions.

Some of these questions are:

  • Which year was the busiest for flights ?
  • Which month was the busiest for flights ?
  • What is the relation between Departure and Arrival Delay?
  • How Does progression of time relate to Departure and Arrival Delay?

and more .......

Which year was the busiest for flights from 1987 to 2008 ?

In [6]:
plt.figure(figsize=(12,7))
colors = [base_color if (x < Year_counts.max()) else alt for x in Year_counts ]
sns.barplot(x=Year_counts.index,y=Year_counts, palette=colors,edgecolor='black');     
plt.xlabel('Year') 
plt.ylabel('Count')   
plt.title('What was the busiest year for flights ?',size=15) 

for i in range (Year_counts.shape[0]):
    plt.text(i,                           
             Year_counts.values[i],       
             year_text[i],                 
             ha='center',                  
             va='bottom')

2007 was the busiest year for flights.

Which month was the busiest for flights from 1987 to 2008 ?

In [11]:
plt.figure(figsize=(12,7))
colors = [base_color if (x < Month_counts.max()) else alt for x in Month_counts ]
sns.barplot(x=Month_counts.index,y=Month_counts, palette=colors);     
plt.xlabel('Month')
plt.xticks(ticks=[0,1,2,3,4,5,6,7,8,9,10,11],
           labels=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']) 
plt.ylabel('Count')   
plt.title('What was the busiest month for flights ?',size=15) 

for i in range (Month_counts.shape[0]):
    plt.text(i,                           
             Month_counts.values[i],       
             Month_text[i],                 
             ha='center',                  
             va='bottom')
          

October was the busiest month for flights.

Which day was the busiest for flights from 1987 to 2008 ?

In [14]:
plt.figure(figsize=(12,7))
colors = [base_color if (x < DayOfWeek_counts.max()) else alt for x in DayOfWeek_counts ]
sns.barplot(x=DayOfWeek_counts.index,y=DayOfWeek_counts, palette=colors);     
plt.xlabel('DayOfWeek')
plt.xticks(ticks=[0,1,2,3,4,5,6],
           labels=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']) 
plt.ylabel('Count')   
plt.title('What was the busiest Day Of Week for flights ?',size=15) 

for i in range (DayOfWeek_counts.shape[0]):
    plt.text(i,                           
             DayOfWeek_counts.values[i],       
             DayOfWeek_text[i],                 
             ha='center',                  
             va='bottom')
          

While the answer is Tuesday, It looks Monday and Wednesday also have similiar traffic, on the other hand Thursday, Friday and Saturday shows somewhat a gradual decrease compared to the 1st three days mentioned.

What was the top 10 most frequent flights from 1987 to 2008?

In [28]:
plt.figure(figsize=(12,7))
colors = [base_color if (x < flight_counts.max()) else alt for x in flight_counts ]
sns.barplot(x=flight_counts.index,y=flight_counts,order=flight_counts.index, palette=colors);     
plt.xlabel('Flight Name') 
plt.ylabel('Count')   
plt.title('Top 10 most frequent Flights',size=15) 

for i in range (flight_counts.shape[0]):
    plt.text(i,                           
             flight_counts.values[i],       
            flight_text[i],                 
             ha='center',                  
             va='bottom')

In the top spot: Alaska Airlines flight number 65.
The top 6 most frequent flights go to Alaska Airlines.
The last 4 belong to Southwest Airlines.

Is there a relation between Departure Delay and Arrival Delay ?

In [30]:
fig=plt.figure(figsize=(15,4))

plt.subplot(1, 2, 1)
sns.regplot(data=fly.sample(10000), x='DepDelay', y ='ArrDelay',
            x_jitter=0.4, scatter_kws={'alpha':1/30},fit_reg= False);
plt.axhline(0,color=sns.color_palette()[7],lw=0.3)
plt.axvline(0,color=sns.color_palette()[7],lw=0.3)
plt.title('Scatter chart')
plt.xlabel('Departure Delay (min)')
plt.ylabel('Arrival Delay (min)')

plt.subplot(1, 2, 2)
bins_x = np.arange(-15, 15+1,1)
bins_y = np.arange(-30, 40+1, 1)
plt.hist2d(data = fly.sample(10000), x = 'DepDelay', y = 'ArrDelay',
            cmin=0.5, cmap='viridis_r',bins=[bins_x,bins_y])
plt.axhline(0,color=sns.color_palette()[7],lw=0.3)
plt.axvline(0,color=sns.color_palette()[7],lw=0.3)
plt.colorbar()
plt.title('Heatmap')
plt.xlabel('Departure Delay (min)')
plt.ylabel('Arrival Delay (min)');
  • The dark vertical plob of points means that flights departured on time, however variablity along y-axis at this specific vertical line shows a higher tendency to be "Early arrival" more than of the "Late Arrival".
  • In the third quadrant (lower left section) has higher concentration of points meaning, "Early departures are associated with "Early arrival"

How does progression of time relate to Departure and Arrival Delays?

In [57]:
g=sns.FacetGrid(data=fly[fly.Year>1990].sample(10000), col='Year',sharey=True,col_wrap=6)
g.set_axis_labels("Departure Delay(min)","Arrival Delay(min)");
g.map(plt.scatter,'DepDelay','ArrDelay',alpha=0.3);
g.map(plt.axhline,color='grey', y=0,lw=0.3);
g.map(plt.axvline,color='grey' ,x=0,lw=0.3);
plt.suptitle('Time Progression',va='bottom',size=15,y=1);
g.set_axis_labels("Departure Delay(min)","Arrival Delay(min)");

The more recent a year is, the more early departures and early arrivals develop in terms of value, you can actually see that the points creep deeper into the 4th quadrant as the years progress, depeicting an overall improvment in both departure and arrival delays.

Which Airline had most air-traffic from 1987 to 2008 ?

In [ ]:
plt.figure(figsize=(13,7))
colors = [base_color if (x < carrier_counts.max()) else alt for x in carrier_counts ]
sns.barplot(x=carrier_counts.index,y=carrier_counts,order=carrier_counts.index, palette=colors);     
plt.xlabel('Airline') 
plt.ylabel('Count')   
plt.title('Which Airline had most traffic?',size=15) 

for i in range (carrier_counts.shape[0]):
    plt.text(i,                           
             carrier_counts.values[i],       
            carrier_text[i],                 
             ha='center',                  
             va='bottom')

WN (Southwest Airlines) collectively had the highest air traffic from 1987 to 2008.
While the second place goes to DL (Delta Air Lines Inc.) with a slight relative difference.

How bad was Departure delays through the years for the top 10 air-traffic Airlines ?

In [63]:
g=sns.FacetGrid(data=top_fly,hue='UniqueCarrier',hue_order=top10car,height=6,aspect=1.8)
g.map(sns.pointplot,'Year','DepDelay',order=yearlist, ci=None, linestyles="-",scale=0.5)
plt.axhline(0,color=sns.color_palette()[7],lw=0.3)
plt.title("Top 10 air-traffic airlines' departure delays through the years",size=15)
plt.xlabel('Year')
plt.ylabel('Avg. Departure Delay (min)');
plt.legend(title='Airline',labels=top10car,loc='upper right');

Around 2001 departure delays started to decrease, and by 2003 most of airlines on average had early departures.

Thank you

In [ ]: