空难数据分析例子

来源:互联网 发布:自动发布信息软件 编辑:程序博客网 时间:2024/04/28 23:25

数据集:Kaggle上的1908年收集的公开数据集

项目任务:

  • 每年空难数分析
  • 机上乘客数量
  • 生还数、遇难数
    • 哪些航空公司空难数最多?
    • 哪些机型空难数最多?
# -*-coding: utf-8 -*-import pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltfrom bokeh.io import output_notebook, output_file, showfrom bokeh.charts import Bar,TimeSeriesfrom bokeh.layouts import columnfrom math import pi
- 查看数据信息
data_path = './dataset/Airplane_Crashes_and_Fatalities_Since_1908.csv'df_data = pd.read_csv(data_path)
print u'数据集基本信息:'print df_data.info()
数据集基本信息: RangeIndex: 5268 entries, 0 to 5267 Data columns (total 13 columns): Date 5268 non-null object Time 3049 non-null object Location 5248 non-null object Operator 5250 non-null object Flight # 1069 non-null object Route 3562 non-null object Type 5241 non-null object Registration 4933 non-null object cn/In 4040 non-null object Aboard 5246 non-null float64 Fatalities 5256 non-null float64 Ground 5246 non-null float64 Summary 4878 non-null object dtypes: float64(3), object(10) memory usage: 535.1+ KB None
print u'数据集有%i行,%i列' %(df_data.shape[0], df_data.shape[1])
数据集有5268行,13列
print u'数据预览:'df_data.head()
数据预览:
Date Time Location Operator Flight # Route Type Registration cn/In Aboard Fatalities Ground Summary 0 09/17/1908 17:18 Fort Myer, Virginia Military - U.S. Army NaN Demonstration Wright Flyer III NaN 1 2.0 1.0 0.0 During a demonstration flight, a U.S. Army fly… 1 07/12/1912 06:30 AtlantiCity, New Jersey Military - U.S. Navy NaN Test flight Dirigible NaN NaN 5.0 5.0 0.0 First U.S. dirigible Akron exploded just offsh… 2 08/06/1913 NaN Victoria, British Columbia, Canada Private - NaN Curtiss seaplane NaN NaN 1.0 1.0 0.0 The first fatal airplane accident in Canada oc… 3 09/09/1913 18:30 Over the North Sea Military - German Navy NaN NaN Zeppelin L-1 (airship) NaN NaN 20.0 14.0 0.0 The airship flew into a thunderstorm and encou… 4 10/17/1913 10:30 Near Johannisthal, Germany Military - German Navy NaN NaN Zeppelin L-2 (airship) NaN NaN 30.0 30.0 0.0 Hydrogen gas which was being vented was sucked…
  • 处理缺失数据
# def process_missing_data(df_data):#     """#             处理缺失数据#     """#     if df_data.isnull().values.any():#         # 存在缺失数据#         print '存在缺失数据!'#         df_data = df_data.fillna(0.)    # 填充nan#         # df_data = df_data.dropna()    # 过滤nan#     return df_data.reset_index()
- 数据转换
df_data['Date'] = pd.to_datetime(df_data['Date'])# df_data['Date']
df_data['Year'] = df_data['Date'].map(lambda x: x.year)df_data.head()
Date Time Location Operator Flight # Route Type Registration cn/In Aboard Fatalities Ground Summary Year 0 1908-09-17 17:18 Fort Myer, Virginia Military - U.S. Army NaN Demonstration Wright Flyer III NaN 1 2.0 1.0 0.0 During a demonstration flight, a U.S. Army fly… 1908 1 1912-07-12 06:30 AtlantiCity, New Jersey Military - U.S. Navy NaN Test flight Dirigible NaN NaN 5.0 5.0 0.0 First U.S. dirigible Akron exploded just offsh… 1912 2 1913-08-06 NaN Victoria, British Columbia, Canada Private - NaN Curtiss seaplane NaN NaN 1.0 1.0 0.0 The first fatal airplane accident in Canada oc… 1913 3 1913-09-09 18:30 Over the North Sea Military - German Navy NaN NaN Zeppelin L-1 (airship) NaN NaN 20.0 14.0 0.0 The airship flew into a thunderstorm and encou… 1913 4 1913-10-17 10:30 Near Johannisthal, Germany Military - German Navy NaN NaN Zeppelin L-2 (airship) NaN NaN 30.0 30.0 0.0 Hydrogen gas which was being vented was sucked… 1913
  • 数据分析与可视化——空难数vs年份

a) seaborn

plt.figure(figsize=(15.0,10.0))sns.countplot(x='Year', data=df_data)
<matplotlib.axes._subplots.AxesSubplot at 0x1158ce610>
plt.rcParams['font.sans-serif']=['SimHei']plt.rcParams['axes.unicode_minus']=False
plt.title(u'空难次数VS年份')plt.xlabel(u'年份')plt.ylabel(u'空难次数')plt.xticks(rotation=90)plt.show()

这里写图片描述

b) bokeh

p = Bar(df_data,'Year',title=u'空难次数 VS 年份',plot_width=1000,legend=False,xlabel=u'年份',ylabel=u'空难次数')p.xaxis.major_label_orientation = pi/2output_notebook()show(p)

这里写图片描述

  • 数据分析与可视化——乘客数量vs遇难数vs年份
grouped_year_sum_data = df_data.groupby('Year',as_index=False).sum()grouped_year_sum_data.head()
Year Aboard Fatalities Ground 0 1908 2.0 1.0 0.0 1 1912 5.0 5.0 0.0 2 1913 51.0 45.0 0.0 3 1915 60.0 40.0 0.0 4 1916 109.0 108.0 0.0

a) seaborn

grouped_year_sum_data = df_data.groupby('Year',as_index=False).sum()grouped_year_sum_data.head()
plt.title(u'乘客数量vs遇难数vs年份')plt.xlabel(u'年份')plt.ylabel(u'乘客数量vs遇难数')plt.xticks(rotation=90)plt.show()

这里写图片描述

b) bokeh

tsline = TimeSeries(data=grouped_year_sum_data, x='Year',y=['Aboard','Fatalities'],color=['Aboard', 'Fatalities'],                    dash=['Aboard', 'Fatalities'],title=u'乘客数量vs遇难数vs年份',xlabel=u'年份',ylabel=u'乘客数vs遇难数',                    legend=True)tspoint = TimeSeries(data=grouped_year_sum_data,x='Year',y=['Aboard','Fatalities'],color=['Aboard', 'Fatalities'],                    dash=['Aboard', 'Fatalities'],builder_type='point',title=u'乘客数量vs遇难数vs年份',xlabel=u'年份',                    ylabel=u'乘客数vs遇难数',legend=True)output_notebook()show(column(tsline,tspoint))

这里写图片描述
这里写图片描述

  • top n 分析
grouped_data = df_data.groupby(by='Type',as_index=False)['Date'].count()grouped_data.rename(columns={'Date':'Count'},inplace=True)top_n = 10top_n_grouped_data = grouped_data.sort_values('Count',ascending=False).iloc[:top_n, :]top_n_grouped_data
Type Count 1178 Douglas DC-3 334 2388 de Havilland Canada DHC-6 Twin Otter 300 81 1097 Douglas C-47A 74 1089 Douglas C-47 62 1230 Douglas DC-4 40 2340 Yakovlev YAK-40 37 125 Antonov AN-26 36 1598 Junkers JU-52/3m 32 1119 Douglas C-47B 29 1045 De Havilland DH-4 28

- 可视化结果

plt.figure(figsize=(15.0,10.0))sns.barplot(x='Count',y='Type',data=top_n_grouped_data)
plt.title('Count vs Type',fontsize=20)plt.xlabel('Type')plt.ylabel('Count')plt.show()

这里写图片描述

0 0
原创粉丝点击