数据探索（一）

来源：互联网发布：好压mac 编辑：程序博客网时间：2024/06/11 16:22

数据探索是拿到数据要做的第一步，目的是对要分析的数据有个大概的了解。弄清数据集大小，特征和样本数量，数据类型，数据的概率分布等。下面结合奔驰车数据做个梳理，也是个人学习的记录。

import numpy as npimport pandas as pd

train_df = pd.read_csv('train_b.csv')test_df = pd.read_csv('test_b.csv')print train_df.shape, test_df.shape

(4209, 378) (4209, 377)

train_df.head()

ID y X0 X1 X2 X3 X4 X5 X6 X8 … X375 X376 X377 X378 X379 X380 X382 X383 X384 X385 0 0 130.81 k v at a d u j o … 0 0 1 0 0 0 0 0 0 0 1 6 88.53 k t av e d y l o … 1 0 0 0 0 0 0 0 0 0 2 7 76.26 az w n c d x j x … 0 0 0 0 0 0 1 0 0 0 3 9 80.62 az t n f d x l e … 0 0 0 0 0 0 0 0 0 0 4 13 78.02 az v n f d h d n … 0 0 0 0 0 0 0 0 0 0

5 rows × 378 columns

特征变换，或者是数据处理，要在训练集和测试集都进行。如果是数据要归一化，在训练集训练好模型，再在测试集转换。合并训练集和测试集是为了方便之后对特征的操作。

all_data = pd.concat([train_df.drop('y', axis=1), test_df])

print all_data.shape

(8418, 377)

import matplotlibimport matplotlib.pyplot as pltimport seaborn as sns

观察自变量y的分布，最直观的还是可视化。画个y值变化趋势和y值分布图。

plt.figure(figsize=(12,12))plt.subplot(211)plt.scatter(range(train_df.shape[0]), np.sort(train_df.y))plt.subplot(212)train_df['y'].loc[train_df['y'] > 175] = train_df.y.mean()sns.distplot(train_df['y'], bins=70)plt.show()

这里写图片描述
png

观察散点图，y值连续变化，在时间高于125时上升明显，对比柱状图高于125数据不多。只有一个异常点，用均值替代。从柱状图中发现y分布不止一个波峰，考虑是不是几个分布叠加而成，后期用聚类试试。

再看看自变量分布和变化，看起来比较特殊的是前面几个字符串类型的特征。

train_df.describe(include=['O'])

X0 X1 X2 X3 X4 X5 X6 X8 count 4209 4209 4209 4209 4209 4209 4209 4209 unique 47 27 44 7 4 29 12 25 top z aa as c d v g j freq 360 833 1659 1942 4205 231 1042 277

X4特征值不重复的有四类，单个d就有4205个。方差极小，如果一个特征的所有值都一样，就可以认为这个特征的作用就不是很大。

还是画图观察这几个特征对因变量的影响：
散点图

# for f in  train_df.columns:#     if train_df[f].dtype == 'object':#         plt.figure(figsize=(12,8))#         sns.stripplot(x=f, y='y', data=train_df)#         plt.xlabel(f, fontsize=12)#         plt.ylabel('y', fontsize=12)#         plt.title('Distribution of y variable with'+ f, fontsize=15)#         plt.show()plt.figure(figsize=(12,8))sns.stripplot(x='X0', y='y', data=train_df)plt.xlabel('X0', fontsize=12)plt.ylabel('y', fontsize=12)plt.title('Distribution of y variable with X0', fontsize=15)plt.show()

这里写图片描述
png

箱型图
主要包含六个数据节点，将一组数据从大到小排列，分别计算出他的上边缘，上四分位数Q3，中位数，下四分位数Q1，下边缘，还有异常值。
通过箱型图能看出各个数据是否有偏态，集中度如何。

plt.figure(figsize=(12,8))sns.boxplot(x='X3', y='y', data=train_df)plt.xlabel('X3', fontsize=12)plt.ylabel('y', fontsize=12)plt.title('boxplot of y variable with X3', fontsize=15)plt.show()

这里写图片描述
png

小提琴图
小提琴图可以看成箱型图的变种，图中白点是中位数，黑色长条上下边界是上下四分卫点，宽度表示密度大小，上下延伸的细黑须是异常点。可以比箱型图多看出数据分布的密度。

plt.figure(figsize=(12, 8))sns.violinplot(x='X3', y='y', data=train_df)plt.xlabel('X3', fontsize=12)plt.ylabel('y', fontsize=12)plt.title('Violinplot of y variable with X3', fontsize=15)plt.show()

这里写图片描述
png

#找出特征值中不重复值unique_values_dict={}for col in train_df.columns:    if train_df[col].dtype != 'object' and col != 'ID' and col != 'y':        unique_value = str(np.sort(train_df[col].unique()).tolist())        tlist = unique_values_dict.get(unique_value, [])        tlist.append(col)        unique_values_dict[unique_value] = tlist[:]# for unique_val, columns in unique_values_dict.items():#     print  unique_val,columns     #画出0、1各占比例zero_count_list = []one_count_list = []cols_list = unique_values_dict['[0L, 1L]']for col in cols_list:    zero_count_list.append((train_df[col] == 0).sum())    one_count_list.append((train_df[col] == 1).sum())N = len(cols_list)ind = np.arange(N)width = 0.35plt.figure(figsize=(6, 100))p1 = plt.barh(ind , zero_count_list, width, color='red')p2 = plt.barh(ind, one_count_list, width, left=zero_count_list, color='blue')plt.yticks(ind, cols_list)plt.legend((p1[0], p2[0]), ('Zero count', 'one count'))plt.show()

这里写图片描述
png

阅读全文

0 0