kaggle学习经验

来源：互联网发布：知耻阅读答案编辑：程序博客网时间：2024/06/06 23:30

房价预测属于回归的问题，关注的是各个特征与目标变量的关系，开始我是直接全特征的拟合，但是效果很差，一是计算速度慢，二是准确度低。所以我参考了Comprehensive data exploration with Python这篇文章。这篇文章没有讲具体的模型，而是教我们如何对数据进行处理。

套路一

查看所有变量名df_train.columns，当我们导入数据之后，当然想看看都有些什么样的特征，这些特征都有一些什么样的基本属性。例如，查看哪些特征是数值变量，哪些事离散变量，查看数值变量的均值，方差等等。

套路二

查看目标值的属性。这里是房价预测，所以目标变量就是房价，首先看看房价的基本信息df_train['SalePrice'].describe()，最大最小值，均值方差等等，这些都是我们希望知道的。接着我们可以查看房价的密度分布sns.distplot(df_train['SalePrice']);，这样非常直观的看出房价的变化，是左偏的。

套路三

接着就是查看房价和各个特征的关系，这里特征有数值类型和类别类型。我们分开看看都有什么关系。数值类型的特征可以直接画出和房价的散点图，而离散的特征不行（因为画出来的都是一条条竖线）。所以我们使用箱型图，由箱型图可以看出离值点，密度分布等等信息。

套路四

一个特征画出来肯定非常的费事，其实我们关心的只是和目标相关性大的特征，所以我们可以画出各个特征之间的热度图。然后选取最热的（相关性最大的）的变量，然后画出与目标之间的关系。

套路五

缺失数据该如何处理呢？我们统计每个特征的缺失值，然后计算缺失比例，排序之后就可以很直观的看出哪些特征缺失的多，哪些只是缺少了一个。文章中的方法非常的残酷暴力，凡是有大于一个缺失值的特征就直接把这个特征去掉，有一个特征值缺失的就把这个样本给去掉，这样数据集一个缺失值都没有了，这样毫无疑问是不行的。

套路六

离值点，出于某些原因，噪声了，人为了等等我们采集的数据会有一些异常点，那么如何找到这些异常点然后删除呢？首先对于单个变量，例如房价，我们归一化后进行排序，将最大的和最小的几个点删除。然后进行双变量分析，我们画出特征和房价的散点图，从图中看出哪些点是离值点，然后我们将这些点删掉。

套路七

看不懂的套路，作者把所有的特征都转化为正态分布，然后进行分析。之后就得到了变换后的特征和变换后的房价都有很强的线性关系。。。。

套路七

关于dummy 变量，可以直接使用df_train = pd.get_dummies(df_train)将其变成数值离散变量。

泰坦尼克号

Workflow goals
The data science solutions workflow solves for seven major goals.

Classifying. We may want to classify or categorize our samples. We may
also want to understand the implications or correlation of different
classes with our solution goal.

Correlating. One can approach the problem based on available features
within the training dataset. Which features within the dataset
contribute significantly to our solution goal? Statistically speaking
is there a correlation among a feature and solution goal? As the
feature values change does the solution state change as well, and
visa-versa? This can be tested both for numerical and categorical
features in the given dataset. We may also want to determine
correlation among features other than survival for subsequent goals
and workflow stages. Correlating certain features may help in
creating, completing, or correcting features.

Converting. For modeling stage, one needs to prepare the data.
Depending on the choice of model algorithm one may require all
features to be converted to numerical equivalent values. So for
instance converting text categorical values to numeric values.

Completing. Data preparation may also require us to estimate any
missing values within a feature. Model algorithms may work best when
there are no missing values.

Correcting. We may also analyze the given training dataset for errors
or possibly innacurate values within features and try to corrent these
values or exclude the samples containing the errors. One way to do
this is to detect any outliers among our samples or features. We may
also completely discard a feature if it is not contribting to the
analysis or may significantly skew the results.

Creating. Can we create new features based on an existing feature or a
set of features, such that the new feature follows the correlation,
conversion, completeness goals.

Charting. How to select the right visualization plots and charts
depending on nature of the data and the solution goals.
Workflow goals
The data science solutions workflow solves for seven major goals.

Classifying. We may want to classify or categorize our samples. We may
also want to understand the implications or correlation of different
classes with our solution goal.

Correlating. One can approach the problem based on available features
within the training dataset. Which features within the dataset
contribute significantly to our solution goal? Statistically speaking
is there a correlation among a feature and solution goal? As the
feature values change does the solution state change as well, and
visa-versa? This can be tested both for numerical and categorical
features in the given dataset. We may also want to determine
correlation among features other than survival for subsequent goals
and workflow stages. Correlating certain features may help in
creating, completing, or correcting features.

Converting. For modeling stage, one needs to prepare the data.
Depending on the choice of model algorithm one may require all
features to be converted to numerical equivalent values. So for
instance converting text categorical values to numeric values.

Completing. Data preparation may also require us to estimate any
missing values within a feature. Model algorithms may work best when
there are no missing values.

Correcting. We may also analyze the given training dataset for errors
or possibly innacurate values within features and try to corrent these
values or exclude the samples containing the errors. One way to do
this is to detect any outliers among our samples or features. We may
also completely discard a feature if it is not contribting to the
analysis or may significantly skew the results.

Creating. Can we create new features based on an existing feature or a
set of features, such that the new feature follows the correlation,
conversion, completeness goals.

Charting. How to select the right visualization plots and charts
depending on nature of the data and the solution goals.

阅读全文

0 0