特征工程--笔记

来源：互联网发布：部落冲突vb.67.9破解版编辑：程序博客网时间：2024/06/05 18:28

数据科学家的工作通常具有高度的探究性。数据科学项目常常始于一个模糊的目标，或者有关哪种数据和方法可用的设想。
你往往只能试验你的想法，洞悉你的数据。数据科学家会编写大量代码，但其中很大一部分代码都是为了测试想法，并不会直接用在最终的解决方案中。
而开发人员把更多的精力用于编写代码。他们的目标就是编写系统，打造具有所需功能性的程序。开发人员有时也从事探究性的工作，
比如原型建造、概念验证或者基准测试，但他们的主要工作就是写代码。
#Feature engineering 特征工程 http://blog.csdn.net/a353833082/article/details/50765671
data wrangling数据整理
### kaggle2014:不同的目标用不同的模型，并且使用使用权重进行多种混合，避免过拟合，因为特征数量多余样本数量,其中SVM的惩罚参数设置的很大，-正则化，混合模型是为了减少每个模型自身的缺点

### kaggle2015:For a supervised learning method, I used ensemble selection to generate an ensemble from a model library. The model library was built with models trained using various algorithms, various parameter settings, and various feature sets. I have used Hyperopt (usually used in parameter tuning) to choose parameter setting from a pre-defined parameter space for training different models。 For feature engineering part, I heavily relied on pandas and Numpy for data manipulation, TfidfVectorizer and SVD in Sklearn for extracting text features. For model training part, I mostly used XGBoost, Sklearn, keras and rgf（a tree ensemble learning method）.
### the combination of best features and prudent validation technique which won the competition,All it takes to win is one or two very good features    有时训练集和测试集的数据分布不一样，要学会分析数据的分布，消除噪声数据，增加相关特征
数据预处理：文本清理，特征提取：useful feature to the task,explore new feature
使用ensemble算法：a model library containing hundreds or thousands of models to combat overfitting and stabilize my results：diverse models
Do not ever underestimate the power of linear models. They can be much better than tree-based models or SVR with RBF/poly kernels when using raw TF-IDF features. They can be even better if you introduce appropriate nonlinearities.
parameter tuning：hyperopt
Keep your implementation flexible and scaleable. This allowed you to add new models to the model library very easily.
表现好但又相互不太相关的模型的结果明显比相关的结果融合得来的要好

特征选择：如果一个特征与目标变量是高度相关的，那么这个特征被认为是重要的。相关系数和其他单变量的方法（每一个变量被认为是相互独立的）是比较通用的方法。更复杂的方法是通过预测模型算法来对特征进行评分。这些预测模型内部有这样的特征选择机制，比如MARS，随机森林，梯度提升机。这些模型也可以得出变量的重要性。

逐步回归就是能够自动的选择特征来构建模型。正则化的方法可以作为特征选择的算法，他们在构建模型的过程中删去或者减小不重要特征的贡献

领域知识(查询相关研究论文)，看别人的分享的经验、trick，每天练习practice makes perfect!什么样的特征会对结果产生影响
1.租房价格：
首先将NULL填充为0：df['a_column'].fillna(0,inplace=True)
然后清除明显没有意义的数据：房间数、床或者价格为0   df[df['price']!=0]   或 df[df.price!=0]            df = df.dropna(axis=0)？？？
#使用RE来 remove the $ from the price and convert to float --astype(float)

df['price'] = df['price'].replace('[\$,)]','',regex=True).replace('[(]','-', regex=True).astype(float)

“one hot”编码：不是数值变量的转换成数值   ： rt_dummies = pd.get_dummies(df.room_type)
#替换one hot编码
alldata = pd.concat((df.drop('room_type', axis=1),rt_dummies.astype(int)),axis=1) axis=1表示按列，如果替换的多只需用[]
allcols = alldata.columns
#绘图的结果将是各个特征两两一对的结果
scattercols = ['price','accommodates', 'number_of_reviews', 'reviews_per_month', 'beds', 'availability_30', 'review_scores_rating']
#scatter_matrix函数显示各个特征的矩阵
axs = pd.scatter_matrix(alldata[scattercols],figsize=(12, 12), c='red')

#np.argsort(array) 返回数组排序之后在数组中的位置索引值

值校验:月份，星期等字段校验:单位，实际意义，如大于零，IP地址的形式街区或曼哈顿距离是总横坐标差的绝对值之和,某些情况下有更高的稳定性，但是各特征的值不能相差太多

欧式距离要将特征的数值标准化，否则会因为某些值取值过大而精读差

余弦距离更适合解决异常值和数据稀疏问题以及特征向量很多的情况，但是丢失了在某些情况下有用的长度信息，所以更常用于文本相似性分析中处理不均衡数据：能否收集更多数据、重采样（数据多，下采样；数据少，上采样，采用随机或分层采样）、人工产生数据SMOTE(scikit-learn-contrib-imbalanced-learn)、尝试不同的算法，DT、RF对imbalanced的效果好、加入正则尝试改变模型评估算法：Kappa(一致性检验：判断不同的模型或者分析方法在预测结果上是否具有一致性、模型的结果与实际结果是否具有一致性等)

0 0