
来源:互联网 发布:部落冲突vb.67.9破解版 编辑:程序博客网 时间:2024/06/05 18:28
#Feature engineering  特征工程  http://blog.csdn.net/a353833082/article/details/50765671
data wrangling数据整理
### kaggle2014:不同的目标用不同的模型,并且使用使用权重进行多种混合,避免过拟合,因为特征数量多余样本数量,其中SVM的惩罚参数设置的很大,-正则化,混合模型是为了减少每个模型自身的缺点

### kaggle2015:For a supervised learning method, I used ensemble selection to generate an ensemble from a model library. The model library was built with models trained using various algorithms, various parameter settings, and various feature sets. I have used Hyperopt (usually used in parameter tuning) to choose parameter setting from a pre-defined parameter space for training different models。 For feature engineering part, I heavily relied on pandas and Numpy for data manipulation, TfidfVectorizer and SVD in Sklearn for extracting text features. For model training part, I mostly used XGBoost, Sklearn, keras and rgf(a tree ensemble learning method).
### the combination of best features and prudent validation technique which won the competition,All it takes to win is one or two very good features    有时训练集和测试集的数据分布不一样,要学会分析数据的分布,消除噪声数据,增加相关特征
数据预处理:文本清理,特征提取:useful feature to the task,explore new feature    
使用ensemble算法:a model library containing hundreds or thousands of models to combat overfitting and stabilize my results:diverse models
Do not ever underestimate the power of linear models. They can be much better than tree-based models or SVR with RBF/poly kernels when using raw TF-IDF features. They can be even better if you introduce appropriate nonlinearities.
parameter tuning:hyperopt
Keep your implementation flexible and scaleable. This allowed you to add new models to the model library very easily.


领域知识(查询相关研究论文),看别人的分享的经验、trick,每天练习practice makes perfect!什么样的特征会对结果产生影响
然后清除明显没有意义的数据:房间数、床或者价格为0   df[df['price']!=0]   或 df[df.price!=0]            df = df.dropna(axis=0)???
#使用RE来 remove the $ from the price and convert to float --astype(float)

df['price'] = df['price'].replace('[\$,)]','',regex=True).replace('[(]','-', regex=True).astype(float)

“one hot”编码:不是数值变量的转换成 数值   : rt_dummies = pd.get_dummies(df.room_type)
#替换one hot编码
alldata = pd.concat((df.drop('room_type', axis=1),rt_dummies.astype(int)),axis=1)  axis=1表示按列,如果替换的多只需用[]
allcols = alldata.columns
scattercols = ['price','accommodates', 'number_of_reviews', 'reviews_per_month', 'beds', 'availability_30', 'review_scores_rating']
axs = pd.scatter_matrix(alldata[scattercols],figsize=(12, 12), c='red')

#np.argsort(array)  返回数组排序之后在数组中的位置索引值




0 0