特征工程小案例
来源:互联网 发布:致远软件合肥 编辑:程序博客网 时间:2024/05/16 15:55
In [29]:
#先把数据读进来import pandas as pddata = pd.read_csv('kaggle_bike_competition_train.csv', header = 0, error_bad_lines=False)
In [30]:
#看一眼数据长什么样data.head()
Out[30]:
In [32]:
# 处理时间字段temp = pd.DatetimeIndex(data['datetime'])data['date'] = temp.datedata['time'] = temp.timedata.head()
Out[32]:
In [33]:
# 设定hour这个小时字段data['hour'] = pd.to_datetime(data.time, format="%H:%M:%S")data['hour'] = pd.Index(data['hour']).hourdata
Out[33]:
In [35]:
# 我们对时间类的特征做处理,产出一个星期几的类别型变量data['dayofweek'] = pd.DatetimeIndex(data.date).dayofweek# 对时间类特征处理,产出一个时间长度变量data['dateDays'] = (data.date - data.date[0]).astype('timedelta64[D]')data
Out[35]:
In [36]:
byday = data.groupby('dayofweek')# 统计下没注册的用户租赁情况byday['casual'].sum().reset_index()
Out[36]:
In [37]:
# 统计下注册的用户的租赁情况byday['registered'].sum().reset_index()
Out[37]:
In [38]:
data['Saturday']=0data.Saturday[data.dayofweek==5]=1data['Sunday']=0data.Sunday[data.dayofweek==6]=1data
Out[38]:
In [39]:
# remove old data featuresdataRel = data.drop(['datetime', 'count','date','time','dayofweek'], axis=1)dataRel.head()
Out[39]:
In [40]:
from sklearn.feature_extraction import DictVectorizer# 我们把连续值的属性放入一个dict中featureConCols = ['temp','atemp','humidity','windspeed','dateDays','hour']dataFeatureCon = dataRel[featureConCols]dataFeatureCon = dataFeatureCon.fillna( 'NA' ) #in case I missed anyX_dictCon = dataFeatureCon.T.to_dict().values() # 把离散值的属性放到另外一个dict中featureCatCols = ['season','holiday','workingday','weather','Saturday', 'Sunday']dataFeatureCat = dataRel[featureCatCols]dataFeatureCat = dataFeatureCat.fillna( 'NA' ) #in case I missed anyX_dictCat = dataFeatureCat.T.to_dict().values() # 向量化特征vec = DictVectorizer(sparse = False)X_vec_cat = vec.fit_transform(X_dictCat)X_vec_con = vec.fit_transform(X_dictCon)
In [41]:
dataFeatureCon.head()
Out[41]:
In [42]:
X_vec_con
Out[42]:
In [43]:
dataFeatureCat.head()
Out[43]:
In [44]:
X_vec_cat
Out[44]:
In [18]:
from sklearn import preprocessing# 标准化连续值数据scaler = preprocessing.StandardScaler().fit(X_vec_con)X_vec_con = scaler.transform(X_vec_con)X_vec_con
Out[18]:
In [20]:
from sklearn import preprocessing# one-hot编码enc = preprocessing.OneHotEncoder()enc.fit(X_vec_cat)X_vec_cat = enc.transform(X_vec_cat).toarray()X_vec_cat
Out[20]:
In [22]:
import numpy as np# combine cat & con featuresX_vec = np.concatenate((X_vec_con,X_vec_cat), axis=1)X_vec
Out[22]:
In [23]:
# 对Y向量化Y_vec_reg = dataRel['registered'].values.astype(float)Y_vec_cas = dataRel['casual'].values.astype(float)
In [24]:
# 看看处理后的结果值Y_vec_reg
Out[24]:
In [25]:
Y_vec_cas
Out[25]:
0 0
- 特征工程小案例
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 特征工程
- 浙江大学ZOJ 1002题 详解
- Java Math的 floor,ceil和round函数的简单介绍
- iOS App上架流程(2016详细版)
- Android绘图Canvas十八般武器之Shader详解及实战篇(下)
- [KinectWPF程序]1深度图像,使用WriteableBitmap对象改进Kinect图像显示&复杂的Kinect初始化方法
- 特征工程小案例
- Findstr
- MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk.
- Set
- angular的指令
- 疑难杂症(3) -- 【java.lang.UnsupportedClassVersionError】版本不一致出错
- 用excel设计带条形码的报价单
- Ubuntu安装记
- ReadSense Ltd. dark horse won the 2016 venture TOP100 the most promising start-ups