sklearn_PCA实践
来源:互联网 发布:淘宝推广员工资待遇 编辑:程序博客网 时间:2024/06/10 23:04
这篇主要记录在sklearn中如何应用pca,理论推导在http://blog.csdn.net/huangyi_906/article/details/75578213)
官网中给出的介绍:
class sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False, svd_solver=’auto’, tol=0.0, iterated_power=’auto’, random_state=None)
其中参数:
- n_components=None:指定降维后的维数。如果给定数在(0,1)之间,则为降维后占原维数的百分比。默认自动选择。
属性:
- components_ :主成分组数
- explained_variance_ratio_:每个主成分占方差比例
- n_components_ :一个整数,指示主成分有多少个元素。
方法:
- fit(x):训练模型
- transform(x): 执行降维
- fit_transform(x): 训练并降维
- inverse_transform(x): 逆向操作,把降维的数据逆向转换回原来数据。
import numpy as npimport pandas as pdfrom sklearn.preprocessing import LabelEncoder
#all_data为在训练测试集同时特征变换train_df = pd.read_csv('train_b.csv')test_df = pd.read_csv('test_b.csv')all_data = pd.concat([train_df.drop('y', axis=1), test_df])#原数据中有数据有字符串格式,用LabelEncoder转成数字量。for c in train_df.columns: if train_df[c].dtype == 'object': lbl = LabelEncoder() lbl.fit(list(train_df[c].values) + list(test_df[c].values)) train_df[c] = lbl.transform(list(train_df[c].values)) test_df[c] = lbl.transform(list(test_df[c].values))print train_df.shape, test_df.shape, all_data.shape
(4209, 378) (4209, 377) (8418, 377)
from sklearn.decomposition import PCAfrom sklearn.metrics import r2_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LassoCV, LassoCV
x = train_df.drop('y', axis=1)y = train_df.yx_train,x_test, y_train, y_test = train_test_split(x, y, train_size=0.7,random_state=1)
不经过PCA降维,直接用Lasson模型预测,r2分数看效果
las = LassoCV(alphas=np.logspace(-3,0,10),cv=3, normalize=True)las.fit(x_train, y_train)y_hat = las.predict(x_test)r2 = r2_score(y_test, y_hat)print r2
0.600606371001
数据经过PCA降维
pca = PCA(n_components=100)x_pca = pca.fit_transform(x)x_train, x_test, y_train, y_test = train_test_split(x_pca, y, train_size=0.7, random_state=1)las = LassoCV(alphas=np.logspace(-3,0,10), cv=3, normalize=True)las.fit(x_train, y_train)y_hat = las.predict(x_test)r2 = r2_score(y_test, y_hat)print las.score(x_test, y_test)print r2
0.5850783568540.585078356854
# 属性:# - components_ :主成分组数# - explained_variance_ratio_:每个主成分占方差比例# - n_components_ :一个整数,指示主成分有多少个元素。print pca.components_.shapeprint pca.components_print pca.explained_variance_ratio_print pca.n_components_
(100, 377)[[ 9.99997229e-01 -7.03051423e-05 2.99568128e-04 ..., -1.56006146e-07 -1.37294911e-07 4.49824033e-07] [ -6.84286561e-05 -9.47851214e-01 2.08504144e-01 ..., 7.39183496e-06 -9.91394396e-06 -2.72693141e-05] [ 1.14415298e-04 2.36792722e-01 1.03927848e-04 ..., -9.05345363e-05 -2.25038995e-07 -5.86626778e-06] ..., [ 1.61354771e-06 8.02087742e-04 2.94251412e-03 ..., 1.44147371e-02 -1.12809556e-03 2.32108319e-02] [ 3.88195289e-07 -3.57689418e-04 1.86472048e-03 ..., 8.35456656e-03 7.63157915e-04 2.21159446e-02] [ 4.75846061e-06 -3.18865700e-04 -1.91695177e-03 ..., 2.27787722e-02 8.69893184e-04 -9.42942931e-03]][ 9.99904441e-01 4.13130289e-05 2.19765863e-05 1.10193653e-05 8.26702521e-06 7.62663194e-06 1.42480683e-06 6.67174103e-07 3.88754081e-07 2.62818901e-07 2.16384043e-07 2.11922104e-07 1.82366088e-07 1.49327561e-07 1.32521195e-07 1.15136912e-07 9.27510747e-08 8.58563786e-08 8.01276864e-08 7.04005004e-08 6.34985956e-08 5.78088419e-08 5.57237058e-08 5.23667240e-08 4.59750995e-08 4.33226546e-08 3.87329541e-08 3.70671641e-08 3.38158788e-08 3.30567259e-08 3.18505129e-08 2.95970906e-08 2.81648193e-08 2.68042845e-08 2.53418096e-08 2.39548190e-08 2.22288938e-08 1.98302079e-08 1.92309281e-08 1.84941517e-08 1.75852712e-08 1.66850239e-08 1.59845385e-08 1.57456236e-08 1.56197617e-08 1.51236570e-08 1.36981729e-08 1.33881358e-08 1.28714627e-08 1.26702565e-08 1.23608559e-08 1.19063092e-08 1.15979570e-08 1.12774700e-08 1.06840169e-08 1.02724873e-08 9.65226045e-09 9.57709444e-09 9.21755466e-09 9.07374803e-09 8.83653670e-09 8.32320742e-09 8.10177890e-09 7.78543398e-09 7.68675342e-09 7.11795011e-09 6.97713683e-09 6.80653296e-09 6.70775259e-09 6.33472793e-09 6.17404430e-09 5.99154122e-09 5.90636490e-09 5.69580309e-09 5.52710835e-09 5.44626714e-09 5.25662572e-09 5.14883243e-09 5.05059409e-09 4.92476555e-09 4.83675619e-09 4.77177612e-09 4.56297210e-09 4.49822949e-09 4.34043398e-09 4.15862107e-09 4.14353784e-09 3.98119789e-09 3.92094430e-09 3.85990803e-09 3.80125941e-09 3.70113785e-09 3.61354437e-09 3.46643273e-09 3.31328766e-09 3.22858846e-09 3.21169764e-09 3.12481222e-09 3.01224773e-09 2.99584028e-09]100
PCA要人为指定保留的特征数,这就有参数选择好坏的区别。下面用pipline结合GridSearchCV,找到分数最高的保留特征数。
from sklearn.pipeline import Pipelinefrom sklearn.model_selection import GridSearchCVx_train,x_test, y_train, y_test = train_test_split(x, y, train_size=0.7,random_state=1)steps = [("pca", PCA()), ("las", LassoCV(alphas=np.logspace(-3,0,10), cv=3, normalize=True))] #把数据处理过程打包在pip中pip = Pipeline(steps)gsea = GridSearchCV(pip, param_grid={'pca__n_components': np.arange(1,370,10)}, cv=3) #参数选择在(1,370)中每隔10选一个数,共37个数。gsea.fit(x_train, y_train)print gsea.score(x_test, y_test)print gsea.best_params_
0.586531700728{'pca__n_components': 141}
发现经过PCA降维之后,模型的预测效果在R2评价有略微下降,分析原因:
- pca降维的原理是“将协方差矩阵化为对角阵的过程。而协方差矩阵的值反映的是不同波段之间数据的相关性,即协方差。对角化后的协方差矩阵除了主对角线上的元素外,都是零。所以说,特征之间的相关性被去除了”。
PCA作为一种无监督数据压缩算法,只保留最重要的主方向,则在压缩时,自变量和因变量间的关系有可能变的更复杂了。
PCA在特征之间有大相关性时效果通常不错。
阅读全文
0 0
- sklearn_PCA实践
- 实践
- 实践
- 实践
- 实践
- 实践
- 实践
- 实践
- 实践
- 实践
- 实践、实践、再实践
- Ant实践
- Ant实践
- Ant实践
- Ant实践
- POI 实践
- 安全管理实践
- Ant实践
- 【题解】弋或树
- HTTP与HTTPS的区别以及代码实现请求
- 在tableView用imageView加载gif,一滑动就崩溃了?
- [展览人周刊]华展云20170724期
- Ubuntu如何访问其他分区
- sklearn_PCA实践
- 笔记本电脑U盘安装win7和win10教程(本地试验成功)
- FairyGUI编辑器制作Unity3D UI值得借鉴
- android开发 之 Paint
- 13.3-全栈Java笔记:打飞机游戏实战项目|PaintThread|launchFrame
- kotlin 踩坑实录(一)之apply
- python命名规范-google-styleguid
- hdu--6052--To my boyfriend
- 浏览器标准模式和怪异模式之间的区别是什么?