机器学习

来源:互联网 发布:java gz解压缩 编辑:程序博客网 时间:2024/05/01 18:20

PCA是一种线性降维算法,即对原数据data乘以一个矩阵W,做线性变换。比如:

data: m * n,即m个样本,n个特征; W: n * (n/2)。则data * W: m * (n/2),便起到了降维的作用。

PCA的详细推导见:http://download.csdn.net/download/zk_j1994/9927042


1. 算法概述

1)记数据集为X,一行代表一个样本,一列代表一个特征;
2)计算X'X的特征值矩阵,特征向量矩阵;
3)取特征值矩阵中较大的特征值对应的特征向量构成的矩阵W,即为投影矩阵;
4)XW即为降维后的数据,注意此X为中心化后的X。


2. PCA的实现

2.1 加载数据,画图等工具函数

def load_data():    with open("../PCA/data/Iris.txt", "r") as f:        iris = []        for line in f.readlines():            temp = line.strip().split(",")            if temp[4] == "Iris-setosa":                temp[4] = 0            elif temp[4] == "Iris-versicolor":                temp[4] = 1            elif temp[4] == "Iris-virginica":                temp[4] = 2            else:                raise(Exception("data error."))            iris.append(temp)    iris = np.array(iris, np.float)    return irisdef draw_result(new_trainX, iris):    """    new_trainX:     降维后的数据    iris:           原数据    """    plt.figure()    # Iris-setosa    setosa = new_trainX[iris[:, 4] == 0]    plt.scatter(setosa[:, 0], setosa[:, 1], color="red", label="Iris-setosa")        # Iris-versicolor    versicolor = new_trainX[iris[:, 4] == 1]    plt.scatter(versicolor[:, 0], versicolor[:, 1], color="orange", label="Iris-versicolor")        # Iris-virginica    virginica = new_trainX[iris[:, 4] == 2]    plt.scatter(virginica[:, 0], virginica[:, 1], color="blue", label="Iris-virginica")    plt.legend()    plt.show()


2.2 PCA核心算法

算法步骤:

1)对数据X0进行中心化; X = X0 - mean(X0);

2)求X0的协方差矩阵;cov(X0) = XX';

3)对协方差矩阵进行特征值分解;

4)取特征值矩阵中最大的K个特征值对应的k-特征向量矩阵;

5)将原数据乘以k-特征向量矩阵;

class PCA:    def __init__(self, dimension):        # 降维后的维度        self.dimension = dimension        def _data_centering(self, train_x):        """ 1. 数据中心化 """        return train_x - np.mean(train_x, axis=0)        def _cal_covMat(self, trainX_centered):        """ 2. 计算协方差矩阵        trainX_centered:    中心化后的trainX数据        """        return np.cov(trainX_centered, rowvar=False)        def _eig_decompostion(self, trainX_covMat):        """ 3. 特征值分解 """        featureVal, featureVec = np.linalg.eig(trainX_covMat)        return featureVal, featureVec        def _gen_result_data(self, trainX_centered, featureVec):        """ 4. 生成降维后的数据        featureVal:     特征值        featureVec:     特征向量                W:              线性变换矩阵        """        W = featureVec[:, 0:self.dimension]        return np.dot(trainX_centered, W)

2.3 main

def main(dimension=2):    iris = load_data()        # 降到2维    pca = PCA(dimension)        # 样本中心化    iris_centered = pca._data_centering(iris[:, 0:4])        # 计算中心化后的协方差矩阵    iris_covMat = pca._cal_covMat(iris_centered)        # 计算特征值, 特征向量    featureVal, featureVec = pca._eig_decompostion(iris_covMat)        # 降维后的数据    new_trainX = pca._gen_result_data(iris_centered, featureVec)         # 降维后的数据可视化    draw_result(new_trainX, iris)


2.4 全部代码

# -*- coding: utf-8 -*-import numpy as npimport matplotlib.pyplot as pltclass PCA:    def __init__(self, dimension):        # 降维后的维度        self.dimension = dimension        def _data_centering(self, train_x):        """ 1. 数据中心化 """        return train_x - np.mean(train_x, axis=0)        def _cal_covMat(self, trainX_centered):        """ 2. 计算协方差矩阵        trainX_centered:    中心化后的trainX数据        """        return np.cov(trainX_centered, rowvar=False)        def _eig_decompostion(self, trainX_covMat):        """ 3. 特征值分解 """        featureVal, featureVec = np.linalg.eig(trainX_covMat)        return featureVal, featureVec        def _gen_result_data(self, trainX_centered, featureVec):        """ 4. 生成降维后的数据        featureVal:     特征值        featureVec:     特征向量                W:              线性变换矩阵        """        W = featureVec[:, 0:self.dimension]        return np.dot(trainX_centered, W)        def load_data():    with open("../PCA/data/Iris.txt", "r") as f:        iris = []        for line in f.readlines():            temp = line.strip().split(",")            if temp[4] == "Iris-setosa":                temp[4] = 0            elif temp[4] == "Iris-versicolor":                temp[4] = 1            elif temp[4] == "Iris-virginica":                temp[4] = 2            else:                raise(Exception("data error."))            iris.append(temp)    iris = np.array(iris, np.float)    return irisdef draw_result(new_trainX, iris):    """    new_trainX:     降维后的数据    iris:           原数据    """    plt.figure()    # Iris-setosa    setosa = new_trainX[iris[:, 4] == 0]    plt.scatter(setosa[:, 0], setosa[:, 1], color="red", label="Iris-setosa")        # Iris-versicolor    versicolor = new_trainX[iris[:, 4] == 1]    plt.scatter(versicolor[:, 0], versicolor[:, 1], color="orange", label="Iris-versicolor")        # Iris-virginica    virginica = new_trainX[iris[:, 4] == 2]    plt.scatter(virginica[:, 0], virginica[:, 1], color="blue", label="Iris-virginica")    plt.legend()    plt.show()    def main(dimension=2):    iris = load_data()        # 降到2维    pca = PCA(dimension)        # 样本中心化    iris_centered = pca._data_centering(iris[:, 0:4])        # 计算中心化后的协方差矩阵    iris_covMat = pca._cal_covMat(iris_centered)        # 计算特征值, 特征向量    featureVal, featureVec = pca._eig_decompostion(iris_covMat)        # 降维后的数据    new_trainX = pca._gen_result_data(iris_centered, featureVec)         # 降维后的数据可视化    draw_result(new_trainX, iris)        if __name__ == "__main__":    main(dimension=2)    

3. sklearn实践PCA

1)fit

2)transform

# -*- coding: utf-8 -*-from sklearn.decomposition import PCAimport matplotlib.pyplot as pltfrom pca import load_datairis = load_data()clf = PCA(n_components=2)clf.fit(iris[:, 0:4])new_trainX = clf.transform(iris[:, 0:4])plt.figure()# Iris-setosasetosa = new_trainX[iris[:, 4] == 0]plt.scatter(setosa[:, 0], setosa[:, 1], color="red", label="Iris-setosa")# Iris-versicolorversicolor = new_trainX[iris[:, 4] == 1]plt.scatter(versicolor[:, 0], versicolor[:, 1], color="orange", label="Iris-versicolor")# Iris-virginicavirginica = new_trainX[iris[:, 4] == 2]plt.scatter(virginica[:, 0], virginica[:, 1], color="blue", label="Iris-virginica")plt.legend()plt.show()

4. 注意
1)如你所见,上述自己的结果和sklearn结果有差距,Y轴恰好互为相反数;这是由于特征向量的不唯一引起的;
2)PCA的优化过程是最大化样本投影后的方差;而投影矩阵是协方差矩阵的特征向量组成;投影矩阵包含的特征向量越多,则投影后的方差越大,但同时降维越不明显;