PCA：Principle Component Analysis [1]

来源：互联网发布：中泰法师斗法事件知乎编辑：程序博客网时间：2024/06/11 03:08

来自wikipedia的定义如下：

Principal component analysis (PCA) is a statistical procedure that uses (1) orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of (2) linearly uncorrelated variablescalled principal components. (3) The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that (4) the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. Principal components are (5) guaranteed to be independent if the data set is jointly normally distributed. PCA issensitive to the relative scaling of the original variables.

针对以上定义谈谈自己的理解：

【(1)】

PCA只是一个正交变换的过程，我们知道样本所处的空间对应一组正交基，样本的坐标也就是用这些正交基来表示的。而PCA使用正交变换将样本数据从原来的空间变换到另一个空间，表现上看就是空间对应的正交基从一组变成了另一组，再具体的表现就是样本数据的坐标值发生了变换。但我理解不管样本空间怎么改变，样本坐标怎么变化，样本之间的关系是不会发生改变的，因为样本定了，样本之间的关联关系也就定下来了，而PCA改变的只是样本的表现方式。

【(2)】

那么为什么要改变样本的表现方式，或者说为什么要进行空间变换呢？就是引出了样本的相关性，这里之前我一直没有理解清楚，觉得样本的相关性指的是样本a和样本b之间的关联关系（如果二者有关联，当然就存在关联关系），而这里的相关性说的并不是样本之间的关联关系，而是样本数据背后所对应的联合概率分布的各个维度（或者说各个变量）的关联关系。以二维空间为例，假设我们有一个联合分布P(X, Y)，我们对其进行独立随机采样得到一批样本数据 {(x_1, y_1), (x_2, y_2), ......, (x_n, y_n)}，我们是要透过这些样本数据来衡量P(X, Y)，或者说是要衡量随机变量X和随机变量Y之间的关联关系。到这里就引出了协方差，协方差是干什么的呢？wikipedia上定义如下：

covariance is a measure of how much two random variables change together. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the smaller values, i.e., the variables tend to show similar behavior, the covariance is positive.^[1] In the opposite case, when the greater values of one variable mainly correspond to the smaller values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret.

协方差是衡量两个随机变量一起变化的程度，其符号对应着正相关或者负相关，其定义公式如下:

如果两个变量独立，那么他们的协方差就等于0,；但是反之不成立，举个例子： y = x². x 在[-1, 1]上均匀分布。x和y对应的协方差为0，但显然二者相关，这是因为协方差只能衡量两个变量之间的线性关联关系，无法衡量这种二次方的非线性关系！但是wikipedia上将，如果两个随机变量的联合分布是联合正态分布(jointly normally distributed)，那么协方差的不相关确实能表明二者是独立的。（这点我还不太明白.......）

协方差可以衡量两个变量的线性关联关系，而对于多维空间，要表示多个维度上两两的关联关系就得用协方差矩阵了，对于n维空间向量，其包括n个随机变量X = [X1, X2, ......, Xn]，其背后对应一个包括n个随机变量的联合分布P(X1, X2, ......, Xn)，而各个维度之间的关联关系用协方差矩阵表示如下：

举个二维空间的例子：

对于P1(X, Y)，如果满足Y = X，其中X ，Y属于 {1，-1}；对其采样得到{(1，1)，(-1，-1)}。

对于P2(X, Y)，如果满足X，Y独立，其中X ，Y属于{1，-1}；对其采样得到{(1，1)，(-1，-1)，(1，-1)，(-1，1)}。

我们求得二者的协方差矩阵如下：

可见，对于X，Y独立的P2(X，Y)来说，其协方差矩阵除对角线外其他的都是0，这里要说明协方差矩阵的两个重要特性：

1）协方差矩阵是对称矩阵。

2）协方差矩阵是半正定矩阵。

【(3)，(4)，(5)】

现在回来说为什么要进行正交变换，因为通过正交变换，可以找到一组正交基，使得P(X1，X2，......，Xn)在变换后的空间里各个随机变量（或者说是各个维度分量）之间的协方差等于（这里不能说各个随机变量之间是独立的），更重要的是变换后的协方差矩阵对角线上值的大小表明了联合分布的各个随机变量的方差。根据协方差对角线值的大小，我们可以合理地将一些方差小的分量给舍弃掉，这样一来可以减少数据存储开销，更重要的是有时可以达到去噪的效果（这一点还有待核实，好像是分布必须满足一定的假设，就想下面给的信号的例子一样）。我们知道在信号传播的过程中会融入噪声，但是一般假设信号在其真正方向上的方差要远远大于噪声的方差，这样通过协方差矩阵我们舍弃噪声的干扰！

备注：

这里举得例子有个不合适的地方是，在PCA过程中，通过样本数据计算协方差矩阵时，是需要对样本各个维度上的值进行归一化的。为什么？因为协方差矩阵的每个元素是衡量不同维度之间的管理关系的，但如果某个维度本身的方差比较大，在计算协方差的时候会掩盖这种关联关系的表达！如下例子可以看出，归一化后的协方差矩阵中，X和Y存在明显的关联关系，而归一化前的协方差矩阵看上去Y方向的分量似乎很大，似乎可以忽略X方向了！

0 0