PCA:Principle Component Analysis [1]

来源:互联网 发布:中泰法师斗法事件知乎 编辑:程序博客网 时间:2024/06/11 03:08


Principal component analysis (PCA) is a statistical procedure that uses (1) orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of (2) linearly uncorrelated variablescalled principal components. (3) The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that (4) the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. Principal components are (5) guaranteed to be independent if the data set is jointly normally distributed. PCA issensitive to the relative scaling of the original variables.





那么为什么要改变样本的表现方式,或者说为什么要进行空间变换呢?就是引出了样本的相关性,这里之前我一直没有理解清楚,觉得样本的相关性指的是样本a和样本b之间的关联关系(如果二者有关联,当然就存在关联关系),而这里的相关性说的并不是样本之间的关联关系,而是样本数据背后所对应的联合概率分布的各个维度(或者说各个变量)的关联关系。以二维空间为例,假设我们有一个联合分布P(X, Y),我们对其进行独立随机采样得到一批样本数据 {(x_1, y_1), (x_2, y_2), ......, (x_n, y_n)},我们是要透过这些样本数据来衡量P(X, Y),或者说是要衡量随机变量X和随机变量Y之间的关联关系。到这里就引出了协方差,协方差是干什么的呢?wikipedia上定义如下:

covariance is a measure of how much two random variables change together. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the smaller values, i.e., the variables tend to show similar behavior, the covariance is positive.[1] In the opposite case, when the greater values of one variable mainly correspond to the smaller values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret. 


如果两个变量独立,那么他们的协方差就等于0,;但是反之不成立,举个例子: y = x2. x 在[-1, 1]上均匀分布。x和y对应的协方差为0,但显然二者相关,这是因为协方差只能衡量两个变量之间的线性关联关系,无法衡量这种二次方的非线性关系!但是wikipedia上将,如果两个随机变量的联合分布是联合正态分布(jointly normally distributed),那么协方差的不相关确实能表明二者是独立的。(这点我还不太明白.......)

协方差可以衡量两个变量的线性关联关系,而对于多维空间,要表示多个维度上两两的关联关系就得用协方差矩阵了,对于n维空间向量,其包括n个随机变量X = [X1, X2, ......, Xn],其背后对应一个包括n个随机变量的联合分布P(X1, X2, ......, Xn),而各个维度之间的关联关系用协方差矩阵表示如下:



对于P1(X, Y),如果满足Y = X,其中X ,Y属于 {1,-1};对其采样得到{(1,1),(-1,-1)}。

对于P2(X, Y),如果满足X,Y独立,其中X ,Y属于{1,-1};对其采样得到{(1,1),(-1,-1),(1,-1),(-1,1)}。









0 0