12-Dimensionality Reduction

来源：互联网发布：vscode如何打开网页编辑：程序博客网时间：2024/06/04 18:01

数据的降维处理

data compression
data visualization

降维：
- 2维到1维：找到一个向量 u(1)∈R2 ，让平面上的2维数据投影到这个向量上，并且投影误差最小
- n维到k维：找到k个向量 u(1),u(2),…,u(k) ,使得投影误差最小
PCA 不是 linear regression：PCA 的中的是投影

左边是 linear regression ，右边是 PCA

数据预处理：feature scaling / mean normalization

从 n 维 x(i) 映射到 k 维 z(i) 的方法：

计算 X 的协方差矩阵(covariance matrix)：
$\sum = 1 m \sum i = 1 n (x (i)) (x (i)) T, x (i) \in R n \times 1, \sum \in R n \times n$
计算协方差矩阵 ∑ 的特征向量(eigenvector)：
$[U, S, V] = s v d (\sum) (s v d : S i n g u l a r v a l u e d e c o m p o s i t i o n 单值分解) U = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ | u (i) | | u (2) | ⋮ ⋮ ⋮ | u (m) | ⎤ ⎦ ⎥ ⎥ ⎥ ⎥, U \in R n \times n$
取出矩阵 U 的前 K 个向量 u(1),u(2),…,u(k) 组成 Ureduce
zi=UTreducex(i),z(i)∈Rk×1
完成了从x(i)→z(i) 的降维转变

代码表示如下：

S i g m a = 1 m \sum i = 1 m (x (i)) (x (i)) T [U, S, V] = s v d (S i g m a); U r e d u c e = U (:, 1 : k); z = U r e d u c e' * x

principal components 的数量 K 的选择方法：

Average squared projection error：1m∑i=1m∥x(i)−x(i)approx∥2
Total variation： 1m∑i=1m∥x(i)∥2
选择条件：
$1 m \sum i = 1 m ∥ x ( i ) - x ( i ) a p p r o x ∥ 2 1 m \sum i = 1 m ∥ x ( i ) ∥ 2 \leq 0.01 (o r 0.05, 0.1, \dots)$
上面的方法比较麻烦，利用之前的 svg 函数的结果能比较简便的计算：
其中， $S = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ S 11 S 22 ⋱ S n n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥$ 那么判断条件可以变为：
$1 - \sum i = 1 k S i i \sum i = 1 n S i i \leq 0.01 o r \sum i = 1 k S i i \sum i = 1 n S i i \geq 0.99$

综上所述，选择 k 的方法如下：

Mapping x(i)→z(i) should be defined by running PCA only on the training set. This mapping can be applied as well to the examples x(i)cv and x(i)test in the cross validation and test sets
PCA的应用：
- Compression
  - Reduce memory/disk needed to store data
  - Speed up learning algorithm
- Visualization
不要为了避免 overfitting 而去使用 PCA，最好用 regulization 来实现
在使用PCA之前，首先要尝试用原始的数据x(i)，只有当它不能达到你想要的结果的时候，才可以去考虑使用z(i).
Before implementing PCA, first try running whatever you want to do with the original/raw data x(i) . Only if that doesn’t do what you want, then implement PCA and consider using z(i) .

0 0