特征工程的预处理

来源：互联网发布：矢量数据的概念编辑：程序博客网时间：2024/09/21 09:27

MNIST数据集

预处理（降维）

https://www.kaggle.com/arthurtok/interactive-intro-to-dimensionality-reduction

解决"Curse ofDimensionality"问题。if we are able toproject our data from a higher-dimensional space to a lower one while keepingmost of the relevant information, that would make life a lot easier for ourlearning methods.

python导入：

from sklearn.manifold importTSNE #Nonlinear,probabilistic method
from sklearn.decomposition import PCA #Unsupervised,linear method
from sklearn.discriminant_analysis import LinearDiscriminantAnalysisas LDA

1. （PCA ) - Unsupervised, linear method

PCA是无监督的线性转换算法，它把数据的原始特征集转换成更小的特征集，也就是降维。PCA不是选择或丢弃某些特征，而是通过所有可能的线性组合构造新的特征。PCA算法为了找到最合适的方向/角度（也就是主成分），在新的子空间尽量增大方差。为什么要尽量增大方差？

首先要知道主成分分析算法。主成分是相互正交的，也就是线性不相关。在新的子空间，协方差矩阵（测量两个变量如何相关）非对角值为0，对角值（也称为特征值）不为0。这些对角值代表主成分的方差，也就是我们讨论的特征的可变性。所以PCA要尽量增大方差，方向（主成分）包含数据点的最大子集或者关于目前所有数据点的信息（方差）。

For a brilliant and detailed description on this,check out this stackexchange thread:

PCA and proportion of variance explained by amoeba他给出很好的解释！

用UCI的Iris数据集实现PCA算法例文:

Principal Component Analysis in 3 Simple Steps bySebastian Raschka

2. Linear Discriminant Analysis (LDA) - Supervised, linear method

Both Linear Discriminant Analysis(LDA) and PCA are linear transformation methods. PCA yields the directions(principal components) that maximize the variance of the data, whereas LDA alsoaims to find the directions that maximize the separation (or discrimination)between different classes, which can be useful in pattern classification problem.
In other words, PCA projects theentire dataset onto a different feature (sub)space, and LDA tries to determinea suitable feature (sub)space in order to distinguish between patterns thatbelong to different classes.

3. T-SNE ( t-Distributed StochasticNeighbour Embedding ) - Nonlinear, probabilisticmethod

https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding

T-SNE aims to convert the Euclidean distancesbetween points into conditional probabilities. A Student-t distribution is thenapplied on these probabilities which serve as metrics to calculate thesimilarity between one datapoint to another.

From the t-SNE scatter plot the first thing that strikes is thatclusters ( and even subclusters ) are very well defined and segregated, resultingin Jackson-Pollock like Modern Art visuals, even more so than the PCA and LDAmethods. T-SNE提供非常好的集群可视化的能力可以归结为算法的拓扑保护属性。

T-SNE缺点：当算法识别集群/子群时，可能会出现多个局部极小值，这可以从散点图中得到，我们可以看到同一颜色的簇作为2个子簇出现在图中的不同区域。

阅读全文

0 0