R Clustering & Dimension Reduction聚类和降维

来源：互联网发布：wps办公软件官网编辑：程序博客网时间：2024/06/05 15:21

定义：

Exploratory data analysis is a "rough cut" or

filter which helps you to find the most beneficial areas of
questioning so you can set your priorities accordingly.

such as
1、"Is the correlation between the
measurements and activities good enough to train a machine?"
2、"Given a set of 561 measurements, would a trained machine be
able to determine which of the 6 activities the person was doing?"

-------------------------------------------------------

Hierarchical Clustering #层级聚类法

找到最近距离的点，合并（生成一个超级点），再找最近的点，生成树形结构。
（相关知识可以去看数据挖掘概念与技术那本书）

hclust ##一般的层次聚类包
myplclust ##优化的层次聚类包（对不同的树标色等）

distxy<-dist(dataframe) #计算矩阵各项距离，默认欧式距离
hc<-hclust(distxy) #计算树形图
plot(hc)
plot(as.dendrogram(hc)) #显示简化树形图

簇的数量：按照选择的距离分割线决定
簇间距离：离得最远的点之间距离
结果是固定的

heatmap(dataMatrix,col=cm.colors(25))
##聚类热力图,每行一个观测值，按行聚类，每列是一个变量，同样进行聚类（重排组合）
适用于含多个变量的矩阵

-----------------------------------------------------

Kmeans Clustering #K均值聚类法

随机选择几个中心点，分配蔟，更新各蔟中心，重新分配蔟，直到收敛。适用于高维数据

distTmp<-mdist(x,y,cx,cy) #计算矩阵各点和初始中心的距离，默认欧式距离
apply(distTmp,2,which.min) #按最小距离，将点归类到各簇中心
tapply(x,newClust,mean) #重新计算簇中心
tapply(y,newClust,mean) #重新计算簇中心

kmObj<-kmeans(dataFrame,centers=3,nstart=100) #K均值计算,100选择随机起点
plot(x,y,col=kmObj$cluster,pch=19,cex=2)
table(kmObj$cluster,类标签) ##显示各簇的类标签分布形式

簇的数量：一开始直接指定
簇间距离：簇中心之间的距离
结果是变化的，取决于初始簇中心

laying<-which(kClust$size==29)
plot(kClust$centers[laying,1:12],pch=19,ylab="Laying Cluster") #查看某类的关键属性

image() ##热力图
-----------------------------------------------------

Dimension Reduction 降维

主成分分析法PCA/SVD分解
Principal Component Analysis and Singular Value Decomposition

#find the best matrix created with fewer variables
(that is, a lower rank matrix) that explains the original data
用最少的变量解释最多的方差

##SVD分解##

X=UDV^T #其中V就是矩阵的主成分
svd1<-svd(scale(datamatrix))
par(mfrow=c(1:3))
image(t(datamartix)[,nrow(datamartix):1])
plot(svd1$u[,i],40:1,,xlab="Row",ylab="First left singular vector",pch=19) ##各对象对i主成分中的权重值
plot(svd1$v[,i],xlab="Column",ylab="First right singular vector",pch=19) ##各变量对i主成分中的权重值
plot(svd$d,xlab="Column",ylab="Singular value",pch=19) ##各主成分解释原矩阵的特征值
plot(svd$d^2/sum(svd1$d^2),xlab="Column",ylab="Prop.of variance explained",pch=19) ##各主成功解释原矩阵的百分比
a2 <- svd1$u[,1:2] %*% diag(svd1$d[1:2]) %*% t(svd1$v[,1:2])

##PCA主成分分析##
prcomp(datamatrix,scale=True) #和SVD的结果是一致的

##PCA和SVD都不能作用与有缺失值得数据
impute ##http://bioconductor.org 缺失值补齐包
impute.knn

##主成分分析可用于数据和图片压缩

##Notes
1、需要先对数据标准化
2、对大矩阵的计算比较大

##相关资源

0 0