DM01-TSVD进一步探索

来源:互联网 发布:广州数控980td编程 编辑:程序博客网 时间:2024/06/05 22:49

摘要:进一步了解TSVD,进一步确认TSVD与LSA的关系,以sklearn的TruncatedSVD为例,在sklearn的文档帮助下理解对TSVD的理解及动手实现一个例子来一探究竟。

在学习LSA时,遇到了TSVD,或者叫截断奇异值分解,后面在sklearn中再次发现了它,TSVD在实现像PCA那样进行降维,在文本的处理,TSVD就是实现与解决LSA的模型算法。

sklearn中,TSVD被注解为“Dimensionality reduction using truncated SVD (aka LSA).”相对于PCA是这样说明的,“ This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with scipy.sparse matrices
efficiently.”相对于PCA来说,对处理scipy的稀疏矩阵会更有效些。在文本处理中,“ In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in klearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).”也就是说,在文本处理中,TSVD就是LSA了。了解更多,查看sklearn文档【0】。

觉得要理解一个模型,第一个是用起来,理论与实践相合;第二个,是模拟一下这个模型。这里想模拟一下这个模型:

import numpy as npimport timefrom numpy.linalg import svdfrom sklearn.datasets import load_irisfrom sklearn.decomposition import TruncatedSVDprint('####加载数据####')iris = load_iris()iris_data = iris.dataprint('iris原数据(%d*%d):' % (iris_data.shape[0], iris_data.shape[1]))print(iris_data[:10])print('####sklearn TSVD分解[arpack]####')t = time.time()tsvd_arpack = TruncatedSVD(2,algorithm='arpack')iris_transformed_arpack = tsvd_arpack.fit_transform(iris_data)print('cost:',str(time.time() - t))print('sklearn模型计算数据(%d*%d):' % (iris_transformed_arpack.shape[0], iris_transformed_arpack.shape[1]))print(iris_transformed_arpack[:10])print('####sklearn TSVD分解[randomized]####')t = time.time()tsvd = TruncatedSVD(2,algorithm='randomized')iris_transformed = tsvd.fit_transform(iris_data)print('cost:',str(time.time() - t))print('sklearn模型计算数据(%d*%d):' % (iris_transformed.shape[0], iris_transformed.shape[1]))print(iris_transformed[:10])print('####SVD分解####')U, S, V = svd(np.array(iris_data), full_matrices=False)print('U(%d*%d):' % (U.shape[0], U.shape[1]))print(U[:10])print('S(%d*%d):' % (np.diag(S).shape[0], np.diag(S).shape[1]))print(np.diag(S[:10]))print('V:(%d*%d):' % (V.shape[0], V.shape[1]))print(V[:10])new_U = U[:, 0:2]new_S = S[0:2]new_V = V[0:2, 0:2]print('new_U(%d*%d):' % (new_U.shape[0], new_U.shape[1]))print(new_U[:10])print('new_S(%d*%d):' % (np.diag(new_S).shape[0], np.diag(new_S).shape[1]))print(np.diag(new_S[:10]))rs_tsvd = new_U.dot(np.diag(new_S))print('模拟TSVD[U*S](%d*%d):' % (rs_tsvd.shape[0], rs_tsvd.shape[1]))print(rs_tsvd[:10])print('####SVD计算####')rs_svd = np.dot(U.dot(np.diag(S)), V)print('(%d*%d):' % (rs_svd.shape[0], rs_svd.shape[1]))print(rs_svd[:10])

运行结果:

####加载数据####iris原数据(150*4):[[ 5.1  3.5  1.4  0.2] [ 4.9  3.   1.4  0.2] [ 4.7  3.2  1.3  0.2] [ 4.6  3.1  1.5  0.2] [ 5.   3.6  1.4  0.2] [ 5.4  3.9  1.7  0.4] [ 4.6  3.4  1.4  0.3] [ 5.   3.4  1.5  0.2] [ 4.4  2.9  1.4  0.2] [ 4.9  3.1  1.5  0.1]]####sklearn TSVD分解[arpack]####cost: 0.032000064849853516sklearn模型计算数据(150*2):[[ 5.91220352  2.30344211] [ 5.57207573  1.97383104] [ 5.4464847   2.09653267] [ 5.43601924  1.87168085] [ 5.87506555  2.32934799] [ 6.47699043  2.32552598] [ 5.51542859  2.07156181] [ 5.85042297  2.14948016] [ 5.15851287  1.77642658] [ 5.64458172  1.99190598]]####sklearn TSVD分解[randomized]####cost: 0.26200008392333984sklearn模型计算数据(150*2):[[ 5.91220352  2.30344211] [ 5.57207573  1.97383104] [ 5.4464847   2.09653267] [ 5.43601924  1.87168085] [ 5.87506555  2.32934799] [ 6.47699043  2.32552598] [ 5.51542859  2.07156181] [ 5.85042297  2.14948016] [ 5.15851287  1.77642658] [ 5.64458172  1.99190598]]####SVD分解####U(150*4):[[ -6.16171172e-02   1.29969428e-01  -5.58364155e-05   1.05847972e-03] [ -5.80722977e-02   1.11371452e-01   6.84386629e-02   5.21149461e-02] [ -5.67633852e-02   1.18294769e-01   2.31062793e-03   9.07826254e-03] [ -5.66543140e-02   1.05607729e-01   4.21768760e-03  -4.22153145e-02] [ -6.12300644e-02   1.31431142e-01  -3.39084839e-02  -3.32538281e-02] [ -6.75033389e-02   1.31215489e-01  -7.05769279e-02  -1.27200659e-02] [ -5.74819200e-02   1.16885813e-01  -6.81501228e-02  -2.80545702e-02] [ -6.09732389e-02   1.21282279e-01   3.42848316e-03  -2.46471653e-02] [ -5.37621363e-02   1.00233102e-01   1.59181065e-02  -1.68628083e-02] [ -5.88279568e-02   1.12391313e-01   6.29780195e-02  -3.04349879e-02]]S(4*4):[[ 95.95066751   0.           0.           0.        ] [  0.          17.72295328   0.           0.        ] [  0.           0.           3.46929666   0.        ] [  0.           0.           0.           1.87891236]]V:(4*4):[[-0.75116805 -0.37978837 -0.51315094 -0.16787934] [ 0.28583096  0.54488976 -0.70889874 -0.34475845] [ 0.49942378 -0.67502499 -0.05471983 -0.54029889] [ 0.32345496 -0.32124324 -0.48077482  0.74902286]]new_U(150*2):[[-0.06161712  0.12996943] [-0.0580723   0.11137145] [-0.05676339  0.11829477] [-0.05665431  0.10560773] [-0.06123006  0.13143114] [-0.06750334  0.13121549] [-0.05748192  0.11688581] [-0.06097324  0.12128228] [-0.05376214  0.1002331 ] [-0.05882796  0.11239131]]new_S(2*2):[[ 95.95066751   0.        ] [  0.          17.72295328]]模拟TSVD[U*S](150*2):[[-5.91220352  2.30344211] [-5.57207573  1.97383104] [-5.4464847   2.09653267] [-5.43601924  1.87168085] [-5.87506555  2.32934799] [-6.47699043  2.32552598] [-5.51542859  2.07156181] [-5.85042297  2.14948016] [-5.15851287  1.77642658] [-5.64458172  1.99190598]]####SVD计算####(150*4):[[ 5.1  3.5  1.4  0.2] [ 4.9  3.   1.4  0.2] [ 4.7  3.2  1.3  0.2] [ 4.6  3.1  1.5  0.2] [ 5.   3.6  1.4  0.2] [ 5.4  3.9  1.7  0.4] [ 4.6  3.4  1.4  0.3] [ 5.   3.4  1.5  0.2] [ 4.4  2.9  1.4  0.2] [ 4.9  3.1  1.5  0.1]]

主要是采用了sklearn自带的iris数据,第一步用sklearn中的TruncatedSVD方法,把iris数据进行分解,观察相关结果;
第二步,用numpy.linalg代数库中的svd进行对iris数据进行分解,得到三个矩阵(U,S,V), 截断最大的前N,留下部分U及S,把V去掉,计算出来的值与TruncateSVD绝对是相等的,这个为什么有正负之差,还得进一步研究。如果有谁知道的请告知。
在TruncateSVD中,采用了两种算法来求解的,“This estimator supports two algorithms: a fast randomized SVD solver, and a “naive” algorithm that uses ARPACK as an eigensolver on (X * X.T) or (X.T * X), whichever is more efficient.”
一种是快速随机SVD分解,这个算法见论文【1】; 一种是arpack朴素算法,采用scipy.sparse.linalg.svds来实现的,可采用这个来计算,看scipy文档【3】;来自sklearn中的两个方法实现分支代码:
这里写图片描述
可以看一个svds的例子:

import  numpy as npfrom scipy.sparse import csc_matrixfrom scipy.sparse.linalg import svds, eigsA = csc_matrix([[1, 0, 0], [5, 0, 2], [0, -1, 0], [0, 0, 3]], dtype=float)u, s, vt = svds(A, k=2)print(s)r = np.sqrt(eigs(A.dot(A.T), k=2)[0]).realprint(r)

运行结果:

[ 2.75193379  5.6059665 ][ 5.6059665   2.75193379]

而在Gensim中采用的是随机投影算法,见论文【2】。

参考
【0】 http://scikit-learn.org/stable/modules/decomposition.html#truncated-singular-value-decomposition-and-latent-semantic-analysis
【1】 Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions Halko, et al., 2009 (arXiv:909) http://arxiv.org/pdf/0909.4061
【2】Fast and Faster: A Comparison of Two Streamed
Matrix Decomposition Algorithms http://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf
【3】 https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html

【作者:happyprince, http://blog.csdn.net/ld326/article/details/78682104 】