
来源:互联网 发布:广州数控980td编程 编辑:程序博客网 时间:2024/06/05 22:49



sklearn中,TSVD被注解为“Dimensionality reduction using truncated SVD (aka LSA).”相对于PCA是这样说明的,“ This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with scipy.sparse matrices
efficiently.”相对于PCA来说,对处理scipy的稀疏矩阵会更有效些。在文本处理中,“ In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in klearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).”也就是说,在文本处理中,TSVD就是LSA了。了解更多,查看sklearn文档【0】。


import numpy as npimport timefrom numpy.linalg import svdfrom sklearn.datasets import load_irisfrom sklearn.decomposition import TruncatedSVDprint('####加载数据####')iris = load_iris()iris_data = iris.dataprint('iris原数据(%d*%d):' % (iris_data.shape[0], iris_data.shape[1]))print(iris_data[:10])print('####sklearn TSVD分解[arpack]####')t = time.time()tsvd_arpack = TruncatedSVD(2,algorithm='arpack')iris_transformed_arpack = tsvd_arpack.fit_transform(iris_data)print('cost:',str(time.time() - t))print('sklearn模型计算数据(%d*%d):' % (iris_transformed_arpack.shape[0], iris_transformed_arpack.shape[1]))print(iris_transformed_arpack[:10])print('####sklearn TSVD分解[randomized]####')t = time.time()tsvd = TruncatedSVD(2,algorithm='randomized')iris_transformed = tsvd.fit_transform(iris_data)print('cost:',str(time.time() - t))print('sklearn模型计算数据(%d*%d):' % (iris_transformed.shape[0], iris_transformed.shape[1]))print(iris_transformed[:10])print('####SVD分解####')U, S, V = svd(np.array(iris_data), full_matrices=False)print('U(%d*%d):' % (U.shape[0], U.shape[1]))print(U[:10])print('S(%d*%d):' % (np.diag(S).shape[0], np.diag(S).shape[1]))print(np.diag(S[:10]))print('V:(%d*%d):' % (V.shape[0], V.shape[1]))print(V[:10])new_U = U[:, 0:2]new_S = S[0:2]new_V = V[0:2, 0:2]print('new_U(%d*%d):' % (new_U.shape[0], new_U.shape[1]))print(new_U[:10])print('new_S(%d*%d):' % (np.diag(new_S).shape[0], np.diag(new_S).shape[1]))print(np.diag(new_S[:10]))rs_tsvd ='模拟TSVD[U*S](%d*%d):' % (rs_tsvd.shape[0], rs_tsvd.shape[1]))print(rs_tsvd[:10])print('####SVD计算####')rs_svd =, V)print('(%d*%d):' % (rs_svd.shape[0], rs_svd.shape[1]))print(rs_svd[:10])


####加载数据####iris原数据(150*4):[[ 5.1  3.5  1.4  0.2] [ 4.9  3.   1.4  0.2] [ 4.7  3.2  1.3  0.2] [ 4.6  3.1  1.5  0.2] [ 5.   3.6  1.4  0.2] [ 5.4  3.9  1.7  0.4] [ 4.6  3.4  1.4  0.3] [ 5.   3.4  1.5  0.2] [ 4.4  2.9  1.4  0.2] [ 4.9  3.1  1.5  0.1]]####sklearn TSVD分解[arpack]####cost: 0.032000064849853516sklearn模型计算数据(150*2):[[ 5.91220352  2.30344211] [ 5.57207573  1.97383104] [ 5.4464847   2.09653267] [ 5.43601924  1.87168085] [ 5.87506555  2.32934799] [ 6.47699043  2.32552598] [ 5.51542859  2.07156181] [ 5.85042297  2.14948016] [ 5.15851287  1.77642658] [ 5.64458172  1.99190598]]####sklearn TSVD分解[randomized]####cost: 0.26200008392333984sklearn模型计算数据(150*2):[[ 5.91220352  2.30344211] [ 5.57207573  1.97383104] [ 5.4464847   2.09653267] [ 5.43601924  1.87168085] [ 5.87506555  2.32934799] [ 6.47699043  2.32552598] [ 5.51542859  2.07156181] [ 5.85042297  2.14948016] [ 5.15851287  1.77642658] [ 5.64458172  1.99190598]]####SVD分解####U(150*4):[[ -6.16171172e-02   1.29969428e-01  -5.58364155e-05   1.05847972e-03] [ -5.80722977e-02   1.11371452e-01   6.84386629e-02   5.21149461e-02] [ -5.67633852e-02   1.18294769e-01   2.31062793e-03   9.07826254e-03] [ -5.66543140e-02   1.05607729e-01   4.21768760e-03  -4.22153145e-02] [ -6.12300644e-02   1.31431142e-01  -3.39084839e-02  -3.32538281e-02] [ -6.75033389e-02   1.31215489e-01  -7.05769279e-02  -1.27200659e-02] [ -5.74819200e-02   1.16885813e-01  -6.81501228e-02  -2.80545702e-02] [ -6.09732389e-02   1.21282279e-01   3.42848316e-03  -2.46471653e-02] [ -5.37621363e-02   1.00233102e-01   1.59181065e-02  -1.68628083e-02] [ -5.88279568e-02   1.12391313e-01   6.29780195e-02  -3.04349879e-02]]S(4*4):[[ 95.95066751   0.           0.           0.        ] [  0.          17.72295328   0.           0.        ] [  0.           0.           3.46929666   0.        ] [  0.           0.           0.           1.87891236]]V:(4*4):[[-0.75116805 -0.37978837 -0.51315094 -0.16787934] [ 0.28583096  0.54488976 -0.70889874 -0.34475845] [ 0.49942378 -0.67502499 -0.05471983 -0.54029889] [ 0.32345496 -0.32124324 -0.48077482  0.74902286]]new_U(150*2):[[-0.06161712  0.12996943] [-0.0580723   0.11137145] [-0.05676339  0.11829477] [-0.05665431  0.10560773] [-0.06123006  0.13143114] [-0.06750334  0.13121549] [-0.05748192  0.11688581] [-0.06097324  0.12128228] [-0.05376214  0.1002331 ] [-0.05882796  0.11239131]]new_S(2*2):[[ 95.95066751   0.        ] [  0.          17.72295328]]模拟TSVD[U*S](150*2):[[-5.91220352  2.30344211] [-5.57207573  1.97383104] [-5.4464847   2.09653267] [-5.43601924  1.87168085] [-5.87506555  2.32934799] [-6.47699043  2.32552598] [-5.51542859  2.07156181] [-5.85042297  2.14948016] [-5.15851287  1.77642658] [-5.64458172  1.99190598]]####SVD计算####(150*4):[[ 5.1  3.5  1.4  0.2] [ 4.9  3.   1.4  0.2] [ 4.7  3.2  1.3  0.2] [ 4.6  3.1  1.5  0.2] [ 5.   3.6  1.4  0.2] [ 5.4  3.9  1.7  0.4] [ 4.6  3.4  1.4  0.3] [ 5.   3.4  1.5  0.2] [ 4.4  2.9  1.4  0.2] [ 4.9  3.1  1.5  0.1]]

第二步,用numpy.linalg代数库中的svd进行对iris数据进行分解,得到三个矩阵(U,S,V), 截断最大的前N,留下部分U及S,把V去掉,计算出来的值与TruncateSVD绝对是相等的,这个为什么有正负之差,还得进一步研究。如果有谁知道的请告知。
在TruncateSVD中,采用了两种算法来求解的,“This estimator supports two algorithms: a fast randomized SVD solver, and a “naive” algorithm that uses ARPACK as an eigensolver on (X * X.T) or (X.T * X), whichever is more efficient.”
一种是快速随机SVD分解,这个算法见论文【1】; 一种是arpack朴素算法,采用scipy.sparse.linalg.svds来实现的,可采用这个来计算,看scipy文档【3】;来自sklearn中的两个方法实现分支代码:

import  numpy as npfrom scipy.sparse import csc_matrixfrom scipy.sparse.linalg import svds, eigsA = csc_matrix([[1, 0, 0], [5, 0, 2], [0, -1, 0], [0, 0, 3]], dtype=float)u, s, vt = svds(A, k=2)print(s)r = np.sqrt(eigs(, k=2)[0]).realprint(r)


[ 2.75193379  5.6059665 ][ 5.6059665   2.75193379]


【1】 Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions Halko, et al., 2009 (arXiv:909)
【2】Fast and Faster: A Comparison of Two Streamed
Matrix Decomposition Algorithms

【作者:happyprince, 】