LSA概述与实例

来源：互联网发布：ip地址定位软件编辑：程序博客网时间：2024/06/06 10:09

LSA概述

Latent Semantic Analysis简单来说，就是将word和document透射到concept space，然后在concept space中聚类，以实现语义级别的检索等功能。

LSA的核心，有以下几点：

parse阶段，将文档表示为bags of words，同时忽略掉stop words以及标点符号。例如实例中的parse(self, doc)函数，输出一个字典对象，key是word，value是出现的文档序号的list（同一篇文档可能出现同一个词多次，因此list中的值不唯一）。
build阶段，构建count Matrix，行是word，列是document，对应的值是对应的word在document中出现的频数。
SVD，基于SVD上发现比较大的奇异值，并且投射到concept space。
picture，实现二维空间的可视化，发现聚类模式

LSA的使用，基于以下假设：

文档被表示为bags of words，也就是只考虑一篇文章中的词的频率而不考虑其顺序。
相同概念的词（表示相同或者近似内容）的词总会被聚类在一起
不考虑多义词，每个单词只确定其唯一含义

LSA注意

得到Count Matrix后，最好进行TF-IDF，来决定对应词在对应文档的重要性权值。
下面的实例中省略了第一个维度，因为第一个维度表征一个平均参数，具体来说就是这个文档平均有多少个词，或者这个词平均在多少个文档出现，意义不大因此省略。但是更加通用的做法是先对Count Matrix进行列的normalize，这样的话就不用省略第一个维度，缺点是这样会让sparse matrix变得dense。

LSA优缺点

优点

将词和文档都聚类到同样的概念空间，因此可以在概念空间上实现聚类，并且可以实现词和文档的相互查询（比如根据词在概念空间上检索相应的文档）。
概念空间的维度相比原矩阵小得多，并且这些维度中包含的信息多噪音少。
LSA是一种global algorithm，容易让我们发现难以观察到的模式信息等。

缺点

假设Gaussian distribution和Frobenius norm，不一定适合所有的问题。比如，文章中的单词遵从Poisson distribution而不是Gaussian distribution。
不能处理多义词的问题，假设每个单词只有一个意思。
严重依赖svd，计算量相对较大。

LSA实例

选用的9个文档标题分别是：

The Neatest Little Guide to Stock Market Investing
Investing For Dummies, 4th Edition
The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns
The Little Book of Value Investing
Value Investing: From Graham to Buffett and Beyond
Rich Dad’s Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!
Investing in Real Estate, 5th Edition
Stock Investing For Dummies
Rich Dad’s Advisors: The ABC’s of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss

Count Matrix为
这里写图片描述

SVD分解后，根据矩阵S对角线上奇异值的平方进行重要性排序，结果如下所示：
这里写图片描述

根据Book Title Matrix的聚类方法结果如下，使用维度2，3进行简单的聚类：
这里写图片描述

Dim2 Dim3 Titles red red 7,9 red blue 6 blue red 2,4,5,8 blue blue 1,3

根据Book Title Matrix和word matrix的聚类方法结果如下，同样使用维度2，3进行简单的聚类：

这里写图片描述

%pylab inlinefrom numpy import zerosfrom scipy.linalg import svd#following needed for TFIDFfrom math import logfrom numpy import asarray, sumimport matplotlib.pyplot as plt titles = ["The Neatest Little Guide to Stock Market Investing",          "Investing For Dummies, 4th Edition",          "The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",          "The Little Book of Value Investing",          "Value Investing: From Graham to Buffett and Beyond",          "Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",          "Investing in Real Estate, 5th Edition",          "Stock Investing For Dummies",          "Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss"          ]stopwords = ['and','edition','for','in','little','of','the','to']ignorechars = ''',:'!'''class LSA(object):    def __init__(self, stopwords, ignorechars):        self.stopwords = stopwords        self.ignorechars = ignorechars        self.wdict = {}        self.dcount = 0            def parse(self, doc):        words = doc.split();        for w in words:            w = w.lower().translate(None, self.ignorechars)            if w in self.stopwords:                continue            elif w in self.wdict:                self.wdict[w].append(self.dcount)            else:                #考虑wdict['book']会不会出现[0,0]如果book在0中出现两次                self.wdict[w] = [self.dcount]        self.dcount += 1          def build(self):        self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]        self.keys.sort()        self.A = zeros([len(self.keys), self.dcount])        for i, k in enumerate(self.keys):            for d in self.wdict[k]:                self.A[i,d] += 1    def calc(self):        self.U, self.S, self.Vt = svd(self.A)    def picture0(self):        '''        根据奇异值的平方画出奇异值的重要性的bar图        '''        plt.bar(left=range(len(self.S)) ,height=(self.S**2)/sum(self.S**2),align="center")        plt.xticks(range(len(self.S)))        plt.title("The Importance of Each Singular Value")        plt.xlabel(u"Singular Values")        plt.ylabel(u"Importance")    def picture1(self):        '''        画出瓦片图        '''        plt.set_cmap('bwr')         plt.pcolor(-1*self.Vt[0:3,:])        plt.colorbar()        plt.yticks(np.arange(3)+0.5,['Dim1','Dim2','Dim3',])        plt.xticks(np.arange(9)+0.5,[i[0]+i[1] for i in zip(['T']*9 ,map(str,range(1,10)))])        plt.gca().invert_yaxis()        plt.gca().set_aspect('equal')        plt.xlabel("Book Titles")        plt.ylabel("Dimensions")        plt.title("Top 3 Dimensions of Each Book Title")    def picture2(self):        '''        画出散点图加上点的注释，投影到概念空间        '''        TitleX = -1*self.Vt[1,:]        TitleY = -1*self.Vt[2,:]        WordX = -1*self.U[:,1]        WordY = -1*self.U[:,2]        #画Word图的形状和注释        Words = self.keys        plt.plot(WordX,WordY,'rs')        for i in range(len(Words)):            plt.annotate(Words[i],xy=(WordX[i],WordY[i]),xytext=(2, 6),textcoords='offset points',color='red')        #画Title图的形状和注释        Titles = [i[0]+i[1] for i in zip(['T']*9 ,map(str,range(1,10)))]        plt.plot(TitleX,TitleY,'bo')        for i in range(len(TitleX)):            plt.annotate(Titles[i],xy=(TitleX[i],TitleY[i]),xytext=(2, 2),textcoords='offset points',color='blue')        plt.title('XY plots of Words and Titles')        plt.xlabel('Dimension 2')        plt.ylabel('Dimension 1')    def TFIDF(self):        WordsPerDoc = sum(self.A, axis=0)                DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)        rows, cols = self.A.shape        for i in range(rows):            for j in range(cols):                self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])    def printA(self):        print 'Here is the count matrix'        print self.A    def printSVD(self):        print 'Here are the singular values'        print self.S        print 'Here are the first 3 columns of the U matrix'        print -1*self.U[:, 0:3]        print 'Here are the first 3 rows of the Vt matrix'        print -1*self.Vt[0:3, :]

参考

非常棒的资料，参考了其中大多数内容

0 0