Latent Semantic Analysis (LSA) Tutorial 潜语义分析LSA介绍 (转)

来源:互联网 发布:中国南海知乎 编辑:程序博客网 时间:2024/05/14 23:56

Latent Semantic Analysis (LSA) Tutorial

转:http://blog.csdn.net/yihucha166/article/details/6783212

译:http://www.puffinwarellc.com/index.php/news-and-articles/articles/33.html

        

潜语义分析LSA介绍

Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI) literally means analyzing documents to find the underlying meaning or concepts of those documents. If each word only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts.

Latent Semantic Analysis (LSA)也被叫做Latent Semantic Indexing (LSI),从字面上的意思理解就是通过分析文档去发现这些文档中潜在的意思和概念。假设每个词仅表示一个概念,并且每个概念仅仅被一个词所描述,LSA将非常简单从词到概念存在一个简单的映射关系)

one to one mapping between words and concepts

Unfortunately, this problem is difficult because English has different words that mean the same thing (synonyms), words with multiple meanings, and all sorts of ambiguities that obscure the concepts to the point where even people can have a hard time understanding.

不幸的是,这个问题并没有如此简单,因为存在不同的词表示同一个意思(同义词),一个词表示多个意思,所有这种二义性(多义性)都会混淆概念以至于有时就算是人也很难理解。

confused mapping between words and concepts

For example, the word bank when used together with mortgage, loans, and rates probably means a financial institution. However, the word bank when used together with lures, casting, and fish probably means a stream or river bank.

例如,银行这个词和抵押、贷款、利率一起出现时往往表示金融机构。但是,和鱼饵,投掷、鱼一起出现时往往表示河岸。

How Latent Semantic Analysis Works

潜语义分析工作原理

Latent Semantic Analysis arose from the problem of how to find relevant documents from search words. The fundamental difficulty arises when we comparewords to find relevant documents, because what we really want to do is compare themeanings or concepts behind the words. LSA attempts to solve this problem by mapping both words and documents into a "concept" space and doing the comparison in this space.

潜语义分析(Latent Semantic Analysis)源自问题:如何从搜索query中找到相关的文档。当我们试图通过比较词来找到相关的文本时,存在着难以解决的局限性,那就是在搜索中我们实际想要去比较的不是词,而是隐藏在词之后的意义和概念。潜语义分析试图去解决这个问题,它把词和文档都映射到一个‘概念’空间并在这个空间内进行比较(注:也就是一种降维技术)。

Since authors have a wide choice of words available when they write, the concepts can be obscured due to different word choices from different authors. This essentially random choice of words introduces noise into the word-concept relationship. Latent Semantic Analysis filters out some of this noise and also attempts to find the smallest set of concepts that spans all the documents.

当文档的作者写作的时候,对于词语有着非常宽泛的选择。不同的作者对于词语的选择有着不同的偏好,这样会导致概念的混淆。这种对于词语的随机选择在 词-概念 的关系中引入了噪音。LSA滤除了这样的一些噪音,并且还能够从全部的文档中找到最小的概念集合(为什么是最小?)。

In order to make this difficult problem solvable, LSA introduces some dramatic simplifications.

1.     Documents are represented as "bags of words", where the order of the words in a document is not important, only how many times each word appears in a document.

2.     Concepts are represented as patterns of words that usually appear together in documents. For example "leash", "treat", and "obey" might usually appear in documents about dog training.

3.     Words are assumed to have only one meaning. This is clearly not the case (banks could be river banks or financial banks) but it makes the problem tractable.

To see a small example of LSA, take a look at the next section.

为了让这个难题更好解决,LSA引入一些重要的简化:

    1. 文档被表示为”一堆词(bags of words)”,因此词在文档中出现的位置并不重要,只有一个词的出现次数。

    2. 概念被表示成经常出现在一起的一些词的某种模式。例如“leash”(栓狗的皮带)、“treat”、“obey”(服从)经常出现在关于训练狗的文档中。

    3. 词被认为只有一个意思。这个显然会有反例(bank表示河岸或者金融机构),但是这可以使得问题变得更加容易。(这个简化会有怎样的缺陷呢?)

接下来看一个LSA的小例子,Next Part:

A Small Example

一个例子

As a small example, I searched for books using the word “investing” at Amazon.com and took the top 10 book titles that appeared. One of these titles was dropped because it had only one index word in common with the other titles. An index word is any word that:

  • appears in 2 or more titles, and
  • is not a very common word such as “and”, “the”, and so on (known as stop words). These words are not included because do not contribute much (if any) meaning.

In this example we have removed the following stop words: “and”, “edition”, “for”, “in”, “little”, “of”, “the”, “to”.

一个小例子,我在amazon.com上搜索”investing”(投资) 并且取top 10搜索结果的书名。其中一个被废弃了,因为它只含有一个索引词(index word)和其它标题相同。索引词可以是任何满足下列条件的词:

    1. 在2个或者2个以上标题中出现 并且

    2. 不是那种特别常见的词例如 “and”, ”the” 这种(停用词-stop word)。这种词没有包含进来是因为他们本身不存在什么意义。

在这个例子中,我们拿掉了如下停用词:“and”, “edition”, “for”, “in”, “little”, “of”, “the”, “to”.

Here are the 9 remaining tiles. The index words (words that appear in 2 or more titles and are not stop words) are underlined.

下面就是那9个标题,索引词(在2个或2个以上标题出现过的非停用词)被下划线标注:

1.     The Neatest Little Guide to Stock Market Investing

2.     Investing For Dummies, 4th Edition

3.     The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share ofStock Market Returns

4.     The Little Book of Value Investing

5.     Value Investing: From Graham to Buffett and Beyond

6.     Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!

7.     Investing in Real Estate, 5th Edition

8.     Stock Investing For Dummies

9.     Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss

Once Latent Semantic Analysis has been run on this example, we can plot the index words and titles on an XY graph and identify clusters of titles. The 9 titles are plotted with blue circles and the 11 index words are plotted with red squares. Not only can we spot clusters of titles, but since index words can be plotted along with titles, we can label the clusters. For example, the blue cluster, containing titles T7 and T9, is about real estate. The green cluster, with titles T2, T4, T5, and T8, is about value investing, and finally the red cluster, with titles T1 and T3, is about the stock market. The T6 title is an outlier, off on its own.

在这个例子里面应用了LSA,我们可以在XY轴的图中画出词和标题的位置(只有2维),并且识别出标题的聚类。蓝色圆圈表示9个标题,红色方块表示11个索引词。我们不但能够画出标题的聚类,并且由于索引词可以被画在标题一起,我们还可以给这些聚类打标签。例如,蓝色的聚类,包含了T7和T9,是关于real estate(房地产)的,绿色的聚类,包含了标题T2,T4,T5和T8,是讲value investing(价值投资)的,最后是红色的聚类,包含了标题T1和T3,是讲stock market(股票市场)的。标题T6是孤立点(outlier)


In the next few sections, we'll go through all steps needed to run Latent Semantic Analysis on this example.

在下面的部分,我们会通过这个例子介绍LSA的整个流程。

 

原创粉丝点击