Latent Semantic Analysis (LSA) Tutorial 潜语义分析LSA介绍 二

来源:互联网 发布:js中的indexof方法 编辑:程序博客网 时间:2024/05/14 22:22

A Small Example

一个例子

As a small example, I searched for books using the word “investing” at Amazon.com and took the top 10 book titles that appeared. One of these titles was dropped because it had only one index word in common with the other titles. An index word is any word that:

  • appears in 2 or more titles, and
  • is not a very common word such as “and”, “the”, and so on (known as stop words). These words are not included because do not contribute much (if any) meaning.

In this example we have removed the following stop words: “and”, “edition”, “for”, “in”, “little”, “of”, “the”, “to”.

一个小例子,我在amazon.com上搜索”investing”(投资) 并且取top 10搜索结果的书名。其中一个被废弃了,因为它只含有一个索引词(index word)和其它标题相同。索引词可以是任何满足下列条件的词:

    1. 在2个或者2个以上标题中出现 并且

    2. 不是那种特别常见的词例如 “and”, ”the” 这种(停用词-stop word)。这种词没有包含进来是因为他们本身不存在什么意义。

在这个例子中,我们拿掉了如下停用词:“and”, “edition”, “for”, “in”, “little”, “of”, “the”, “to”.

Here are the 9 remaining tiles. The index words (words that appear in 2 or more titles and are not stop words) are underlined.

下面就是那9个标题,索引词(在2个或2个以上标题出现过的非停用词)被下划线标注:

1.     The Neatest Little Guide to Stock Market Investing

2.     Investing For Dummies, 4th Edition

3.     The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share ofStock Market Returns

4.     The Little Book of Value Investing

5.     Value Investing: From Graham to Buffett and Beyond

6.     Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!

7.     Investing in Real Estate, 5th Edition

8.     Stock Investing For Dummies

9.     Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss

Once Latent Semantic Analysis has been run on this example, we can plot the index words and titles on an XY graph and identify clusters of titles. The 9 titles are plotted with blue circles and the 11 index words are plotted with red squares. Not only can we spot clusters of titles, but since index words can be plotted along with titles, we can label the clusters. For example, the blue cluster, containing titles T7 and T9, is about real estate. The green cluster, with titles T2, T4, T5, and T8, is about value investing, and finally the red cluster, with titles T1 and T3, is about the stock market. The T6 title is an outlier, off on its own.

在这个例子里面应用了LSA,我们可以在XY轴的图中画出词和标题的位置(只有2维),并且识别出标题的聚类。蓝色圆圈表示9个标题,红色方块表示11个索引词。我们不但能够画出标题的聚类,并且由于索引词可以被画在标题一起,我们还可以给这些聚类打标签。例如,蓝色的聚类,包含了T7和T9,是关于real estate(房地产)的,绿色的聚类,包含了标题T2,T4,T5和T8,是讲value investing(价值投资)的,最后是红色的聚类,包含了标题T1和T3,是讲stock market(股票市场)的。标题T6是孤立点(outlier)


In the next few sections, we'll go through all steps needed to run Latent Semantic Analysis on this example.

在下面的部分,我们会通过这个例子介绍LSA的整个流程。


转载:http://blog.csdn.net/yihucha166/article/details/6783234

0 0
原创粉丝点击