短文本关键词提取算法RAKE & TextRank及改进

来源:互联网 发布:多益网络加班严重吗 编辑:程序博客网 时间:2024/06/05 01:01

最近做的一个项目是短文本关键词提取(twitter, linkedin post),这里主要用到了两个算法,一个是TextRank, 一个是RAKE,总的来说,这两个算法思路上差别很大,但对于短文本的关键词提取来说,RAKE算法效果更为明显。


TextRank 介绍

 TextRank 算法是一种用于文本的基于图的排序算法。其基本思想来源于谷歌的 PageRank算法, 通过把文本分割成若干组成单元(单词、句子)并建立图模型, 利用投票机制对文本中的重要成分进行排序, 仅利用单篇文档本身的信息即可实现关键词提取、文摘。和 LDA、HMM 等模型不同, TextRank不需要事先对多篇文档进行学习训练, 因其简洁有效而得到广泛应用。
  TextRank 一般模型可以表示为一个有向有权图 G =(V, E), 由点集合 V和边集合 E 组成, E 是V ×V的子集。图中任两点 Vi , Vj 之间边的权重为 wji , 对于一个给定的点 Vi, In(Vi) 为 指 向 该 点 的 点 集 合 , Out(Vi) 为点 Vi 指向的点集合。点 Vi 的得分定义如下:

  其中, d 为阻尼系数, 取值范围为 0 到 1, 代表从图中某一特定点指向其他任意点的概率, 一般取值为 0.85。使用TextRank 算法计算图中各点的得分时, 需要给图中的点指定任意的初值, 并递归计算直到收敛, 即图中任意一点的误差率小于给定的极限值时就可以达到收敛, 一般该极限值取 0.0001。

基于TextRank的关键词提取

  关键词抽取的任务就是从一段给定的文本中自动抽取出若干有意义的词语或词组。TextRank算法是利用局部词汇之间关系(共现窗口)对后续关键词进行排序,直接从文本本身抽取。其主要步骤如下:

  1. 把给定的文本T按照完整句子进行分割,即

  2. 对于每个句子,进行分词和词性标注处理,并过滤掉停用词,只保留指定词性的单词,如名词、动词、形容词,即,其中是保留后的候选关键词。

  3. 构建候选关键词图G = (V,E),其中V为节点集,由(2)生成的候选关键词组成,然后采用共现关系(co-occurrence)构造任两点之间的边,两个节点之间存在边仅当它们对应的词汇在长度为K的窗口中共现,K表示窗口大小,即最多共现K个单词。

  4. 根据上面公式,迭代传播各节点的权重,直至收敛。

  5. 对节点权重进行倒序排序,从而得到最重要的T个单词,作为候选关键词。

  6. 由(5)得到最重要的T个单词,在原始文本中进行标记,若形成相邻词组,则组合成多词关键词。例如,文本中有句子“Matlab code for plotting ambiguity function”,如果“Matlab”和“code”均属于候选关键词,则组合成“Matlab code”加入关键词序列。

    另外,TextRank算法还可以做文章的自动生成摘要,这里没有涉及到,我就不做详细介绍了。
    TextRank算法github地址


RAKE(Rapid Automatic keyword extraction) 介绍

RAKE算法思想

  RAKE算法用来做关键词(keyword)的提取,实际上提取的是关键的短语(phrase),并且倾向于较长的短语,在英文中,关键词通常包括多个单词,但很少包含标点符号和停用词,例如and,the,of等,以及其他不包含语义信息的单词。

  RAKE算法首先使用标点符号(如半角的句号、问号、感叹号、逗号等)将一篇文档分成若干分句,然后对于每一个分句,使用停用词作为分隔符将分句分为若干短语,这些短语作为最终提取出的关键词的候选词。

  我们注意到,每个短语可以再通过空格分为若干个单词,可以通过给每个单词赋予一个得分,通过累加得到每个短语的得分。一个关键点在于将这个短语中每个单词的共现关系考虑进去。最终定义的公式是:

  • wordScore = wordDegree(w) / wordFrequency(w)

即单词w的得分是该单词的度(是一个网络中的概念,每与一个单词共现在一个短语中,度就加1,考虑该单词本身)除以该单词的词频(该单词在该文档中出现的总次数)。

  然后对于每个候选的关键短语,将其中每个单词的得分累加,并进行排序,RAKE将候选短语总数的前三分之一的认为是抽取出的关键词。

  另外,值得说明的是,关于分数计算这部分,wordDegree(w)实际上是等于word和每一个phrase里面的词共现的次数加上word的frequency。具体算法请看附件论文,《Automatic Keyword Extraction from IndividualDocumen》
RAKE算法github地址


短文本关键词提取实验

RAKE实验

测试文本1:

“Great interview by Gerry Dick with Ball State University’s new president, Geoffrey Mearns, who recognizes the need to offer curriculum that meets students’ needs. Aidex would welcome the opportunity to introduce our latest learning technologies, including Desktop Metal, metal 3D printing; SynDaver Labs and its lifelike human cadavers; and FANUC America Corporation robotics and CNC technology. These technologies elevate the educational experience and prepare students for fantastic careers. We hope to visit Muncie soon to present these and other STEM technologies.”

结果:

[('fanuc america corporation robotics', 16.0), ('ball state university', 9.0), ('lifelike human cadavers', 9.0), ('including desktop metal', 9.0), ('metal 3d printing', 9.0), ('latest learning technologies', 8.333333333333334), ('stem technologies', 4.333333333333334), ('technologies elevate', 4.333333333333334), ('educational experience', 4.0), ('geoffrey mearns', 4.0), ('syndaver labs', 4.0), ('great interview', 4.0), ('prepare students', 4.0), ('visit muncie', 4.0), ('meets students', 4.0), ('cnc technology', 4.0), ('offer curriculum', 4.0), ('gerry dick', 4.0), ('fantastic careers', 4.0), ('aidex', 1.0), ('recognizes', 1.0), ('introduce', 1.0), ('president', 1.0), ('opportunity', 1.0), ('present', 1.0), ('hope', 1.0)]

这个是结果是按照分数排列的。

测试文本2:

“Yesterday, Desktop Metal CEO Ric Fulop joined Bloomberg Radio to discuss the future of metal 3D printing. Listen to the interview here”

结果:

[('desktop metal ceo ric fulop joined bloomberg radio', 61.5), ('metal 3d printing', 11.5), ('yesterday', 1.0), ('interview', 1.0), ('future', 1.0), ('discuss', 1.0), ('listen', 1.0)]

测试文本3:

“3D printing metal on a desktop FDM printer, exclusive interview with The Virtual Foundry founder : Is 2017 going to be the year for 3D printing metal? Recently 3D Printing Industry reported announcements from Markforged about their forthcoming Metal X 3D “

结果:

[('recently 3d printing industry reported announcements', 31.25), ('3d printing metal', 9.916666666666666), ('virtual foundry founder', 9.0), ('desktop fdm printer', 9.0), ('forthcoming metal', 4.666666666666666), ('exclusive interview', 4.0), ('3d', 3.25), ('year', 1.0), ('markforged', 1.0), ('2017', 0)]

两种算法对比实验

测试文本:

  • Desktop Metal is proud to welcome Morris Group, Inc.. as an authorized reseller of its metal 3D printing systems in 30 states. With the addition of Desktop Metal’s Studio System™ to its existing lineup of CNC machine tools, Morris Group’s extensive distributor network provides an end-to-end suite of advanced solutions to manufacturers of precision metal parts.

  • In the latest episode of podcast, The Digital Factory, Desktop Metal CEO Ric Fulop shares his thoughts on the state of the metal 3D printing industry.

  • We’re excited to announce our Series D Funding with support from our strategic partners NEA, GV, GE Ventures, among others.

  • Register now for ‘s Metal 3D Printing webinar featuring Desktop Metal and the Studio System, the world’s first office-friendly metal 3D printing system.

  • The latest issue of examines how recent advances make 3D printing a powerful competitor to conventional mass production. Read the full article here, including commentary from Desktop Metal CEO Ric Fulop.

  • We’re honored to be recognized as one of ‘s 50 Smartest Companies of 2017.

  • Desktop Metal is honored to join the prestigious roster of recipients of the World Economic Forum Technology Pioneers program. For the press release, please visit: .

  • See the full list of Technology Pioneers 2017 here: .

  • At RAPID+TCT last month, Desktop Metal CTO Jonah Myerberg spoke with about leveraging metal 3D printing for the full product life cycle, from prototyping to mass production.

  • At RAPID+TCT, Desktop Metal CTO Jonah Myerberg talked to TechCrunch about our metal 3D printing solutions. Check out the video here:

  • This past weekend, Desktop Metal was honored to be recognized as Startup of the Year by the 3D Printing Industry awards. Thank you to all who voted!

  • Yesterday, Desktop Metal CEO Ric Fulop joined Bloomberg Radio to discuss the future of metal 3D printing. Listen to the interview here:

  • Today in the Wall Street Journal: 3D printing is transforming manufacturing, from prototyping to mass production.

  • Desktop Metal CEO Ric Flop joined CNBC’s Squawk Box to discuss the latest in metal 3D printing–from prototyping to mass production.

利用RAKE提取关键词的结果是:

str score desktop metal ceo ric fulop joined bloomberg radio 52.1515151515 desktop metal ceo ric flop joined cnbc 43.8181818182 metal 3d printing webinar featuring desktop metal 36.1090909091 desktop metal ceo ric fulop shares 34.6515151515 desktop metal cto jonah myerberg spoke 33.3181818182 desktop metal cto jonah myerberg talked 33.3181818182 world economic forum technology pioneers program 29.5 desktop metal ceo ric fulop 28.6515151515 recent advances make 3d printing 23.2909090909 office-friendly metal 3d printing system 20.7909090909 metal 3d printing industry 16.7909090909 metal 3d printing systems 16.7909090909 leveraging metal 3d printing 16.7909090909 3d printing industry awards 16.2909090909 metal 3d printing solutions 15.7909090909 full product life cycle 14.6666666667 metal 3d printing 12.7909090909 metal 3d printing– 11.5909090909 precision metal parts 10.5 desktop metal 9.31818181818 desktop metal’ 9.31818181818 strategic partners nea 9.0 extensive distributor network 9.0 wall street journal 9.0 cnc machine tools 9.0 3d printing 8.29090909091 technology pioneers 2017 8.0 conventional mass production 7.5 studio system 5.0 advanced solutions 5.0 studio system™ 5.0 full list 4.66666666667 full article 4.66666666667 mass production 4.5 50 smartest companies 4.0 transforming manufacturing 4.0 including commentary 4.0 authorized reseller 4.0 ge ventures 4.0 squawk box 4.0 end-to-end suite 4.0 prestigious roster 4.0 digital factory 4.0 morris group 4.0 past weekend 4.0 press release 4.0 existing lineup 4.0 morris group’ 4.0 powerful competitor 4.0 latest episode 3.66666666667 latest issue 3.66666666667 world 3.5 latest 1.66666666667 gv 1.0 month 1.0 voted 1.0 announce 1.0 techcrunch 1.0 recipients 1.0 read 1.0 discuss 1.0 honored 1.0 series 1.0 startup 1.0 prototyping 1.0 year 1.0 funding 1.0 state 1.0 rapid+tct 1.0 recognized 1.0 visit 1.0 addition 1.0 support 1.0 today 1.0 listen 1.0 manufacturers 1.0 30 states 1.0 podcast 1.0 join 1.0 excited 1.0 future 1.0 video 1.0 proud 1.0 examines 1.0 check 1.0 interview 1.0 yesterday 1.0 thoughts 1.0 register 1.0 2017 0

利用TextRank算法结果

str str str str desktop metal d printing production product join joined morris myerberg latest advances advanced solutions fulop distributor network ric machine partners ge pioneers economic

可以看到的是,其实两种算法的结果都不太好,但是总体上来说,RAKE算法的结果会更好一些,所以针对这个问题,我把RAKE算法进行了改进,结果成为了

str mean score desktop metal desktop metal 49.5083333333 desktop metal 49.5083333333 morris 42.1666666667 metal 3d printing tool 38.9375 morris group 37.0 metal 3d printing solutions 36.8333333333 make metal 3d printing 35.1583333333 represent desktop metal 34.1166666667 desktop metal offers 34.1166666667 desktop metal products 34.1166666667 metal additive manufacturing 33.3388888889 office-friendly metal 3d printing system 32.7966666667 end-to-end metal 3d printing solutions 30.8066666667 innovative metal 3d printing systems 29.9566666667 metal cutting manufacturers 29.6722222222 precision metal parts 29.0611111111 morris group distributor 28.3055555556 morris company 27.25 bound metal deposition 26.8388888889 3d printing process 23.9 morris group distribution network 23.125 local morris group distributor 22.2916666667 groundbreaking 3d printing technology 21.5083333333 studio system 21.3

(以上为部分结果)


改进后具体代码见我的github.

原创粉丝点击