一个非常高效的提取内容关键词的python代码
来源:互联网 发布:齐天大圣象棋软件下载 编辑:程序博客网 时间:2024/06/05 11:55
# coding=UTF-8import nltkfrom nltk.corpus import brown# This is a fast and simple noun phrase extractor (based on NLTK)# Feel free to use it, just keep a link back to this post# http://thetokenizer.com/2013/05/09/efficient-way-to-extract-the-main-topics-of-a-sentence/# Create by Shlomi Babluki# May, 2013# This is our fast Part of Speech tagger#############################################################################brown_train = brown.tagged_sents(categories='news')regexp_tagger = nltk.RegexpTagger( [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), (r'(-|:|;)$', ':'), (r'\'*$', 'MD'), (r'(The|the|A|a|An|an)$', 'AT'), (r'.*able$', 'JJ'), (r'^[A-Z].*$', 'NNP'), (r'.*ness$', 'NN'), (r'.*ly$', 'RB'), (r'.*s$', 'NNS'), (r'.*ing$', 'VBG'), (r'.*ed$', 'VBD'), (r'.*', 'NN') ])unigram_tagger = nltk.UnigramTagger(brown_train, backoff=regexp_tagger)bigram_tagger = nltk.BigramTagger(brown_train, backoff=unigram_tagger)############################################################################## This is our semi-CFG; Extend it according to your own needs#############################################################################cfg = {}cfg["NNP+NNP"] = "NNP"cfg["NN+NN"] = "NNI"cfg["NNI+NN"] = "NNI"cfg["JJ+JJ"] = "JJ"cfg["JJ+NN"] = "NNI"#############################################################################class NPExtractor(object): def __init__(self, sentence): self.sentence = sentence # Split the sentence into singlw words/tokens def tokenize_sentence(self, sentence): tokens = nltk.word_tokenize(sentence) return tokens # Normalize brown corpus' tags ("NN", "NN-PL", "NNS" > "NN") def normalize_tags(self, tagged): n_tagged = [] for t in tagged: if t[1] == "NP-TL" or t[1] == "NP": n_tagged.append((t[0], "NNP")) continue if t[1].endswith("-TL"): n_tagged.append((t[0], t[1][:-3])) continue if t[1].endswith("S"): n_tagged.append((t[0], t[1][:-1])) continue n_tagged.append((t[0], t[1])) return n_tagged # Extract the main topics from the sentence def extract(self): tokens = self.tokenize_sentence(self.sentence) tags = self.normalize_tags(bigram_tagger.tag(tokens)) merge = True while merge: merge = False for x in range(0, len(tags) - 1): t1 = tags[x] t2 = tags[x + 1] key = "%s+%s" % (t1[1], t2[1]) value = cfg.get(key, '') if value: merge = True tags.pop(x) tags.pop(x) match = "%s %s" % (t1[0], t2[0]) pos = value tags.insert(x, (match, pos)) break matches = [] for t in tags: if t[1] == "NNP" or t[1] == "NNI": # if t[1] == "NNP" or t[1] == "NNI" or t[1] == "NN": matches.append(t[0]) return matches# Main method, just run "python np_extractor.py"def main(): sentence = "Swayy is a beautiful new dashboard for discovering and curating online content." np_extractor = NPExtractor(sentence) result = np_extractor.extract() print("This sentence is about: %s" % ", ".join(result))if __name__ == '__main__': main()
转载出处:http://www.open-open.com/code/view/1428844470065/
0 0
- 一个非常高效的提取内容关键词的python代码
- 一个非常高效的malloc库
- python多进程提取处理大量文本的关键词
- 一个内容非常多的网站
- NLP中关键词的提取
- python提取文件中的关键词及部分上下文内容
- 启发式提取一个网页的主体内容
- python实现关键词提取
- 一个非常简洁高效的JS右键菜单!
- 一个iOS程序员自己写代码将Kindle中我的剪贴内容筛选提取出来
- 一个非常基础的python的bug
- 一个非常简单的滚动代码
- 一个非常简单的javascript网页代码
- python自定义一个非常简易的模块
- 文本关键词的提取算法实验
- 邮件透明过滤-中文关键词的提取
- 文本关键词的提取算法实验
- URL提取关键词的value值
- CentOS 7安装ownCloud
- scala的一行代码
- 数据库连接池性能PK
- Java线程:新特征-信号量
- jsp读取图片路径,然后在页面中显示图片
- 一个非常高效的提取内容关键词的python代码
- 特征选择案例
- jquery的bind跟on绑定事件的区别
- UltraWinGrid:处理单元格输入非法值
- Linux给正文第一行添加注释
- HDU2389Rain on your Parade(最大匹配+Hopcroft-Karp算法)
- PL/SQL查询数据时乱码
- 快速合并JSON对象
- android 消息传递机制EventBus的深入探究