[NLP-关键词提取]关于nlp的一些实践

来源:互联网 发布:水经注软件 谷歌 编辑:程序博客网 时间:2024/05/23 20:17

目前就职于某教育类的互联网公司,内部推出了ai实验室,看到了一些公司内部关于nlp的一些接口,于是就摸索了一下

## 使用rake_nltk来进行快速关键词提取from rake_nltk import Rake

rake_nltk 是一个github上的库,地址:https://github.com/csurfer/rake-nltk

如果从一段话中中快速的提取关键词的话,
可以使用如下的接口:

r = Rake()my_test = 'My father was a self-taught mandolin player. He was one of the best string instrument players in our town. He could not read music, but if he heard a tune a few times, he could play it. When he was younger, he was a member of a small country music band. They would play at local dances and on a few occasions would play for the local radio station. He often told us how he had auditioned and earned a position in a band that featured Patsy Cline as their lead singer. He told the family that after he was hired he never went back. Dad was a very religious man. He stated that there was a lot of drinking and cursing the day of his audition and he did not want to be around that type of environment.'r.get_ranked_phrases_with_scores()##以下为输出结果[(16.0, 'best string instrument players'), (13.5, 'small country music band'), (9.0, 'taught mandolin player'), (9.0, 'never went back'), (9.0, 'featured patsy cline'), (8.5, 'local radio station'), (8.0, 'often told us'), (7.833333333333334, 'occasions would play'), (5.0, 'read music'), (4.833333333333334, 'would play'), (4.5, 'local dances'), (4.0, 'religious man'), (4.0, 'lead singer'), (3.8333333333333335, 'could play'), (2.5, 'band'), (2.0, 'told'), (1.5, 'could'), (1.0, 'younger'), (1.0, 'want'), (1.0, 'type'), (1.0, 'tune'), (1.0, 'town'), (1.0, 'times'), (1.0, 'stated'), (1.0, 'self'), (1.0, 'position'), (1.0, 'one'), (1.0, 'member'), (1.0, 'lot'), (1.0, 'hired'), (1.0, 'heard'), (1.0, 'father'), (1.0, 'family'), (1.0, 'environment'), (1.0, 'earned'), (1.0, 'drinking'), (1.0, 'day'), (1.0, 'dad'), (1.0, 'cursing'), (1.0, 'auditioned'), (1.0, 'audition'), (1.0, 'around')]

具体的算法如下:
关键词提取
步骤:1.根据标点符号(如半角的句号、问号、感叹号、逗号等)将一篇文档分成若干分句
2.然后对于每一个分句,使用停用词作为分隔符将分句分为若干短语
3.这些短语作为最终提取出的关键词的候选词
4.然后计算这些短语中权重比较高的单词,输出这些单词作为整个文本的关键词。

print(r.stopwords)  #打印一下 这个自带库的stopwords 这个也提供了结果可以自定义#以下为结果:['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

关于关键词提取有一些其他资料:http://blog.csdn.net/mpk_no1/article/details/75201546

RAKE提取的关键词并不是单一的单词,有可能是一个短语。每个短语的得分有组成短语的词累加得到,而词的得分与词的度与词频有关:score = degree / freq

该句话是从上述文章中引用的,每个短语中有很多候选词,这些候选词的得分是根据其他词和该词共同出现在次数有关的
// TF-IDF关键词的提取实践
TF-IDF
//Topic-model
LDA主题模型关键词提取实践

原创粉丝点击