使用NLPIR 进行中文分词并标注词性
来源:互联网 发布:plc编程招聘东莞 编辑:程序博客网 时间:2024/05/16 05:20
背景
在许多时候为了更好的解析文本,我们不仅仅需要将文本分词,去停这么简单,除了获取关键词与新词汇以外,我们还需要对获取每个粒度的其他信息,比如词性标注,在python中NLPIR就可以很好的完成这个任务,如果你没有NLPIR那么你可以参考这篇文章NLPIR快速搭建,或者直接下载我已经准备好的汉语自然语言处理文件包NLP源码集合
代码,亦是我的笔记
# - * - coding: utf - 8 -*-## 作者:田丰(FontTian)# 创建时间:'2017/7/3'# 邮箱:fonttian@Gmaill.com# CSDN:http://blog.csdn.net/fontthroneimport nltkimport sysimport nlpirsys.path.append("../")reload(sys)sys.setdefaultencoding('utf-8')import jiebafrom jieba import possegdef cutstrpos(txt): # 分词+词性 cutstr = posseg.cut(txt) result = "" for word, flag in cutstr: result += word + "/" + flag + ' ' return resultdef cutstring(txt): # 分词 cutstr = jieba.cut(txt) result = " ".join(cutstr) return result# 读取文件txtfileobject = open('txt/nltest1.txt')textstr = ""try: filestr = txtfileobject.read()finally: txtfileobject.close()# 使用NLPIR2016 进行分词def ChineseWordsSegmentationByNLPIR2016(text): txt = nlpir.seg(text) seg_list = [] for t in txt: seg_list.append(t[0].encode('utf-8')) return seg_liststopwords_path = 'stopwords\stopwords1893.txt' # 停用词词表# 去除停用词def ClearStopWordsWithListByNLPIR2016(seg_list): mywordlist = [] liststr = "/ ".join(seg_list) f_stop = open(stopwords_path) try: f_stop_text = f_stop.read() f_stop_text = unicode(f_stop_text, 'utf-8') finally: f_stop.close() f_stop_seg_list = f_stop_text.split('\n') for myword in liststr.split('/'): if not (myword.strip() in f_stop_seg_list) and len(myword.strip()) > 1: mywordlist.append(myword) return ''.join(mywordlist)# print filestrfilestr2 = ClearStopWordsWithListByNLPIR2016(ChineseWordsSegmentationByNLPIR2016(filestr)).replace(' ', '')# 中文分词并标注词性posstr = cutstrpos(filestr2)print '**** show is end ****'print ' 'print 'This is posster'print posstrstrtag = [nltk.tag.str2tuple(word) for word in posstr.split()]# for item in strtag:# print itemstrsBySeg = nlpir.seg(filestr)strsBySeg2 = nlpir.seg(filestr2)strsByParagraphProcess = nlpir.ParagraphProcess(filestr, 1)strsByParagraphProcessA = nlpir.ParagraphProcessA(filestr, ChineseWordsSegmentationByNLPIR2016(filestr)[0], 1)print ' 'print ' 'print '**** strtag ****'for word, tag in strtag: print word, "/", tag, "|",print ' 'print ' 'print '**** strsBySeg ****'for word, tag in strsBySeg: print word, "/", tag, "|",print ' 'print ' 'print '**** strsBySeg2 ****'for word, tag in strsBySeg2: print word, "/", tag, "|",print ' 'print ' 'print '**** strsByParagraphProcess ****'print strsByParagraphProcess# print ' '# print ' '# print '**** strsByParagraphProcessA ****'# # for item in strsByParagraphProcessA:# print item,print ' 'print ' 'print '**** show is end ****
实用示例
NLPIR会自动对人名进行分词与标注,借助该功能我们可以获取自定义新词,或者提取与某类人有关的句子.下面是我前段时间在写一个项目demon时刚写的测试代码
# - * - coding: utf - 8 -*-## 作者:田丰(FontTian)# 创建时间:'2017/7/11'# 邮箱:fonttian@Gmaill.com# CSDN:http://blog.csdn.net/fontthronefrom os import pathfrom scipy.misc import imreadimport matplotlib.pyplot as pltimport jiebafrom nlpir import *from wordcloud import WordCloud, ImageColorGeneratorimport sysreload(sys)sys.setdefaultencoding('utf-8')d = path.dirname(__file__)text = '接待钟世镇院士,筹备杨东奇部长接待事宜。'stopwords_path = 'stopwords\CNENstopwords.txt' # 停用词词表number = 10def ShowByItem(List): print '********* show ', str(List), ' end *********' for item in List: print item, print print '********* show ', str(List), ' end *********'# 使用NLPIR2016 获取名字def FindAcademicianNameByNLPIR2016(text,isAddYuanShi): txt = seg(text) seg_list = [] for i in range(len(txt)): if txt[i][1] == 'nr' and txt[i+1][0] == '院士': if isAddYuanShi == 1: seg_list.append(txt[i][0].encode('utf-8')+'院士') else: seg_list.append(txt[i][0].encode('utf-8')) return seg_liststr2 = FindAcademicianNameByNLPIR2016(text,1)ShowByItem(str2)# 输出********* show ['\xe9\x92\x9f\xe4\xb8\x96\xe9\x95\x87\xe9\x99\xa2\xe5\xa3\xab'] end 钟世镇院士********* show ['\xe9\x92\x9f\xe4\xb8\x96\xe9\x95\x87\xe9\x99\xa2\xe5\xa3\xab'] end
在demon中使用的
使用NLPIR2016 获取名字def FindAcademicianNameByNLPIR2016(text,isAddYuanShi): txt = seg(text) seg_list = [] for i in range(len(txt)): if txt[i][1] == 'nr' and txt[i+1][0] == '院士': if isAddYuanShi == 1: seg_list.append(txt[i][0].encode('utf-8')+'院士') else: seg_list.append(txt[i][0].encode('utf-8'))strAcademicianName = FindAcademicianNameByNLPIR2016(fullContent,1)strAcademicianName = list(set(strAcademicianName))# 利用pandas存储dfAcademicianName = pd.DataFrame(strAcademicianName)dfAcademicianName.columns = ['AcademicianName']dfAcademicianName.to_csv('csv/dfAcademicianName')# 利用Pandas 获取dfNewWords = pd.read_csv("csv/dfNewWords")dfAcademicianName = pd.read_csv("csv/dfAcademicianName")# 你也可以将其加入用户新词汇# add_word(dfAcademicianName['AcademicianName'])# 提取所有带有院士的报告def GetAcademicianCSV(df,strColumn,df1): dfAcademicianName = pd.read_csv("csv/dfAcademicianName") listAcademicianName = list(dfAcademicianName['AcademicianName']) print type(listAcademicianName) mywordlistAcademicianName =[] mywordlisttime = [] mywordAca = [] df1 = df1.copy() numlen = len(df1.index) for i in range(numlen): for myword in df1.loc[i, strColumn].split(): if (myword in listAcademicianName) and len(myword) > 1: print myword mywordlistAcademicianName.append(df.loc[i, strColumn]) mywordAca.append(myword) mywordlisttime.append(df.loc[i, 'time']) return mywordlistAcademicianName,mywordlisttime,mywordAca# 返回的信息mywordlistAcademicianName, mywordlisttime,mywordAca = GetAcademicianCSV(df,'content',df1)
效果如下
阅读全文
1 0
- 使用NLPIR 进行中文分词并标注词性
- 中文分词与词性标注
- python进行中文分词、词性标注、词频统计
- 使用NLPIR汉语分词工具进行中文分词(java语言)
- 使用NLPIR汉语分词系统进行分词
- ICTCLAS 中科院分词系统 代码 注释 中文分词 词性标注
- ICTCLAS 中科院分词系统 代码 注释 中文分词 词性标注
- 结巴分词--词性标注
- Python 文本挖掘:jieba中文分词和词性标注
- Python 文本挖掘:jieba中文分词和词性标注
- NLPIR 词性标注的兼容设置
- 中科院的分词系统使用的词性标注标准
- 中科院的分词系统使用的词性标注标准
- LTP分词与词性标注(使用用户词典)
- 分词:词性标注北大标准
- 分词:词性标注北大标准
- 分词:词性标注北大标准
- 分词:词性标注北大标准
- 学习笔记TF033:实现ResNet
- 最短路
- LS8-linux系统调用方式文件编程之学习笔记
- Vue实现组件信息的缓存
- 函数
- 使用NLPIR 进行中文分词并标注词性
- Apache Kylin在电信运营商的实践和案例分享
- map集合遍历的两种方式,在spring中的依赖注入
- POJ.3087 Shuffle'm Up (模拟)
- STM32网络丢包问题分析
- Linux--多线通信
- react项目实战(权限模块开发七)通过ajax技术获取数据
- 实时股票分析系统的架构与算法
- Python中整数和浮点数