自然语言处理---新词发现---生成二元组

来源：互联网发布：网络危机公关流程编辑：程序博客网时间：2024/04/28 06:27
record：
生成二元组
#coding:utf-8'''Created on 2014年10月15日@author: shifeng'''import codecs#----------------------#一行一行读取该文件#2012_7after_preproces#testdatawith codecs.open(u"D:/shifengworld/NLP/NLP_project/新词发现/data/data_preproces/test/2012_7after_preprocesN.txt") as f:    text = f.readlines()#----------------------file_object=open(u"D:/shifengworld/NLP/NLP_project/新词发现/data/data_preproces/test/data_getCW300.txt",'w')#----------------------'''生成二元组，并且作各种记录：1.该词出现的频率，用于词频过滤2.该词出现的所在行数，以备找回该词所在的句子，情绪倾向性需要用到3.该词的邻接？，以及一些统计特征需要的统计量，用于统计特征过滤4.'''dict={}record_line_num=1for line in text:    line=line.decode('utf-8')       #注意注意：文件需要时utf-8格式，不然解码问题错误！！！#     print line,    for i in range(len(line)-3):    #包括换行符，不要，所以长度减3        tuple=line[i:i+2]        if tuple in dict:       #或者dict.has_key(tuple),如果出现过元组，那么加1就行            dict[tuple]=dict[tuple]+1          else:                   #否则，进行下一步判断，            if ' ' in tuple:    #如果某个二元组包含空格                 pass            else:               #第一次出现，而且不包括空格的话，那么出现频率赋为1就行                dict[tuple]=1                #---------------#出现了某个词记录其行数，以备找到词后找回原来的句子，处理后的续情绪倾向性任务时有用                #---------------    record_line_num=record_line_num+1       #运行完了后，进入下一行，行数+1    print '第',record_line_num,'行''''-------------------------------------------------将得到的二元组，进行过滤，A.考虑各种过滤方法：1.词频过滤2.不能构成搭配的词过滤3.统计特征过滤4.B.并且，每种过滤方法都进行各种统计，得出统计数据，如该统计该过滤方法过滤的效果、过滤的数量C.-------------------------------------------------'''#----------------------------1.词频过滤--------------------------lower_bound_fre=300               #词频选择，频率小于lower_bound_fre的去掉for j in dict.keys():    if dict[j]<lower_bound_fre:                      del dict[j]    else:        pass#         print j,dict[j]dict_sort=sorted(dict.iteritems(), key=lambda d:d[1],reverse=True) #排序后是数组类型的数据for k in dict_sort:#     print k[0]," ",k[1]    file_object.write(k[0]+" "+str(k[1])+"\n")#-----------------------------2.停用词、助词等不能构成搭配的词，用词典过滤----#-----------------------------3.统计特征过滤------------------------#-----------------------------4.词典过滤---------------------------#-----------------------------5.--------------------------------file_object.close()
0 0