python中的汉字处理

来源:互联网 发布:mac系统iso镜像百度云 编辑:程序博客网 时间:2024/05/20 19:31

最近用python处理这样一个问题:首先将大段文本根据中文的句号和问号分成句子,再用匹配项来查找匹配的句子。问题虽简单,但在处理时总会出现一些乱码,看了很多人的帖子,总算把这个问题搞清楚了,下面是简要的代码(含有一些注意的地方)

#-*-coding:cp936-*-

#或者-*-coding:utf-8-*-

 

import sys

import re

if __name__ == '__main__':

    if len(sys.argv) < 2:
        print "usage:python search.py test.txt word.list outText"
        sys.exit()  
    fin1 = open(sys.argv[2],'r')
    fout = open(sys.argv[3],'w')
    while 1:
        curLine = fin1.readline()
        if not curLine: break
        searchReg = curLine.rstrip().decode('gbk')
        fout.write(searchReg.encode('gbk'))
        fout.write('/n')
        countN = 1
        fin = open(sys.argv[1],'r')
        while 1:
            testLine = fin.readline()
            if not testLine:break
            transList = []
            if testLine.find('GET') == 0:
                tmpList = []
                testLine = unicode(testLine,'gbk')
                transLine = testLine.rstrip().split('ON:'.decode('gbk'))[-1]
                tmpList = transLine.split(r'?'.decode('gbk'))
                for item in tmpList:
                    thisList = item.split(r'。'.decode('gbk'))
                    for thisItem in thisList:
                        transList.append(thisItem)
                for item in transList:
                    if re.search(searchReg,item):
                        outLine = '(' + str(countN) + '):' + item + '/n'
                        fout.write(outLine.encode('gbk'))
                        countN += 1
        fin.close()
        fout.write('/n')
    fin1.close()
    fout.close()

注意以上encode和decode的地方,编码的原则是匹配项与被匹配项编码要一致。写入文件的内容若被解码了,则在写入前要编码。

原创粉丝点击