python分词脚本 注意python对于中文的编码方式
来源:互联网 发布:淘宝销量为0敢买吗 编辑:程序博客网 时间:2024/05/21 05:59
对于中文以及windows下路径的修改是要注意的,尤其是编码方式
ASCII不能存储中文
unicode是中文在内存的编码方式
utf-8是中文在硬盘的编码方式
需要转化尤其是在调用存储的时候
下面的代码先decode的目的在于,将原本存于硬盘的utf-8代码解析成Unicode,然后再转换成utf-8显示
还有就是split对于分词来说十分有用
python下标是从0开始的。。。
# -*- coding: UTF-8 -*-import os,sys import restr2 = 'C:/Users/Hit/Desktop/文本/199801.txt' path = unicode(str2,"utf8") fo = open(path) fw = open('new.txt','w')count = 0 done = 0while not done: line = fo.readline() if line: count = count+1 if count != 0: split_line = line.split(" ") clear_time = 1 for item in split_line: if clear_time == 1: clear_time = clear_time + 1 continue else: term = re.split('/',item) if term[0] != '\n': for word in term[1].split(): if word == 'nr' or word == 'ns' or word == 'nz' or word == 'nt': count_nr = 0 isfirst = 1 for contain in term[0].decode('utf-8'): count_nr = count_nr + 1 if count_nr == 1 and contain == '[': continue else: fw.write(contain.encode('utf-8')) fw.write(' ') if isfirst == 1: fw.write(word.upper()) fw.write('-B') isfirst = isfirst + 1 else: fw.write('I') fw.write('\n') else: for contain in term[0].decode('utf-8'): fw.write(contain.encode('utf-8')) fw.write(' O\n') fw.write('\n') else: done = 1fw.close()fo.close()
# -*- coding: UTF-8 -*-import os,sys import restr2 = 'C:/Users/Hit/Desktop/文本/199801.txt' path = unicode(str2,"utf8") fo = open(path) fw = open('new.txt','w')count = 0 done = 0while not done: line = fo.readline() if line: count = count+1 if count ==4: split_line = line.split(" ") clear_time = 1 print len(split_line) rows = 0 pre = '' preterm = [] for num in range(len(split_line)): if num == 0: continue else: print "NEW ITERATION :", print num term = re.split('/',split_line[num]) print term[0] if term[0] != '\n': word = term[1] if word == 'nr' or word == 'ns' or word == 'nz' or word == 'nt': if word != pre: if word == 'nr' or word == 'ns' or word == 'nz' or word == 'nt': count_nr = 0 isfirst = 1 for contain in term[0].decode('utf-8'): count_nr = count_nr + 1 if count_nr == 1 and contain == '[': continue else: fw.write(contain.encode('utf-8')) fw.write(' ') if isfirst == 1: fw.write(word.upper()) fw.write('-B') isfirst = isfirst + 1 else: fw.write('I') fw.write('\n') else: if word == 'nr' or word == 'ns' or word == 'nz' or word == 'nt': count_nr = 0 isfirst = 1 for contain in term[0].decode('utf-8'): count_nr = count_nr + 1 if count_nr == 1 and contain == '[': continue else: fw.write(contain.encode('utf-8')) fw.write(' ') if isfirst == 1: fw.write('I') isfirst = isfirst + 1 else: fw.write('I') fw.write('\n') else: for contain in term[0].decode('utf-8'): fw.write(contain.encode('utf-8')) fw.write(' O\n') if num == 1: continue preterm = re.split('/',split_line[num]) pre = preterm[1] fw.write('\n') else: done = 1fw.close()fo.close()
阅读全文
0 0
- python分词脚本 注意python对于中文的编码方式
- python对于中文编码处理的几种方式
- 浅谈字符编码方式与python的中文编码(一)
- python 处理中文需要注意的编码问题
- python中文分词:结巴分词
- Python的中文编码
- 【中文分词】基于ICTCLAS的Python中文分词
- python中文分词
- python NLTK、中文分词
- python 中文分词
- python中文分词---jieba
- Python中文分词组件
- Python 中文分词
- stanford python中文分词
- 【Python学习】python中文分词
- Python调用jieba分词中的中文编码问题
- Python:实现简单的中文分词
- Python下的中文分词实现
- ajax请求,跨域问题,在Java客户端中解决
- Windows Server 2008开关机取消登录时要按Ctrl+Alt+Delete组合键登录的方法
- Java程序员们最常犯的10个错误
- ajax发送异步请求从入门到精通
- 系统移植的四大步骤
- python分词脚本 注意python对于中文的编码方式
- 无法连接到夜神模拟器解决办法
- 图解YV12和NV12以及I420色度采样格式
- 怎么样才能进入BAT公司的研发部门
- 4.9easyUI
- 仿新浪微博@功能 JS的实现 ——使用JQ At.js 和HTML5 contentEditable
- 改进的KMP
- 三元运算符
- throw "uglifyjs is deprecated