java读取中文分词工具(一)

来源:互联网 发布:有哪些网络兼职 编辑:程序博客网 时间:2024/05/21 23:34


import java.io.BufferedReader;import java.io.File;import java.io.FileInputStream;import java.io.IOException;import java.io.InputStreamReader;import java.io.RandomAccessFile;import java.util.StringTokenizer;/* * 文本格式:已分词的中文文本,空格分割。有若干行,每行为一个段落。 * 功能:遍历文档,逐个返回词语。 * 两种模式: * 1 到文档末尾后,结束 * 2 到文档末尾后,从头再读。/public class WordReader {static final int normalMode = 0;static final int againMode = 1;int currentMode = 0;//BufferedReader br=null;RandomAccessFile raf= null;StringTokenizer tokenizer = null;String nextWord=null;int currentLine = 0;int allCounts = 0;public  WordReader(String fileName) throws IOException{File file=new File(fileName);//br=new BufferedReader(new InputStreamReader(new FileInputStream(file),"utf-8"));raf = new RandomAccessFile(file,"r") ;}private boolean hasNextWord() throws IOException{if( tokenizer!=null && tokenizer.hasMoreTokens()){nextWord = tokenizer.nextToken();return true;}else {String line=raf.readLine();if(line == null){if(currentMode == normalMode)return false;else //从头再来{raf.seek(0);return hasNextWord();//递归}}tokenizer = null;line = new String(line.getBytes("iso8859-1"),"utf-8");tokenizer= new StringTokenizer(line," ");return hasNextWord();//递归}}private String getNextWord() throws IOException{if(nextWord != null){String word = nextWord;nextWord = null;allCounts ++;return word;}else if(hasNextWord()){return getNextWord();}else return null;}public static void main(String[] args) throws IOException {// TODO Auto-generated method stubWordReader wordReader = new WordReader("/home/linger/sources/ParaModel/electronic_seg.txt");wordReader.currentMode = WordReader.againMode;//while(wordReader.hasNextWord())//共10329309个词for(int i=0;i<10329319;i++)//文本从头读{System.out.println(wordReader.getNextWord());}System.out.println(wordReader.allCounts);}}


用randomaccessfile类很容易操作文件指针。

但是遇到中文乱码问题,参考了这里http://blog.chinaunix.net/uid-15490606-id-211958.html,解决了。

line = new String(line.getBytes("iso8859-1"),"utf-8");

对编码不是很精通,有时见看看这个http://blog.sina.com.cn/s/blog_673c81990100t1lc.html。




1 0
原创粉丝点击