分词IKAnalyze

来源:互联网 发布:windows虚拟机软件 编辑:程序博客网 时间:2024/05/05 11:00

英文分词

“`JAVA
public class Test {
public static void main(String[] args) {
try {
File file = new File(“E:\workspace\try\src\english.txt”);
FileReader stopWords = new FileReader(“E:\workspace\try\src\stopword.txt”);
Reader reader = new FileReader(file);
Analyzer a = new StandardAnalyzer(Version.LUCENE_20,stopWords);
TokenStream ts = a.tokenStream(“”, reader);
Token t = null;
int n = 0;
while(ts.incrementToken()){
n ++ ;
CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);
System.out.println(“词条”+n+”的内容为 :”+charTermAttribute.toString());
}
System.out.println(“== 共有词条 “+n+” 条 ==”);
} catch (Exception e) {
e.printStackTrace();
}
}
}

中文分词(加载停用词)

“`JAVA
String stopWordTable = “src\stopword.txt”;
BufferedReader StopWordFileBr = new BufferedReader(new FileReader(stopWordTable));
Set stopWordSet = new HashSet();
String stopWord = null;
for(; (stopWord = StopWordFileBr.readLine()) != null;){
stopWordSet.add(stopWord);
}

//开始分词
IKAnalyzer analyzer = new IKAnalyzer(true);
analyzer.setUseSmart(true);
ArrayList cutedString = new ArrayList();
StringReader reader = new StringReader(str);
TokenStream tokenStream = analyzer.tokenStream(“text”, reader);
tokenStream.addAttribute(CharTermAttribute.class);
try {
while (tokenStream.incrementToken()) {
CharTermAttribute charTermAttribute = tokenStream
.getAttribute(CharTermAttribute.class);
if(stopWordSet.contains(charTermAttribute.toString())) {
continue;
}
cutedString.add(charTermAttribute.toString());
}
} catch (IOException e) {
e.printStackTrace();
}
reader.close();
System.out.print(“分词结果:”);
for(String word : cutedString){
System.out.print(word+’|’);
}

0 0
原创粉丝点击