中文分词之候选集的选取

来源：互联网发布：mac分区不动了编辑：程序博客网时间：2024/05/18 02:43

中文分词之候选集的选取

一、基本思路：

·0. 之前讲到利用字典的分词，是因为字典的词组是有限的，单个的。但是候选词是可以由多个词组成。例如“Android/x 系统 /n 平台/n”，“蓝/n 牙/n”等。如果把这些词是连着出现在被爬虫爬下来的文章中，那么就很有必要把“Android系统平台”，“蓝牙”作为整体的词语来理解了。即“把x n n”,”n n”组合在一起作为候选词，然后在后面会通过“候选集过滤”，“左右完整性分析”和“稳定性”来评估这个候选词是否有资格成为对象词。

1. 首先定义rule.txt，里面定义好抽取的规则

n n

n v

v n

ng n p

n n n

rzv q n

x n n

规则来自《中文产品评价对象的识别研究》

2. 构造词组的正则表达式

思路：读取txt文本，每行地读出内容，然后构造正则表达式，存放在List中去。其中中文字母数字的正则表达式是：String regex = "[\u4E00-\u9FA5a-zA-Z0-9]*";

程序实现：

/** * 获取抽取规则并放在list * @param inputPath  * @return * @throws Exception  */public static List<String> getRulesFromFile(String inputPath) throws Exception{List<String> results = new ArrayList<String>();File inputFile = new File(inputPath);if(!inputFile.exists()){throw new Exception("inputPath is no exists");}FileReader fr = new FileReader(inputFile);BufferedReader br = new BufferedReader(fr);try {String regex = "[\u4E00-\u9FA5a-zA-Z0-9]*";String temp;while((temp = br.readLine()) != null){String[] worlds = temp.split(" ");String build_regx ="";for(int i =0;i<worlds.length;i++){build_regx += regex +ConstantString.slash+worlds[i]+" ";}System.out.println(build_regx);results.add(build_regx);}} catch (Exception e) {e.printStackTrace();}finally{try {if(br != null){br.close();}if(fr != null){fr.close();}} catch (Exception e2) {e2.printStackTrace();}}return results;}

结果：

3. 利用构造好的正则表达式，递归匹配已经分词的文本内容，并输出到文档。

public static void ExtratorWorld2File(String rulesPath,String inputPath,String outputPath) throws Exception{if(StringUtil.isEmpty(inputPath) || StringUtil.isEmpty(rulesPath)){throw new Exception("inputPath/rulesPath is null");}File file = new File(outputPath);if(!file.exists()){file.mkdirs();}ArrayList<String> rules = (ArrayList<String>) ExtratorWordUtil.getRulesFromFile(rulesPath);ArrayList<String> paths = new StringUtil().getAllPath(inputPath);for(String path : paths){String result ="";String content = StringUtil.getContent(path);String name = StringUtil.getNameFromPath(path);String outputFile = outputPath + ConstantString.slash + name + ConstantString.postText;for(String rule : rules){result += StringUtil.getContentUseRegex(rule, content, 0,ConstantString.WIN_NextLine) +ConstantString.WIN_NextLine;}StringUtil.String2File(result, outputFile);}}

结果：

0 0