一个基于搜索的中文分词方法( A Search-based Chinese Word Segmentation Method)

来源：互联网发布：java 迭代器实现编辑：程序博客网时间：2024/04/27 19:27

A Search-based Chinese Word Segmentation Method

一个基于搜索的中文分词方法

ABSTRACT

In this paper, we propose a novel Chinese word segmentation method which leverages the huge deposit of Web documents and search technology. It simultaneously solves ambiguous phrase boundary resolution and unknown word identification problems. Evaluations prove its effectiveness.

Keywords: Chinese word segmentation, search.

摘要：在本论文中，我们提出一个新的中文分词方法，它巧妙地利用了数量非常大的网页和搜索技术。它同时解决了模糊的短语边界的问题和未登录词识别问题。并且通过评价来证明其有效性。

关键词：中文分词搜索技术

1. INTRODUCTION

Automatic Chinese word segmentation is an important technique for many areas including speech synthesis, text categorization, etc [3]. It is challenging because 1) there is no standard definition of words in Chinese, 2) word boundaries are not marked by spaces. Two research issues are mainly involved: ambiguous phrase boundary resolution and unknown word identification.

Previous approaches fall roughly into four categories:

1)Dictionary-based methods, which segment sentences by matching entries in a dictionary [3]. Its accuracy is determined by thecoverage of the dictionary, and drops sharply as new word sappear.

2) Statistical machine learning methods [1], which aretypically based on co-occurrences of character sequences. Generally large annotated Chinese corpora are required for model training, and they lack the flexibility to adapt to different segmentation standards.

3) Transformation-based methods [4].They are initially used in POS tagging and parsing, which learn a set of n-gram rules from a training corpus and then apply them to the new text.

4) Combining methods [3] which combine two or more of the above methods.

As the Web prospers, it brings new opportunities to solve many previously "unsolvable" problems. In this paper, we propose to leverage the Web and search technology to segment Chinese words. Its typical advantages include:

1) Free from the Out-of-Vocabulary (OOV) problem, and this is a typical feature of leveraging the Web documents.

2) Adaptive to different segmentation standards since ideally we can obtain all valid character sequences by searching the Web.

3) Can be entirely unsupervised that need no training corpora.

1、介绍

中文自动分词是一个包括语音合成，文本分类等许多领域的重要技术。这是挑战，因为中文中没有标准清晰的词定义，而且词两个词之间不用空格分开。两个研究问题主要涉及：歧义处理和未登录词识别。

以前的办法大致可以分为四类：

1）. 基于字典的方法，它通过匹配词典中的条目来实现分词[3]。它的精确度取决于字典的覆盖程度，当出现新词时，准确率大幅下降。

2）统计机器学习方法[1]，这是典型的基于字符序列出现的共同性。一般这种方法需要一个大语料库，当中注明所需要的模型训练。这种方法缺乏灵活性，不能很好地适应不同的细分标准。

3）基于统计的方法[4]。它们最初应用在标记和分析POS机，这从一个学习训练语料的N-gram模型（一种统计模型），然后将它们应用到新的文本中使用。

4）相结合的方法[3]，结合上述两种或多种方法。

随着网络的繁荣，会带来新的机会，解决了许多以前“无法解决的问题“。在本论文中，我们提出利用网络和搜索技术领域中的话。其典型的优点包括：

     1）摆脱不在词表种的词汇（未登录词）的问题，这是一种利用Web文档的典型特征。
     2）自适应分割的标准，因为不同的理想，我们可以通过搜索网站的所有有效的字符序列。
     3）可完全不受监督，并不需要训练语料。

2. THE PROPOSED APPROACH

2. 推荐的方法

The approach contains three steps:

1) segments collecting,

2) segments scoring,

3) segmentation scheme ranking.

       该方法包含三个步骤：
       1）收集分割字段
       2）分割方式的比较
       3）分割方案的排名。

2.1 Segments Collecting

The segments are collected in two steps:

1) Firstly, the query sentence is semantically segmented by punctuation which gives several sub-sentences.

2) Then each sub-sentence is submitted to a search engine for segments collecting. Technically, if the search

engine’s invertedindices are inaccessible as commercial search engines do, e.g. Google and Yahoo!, we collect the highlights (the red words in Figure 1) from the returned snippets as the segments. Otherwise,we check the characters’ positions indicated by the inverted indices and find those that neighbor each other in the query.

2.1 收集分割字段

     该段被收集在两个步骤：
     1）首先，查询句会被语义上的标点符号所分割，形成不同的子句。
     2）然后每个子句子提交到搜索引擎收集段。从技术上讲，如果搜索引擎的倒排索引是不能使用，因为倒排索引就是为商业搜索引擎做的，例如谷歌和雅虎，我们将收集的亮点（图1中的红色字）从返回的部分片段。否则，我们检查字的位置表示了倒排索引，在查询语句中找到它的相邻位置。

Although search engines generally have local segmentors, we argue that their performance normally will not affect our results, e.g. Figure 1 shows the search results of “他高兴地说” (he said happily), our method assumes that the highlight “他高兴地” (he happily) is a segment. However, by checking the HTML source, we found that Yahoo!’s local segmentor gives “<b>