JavaScript英文分词

来源:互联网 发布:js的sleep函数 编辑:程序博客网 时间:2024/04/26 21:49

英文分词是搜索,文本分析中很常用的一种技术。最近有一个作业是实现一个分词系统,用JavaScript尝试了一下,完成之后觉得下次可能会使用Python来实现,因为相应的库可能会多一些。

首先去掉空格和符号。
var re = /\s*\b/g;
var re2 = /[A-Z]/g;
var string = "In the response, the evidence Profile element contains the sources for \
              the evidence and is unique to the Watson pipeline for which it was \
              configured. The evidence passage and related information is contained \
              in the evidence section."


function convert(propertyName)
{
    function upperToLower(match)
    {
        return match.toLowerCase();
    }
    return propertyName.replace(re2, upperToLower);
}
string = convert(string);
var output = [];
output = string.split(re);
var output2 = [];
for (var i = 0; i < output.length; i++) {
    if (output[i]!==',' && output[i]!==';' && output[i]!=='.')
       output2.push(output[i]);
}
然后是排除类似a/an/and/are/then这种stop word。
var stopWords = ['a','about','above','after','again','against','all','am','an','and','any','are','aren\'t','as','at','be','because',
                 'been','before','being','below','between','both','but','by','can\'t','cannot','could','couldn\'t','did','didn\'t',
                 'do','does','doesn\'t','doing','don\'t','down','during','each','few','for','from','further','had','hadn\'t','has',
                 'hasn\'t','have','haven\'t','having','he','he\'d','he\'ll','he\'s','her','here','here\'s','hers','herself','him',
                 'himself','his','how','how\'s','i','i\'d','i\'ll','i\'m','i\'ve','if','in','into','is','isn\'t','it','it\'s','its',
                 'itself','let\'s','me','more','most','mustn\'t','my','myself','no','nor','not','of','off','on','once','only','or',
                 'other','ought','our','ours','ourselves','out','over','own','same','shan\'t','she','she\'d','she\'ll','she\'s',
                 'should','shouldn\'t','so','some','such','than','that','that\'s','the','their','theirs','them','themselves','then',
                 'there','there\'s','these','they','they\'d','they\'ll','they\'re','they\'ve','this','those','through','to','too',
                 'under','until','up','very','was','wasn\'t','we','we\'d','we\'ll','we\'re','we\'ve','were','weren\'t','what',
                 'what\'s','when','when\'s','where','where\'s','which','while','who','who\'s','whom','why','why\'s','with','won\'t',
                 'would','wouldn\'t','you','you\'d','you\'ll','you\'re','you\'ve','your','yours','yourself','yourselves'];
stopWords.forEach(function(element){
    for(var i = output2.length - 1; i >= 0; i--) {
        if(output2[i] === element) {
            output2.splice(i, 1);
        }
    }
})
console.log(output2);

结果显示


最后是提取词干
提取词干(Stemming)对单词的不同形态做截取操作,会去掉一些前缀后缀,但保留词的主要部分。
Stemming有3大主流算法
* Porter Stemming
* Lovins Stemming
* Lancaster Stemming
这里使用Porter Stemming算法


Porter Stemming Algorithm(http://tartarus.org/~martin/PorterStemmer/js.txt)


其实关于英文分词有很多研究,有许多著名的算法,也有著名的开源全文检索引擎工具包Lucene,方便软件开发人员在不同的系统中实现全文索引。Lucene自身就包括了3个Stemming算法。分别是
* EnglishMinimalStemmer
* Porter Stemming
* KStemmer


另外提一下词形还原(Lemmatisation)
Lemmatisation就是将不同时态,不同人称的单词还原成原形。非常强大,一般情况包括搜索时用不到,在研究计算机语言时会有涉及。

0 0
原创粉丝点击