JavaScript英文分词

来源：互联网发布：js的sleep函数编辑：程序博客网时间：2024/04/26 21:49

英文分词是搜索，文本分析中很常用的一种技术。最近有一个作业是实现一个分词系统，用JavaScript尝试了一下，完成之后觉得下次可能会使用Python来实现，因为相应的库可能会多一些。

首先去掉空格和符号。
var re = /\s*\b/g;
var re2 = /[A-Z]/g;
var string = "In the response, the evidence Profile element contains the sources for \
the evidence and is unique to the Watson pipeline for which it was \
configured. The evidence passage and related information is contained \
in the evidence section."

function convert(propertyName)
{
function upperToLower(match)
{
return match.toLowerCase();
}
return propertyName.replace(re2, upperToLower);
}
string = convert(string);
var output = [];
output = string.split(re);
var output2 = [];
for (var i = 0; i < output.length; i++) {
if (output[i]!==',' && output[i]!==';' && output[i]!=='.')
output2.push(output[i]);
}
然后是排除类似a/an/and/are/then这种stop word。
var stopWords = ['a','about','above','after','again','against','all','am','an','and','any','are','aren\'t','as','at','be','because',
'been','before','being','below','between','both','but','by','can\'t','cannot','could','couldn\'t','did','didn\'t',
'do','does','doesn\'t','doing','don\'t','down','during','each','few','for','from','further','had','hadn\'t','has',
'hasn\'t','have','haven\'t','having','he','he\'d','he\'ll','he\'s','her','here','here\'s','hers','herself','him',
'himself','his','how','how\'s','i','i\'d','i\'ll','i\'m','i\'ve','if','in','into','is','isn\'t','it','it\'s','its',
'itself','let\'s','me','more','most','mustn\'t','my','myself','no','nor','not','of','off','on','once','only','or',
'other','ought','our','ours','ourselves','out','over','own','same','shan\'t','she','she\'d','she\'ll','she\'s',
'should','shouldn\'t','so','some','such','than','that','that\'s','the','their','theirs','them','themselves','then',
'there','there\'s','these','they','they\'d','they\'ll','they\'re','they\'ve','this','those','through','to','too',
'under','until','up','very','was','wasn\'t','we','we\'d','we\'ll','we\'re','we\'ve','were','weren\'t','what',
'what\'s','when','when\'s','where','where\'s','which','while','who','who\'s','whom','why','why\'s','with','won\'t',
'would','wouldn\'t','you','you\'d','you\'ll','you\'re','you\'ve','your','yours','yourself','yourselves'];
stopWords.forEach(function(element){
for(var i = output2.length - 1; i >= 0; i--) {
if(output2[i] === element) {
output2.splice(i, 1);
}
}
})
console.log(output2);

结果显示

最后是提取词干
提取词干（Stemming）对单词的不同形态做截取操作，会去掉一些前缀后缀，但保留词的主要部分。
Stemming有3大主流算法
* Porter Stemming
* Lovins Stemming
* Lancaster Stemming
这里使用Porter Stemming算法

Porter Stemming Algorithm(http://tartarus.org/~martin/PorterStemmer/js.txt)

其实关于英文分词有很多研究，有许多著名的算法，也有著名的开源全文检索引擎工具包Lucene，方便软件开发人员在不同的系统中实现全文索引。Lucene自身就包括了3个Stemming算法。分别是
* EnglishMinimalStemmer
* Porter Stemming
* KStemmer

另外提一下词形还原（Lemmatisation)
Lemmatisation就是将不同时态，不同人称的单词还原成原形。非常强大，一般情况包括搜索时用不到，在研究计算机语言时会有涉及。

0 0