词形还原-European Language Lemmatizer使用笔记
来源:互联网 发布:mac怎么修改hosts文件 编辑:程序博客网 时间:2024/06/16 09:28
所谓词形还原(lemmatizer),即把一个任何形式的英语单词还原到一般形式,与词根还原不同(stemmer),后者是抽取一个单词的词根。
最近在做跨语言文本分类的研究,需要词形还原,穷搜一番后找到了这个俄罗斯人写的工具,貌似还不错。现在把自己的探索笔记列在这里,以方便来人。
1、下载地址:http://lemmatizer.org/en/download.html
2、依次安装4个组件,其中有一个是俄文词典,可以不装。安装过程很简单:http://lemmatizer.org/en/setup.html。安装时要用到cmake,下载地址:http://www.cmake.org/
3、装英文词典时遇到问题,make报错,找不到turglem/txml.h,跑到usr/local/include/turglem下一看,果然没有,在usr/local/include/txml下有个txml.hpp,拷贝到前一个目录下,名字改成txml.h,再make。还有错:‘void txml::determination_object::load_from_file(const std::string&)’ 的原型不匹配类 ‘txml::determination_object’ 中的任何一个。大意是说txml.cpp的实现中,这个函数与头文件中的声明不一致。于是打开turglem-english-0.2/txml.cpp一看,也没有太大的问题,头文件中声明的方法参数是char *,cpp文件中是string&,于是自己把源代码txml.cpp中的实现改成char *版本。再make,顺利通过!
4、下载例子程序:Usage examples ,cmake、make,run,运行成功。
5、自己写个程序,直接改例子中的test1.c,命名为lemmatizer.cpp,放到例子文件夹下,再改下例子文件夹中的cmakelists.txt,在最后加上一句:
ADD_EXECUTABLE(lemmatizer lemmatizer.cpp)
TARGET_LINK_LIBRARIES(lemmatizer ${TURGLEM_LIBRARY} ${MAFSA_LIBRARY}
${TURGLEM_ENGLISH_LIBRARY})
cmake、make、run,运行成功!
6、测试句子:
Mrs. Clinton, speaking Wednesday afternoon in Seoul after conferring with President Lee Myung-bak, said the South s conclusion that one of its warships had been sunk by a North Korean torpedo was inescapable, and she endorsed Mr. Lee s moves to cut trade ties and redesignate the North its archenemy.
转换后:
Mrs. Clinton, SPEAKING WEDNESDAY AFTERNOON IN SEOUL AFTER CONFER WITH PRESIDENT LEE Myung-bak, SAY THE SOUTH CONCLUSION THAT ONE OF IT WARSHIP HAVE BE SINK BY A NORTH KOREAN TORPEDO BE inescapable, AND SHE ENDORSED Mr. LEE MOVE TO CUT TRADE TIE AND REDESIGNATE THE NORTH IT archenemy.
注意到,有一些词,比如speaking、endorsed没有被转换,前者可以被认为是名词形式,后者则可能是在词典里面没有。
总之European Language Lemmatizer功能比较完善,方便使用,而且本身支持输出一个单词的所有形式,如果再结合pos进行选择的话,效果会更好。
//附:英语文本词形还原程序lemmatizer.cpp#include <turglem/lemmatizer.h>#include <stdio.h>#include <errno.h>#include <string.h>#include <string>#include <sstream>#include <iostream>#include <fstream>#include <turglem/english/charset_adapters.h>#include <MAFSA/charset_adapter.h>using namespace std;MAFSA_conv_string_to_letters my_s2l = 0;MAFSA_conv_letters_to_string my_l2s = 0;string convert(turglem lem, string word){const char * s = word.c_str();MAFSA_letter l[32];ssize_t ssz_src;size_t sz_lem_res;size_t sz;int lem_res[32 * 2];ssz_src = my_s2l(s, l, 32);if (ssz_src < 0){ //含有非英语字符,直接返回,不转换 //printf("can't convert word <%s> to letters! Is it english word in UTF-8?/n", s); return word;} sz_lem_res = turglem_lemmatize(lem, l, ssz_src, lem_res, 32, ENGLISH_LETTER_DELIM, 1); if (sz_lem_res == 0){ //没有转换结果,直接返回 return word;}MAFSA_letter out_letters[32];char buf[500];sz = turglem_build_form(lem, l, ssz_src, out_letters, 32, lem_res[0], lem_res[1], 0);sz = my_l2s(out_letters, sz, buf, 500);return string(buf);}string strToLower(string str){ for(int i = 0; i < str.size(); i++) { if(isalpha(str[i])) { str[i] = tolower(str[i]); } } return str;}int main(int argc, char **argv){turglem lem = 0;int err_no;int err_what;if (argc != 3){ cout << "command infile ofile" << endl; return 1;} string inpath(argv[1]); string opath(argv[2]); fstream infile(inpath.c_str(), ios_base::in); fstream ofile(opath.c_str(), ios_base::out); if(!infile || !ofile) { cout << "open file error!" << endl; return 1; }//加载词典cout << "we try to load english dictionaries from /usr/local/share/turglem/english/n";lem = turglem_load("/usr/local/share/turglem/english/dict_english.auto", "/usr/local/share/turglem/english/prediction_english.auto", "/usr/local/share/turglem/english/paradigms_english.bin", &err_no, &err_what);if (0 == lem){ printf("FAILED: err_no/err_what = %d/%d/n (error loading %s: %s: %s)/n", err_what, err_no, turglem_error_what_string(err_what), turglem_error_no_string(err_no), strerror(errno)); return -1;}my_s2l = LEM_ENGLISH_conv_string_to_letters_utf8;my_l2s = LEM_ENGLISH_conv_letters_to_string_utf8; string rmchars = ",./?<>[]{}/"/';:1234567890-+=_()"; string line; //此处的文件格式为,每行一个文本,文本中各个单词之间用空格分开 while(getline(infile, line)) { line = strToLower(line); for(int j = 0; j < line.size(); j++) { for(int i = 0; i < rmchars.size(); i++) { if(line[j] == rmchars[i]) { line[j] = ' '; } } } istringstream iss(line); string word; while(iss >> word) { string normWord = convert(lem, word); ofile << strToLower(normWord) << " "; } ofile << endl; }turglem_close(lem);return 0;}
- 词形还原-European Language Lemmatizer使用笔记
- 如何使用European english lemmatizer
- 英文单词词形还原程序
- 词干提取和词形还原
- 词干提取和词形还原
- 【java】使用Stanford CoreNLP处理英文(词性标注/词形还原/解析等)
- 采用Stanford CoreNLP实现英文单词词形还原
- 短语高级识别和词形还原
- 采用Stanford CoreNLP实现英文单词词形还原
- 词干提取(stemming)和词形还原(lemmatization)
- 词干提取(stemming)和词形还原(lemmatization)
- 词干提取(stemming)和词形还原(lemmatization)
- 词干提取(stemming)和词形还原(lemmatization)
- 词干提取(stemming)和词形还原(lemmatization)
- 词干提取(stemming)与词形还原(lemmatization)
- 词形还原(lemmatization)和词性辨别(PartOfSpeech)工具
- 词干提取(stemming)和词形还原(lemmatization)比较
- 词形归并
- Dos命令大全
- 用JDBC封装CRUD操作(个人总结)
- Red Hat Linux下python lxml的配置安装
- 汇编笔记--循环
- android源码下载
- 词形还原-European Language Lemmatizer使用笔记
- C语言库函数access的使用
- suse sendmail
- POJ_1014_Dividing
- 说下新浪和腾讯微博的现状
- Tomcat独立布暑项目
- 获得本地所有的网络适配器及其信息
- 我来了
- Androidd 的第六感幻想曲