43种语言的停用词库都在此了~

来源:互联网 发布:淘宝换购怎么设置 编辑:程序博客网 时间:2024/04/28 18:39

https://github.com/6/stopwords


stopwords

Stopwords for various languages in JSON format. Per Wikipedia:

Stop words are words which are filtered out prior to, or after, processing of natural language data [...] these are some of the most common, short function words, such as theisatwhich, and on.

You can use all stopwords with stopwords-all.json (keyed by language ISO 639-1 code), or see the below table for individual language stopword files.

Languages

There are a total of 43 supported languages:

LanguageStopword countFilenameArabic162ar.jsonArmenian45hy.jsonBasque98eu.jsonBengali116bn.jsonBreton126br.jsonBulgarian259bg.jsonCatalan218ca.jsonChinese542zh.jsonCroatian179hr.jsonCzech346cs.jsonDanish101da.jsonDutch275nl.jsonEnglish570en.jsonEsperanto173eo.jsonEstonian35et.jsonFinnish772fi.jsonFrench606fr.jsonGalician160gl.jsonGerman596de.jsonGreek75el.jsonHebrew194he.jsonHindi225hi.jsonHungarian781hu.jsonIndonesian355id.jsonIrish109ga.jsonItalian623it.jsonJapanese109ja.jsonKorean679ko.jsonLatin49la.jsonLatvian161lv.jsonMarathi99mr.jsonNorwegian172no.jsonPersian332fa.jsonPolish260pl.jsonPortuguese408pt.jsonRomanian282ro.jsonRussian539ru.jsonSlovak110sk.jsonSlovenian446sl.jsonSpanish577es.jsonSwedish401sv.jsonThai115th.jsonTurkish279tr.json

Sources

  • Apache Lucene - Apache 2.0 License
  • Carrot2 - License
  • cue.language - Apache 2.0 License
  • Jacques Savoy - BSD License
  • SMART Information Retrieval System: ftp://ftp.cs.cornell.edu/pub/smart/

License and Copyright

Copyright (c) 2015 Peter Graham, contributors. Released under the Apache-2.0 license


0 0
原创粉丝点击