自然语言处理常用数据集

来源:互联网 发布:淘宝低价销售的危害 编辑:程序博客网 时间:2024/05/21 19:24

Treebanks and annotated corpus useful for training POS tagger, parser etc

  • Penn Treebank http://www.cis.upenn.edu/~treebank/home.html
  • WSJ Corpus https://catalog.ldc.upenn.edu/LDC2000T43
  • NEGRA German corpus http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/
  • Tiger corpus http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/
  • alpino Treebank http://odur.let.rug.nl/~vannoord/trees/
  • Bultreebank http://www.bultreebank.org/
  • Turin University Treebank http://www.di.unito.it/~tutreeb/
  • prague dependency Treebank http://ufal.mff.cuni.cz/pdt2.0/

Semantic relation annotated corpus
  • propbank 
  • Nombank http://nlp.cs.nyu.edu/meyers/NomBank.html
  • framenet http://framenet.icsi.berkeley.edu/
  • salsa http://www.coli.uni-saarland.de/projects/salsa/page.php?id=index

Text classification corpus
  • Reuters dataset http://www.daviddlewis.com/resources/testcollections/reuters21578/
  • news group datasets http://people.csail.mit.edu/jrennie/20Newsgroups/

Parallel corpus used in machine translation
  • EMILE http://www.lancs.ac.uk/fass/projects/corpus/emille/

Text summarization

  • DUC-2001, 2002, 2003, 2004, 2005, 2006, 2007 http://www-nlpir.nist.gov/projects/duc/data.html
  • TAC-2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015 http://tac.nist.gov/data/
  • Gigawords https://catalog.ldc.upenn.edu/LDC2012T21
  • LCSTS http://icrc.hitsz.edu.cn/Article/show/139.html

Machine Reading

  • CNN http://datasets.maluuba.com/NewsQA
  • Microsoft https://arxiv.org/abs/1611.09268
  • Microsoft Marco http://www.msmarco.org/
  • SQuAD https://www.aclweb.org/anthology/D16-1264
Others
  • TREC
  • SemEval http://alt.qcri.org/semeval2017/index.php?id=tasks
  • Microsoft COCO: http://mscoco.org/
0 0
原创粉丝点击