NLP 常用语料库

来源:互联网 发布:手机淘宝4.6.0下载 编辑:程序博客网 时间:2024/06/05 18:18

1.Sogou News Corpus

搜狗新闻语料库. Containing in total 2,909,551 news articles in various topic channels.
参考文献[1] 中是这么描述与使用的: :

There are a large number categories but most of them contain only few articles. We choose 5 categories – “sports”, “finance”, “entertainment”, “automobile” and “technology”. The number of training samples selected for each class is 90,000 and testing 12,000.

2. YFCC 100M

YaHoo 实验室的多媒体数据集, 用处不局限于NLP. 地址在参考文献[3]中.
内含约 1亿 张图片 与 100 万个视频, 有 标题, 说明 与 标签. 即 title, captions and tags.
它的标注是多元的, 比如一只小狗, 会被标注 动物/小狗/宠物/狮子狗 等.
FastText 论文中, 用到了它作 Tag Prediction.

参考

  1. Character-level Convolutional Networks for Text Classification
  2. 搜狗实验室
  3. YFCC 100M
原创粉丝点击