数据集汇总

来源：互联网发布：免费淘客软件编辑：程序博客网时间：2024/05/18 02:38

图形图像处理：

1. CIFAR-10 & CIFAR-100

CIFAR-10包含10个类别，50,000个训练图像，彩色图像大小：32x32，10,000个测试图像。

（类别：airplane，automobile, bird, cat, deer, dog, frog, horse, ship, truck）

（作者：Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton）

（数据格式：Python版本、Matlab版本、二进制版本<for C程序>）

CIFAR-100与CIFAR-10类似，包含100个类，每类有600张图片，其中500张用于训练，100张用于测试；这100个类分组成20个超类。每个图像有一个"find" label和一个"coarse"label。

2. 图像分类结果及对应的论文

图像分类结果及应的论文，包含数据集：MNIST、CIFAR-10、CIFAR-100、STL-10、SVHN、ILSVRC2012 task 1

ILSVRC： ImageNet Large Scale Visual Recognition Challenge

3. ImageNet

ImageNet相关信息如下：

1）Total number of non-empty synsets: 21841
2）Total number of images: 14,197,122
3）Number of images with bounding box annotations: 1,034,908
4）Number of synsets with SIFT features: 1000
5）Number of images with SIFT features: 1.2 million

4. COCO

COCO(Common Objects in Context)是一个新的图像识别、分割、和字幕数据集，它有如下特点：

1）Object segmentation

2）Recognition in Context
3）Multiple objects per image
4）More than 300,000 images
5）More than 2 Million instances
6）80 object categories
7）5 captions per image
8）Keypoints on 100,000 people

COCO 2016 Detection Challenge(2016.6.1-2016.9.9)和COCO 2016 Keypoint Challenge(2016.6.1-2016.9.9)已经由Microsoft发起由ECCV 2016(ECCV：European Conference On Computer Vision )。

4. 3D数据

1）RGB-D People Dataset

2）NYU Hand Pose Dataset code

3）Human3.6M (3D Human Pose Dataset)

- 《Iterated Second-Order Label Sensitive Pooling for 3D Human Pose Estimation》

5. 人脸Dataset

1）LFW (Labeled Faces in the Wild)

6. Stereo Datasets

2）Middlebury Stereo Datasets

3）KITTI Vision Benchmark Suite

7. 普林斯顿大学人工智能自动驾驶汽车项目

1）Deep Drive

2）Source Code and Data

关于源代码，网上有很多公开源码的算法包，例如最为著名的Weka，MLC++等。Weka还在不断的更新其算法，下载地址：
http://www.cs.waikato.ac.nz/ml/weka/
UCI收集的机器学习数据集
ftp://pami.sjtu.edu.cn
http://www.ics.uci.edu/~mlearn/\\MLRepository.htm

statlib
http://liama.ia.ac.cn/SCILAB/scilabindexgb.htm
http://lib.stat.cmu.edu/

样本数据库
http://kdd.ics.uci.edu/
http://www.ics.uci.edu/~mlearn/MLRepository.html

关于基金的数据挖掘的网站
http://www.gotofund.com/index.asp

http://lans.ece.utexas.edu/~strehl/

reuters数据集
http://www.research.att.com/~lewis/reuters21578.html

各种数据集：
http://kdd.ics.uci.edu/summary.data.type.html
http://www.mlnet.org/cgi-bin/mlnetois.pl/?File=datasets.html
http://lib.stat.cmu.edu/datasets/
http://dctc.sjtu.edu.cn/adaptive/datasets/
http://fimi.cs.helsinki.fi/data/
http://www.almaden.ibm.com/software/quest/Resources/index.shtml
http://miles.cnuce.cnr.it/~palmeri/datam/DCI/

进行文本分类&WEB
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html

http://www.w3.org/TR/WD-logfile-960221.html
http://www.w3.org/Daemon/User/Config/Logging.html#AccessLog
http://www.w3.org/1998/11/05/WC-workshop/Papers/bala2.html
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
http://www.web-caching.com/traces-logs.html
http://www-2.cs.cmu.edu/webkb
http://www.cs.auc.dk/research/DP/tdb/TimeCenter/TimeCenterPublications/TR-75.pdf
http://www.cs.cornell.edu/projects/kddcup/index.html

时间序列数据的网址
http://www.stat.wisc.edu/~reinsel/bjr-data/

apriori算法的测试数据
http://www.almaden.ibm.com/cs/quest/syndata.html

数据生成器的链接
http://www.cse.cuhk.edu.hk/~kdd/data_collection.html
http://www.almaden.ibm.com/cs/quest/syndata.html

关联：
http://flow.dl.sourceforge.net/sourceforge/weka/regression-datasets.jar
http://www.almaden.ibm.com/software/quest/Resources/datasets/syndata.html#assocSynData

WEKA：
http://flow.dl.sourceforge.net/sourceforge/weka/regression-datasets.jar
1。A jarfile containing 37 classification problems, originally obtained from the UCI repository
http://prdownloads.sourceforge.net/weka/datasets-UCI.jar
2。A jarfile containing 37 regression problems, obtained from various sources
http://prdownloads.sourceforge.net/weka/datasets-numeric.jar
3。A jarfile containing 30 regression datasets collected by Luis Torgo
http://prdownloads.sourceforge.net/weka/regression-datasets.jar

癌症基因：
http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi

金融数据：
http://lisp.vse.cz/pkdd99/Challenge/chall.htm

kdnuggets 相关链接数据集（借花献佛了）：
http://www.kdnuggets.com/datasets/index.html
另一个人提供的
http://www.cs.toronto.edu/~roweis/data.html
http://kdd.ics.uci.edu/summary.task.type.html
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/
http://www.phys.uni.torun.pl/~duch/software.html
在下面的网址可以找到reuters数据集
http://www.research.att.com/~lewis/reuters21578.html

以下网址上有各种数据集：
http://kdd.ics.uci.edu/summary.data.type.html

进行文本分类，还有一个数据集是可以用的，即rainbow的数据集
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html
Download the Financial Data (~17.5M zipped file, ~67M unzipped data)
Download the Medical Data (~2M zipped file, ~6M unzipped data)
http://lisp.vse.cz/pkdd99/Challenge/chall.htm

最流行的4个机器学习数据集

字数887 阅读12333 评论5 喜欢28

机器学习算法需要作用于数据，而数据的本质则决定了应用的机器学习算法是否合适，而数据的质量也会决定算法表现的好坏程度。所以会研究数据，会分析数据很重要。本文作为学习研究数据系列博文的开篇，列举了4个最流行的机器学习数据集。

Iris

Iris也称鸢尾花卉数据集，是一类多重变量分析的数据集。通过花萼长度，花萼宽度，花瓣长度，花瓣宽度4个属性预测鸢尾花卉属于（Setosa，Versicolour，Virginica）三个种类中的哪一类。

数据集特征:多变量记录数:150领域:生活属性特征:实数属性数目:4捐赠日期1988-07-01相关应用:分类缺失值?无网站点击数:563347

Adult

该数据从美国1994年人口普查数据库抽取而来，可以用来预测居民收入是否超过50K$/year。该数据集类变量为年收入是否超过50k$，属性变量包含年龄，工种，学历，职业，人种等重要信息，值得一提的是，14个属性变量中有7个类别型变量。

数据集特征:多变量记录数:48842领域:社会属性特征:类别型，整数属性数目:14捐赠日期1996-05-01相关应用:分类缺失值?有网站点击数:393977

Wine

这份数据集包含来自3种不同起源的葡萄酒的共178条记录。13个属性是葡萄酒的13种化学成分。通过化学分析可以来推断葡萄酒的起源。值得一提的是所有属性变量都是连续变量。

数据集特征:多变量记录数:178领域:物理属性特征:整数，实数属性数目:13捐赠日期1991-07-01相关应用:分类缺失值?无网站点击数:337319

Car Evaluation

这是一个关于汽车测评的数据集，类别变量为汽车的测评，（unacc，ACC，good，vgood）分别代表（不可接受，可接受，好，非常好），而6个属性变量分别为「买入价」，「维护费」，「车门数」，「可容纳人数」，「后备箱大小」，「安全性」。值得一提的是6个属性变量全部是有序类别变量，比如「可容纳人数」值可为「2，4，more」，「安全性」值可为「low, med, high」。

数据集特征:多变量记录数:1728领域:N/A属性特征:类别型属性数目:6捐赠日期1997-06-01相关应用:分类缺失值?无网站点击数:272901

小结

通过比较以上4个数据集的差异，简单地总结：当需要试验较大量的数据时，我们可以想到「Adult」；当想研究变量之间的相关性时，我们可以选择变量值只为整数或实数的「Iris」和「Wine」；当想研究logistic回归时，我们可以选择类变量值只有两种的「Adult」；当想研究类别变量转换时，我们可以选择属性变量为有序类别的「Car Evaluation」。更多的尝试还需要对这些数据集了解更多才行。

0 0