Top 10 Open Dataset Resources on Github
来源:互联网 发布:windows画图软件下载 编辑:程序博客网 时间:2024/06/03 06:55
The top open dataset repositories on Github include a variety of data, freely available for use by researchers, practitioners, and students alike.
Over the past several months we have had a look at a number of top Github repository collections, such as:
- Top 10 Machine Learning Projects on Github
- Top 10 Deep Learning Projects on Github
- Top 10 Data Visualization Projects on Github
- Top 10 Data Science Resources on Github
- Top 10 IPython Notebook Tutorials for Data Science and Machine Learning
This post will be a bit different, in that we are looking at the top open dataset repositories that Github has to offer. The post was inspired by the Github Open Data Showcase, which is good, but which is not very large. Ideally, I would like to make a list of the top open datasets on Github, period; however, this gets tricky, since searching for "open data," or any variant of this search term, is going to lead to complications on a site set up with the explicit goal of sharing open source projects and their data.
I decided to take the offerings in this showcase which were not explicitly noted as being out of date and add in 3 additional strictly-dataset repos with the highest numbers of stars I could find from simple search, rank them all accordingly, and present them here. We have found at KDnuggets that datasets are one of the most sought-after pieces of the data science puzzlefor many readers, and hopefully this fresh batch (at least, fresh from our perspective) is of use to some of our readers.
We are currently conducting our latest Annual KDnuggets Analytics Software Poll, and so the particular percentages from last year may change, but we know that open source tools have been used by 73% of data scientists in the past 12 months. While this number reflects software, and not data, it is easy to surmise that open data is a heavily-relied upon commodity in data science and related data-oriented disciplines for research, practice, and production alike, for myriad reasons.
So here they are, the open dataset repos with the highest number of stars as of the time of writing.
1. Awesome Public Datasets
Stars: 14137, Forks: 1573
Brought to us by Xiaming (Sammy) Chen, this seems to be the undisputed leader of the open dataset collections available on Github. This curated list is organized by such topics as biology, sports, museums, and natural language, and appears to include several hundred datasets. Most are free, but there is a disclaimer at the top of the list that some are not. Xiaming also points out 2 other awesome-branded repo lists that contain more datasets; however, since those lists contain all sorts of other big data/machine learning/data science links, they will not be included in the list below, despite their high number of stars. Feel free to explore them on your own... obviously.
2. OpenAddresses
Stars: 529, Forks: 510
This is the official repo of OpenAddresses.io, the free and open global address collection. Why addresses?
Street address data is essential infrastructure. Street names, house numbers and zip codes, when combined with geographic coordinates, are the hub that connects digital to physical places. Precisely because of their connecting role, free and open addresses are rocket fuel for civic and commercial innovation.
3. Congress Legislators
Stars: 417, Forks: 187
This repo is is summed up by its description:
Members of the United States Congress, 1789-Present, in YAML, as well as committees, presidents, and vice presidents.
4. Open Exoplanet Catalogue
Stars: 300, Forks: 88
This is a catalog of all known discovered planets existing outside of our solar system. The database is generally updated within 24 hours of new discoveries, too, which means this is about as up-to-date as one could imagine; that the repo was last updated 20 days ago is encouraging in this respect. The README also points to this repo, should you be interested in a simple CSV of the data.
5. CitySDK
Stars: 274, Forks: 92
CitySDK is described as a "[u]ser-friendly [J]avascript SDK for US Census Bureau data," which also includes a number of samples detailing integration of the data with other open datasets. It refers to itself as a "toolbox" for civic hackers, and boasts latitude/longitude and ZIP code translation, and a modular architecture which makes integration with other data services straightforward. Use the API to create your own, custom dataset.
6. openFDA
Stars: 236, Forks: 53
openFDA is a project by the FDA, which aims to bring a collection of FDA public datasets to researchers and developers via APIs, raw data, usage examples, and documentation. Data is noted as not being suited for clinical use, and one should assume no specific validity of any data results included within. Even with these disclaimers, there is no doubt that the data here would be great practice for those interested in the domain.
7. Food Inspections Evaluation
Stars: 100, Forks: 44
In case the name "Chicago Food Inspections Evaluation" didn't give it away, here's what to expect from this repo:
This repository contains the code to generate predictions of critical violations at food establishments in Chicago. It also contains the results of an evaluation of the effectiveness of those predictions.
8. GSA Data
Stars: 92, Forks: 40
This contains various data published by the General Services Administration, which handles the basic functioning of federal agencies (offices, supplies, and the like). Specifically, it contains a collection of over 5000 .gov domains and their data.
9. US Congressional Districts
Stars: 82, Forks: 21
From the repo's README:
Historic and current US Congressional districts as GeoJSON, versioned within Git
10. CERN Open Data Portal
Stars: 79, Forks: 34
This is the source code for the CERN Open Data Portal, described as "the access point to a growing range of data produced through the research performed at CERN."
Related:
- Awesome Public Datasets on GitHub
- 9 Must-Have Datasets for Investigating Recommender Systems
- 5 Machine Learning Projects You Can No Longer Overlook
- Original link: http://www.kdnuggets.com/2016/05/top-10-datasets-github.html
- Top 10 Open Dataset Resources on Github
- Top 10 Open Dataset Resources on Github
- Open Source on Github
- Open Source on Github: Your First Contribution
- Open Source on Github: Your First Contribution
- 2006 Open source TOP 10!
- The Top 10 Most Popular Security Projects on GitHub Read more: http://news.softpedia.com/news/the-t
- (译文)Open Source on Github: Your First Contribution
- 11 open source security tools catching fire on GitHub
- open dataset abap 01
- open dataset appending
- open dataset compress
- open dataset compress
- open dataset appending
- Open Dataset
- Top resources to learn Android
- Resources on the Internet
- On-Demand Resources Essentials
- JSP页面顶端出现“红色”的报错信息:The superclass "javax.servlet.http.HttpServlet" was not
- 第14周项目1-排序函数模板
- oracle 正则表达式
- 查看签名apk调试日志工具
- HDU Hdu Girls' Day
- Top 10 Open Dataset Resources on Github
- 怎么把一个事情描述清楚
- Datagrid动态设置列标题的的扩展方法
- JavaScript学习--Item25 创建对象(类)的8种方法总结
- CALayer
- 第十二周上机时间项目——项目1—实现复数类中的运算符重载
- 第13周阅读程序(3)
- ajax、Spring提交表格的时候出现中文乱码
- Java中的反射机制