Advanced Topics in Data Mining Spring 2011

来源：互联网发布：清华大学软件学院课程编辑：程序博客网时间：2024/05/16 13:53

Books (PDFs):

Mining Massive Datasets by A. Rajaraman, J. Ullman.
Networks, Crowds, and Markets: Reasoning About a Highly Connected World by D. Easley, J. Kleinberg.
Data-Intensive Text Processing with MapReduce by J. Lin, C. Dyer.

SNAP network datasets

Wikipedia

Complete edit history of Wikipedia articles: Which user edited what article at what time.
Wikipedia page to page link data
DBpedia: A richly labeled graph of Wikipedia entities.
Freebase: An entity graph of people, places and things.

Ratings and purchases (movies, music, etc.)

Yahoo! Webscope Catalog of datasets

Yahoo! Webscope dataset collection. Cotains Language Data, Graph and Social Data, Ratings Data, Advertising and Market Data, Competition Data
Note: Jure Leskovec will have to apply for any sets you want, and we must agree not to distribute them further. There may be a delay, so get requests in early.

Co-authorship and Citation Networks

Internet (Autonomous Systems) topology

Who trusts whom data at Trustlet

Instant messenger buddy graph from March 2005. There are 227 million nodes and 7.3 billion undirected edges.
Altavista web graph from 2002. 1.4 billion nodes, 5.5 billion edges.
Memetracker2. 1 million blog posts, news media articles, tweets and facebook wall posts per hour for a period from August 1 to August 31 2010. 181GB of compressed data.
The New York Times Annotated Corpus: over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata.
TheFind: product information data (price, category, related products) extracted from 239 different websites.
Twitter: About 500 million tweets over a 7 month period. Data description.
Wikipedia: Complete revision history of Wikipedia -- every edit of every article with full content.
Wikipedia webserver logs: Hourly Wikipedia page access statistics.
Yahoo! Messenger: Instant Messenger graph with some additional information

Data can be accessed here. Email Jure if you do not have a password.

Yannis Antonellis and Jawed Karim offer a file that contains information about the search queries that were used to reach pages on the Stanford Web server. See http://www.stanford.edu/~antonell/tags_dataset.html
The Stanford WebBase project provides a crawl, and may even be talked into providing a specialized crawl if you have a need. Find description here. Find how to access web pages in the repository here.