Advanced Topics in Data Mining Spring 2011

来源:互联网 发布:清华大学软件学院课程 编辑:程序博客网 时间:2024/05/16 13:53

Books (PDFs):

  • Mining Massive Datasets by A. Rajaraman, J. Ullman.

  • Networks, Crowds, and Markets: Reasoning About a Highly Connected World by D. Easley, J. Kleinberg.

  • Data-Intensive Text Processing with MapReduce by J. Lin, C. Dyer.

Datasets:

SNAP network datasets

  • 60 large social and information network datasets

Wikipedia

  • Complete edit history of Wikipedia articles: Which user edited what article at what time.
  • Wikipedia page to page link data
  • DBpedia: A richly labeled graph of Wikipedia entities.
  • Freebase: An entity graph of people, places and things.

Ratings and purchases (movies, music, etc.)

  • Amazon product co-purchasing network: 600k products and all their metadata.
  • KDD Cup 2011: 300M ratngs from 1M users on 600k songs, albums and artists.
  • IMDB database: Everything about every movie ever made.
  • Movielens: User movie rating data.

Yahoo! Webscope Catalog of datasets

  • Yahoo! Webscope dataset collection. Cotains Language Data, Graph and Social Data, Ratings Data, Advertising and Market Data, Competition Data
  • Note: Jure Leskovec will have to apply for any sets you want, and we must agree not to distribute them further. There may be a delay, so get requests in early.

Co-authorship and Citation Networks

  • DBLP: Digital Bibliography & Library Project. More info.
  • Arxiv citation and co-authorship networks: Data is from KDD 2003 Cup.

Internet (Autonomous Systems) topology

  • AS Graphs

Who trusts whom data at Trustlet

  • Trust network datasets from Trustlet.org

Stanford only datasets

  • Instant messenger buddy graph from March 2005. There are 227 million nodes and 7.3 billion undirected edges.
  • Altavista web graph from 2002. 1.4 billion nodes, 5.5 billion edges.
  • Memetracker2. 1 million blog posts, news media articles, tweets and facebook wall posts per hour for a period from August 1 to August 31 2010. 181GB of compressed data.
  • The New York Times Annotated Corpus: over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata.
  • TheFind: product information data (price, category, related products) extracted from 239 different websites.
  • Twitter: About 500 million tweets over a 7 month period. Data description.
  • Wikipedia: Complete revision history of Wikipedia -- every edit of every article with full content.
  • Wikipedia webserver logs: Hourly Wikipedia page access statistics.
  • Yahoo! Messenger: Instant Messenger graph with some additional information

Data can be accessed here. Email Jure if you do not have a password.

Other Datasets

  • Yannis Antonellis and Jawed Karim offer a file that contains information about the search queries that were used to reach pages on the Stanford Web server. See http://www.stanford.edu/~antonell/tags_dataset.html
  • The Stanford WebBase project provides a crawl, and may even be talked into providing a specialized crawl if you have a need. Find description here. Find how to access web pages in the repository here.
原创粉丝点击