北理工机器学习课程Project题目汇总

来源：互联网发布：淘宝新店刷几单安全编辑：程序博客网时间：2024/04/28 16:42

Project:

1. Inferring Networks of Diffusion and Influence

Data:

Download the dataat http://snap.stanford.edu/netinf/#data.

Data contains information about the connectivity of the who-copies-from-whom orwho-repeats-after-whom network of news media sites and blogs inferred byNETINF, an algorithm that infers a who-copies-from-whom orwho-repeats-after-whom network of news media sites and blogs.

The dataset used by NETINF is called MemeTracker. It can be downloaded from here.

MemeTracker contains two datasets. The first one is a phrase cluster data. Foreach phrase cluster the data contains all the phrases in the cluster and a listof URLs where the phrases appeared. The second is the raw MemeTracker phrasedata, which contains phrases and hyper-links extracted from eacharticle/blogpost.

Project idea: Information diffusion and viruspropagation are fundamental processes taking place in networks. In manyapplications, the underlying network over which the diffusions and propagationsspread is hard to find. Finding such underlying network using MemeTracker datawould be an interesting and challenging project. Gomez-Rodriguez et al. (2010)have recently published a paper on this topic, and made their code publicallyaccessible. It would be interesting to replicate their result and furtherimprove the proposed algorithm by making use of more informative features(e.g., textual content of postings etc).

References::

Inferring Networks of Diffusion and Influence,Gomez-Rodriguez et al., 2010

2. Apply NetInf to Other Domains

Data:

Download the dataat http://snap.stanford.edu/netinf/#data.

Data contains information about the connectivity of the who-copies-from-whom orwho-repeats-after-whom network of news media sites and blogs inferred byNETINF, an algorithm that infers a who-copies-from-whom orwho-repeats-after-whom network of news media sites and blogs.

The dataset used by NETINF is called MemeTracker. It can be downloaded from here.

MemeTracker contains two datasets. The first one is a phrase cluster data. Foreach phrase cluster the data contains all the phrases in the cluster and a listof URLs where the phrases appeared. The second is the raw MemeTracker phrasedata, which contains phrases and hyper-links extracted from eacharticle/blogpost.

Project idea: In Gomez-Rodriguez et al.'s (2010)paper, they applied NetInf to Memetracker, and found that clusters of sitesrelated to similar topics emerge (politics, gossip, technology, etc.), and afew sites with social capital interconnect these clusters allowing a potentialdiffusion of information among sites in different clusters. It would beinteresting to see how the proposed algorithm could be used in other networks,and what knowledge could we get from those networks. For example, can wediscover users that share similar interest from a social network? Networkdatasets of different domains can be found at here.Different networks may take different forms, and thus the algorithm may not bedirectly applicable. How to modify the existing algorithm to support othernetworks?

References::

Inferring Networks of Diffusion and Influence,Gomez-Rodriguez et al., 2010

3. Dynamically Inferring Networks of Diffusion and Influence

4. Relational Information Retrieval

Data:

2010, yeast2 updated yeast datawith extra information about Mesh heading, chemicals and affiliations etc.(321K entities and 6.1M links)

2010, fly a biologicalliterature graph with 770K entities and 3.5M links

2010, yeast a biologicalliterature graph with 164K entities and 2.8M links

All these datasetsare relational graph based datasets. Nodes in the graph are of different types(e.g. author, paper, gene, protein, title word, journal, year). Edges betweennodes describe relations between two nodes (e.g. AuthorOf, Cites, Mentions).

Project idea: Scientific literature with richmetadata can be represented as a labeled directed graph. Given this graph, canwe suggest related work to authors? Can we retrieve relevant papers given somekey words? All of these tasks can be formulated as relational retrieval tasksin the graph. How to efficiently retrieve items in the graph given somespecific nodes as queries? Random walk with restart (RWR) has been used tomodel these tasks. Potential projects include implementing different versionsof RWR related work, and further improving them to achieve better retrievalquality.

References::

Ni Lao, William W.Cohen, Relational retrieval using a combination of path-constrained random walks Machine Learning,2010, Volume 81, Number 1, Pages 53-67 (ECML, 2010 slides poster )

5. Image Categorization

Project idea: Imagecategorization/object recognition has been one of the most important researchproblems in the computer vision community. Researchers have developed a widespectrum of different local descriptors, feature coding schemes, andclassification methods.

In this project,you will implement your own object recognition system. You could use any codefrom the web for computing image features, such as SIFT, HoG, etc.
For computing SIFT features, you could use http://www.vlfeat.org/~vedaldi/code/sift.html.
Following is a list of data sets you could use.

A list of datasets:

[1] Caltech 101/256:http://www.vision.caltech.edu/Image_Datasets/Caltech101/Caltech101.html
[2] The PASCAL Object Recognition Database Collection:http://pascallin.ecs.soton.ac.uk/challenges/VOC/databases.html
[3] LabelMe:http://labelme.csail.mit.edu/
[4] CMU face databases:http://vasc.ri.cmu.edu/idb/html/face/
[5] Face in the wild:http://vis-www.cs.umass.edu/lfw/
[6] ImageNet:http://www.image-net.org/index
[7] TinyImage:http://groups.csail.mit.edu/vision/TinyImages/

6. Human Action Recognition

Project idea: Applications suchas surveillance, video retrieval and human-computer interaction require methodsfor recognizing human actions in various scenarios.
In this project, you will implement your own human action recognition system.You could use any code from the web for computing spatio-temporal features. Onegood example is the spatio-temporal interest point proposed by Piotr Dollar.Source code available at http://vision.ucsd.edu/~pdollar/research/research.html.
Following is a list of data sets you could use.

A list ofdatasets:

[1] KTH:http://www.nada.kth.se/cvap/actions/
[2] Weizmann:http://www.wisdom.weizmann.ac.il/~vision/SpaceTimeActions.html
[3] Hollywood Human Actions dataset:http://www.irisa.fr/vista/Equipe/People/Laptev/download.html
[4] VIRAT Video Dataset:http://www.viratdata.org/

7. Single Class Object Detector

Project idea: Given a randomphoto shot of a bookcover, movie poster, wine bottle picture, etc, which mayhave light, scale, angle variation, find the standard image of the poster,logo, etc, in a database that matches the query. This is real application foran iPhone user to identify what they see. For example, if I see a movie poster,or a bookcover, or a wine bottle, I can take a picture and then hit search, andfind online information of the original image and other relevant information ofthe movie, book, and wine.
One intuition here lies in the fact that there are limited number of books inthe world. If we have a database containing all book covers in the world, therecognition problem would reduce to a duplicate detection problem, which ismuch simpler to solve, compared with general purpose object recognition.

In this project, you are encouraged to design an object detector for a singleimage class using duplicate detection. For example, in the book cover case, youcould crawl all pages about books from Amazon.com and store the images as yourdatabase of book covers.

Following is a list of possible image classes you could consider in thisfashion:
[1] Book cover
[2] Landmark (e.g., Eiffel Tower, Great Wall, White House, etc)
[3] Movie Posters (e.g., crawl images fromhttp://www.movieposter.com)
[4] Wine/beer bottle labels
[5] Logos
[6] Art pieces (e.g., painting, sculpture)

Once you have thedatabase, recognition / detection could be solved using near duplicated imagedetection.
You could use any algorithm or source code on the Internet, e.g.http://www.mit.edu/~andoni/LSH/.

8. Object based action recognition

9. Data Mining for Social Media

Project idea: In thisproject, we encourage students to infer the underlying relations betweendifferent modalities of information on the Web. Here are some examples.

(1) Given a photo of movie poster (image), can we retrieve related trailers(video) or latest news articles (documents) of the movie?
To make project simpler, we recommend focusing on less than five movies (eg.'Rise of the planet of the apes' and 'The smurfs'). You first download postersand trailers from some well-organized sites such as imdb.com or itunes.com.They will be used as training data to learn your classifiers. Now your job isto gather raw data from youtube.com or Flickr, and classify them. In thisproject, we encourage you to explore the possibility to build classifiers to belearned from one information modality (eg. images), and to be applicable toother modalities (eg. trailer videos).

(2) Given a beer label (image), can we search for which frames of a given videoclip the logo or bottle appears?
Suppose that you are a big fan of Guinness beer. You can easily download theclean Guinness logo or cup images by Google image search. These images can beused to learn your detector, which can discover the frames that the logoappears in the video clips. For testing, you can download some video clips fromyoutube.com.

The above examples are just two possible candidates, and any new ideas orproblem definitions are welcome.
For this purpose, one may take advantage of some source codes available on theWeb as unit modules (eg. near-duplicated image detection, object recognition,action recognition in video).
Another interesting direction is to improve the current state-of-the-artsmethods by considering more practical scenarios.

Related Papers and Software::

- A good exampleof how a machine learning technique is successfully applied to real systems(ex. Google news recommendation).
[1] Das, Datar, Garg, Rajaram. Google news personalization: scalable onlinecollaborative filtering. WWW 2007.
- One of most popular approaches to near duplicated image detection is LSHfamilies.
[2] http://www.mit.edu/~andoni/LSH/ (This webpagelinks several introductory articles and source codes).
- Various hashing techniques in computer vision (papers and source codes).
[3] Spectral Hashing (http://www.cs.huji.ac.il/~yweiss/SpectralHashing/)
[4] Kernelized LSH (http://www.eecs.berkeley.edu/~kulis/klsh/klsh.htm)
- Recognition in video
[5] Naming of Characters in Video (http://www.robots.ox.ac.uk/~vgg/data/nface/index.html)
[6] Action recognition in Video (http://www.robots.ox.ac.uk/~vgg/data/stickmen/index.html)
- Recognition in images
[7] Human pose detection (Poselet) (http://www.eecs.berkeley.edu/~lbourdev/poselets/)
[8] General object detection (http://people.cs.uchicago.edu/~pff/latent/)

10. Object Recognition,Scene Understanding, and More on Twitter

Project idea: Currently, Twitter does not providethe photo-sharing functionality, which has been supported by severalthird-party services such as twitpic, yfog,lockerz, instagram. (See thecurrent market-share on these services at http://techcrunch.com/2011/06/02/a-snapshot-of-photo-sharing-market-share-on-twitter/).The main goal of this project is to recognize objects or scenes in user photosby using its contextual information such as author, taken time, and associatedtweets. Students may gather data by using Twipho or built-in searchengines of the services (eg. http://web1.twitpic.com/search/).
In practice, it is extremely difficult to completely understand the photos intwitter. Hence, we encourage students to come up with good problem definitionsso that they can not only be solvable as course projects but also be usable toreal applications. Here are some examples.

(1) The photos that are retrieved by querying 'superman' in the http://web1.twitpic.com/search/ are highlyvariable. But, given an image, you can build a classifier to tell whether 'superman'logos appear on the images or not.
(2) Let's download the photos queried by 'beach'. Observing the images, you mayidentify what objects are usually shown. Choose some of them as our targetobjects such as human faces, sand, sea, and sky, and learn your classifier foreach object category. Then, your goal is to tell what objects appear where in atwitter image.

Related Papers andSoftware::

- Some objectrecognition competition sites will be very helpful.
[1] PASCAL VOC (http://pascallin.ecs.soton.ac.uk/challenges/VOC/)
[2] ImageNet (http://www.image-net.org/challenges/LSVRC/2011/)
[3] MIRFLICKR (http://press.liacs.nl/mirflickr/)
[4] SUN database (http://groups.csail.mit.edu/vision/SUN/)
- Some object detection source codes are available.
[5] Most popular object detection (http://people.cs.uchicago.edu/~pff/latent/)
[6] Object recognition short course ¿ pLSA and Boosting (http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html)
[7] Human pose detection (Poselet)(http://www.eecs.berkeley.edu/~lbourdev/poselets/)

11. Associationanalysis

Project Idea:The goal ofpopulation association studies is to identify patterns of polymorphisms thatvary systematically between individuals with different disease states and couldtherefore represent the effects of risk-enhancing or protective alleles. Thesestudies make use of the statistical correlation between the polymorphism andthe trait of interest (usually the presence or absence of disease) to identifythese patterns. This project will make use of data from the Personal GenomeProject. It contains information about many traits of the individuals from whomthe genetic data was obtained. Using techniques such as statistical tests,sparsity-based methods, eigenanalysis, you can try to find the geneticpolymorphisms that are likely to be responsible for a particular trait.

Data:

· Personal Genome Project (http://www.personalgenomes.org/)- The Personal Genome Project makes available genetic and phenotypic data fromvolunteers for analysis. The data can be downloaded at http://www.personalgenomes.org/public/.

References:

· D.J.Balding, (2006) A tutorial on statistical methods forpopulation association studies, Nature Reviews Genetics.

· Stephens M, Balding DJ. (2009) Bayesian statisticalmethods for genetic association studies, Nature Reviews Genetics.

· Alkes L Price, Nick J Patterson, Robert M Plenge, MichaelE Weinblatt, Nancy A Shadick and David Reich, (2006), Principal componentsanalysis corrects for stratification in genome-wide association studies, NatureGenetics.

12. Using text andnetwork data to predict and to understand

Description:

Many data sets are heterogeneous, comprising feature vectors, textual data (bagof words), network links, image data, and more. For example, Wikipedia pagescontain text, links and images. The challenge is figuring out how to use allthese types of data for some machine learning task: data exploration,prediction, etc.. A typical machine learning approach to this problem is"multi-view learning", in which the different data types are assumedto be multiple "views" of the entities of interest (webpages in thecase of Wikipedia).

Multi-view learning opens up applications not normally available withsingle-view datasets. For example, consider a citation recommendation servicefor academics, which suggests papers you should cite based on the text of yourpaper draft. Such a service would be trained on a corpus of academic papers,learning how the citations relate to the paper texts. Another example would beinterest prediction and advertising in social networks: given a user's friendlist, determine what things that user is interested in.

In this project, you will focus on datasets with text and network data, such as(but not limited to) citation networks. As our examples suggest, your primarygoal is to design a machine learning algorithm that trains on a subset of thetext and network data, and, given text (or network links) from test entities,outputs network link (or text) predictions for them. Alternatively, you coulddesign an algorithm that converts text and network data into "latentspace" feature vectors suitable for data visualization (similar to methodssuch as the Latent Dirichlet Allocation and the Mixed-Membership StochasticBlockmodel). Note that these goals are not exclusive; your proposed methodcould even do both. The key challenge in this project is figuring out how tolearn from text and network data jointly, even though both data types arefundamentally different.

Recommendedreading:

· Joint Latent Topic Models for Text and Citations(Nallapati, Ahmed, Xing, Cohen, 2008)

· Multi-view learning over structured and non-identicaloutputs (Ganchev, Graca, Blitzer, Taskar, 2008)

Suggested Datasets :

· ACL Anthology citation network

· arXiv High-Energy Physics citation network (from KDD cup2003)

13. Efficient methodsfor understanding large networks

14. Cognitive StateClassification with Magnetoencephalography Data (MEG)

Data:

A zip filecontaining some example preprocessing of the data into features along with sometext file descriptions: LanguageFiles.zip
The raw time data (12 GB) for two subjects (DP/RG_mats) and the FFT data (DP/RG_avgPSD)is located at:
/afs/cs.cmu.edu/project/theo-23/meg_pilot
You should access this directly through AFS space

This data setcontains a time series of images of brain activation, measured using MEG. Humansubjects viewed 60 different objects divided into 12 categories (tools, foods,animals, etc...). There are 8 presentations of each object, and eachpresentation lasts 3-4 seconds. Each second has hundreds of measurements from300 sensors. The data is currently available for 2 different human subjects.

Project A: Building acognitive state classifier
Project idea: We would like to build classifiers todistinguish between the different categories of objects (e.g. tools vs. foods)or even the objects themselves if possible (e.g. bear vs. cat). The excitingthing is that no one really knows how well this will work (or if it's evenpossible). This is because the data was only gathered a few weeks ago (Aug-Sept08). One of the main challenges is figuring out how to make good features fromthe raw data. Should the raw data just be used? Or maybe it should be firstpassed through a low-pass filter? Perhaps a FFT should convert the time seriesto the frequency domain first? Should the features represent absolute sensorvalues or should they represent changes from some baseline? If so, whatbaseline? Another challenge is discovering what features are useful for whattasks. For example, the features that may distinguish foods from animals may bedifferent than those that distinguish tools from buildings. What are good waysto discover these features?

This project ismore challenging and risky than the others because it is not known what theresults will be. But this is also good because no one else knows either,meaning that a good result could lead to a possible publication.
Papers to read:
Relevant but in the fMRI domain:
Learning to Decode Cognitive States from Brain Images,Mitchell et al., 2004,
Predicting Human Brain Activity Associated with the Meanings of Nouns,Mitchell et al., 2008
MEG paper:
Predicting the recognition of natural scenes from single trial MEGrecordings of brain activity, Rieger et al. 2008 (access from CMUdomain)

15. Educational DataMining on Predicting Student Performance

Data:

Register at the KDD Cup 2010: Educational Data Mining Challenge website, and clickon "Get Data".

There are two types of data sets available, development data sets and challengedata sets. Development data sets differ from challenge sets in that the actualstudent performance values for the prediction column, "Correct FirstAttempt", are provided for all steps.

The data takes the form of records of interactions between students andcomputer-aided-tutoring systems. The students solve problems in the tutor andeach interaction between the student and computer is logged as a transaction.Four key terms form the building blocks of the data. These are problem, step,knowledge component, and opportunity.

Project idea: How generally or narrowly do studentslearn? How quickly or slowly? Will the rate of improvement vary betweenstudents? What does it mean for one problem to be similar to another? It mightdepend on whether the knowledge required for one problem is the same as theknowledge required for another. But is it possible to infer the knowledgerequirements of problems directly from student performance data, without humananalysis of the tasks?

We would like toask you to predict whether a student is likely to be correct or not on eachstep given based on previous log data. The problem can be formalized as aclassification problem. You could also build a model of students' learningbehavior and predict the probability of making an error. The challenge here isto select the correct classifier/model that best represents the data. Moreover,maybe not all given features are informative. Models that are over-complicatedmay overfit. How to find the relevant features and make good use of them areinteresting topics.

References::

Feature Engineering and Classifier Ensemble for KDDCup 2010, Yu et al., 2010
Using HMMs and bagged decision trees to leverage richfeatures of user and skill from an intelligent tutoring system dataset,Pardos and Heffernan, 2010
Collaborative Filtering Applied to Educational Data Mining,Toscher and Jahrer, 2010

16. Brain imaging data(fMRI)

This data is available here

This data setcontains a time series of images of brain activation, measured using fMRI, withone image every 500 msec. During this time, human subjects performed 40 trialsof a sentence-picture comparison task (reading a sentence, observing a picture,and determining whether the sentence correctly described the picture). Each ofthe 40 trials lasts approximately 30 seconds. Each image contains approximately5,000 voxels (3D pixels), across a large portion of the brain. Data isavailable for 12 different human subjects.

Available software: we can provideMatlab software for reading the data, manipulating and visualizing it, and fortraining some types of classifiers (Gassian Naive Bayes, SVM).

Project A: Bayes networkclassifiers for fMRI
Project idea: Gaussian Naive Bayes classifiers andSVMs have been used with this data to predict when the subject was reading asentence versus perceiving a picture. Both of these classify 8-second windowsof data into these two classes, achieving around 85% classification accuracy[Mitchell et al, 2004]. This project will explore going beyond the GaussianNaive Bayes classifier (which assumes voxel activities are conditionallyindependent), by training a Bayes network in particular a TAN tree [Friedman,et al., 1997]. Issues you'll need to confront include which features to include(5000 voxels times 8 seconds of images is a lot of features) for classifierinput, whether to train brain-specific or brain-independent classifiers, and anumber of issues about efficient computation with this fairly large data set.
Papers to read: " Learning to Decode Cognitive States from Brain Images",Mitchell et al., 2004, " Bayesian Network Classifiers", Friedman et al., 1997.

17. Hierarchical BayesTopic Models

Statistical topicmodels have recently gained much popularity in managing large collection oftext documents. These models make the fundamental assumption that a document isa mixture of topics(as opposed to clustering in which we assume that a documentis generated from a single topic), where the mixture proportions aredocument-specific, and signify how important each topic is to the document.Moreover, each topic is a multinomial distribution over a given vocabularywhich in turn dictates how important each word is for a topic. The document-specific mixture proportions provide a low-dimensional representation of thedocument into the topic-space. This representation captures the latent semanticof the collection and can then be used for tasks like classifications andclustering, or merely as a tool to structurally browse the otherwiseunstructured collection. The most famous of such models is known as LDA ,LatentDirichlet Allocation (Blei et. al. 2003). LDA has been the basis for manyextensions in text, vision, bioiformatic, and social networks. These extensionsincorporate more dependency structures in the generative process like modelingauthors-topic dependency, or implement more sophisticated ways of representinginter-topic relationships.

Potential projects include

· Implement one of the models listed below or propose a newlatent topic model that suits a data set in your area of interest

· Implement and Compare approximate inference algorithmsfor LDA which includes: variational inference (Blei et. al. 2003), collapsedgibbs sampling (Griffth et. al. 2004) and (optionally) collapsed variationalinference (Teh. et. al. 2006). You should compare them over simulated data byvarying the corpus generation parameters --- number of optics, size ofvocabulary, document length, etc --- in addition to comparison over severalreal world datasets.

Papers:

Inference:

· D. Blei, A. Ng, and M. Jordan. Latent Dirichletallocation. Journal of Machine Learning Research, 3:993–1022, January 2003.
[pdf]

· Griffiths, T, Steyvers, M.(2004). Finding scientific topics. Proceedings of the National Academyof Sciences, 101, 5228-5235 2004.
[pdf]

· Y.W. Teh, D. Newman and M. Welling. A CollapsedVariational Bayesian Inference Algorithm for Latent Dirichlet Allocation.InNIPS 2006.
[pdf]

Expressive Models:

· Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth,P. The Author-Topic Model for authors and documents.In UAI 2004.
[pdf]

· Jun Zhu, Amr Ahmed and Eric Xing. MedLDA: Maximum MarginSupervised Topic Models for Regression and Classification. Internationalconference of Machine learning. ICML 2009.
[pdf]

· D. Blei, J. McAuliffe. Supervised topic models. InAdvances in Neural Information Processing Systems 21, 2007
[pdf]

· Wei Li and Andrew McCallum. Pachinko Allocation: ScalableMixture Models of Topic Correlations. Submitted to the Journal of MachineLearning Research, (JMLR), 2008
[pdf]

Application inVision:

· L. Fei-Fei and P. Perona. A Bayesian HierarchicalModel for Learning Natural Scene Categories. IEEE Comp. Vis. Patt. Recog. 2005. [PDF]

· L. Cao and L. Fei-Fei. Spatially coherent latent topicmodel for concurrent object segmentation and classification . IEEE Intern.Conf. in Computer Vision (ICCV). 2007 [PDF]

Application inSocial Networks/relational data:

· Ramesh Nallapati, Amr Ahmed, Eric P. Xing, and William W.Cohen, Joint Latent Topic Models for Text and Citations. Proceedings of TheFourteen ACM SIGKDD International Conference on Knowledge Discovery and DataMining. (KDD 2008) [PDF]

· Erosheva, Elena A., Fienberg, Stephen E., and Lafferty,John (2004). Mixed-membership models of scientific publications,"Proceedings of the National Academy of Sciences, 97, No. 22, 11885-11892. [PDF]

· E. Airoldi, D. Blei, E.P. Xing and S. Fienberg, Mixed Membership Model for Relational Data. JMLR 2008. [PDF]

· Andrew McCallum, Andres Corrada-Emmanuel, Xuerui Wang TheAuthor-Recipient-Topic Model for Topic and Role Discovery in Social Networks:Experiments with Enron and Academic Email. Technical Report UM-CS-2004-096,2004. [PDF]

· E.P. Xing, W. Fu, and L. Song, A State-Space MixedMembership Blockmodel for Dynamic Network Tomography, Annals of AppliedStatistics, 2009. [PDF]

Application inBiology/Bioligical Text:

· S. Shringarpure and E. P. Xing, mStruct: A New AdmixtureModel for Inference of Population Structure in Light of Both Genetic Admixingand Allele Mutations, Proceedings of the 25th International Conference onMachine Learning (ICML 2008). [PDF]

· Amr Ahmed, Eric P. Xing, William W. Cohen, Robert F.Murphy. Structured Correspondence Topic Models for Mining Captioned Figures inBiological Literature. Proceedings of The Fifteenth ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, (KDD 2009) [PDF]

18. Image SegmentationDataset

The goal is to segment images in a meaningful way. Berkeleycollectedthree hundred images and paid students to hand-segment each one (usually eachimage has multiple hand-segmentations). Two-hundred of these imagesare training images, and the remaining 100 are test images. The datasetincludes code for reading the images and ground-truth labels, computing thebenchmark scores, and some other utility functions. It also includes codefor a segmentation example. This dataset is new and the problem unsolved,so there is a chance that you could come up with the leading algorithm for yourproject.
http://www.cs.berkeley.edu/projects/vision/grouping/segbench/

Project ideas:
Project : Region-Based Segmentation
Most segmentation algorithms have focused on segmentation based on edges orbased on discontinuity of color and texture. The ground-truth in thisdataset, however, allows supervised learning algorithms to segment the imagesbased on statistics calculated over regions. One way to do this is to"oversegment" the image into superpixels (Felzenszwalb 2004, codeavailable) and merge the superpixels into largersegments. Graphical models can be used to represent smoothness inclusters, by adding appropriate potentials between neighboring pixels. In thisproject, you can address, for example, learning of such potentials, andinference in models with very large tree-width.
Papers to read: Some segmentation papers from Berkeley are available here

19. Twenty Newgroups textdata

20. HandwritingRecognition (Lisa Anthony http://www.cs.cmu.edu/~lanthony/)

A general overviewof our data: we have approximately 16,000 labeled character samples from 39middle and high school students, consisting of x-coord, y-coord, and time perpoint in each stroke. They are grouped into sets of 45 equations that eachstudent copied. The symbols in our dataset are: 0-9, x, y, a, b, c, +, -, _(fraction bar), =, (, ).

There are 3 main ideas for projects:

1. HOW MUCH DATA: All our data is currently hand-labeled, and we have lotsof it. One question might be, if the data wasn't labeled, what would be theadded value of additional data? That is, what would be the optimal or minimaldataset? This could be defined along several axes: the number of users, thenumber of samples per character, or the number of samples per symbol per user.We have done a few preliminary experiments where it is clear that there is aleveling off point for test accuracy -- likely caused by the increase invariability of adding new samples (especially by new users with differinghandwriting styles), which harms the classification algorithm (see #3). Forfuture studies and domains it might be useful to get a general sense of"data saturation" -- a recommended canonical corpus size
2. HOW MUCH LABELED DATA AND/OR AUTOMATIC LABELING: Hand-labeling allour data took quite a bit of time. What possibilities exist for an
automated, semi-supervised labeling algorithm that could tell us how much datawe need to label in advance and how much human verification is needed on theautomatically labeled stuff? A side note is that the collection of this data(for the sake of the users) was in the form of one equation at a time ratherthan one character at a time, so the characters needed to be segmented at thetime of labeling since the strokes all ran together in the logs. An automatedsegmenting approach would be very helpful to us in the future!

3. MULTIPLE CLASSIFIERS: Finally, there is quite a bit of variancebetween users in that their handwriting styles differ and the particular meansof executing a style differs across users. We hypothesize that multipleclassifiers trained per user would have higher walk-up-and-use accuracy on aset of independent users than one classifier that has to generalize across alluser styles. So this could also be an interesting area to explore.

21. Characterrecognition (digits) data

22. NBA statisticsdata

This download contains 2004-2005 NBA and ABA stats for:

-Player regularseason stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records

Currently all ofthe regular season

Project idea:

· outlier detection on the players; find out who are theoutstanding players.

· predict the game outcome.

23. Precipitation data

This dataset hasincludes 45 years of daily precipitation data from the Northwest of the US:

http://www.jisao.washington.edu/data_sets/widmann/

Project ideas:

Weatherprediction: Learn a probabilistic model to predict rain levels

Sensor selection:Where should you place sensor to best predict rain

24. WebKB

25. Deduplication

26. Email Annotation

The datasets provided below are sets of emails. The goal is to identify whichparts of the email refer to a person name. This task is an example of thegeneral problem area of Information Extraction.

http://www.cs.cmu.edu/~einat/datasets.html

Project Ideas:

· Model the task as a Sequential Labelingproblem, where each email is a sequence of tokens, and each token can haveeither a label of "person-name" or "not-a-person-name".

Papers: http://www.cs.cmu.edu/~einat/email.pdf

27. Netflix PrizeDataset

28. Physiological DataModeling (bodymedia)

29. Object Recognition

The Caltech 256dataset contains images of 256 object categories taken at varying orientations,varying lighting conditions, and with different backgrounds.
http://www.vision.caltech.edu/Image_Datasets/Caltech256/

Project ideas:

· You can try to create an object recognition system whichcan identify which object category is the best match for a given test image.

· Apply clustering to learn object categories withoutsupervision

30. Enron E-mail Dataset

The Enron E-maildata set contains about 500,000 e-mails from about 150 users. The data set isavailable here: Enron Data

Project ideas:

· Can you classify the text of an e-mail message to decidewho sent it?