自然语言分类(machine learning)

来源:互联网 发布:php数组遍历 编辑:程序博客网 时间:2024/06/06 03:30

Stack Overflow:

From:https://stackoverflow.com/questions/2696392/i-want-a-machine-to-learn-to-categorize-short-texts

Question :

I want a machine to learn to categorize short texts

I have a ton of short stories about 500 words long and I want to categorize them into one of, let’s say, 20 categories:

Entertainment
Food
Music
etc
I can hand-classify a bunch of them, but I want to implement machine learning to guess the categories eventually. What’s the best way to approach this? Is there a standard approach to machine learning I should be using? I don’t think a decision tree would work well since it’s text data…I’m completely new in this field.

Any help would be appreciated, thanks!

Answer :

贝叶斯

A naive Bayes will most probably work for you. The method is like this:

Fix a number of categories and get a training data set of (document, category) pairs.
A data vector of your document will be sth like a bag of words. e.g. Take the 100 most common words except words like “the”, “and” and such. Each word gets a fixed component of your data vector (e.g. “food” is position 5). A feature vector is then an array of booleans, each indicating whether that word came up in the corresponding document.
Training:

For your training set, calculate the probability of every feature and every class: p(C) = number documents of class C / total number of documents.
Calculate the probability of a feature in a class: p(F|C) = number of documents of class with given feature (= word “food” is in the text) / number of documents in given class.
Decision:

Given an unclassified document, the probability of it belonging to class C is proportional to P(C|F1, …, F500) = P(C) * P(F1|C) * P(F2|C) * … * P(F500|C). Pick the C that maximizes this term.
Since multiplication is numerically difficult, you can use the sum of the logs instead, which is maximized at the same C: log P(C|F1, …, F500) = log P(C) + log P(F1|C) + log P(F2|C) + … + log P(F500|C).

阅读全文
0 0
原创粉丝点击