自然语言分类(machine learning)
来源:互联网 发布:php数组遍历 编辑:程序博客网 时间:2024/06/06 03:30
Stack Overflow:
From:https://stackoverflow.com/questions/2696392/i-want-a-machine-to-learn-to-categorize-short-texts
Question :
I want a machine to learn to categorize short texts
I have a ton of short stories about 500 words long and I want to categorize them into one of, let’s say, 20 categories:
Entertainment
Food
Music
etc
I can hand-classify a bunch of them, but I want to implement machine learning to guess the categories eventually. What’s the best way to approach this? Is there a standard approach to machine learning I should be using? I don’t think a decision tree would work well since it’s text data…I’m completely new in this field.
Any help would be appreciated, thanks!
Answer :
贝叶斯
A naive Bayes will most probably work for you. The method is like this:
Fix a number of categories and get a training data set of (document, category) pairs.
A data vector of your document will be sth like a bag of words. e.g. Take the 100 most common words except words like “the”, “and” and such. Each word gets a fixed component of your data vector (e.g. “food” is position 5). A feature vector is then an array of booleans, each indicating whether that word came up in the corresponding document.
Training:
For your training set, calculate the probability of every feature and every class: p(C) = number documents of class C / total number of documents.
Calculate the probability of a feature in a class: p(F|C) = number of documents of class with given feature (= word “food” is in the text) / number of documents in given class.
Decision:
Given an unclassified document, the probability of it belonging to class C is proportional to P(C|F1, …, F500) = P(C) * P(F1|C) * P(F2|C) * … * P(F500|C). Pick the C that maximizes this term.
Since multiplication is numerically difficult, you can use the sum of the logs instead, which is maximized at the same C: log P(C|F1, …, F500) = log P(C) + log P(F1|C) + log P(F2|C) + … + log P(F500|C).
- 自然语言分类(machine learning)
- Machine learning 5---贝叶斯分类
- machine learning(5) --AdaBoost分类器
- 【machine learning】朴素贝叶斯分类方法
- 《machine learning with spark》学习笔记--分类
- Machine learning —Machine learning :分类和聚类,监督学习和非监督学习
- machine learning(2) OpenCV训练分类器制作xml文档
- Machine Learning分类总结 和机器学习的四个等级
- Machine Learning—Naive Bayesian classification(朴素贝叶斯分类)
- [Machine Learning] 机器学习常见算法分类汇总
- azure machine learning 预测分类实例-- 预测出口国
- Pattern Recognition and Machine Learning 第四章 线性分类模型
- 周志华《Machine Learning》学习笔记(8)--贝叶斯分类器
- Machine Learning第三讲[Logistic回归] --(三)多元分类
- kaggle Code : Titanic: Machine Learning from Disaster 分类
- Machine Learning---8--模型评估与分类性能度量
- 周志华《Machine Learning》学习笔记(8)--贝叶斯分类器
- svm之大间距分类(斯坦福machine learning week 6)
- 可能是东半球最全的RxJava使用场景小结
- JAVA虚拟机是如何使用内存
- java中两大异常:空指针异常和数组越界异常
- iOS开发中的图形编程
- 有理有据做设计之定义方向策略
- 自然语言分类(machine learning)
- 001 Django学习之WSGI(基于Python2.7)
- 代理模式实例-数据库连接池的实现
- php下添加redis扩展
- 数据结构-单链表查找按序号查找
- iOS TouchID验证和Keychain结合使用
- Xshell简介
- 加强交互设计过程的「逻辑性」,能解决哪些问题
- 纽约生活: 最让人喜爱的纽约蛋糕店