数据挖掘的基本问题与基本方法

来源：互联网发布：java.long.ill 编辑：程序博客网时间：2024/06/05 04:09

英语不太好，但是还是尽量用英语表达观点。重点地方，加中文注释。

（一）数据挖掘基本问题：

Data mining is just a research about how to mining the information from huge data. Four kinds of problem will be study in this filed.

(1) Classification:

you already have a lot of data and you know which data is in which group(it is also called Training Set). Now your task is find the common feature about a group

which makes it distinguish with another group.(it also means in the feature vector space, the different group point must be as far as possible )

And then, you can use this featrues to create a decision tree to classify the new coming data. The difficulty is how to find this features.

(2)Cluster:

you now have a lot of data and you know little about then. Now make a decision how many groups exsits and for a data ,which group it will be.

（聚类：即将一堆数据划分为几个特定的类。比如有一堆新闻网页，你根据其title，可以将他们划分为政治、娱乐、体育、经济、科技类）

(3) 关联规则(assosiation or relationship)：

find if relationship exists between two data attributes. It often be used in electric business.

(发现不同数据中是否存在关联的特征，比如西方社会中买了尿布的人往往会买啤酒<和西方家庭生活有关>)

(4) 数据预测 (forecasting)

using the data in you hand, forecast the next data.

（二）数据挖掘的基本方法：

(1) classification:

decision tree: it is just a big name.Indeed it is a simple concept. If a decision tree be created ,you can easily solve the problem. As a coder's mind, it just a pile of

if...else sentense. Whereas it is an important topic about how to create a decision tree.

（决策树：一个唬人的名字。如果决策树被建立，则问题很简单。在程序员看来，就是一堆if ....else语句而已。但是如何建立一颗好的决策树，比较困难。（比如：如果一个角大于90度就是钝角，如果等于90度就是直角，如果小于90度就是锐角。这就是一颗决策树。你完全可以编程实现。）

（分类的意义在于，他们会用其他共同的特征，不仅仅是90度这一个）

神经网络:通过调整单元之间的连接强度，以响应外部提供的数据。问题在于学习过程比较慢，学习的知识很难理解。

(2)聚类方法：有一堆聚类方法，以后慢慢总结。

(3)关联规则：未知(unknown)

(4)数据预测：未知(unknown)