数据挖掘的基本问题与基本方法

来源:互联网 发布:java.long.ill 编辑:程序博客网 时间:2024/06/05 04:09

英语不太好,但是还是尽量用英语表达观点。重点地方,加中文注释。

(一)数据挖掘基本问题:

Data mining is just a research about how to mining the information from huge data. Four kinds of problem will be study in this filed.

(1) Classification:

       you already have a lot of data and you know which data is in which group(it is also called Training Set). Now your task is find the common feature about a group

which makes it distinguish with another group.(it also means in the feature vector space, the different group point must be as far as possible )

And then, you can use this featrues to create a decision tree to classify the new coming data. The difficulty is how to find this features.

 

(2)Cluster:

      you now have a lot of data and you know little about then. Now make a decision how many groups exsits and for a data ,which group it will be.

(聚类:即将一堆数据划分为几个特定的类。比如有一堆新闻网页,你根据其title,可以将他们划分为政治、娱乐、体育、经济、科技类)

 

(3)  关联规则(assosiation or relationship):

       find if  relationship exists between two data attributes. It often be used in electric business.

       (发现不同数据中是否存在关联的特征,比如西方社会中买了尿布的人往往会买啤酒<和西方家庭生活有关>)

 

(4) 数据预测 (forecasting)

          using the data in you hand, forecast the next data.

               

(二) 数据挖掘的基本方法:

    (1) classification:

             decision tree: it is just a big name.Indeed it is a simple concept. If a decision tree be created ,you can easily solve the problem. As a coder's mind, it just a pile of

if...else sentense. Whereas it is an important topic about how to create a decision tree. 

            (决策树:一个唬人的名字。 如果决策树被建立,则问题很简单。在程序员看来,就是一堆if ....else语句而已。但是如何建立一颗好的决策树,比较困难。(比如:如果一个角大于90度就是钝角,如果等于90度就是直角,如果小于90度就是锐角。这就是一颗决策树。你完全可以编程实现。)

             (分类的意义在于,他们会用其他共同的特征,不仅仅是90度这一个)

            神经网络:通过调整单元之间的连接强度,以响应外部提供的数据。问题在于学习过程比较慢,学习的知识很难理解。

    

   (2)聚类方法: 有一堆聚类方法,以后慢慢总结。

   (3)关联规则:未知(unknown)

   (4)数据预测:未知(unknown)