数据挖掘：Top 10 Algorithms 序

来源：互联网发布：最新ip代理搜索软件编辑：程序博客网时间：2024/05/18 12:01

http://www.tnove.com/?p=209

一直想对top 10 algorithms in data mining 中的算法做一些分析介绍，也作为自己的一个回顾。但一直都没有时间来做，现在终于抽出一些时间来写点东西。

首先对该事件做一个介绍。事件发生于The 2006 IEEE International Conference on Data Mining(ICDM)。最后根据活动整理出了paper。电子版PDF已对选出的10个算法的来源与贡献做了简单的介绍。文中对这top 10的算法如何选出，选举的过程也进行了详细描述。为了方便阅读，我在这里再说明一下：

三步流程：

A．提名

ICDM2006上邀请ACM KDD Innovation Aword 和IEEE ICDM Research Contributions Aword 获奖者参与top 10 大算法的提名。每人各提名10种他认为最重要的算法，同时给出提名该算法的理由，该算法的代表性论文。所提名的算法必须是在该领域被广泛研究和引用的论文

B．审核

通过Google Scholar对每个提名算法引用进行审核。以此删除名单中引用低于50的论文。最后剩下18种算法。

C．投票

邀请了：

（a）. KDD06/ICDM06和SDM06的程序委员会的成员

（b）.ACM KDD创新奖和IEEE ICDM研究贡献奖获得者

最后通过投票排名选出Top 10 算法。

此处顺便列出审核阶段结束后产生的18种算法：

A． Classification

C4.5 (1993) C4.5: programs for Machine Learning
CART(1984) classification and Regression Trees
K Nearest Neighbors(KNN) (1996) Discriminant Adaptive Nearest Neighbor Classification
Naïve Bayes(2001) Idiot’s Bayes: Not So Stupid After All?Internat

B． Statistical Learning

SVM(1995) The Nature of Statistical Learning Theory
EM(2000) Finite Mixture Models

C． Association Analysis

Apriori(1994) Fast Algorithms for Mining Association Rules
FP. Tree(2000) Mining Frequent patterns without candidate generation

D． Link Mining

Page Rank(1998) The anatomy of a large-scale hyperlinked environment
HITS(1998) Authoritative source in a hyperlinked environment

E． Clustering

K-Means(1967) Some methods for classification and analysis of multivariate observations
BIRCH(1996) BIRCH: an efficient data clustering method for very large databases

F． Bagging and Boosting

AdaBoost(1997) A decision-theoretic generalization of on-line learning and an application to boosting

G． Sequential Patterns

GSP(1996) Mining Sequential Patterns: Generalizations and Performance Improvements
PrefixSpan(2001) PrefixSpan: Mining Sequential Patterns Efficiently by Projected Pattern Growth

H． Integrated Mining

CBA(1998) Integrating classification and association rule mining

I． Rough Sets

Finding reduct(1992) Rough Sets: Theoretical Aspects of Reasoning about Data

J． Graph Mining

gSpan(2002) gSpan: Graph-Based Substructure Pattern Mining

最后投票产生的Top 10 算法为：

对该10大算法的使用指导已经出版。同名《The Top Ten Algorithms in Data Mining》

ICMD 2006 会议投票的前10与该结果相同，可见该结果得到数据挖掘领域的普遍认可。后面我将按排名顺序对算法做一些介绍