大数据最近越来越火,本座岂能不与时俱进......
这里集合了数据挖掘的三个基本算法决策树(ID3,C4.5)、K-means、朴素贝叶斯的基本思想,小结下。
1. 决策树 (decision tree)
·元数据:我们的目标是 判断 Rain or not
No.
Cloud
Time
Tempera
Rain
1
Lot
Day
15.6
Y
2
Lot
Night
30.2
Y
3
Lot
Night
1.2
N
4
Bit
Night
-2.3
N
5
Lot
Day
10.2
Y
6
Bit
Day
13.2
N
7
Some
Day
28.3
N
8
Lot
Day
32.2
Y
9
Lot
Night
11.6
Y
10
Some
Day
10.7
N
·决策树-部分统计结果:
Sum 总计
Y/N=5/5
Cloud
Lot 5/1
Some 0/2
Bit 0/2
Time
Day 3/3
Night 2/2
·算法(log为以2为底的对数,以Cloud属性为例)
<1>ID3 (因为ID3不能计算连续的属性Tempera,所以此处忽略此属性)
I = -[0.5*log(0.5)+ 0.5*log(0.5)];
G(cloud)=I-[-(6/10)*((5/6)*log(5/6)+(1/6)*log(1/6))-2/10*((0/2)*log(0/2)+ (2/2)*log(2/2))- 2/10*((0/2)*log(0/2)+(2/2)*log(2/2)];
最后以G值最高的属性作为此次分类依据,即 以此属性的不同值分类,
例如:假设Cloud G值( G(Cloud) )最高,以TA分类,则可分为三类:Lot、Some、Bit。
<2>C4.5
* 增益率
Split(cloud) = -[6/10*log(6/10)+ 2/10*log(2/10)+ 6/10*log(2/10)];
GRatio(cloud)= G(cloud)/ Split(cloud);
* 连续属性分割
Tempera 有9个位置可以切,分别计算切每个位置所形成的两个离散属性值后此属性的熵。 根 据客户需要,取最大的n个切点分割属性值。
2. k-means
伪代码~
a. 假设已知分类class_Num=3,那么首先将元数据的前三个默认为三个中心点(此处也可以使用随机算法或其 他算法定义中心点),将他们作为三个class的中心。
b. 求其他数据到三个中心的距离,并将其分到有最小距离的聚类中。
c. 划归完毕所有数据后,重新计算三个聚类的中心点,即此类的平均值作为中心点。
d. 若果新的中心点与原有的不同,则重复b,c 两步。
e. 若果相同,则算法结束。
java 代码:
在hadoop 里运行了自己写的MapReduce 的K-means程序,这其实是解一道K-means的算法题目的代码,原题在 http://cloudcomputing.ruc.edu.cn/Chinese/problempage.jsp?id=1009
代码如下:package :mine
K_means.java:
package mine;
import java.io.PrintStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
// Referenced classes of package mine:
// MyMapper, MyReducer
public class K_means
{
public K_means()
{
}
public static void main(String args[])
throws Exception
{
Configuration conf = new Configuration();
String otherArgs[] = (new GenericOptionsParser(conf, args)).getRemainingArgs();
if(otherArgs.length != 2)
{
System.err.println("Usage: k-means <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "k-means");
job.setJarByClass(mine/K_means);
job.setMapperClass(mine/MyMapper);
job.setReducerClass(mine/MyReducer);
job.setMapOutputKeyClass(org/apache/hadoop/io/Text);
job.setMapOutputValueClass(org/apache/hadoop/io/Text);
job.setOutputKeyClass(org/apache/hadoop/io/Text);
job.setOutputValueClass(org/apache/hadoop/io/Text);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
MyMapper.java:
package mine;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMapper extends Mapper
{
public MyMapper()
{
}
public void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
{
String source = value.toString();
context.write(new Text("1"), new Text(source.substring(1, source.length() - 1)));
}
public volatile void map(Object obj, Object obj1, org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
{
map((LongWritable)obj, (Text)obj1, (org.apache.hadoop.mapreduce.Mapper.Context)context);
}
}
MyReducer.java:
package mine;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MyReducer extends Reducer
{
public MyReducer()
{
}
public double countLength(DoubleWritable a[], DoubleWritable b[])
{
double l = 0.0D;
l = Math.sqrt((b[1].get() - a[1].get()) * (b[1].get() - a[1].get()) + (b[0].get() - a[0].get()) * (b[0].get() - a[0].get()));
return l;
}
public void reduce(Text key, Iterable values, org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException
{
int k = 2;
int change = 1;
List class1 = new ArrayList();
List class2 = new ArrayList();
DoubleWritable miw[];
for(Iterator iterator = values.iterator(); iterator.hasNext(); class1.add(miw))
{
Text val = (Text)iterator.next();
String ss = val.toString();
String s[] = ss.split(",");
miw = new DoubleWritable[2];
miw[0] = new DoubleWritable(Integer.parseInt(s[0]));
miw[1] = new DoubleWritable(Integer.parseInt(s[1]));
}
String t1 = "";
String t2 = "";
DoubleWritable label1[] = new DoubleWritable[2];
label1[0] = ((DoubleWritable[])class1.get(0))[0];
label1[1] = ((DoubleWritable[])class1.get(0))[1];
DoubleWritable label2[] = new DoubleWritable[2];
label2[0] = ((DoubleWritable[])class1.get(1))[0];
label2[1] = ((DoubleWritable[])class1.get(1))[1];
while(change == 1)
{
change = 0;
int c1 = class1.size();
int c2 = class2.size();
for(int j = 0; j < c1; j++)
{
DoubleWritable zan[] = (DoubleWritable[])class1.get(0);
double m1 = countLength(zan, label1);
double m2 = countLength(zan, label2);
class1.remove(0);
if(m1 <= m2)
{
class1.add(zan);
} else
{
class2.add(zan);
}
}
for(int j = 0; j < c2; j++)
{
DoubleWritable zan[] = (DoubleWritable[])class2.get(0);
double m1 = countLength(zan, label1);
double m2 = countLength(zan, label2);
class2.remove(0);
if(m1 <= m2)
{
class1.add(zan);
} else
{
class2.add(zan);
}
}
double bit_a = 0.0D;
double bit_b = 0.0D;
for(int i = 0; i < class1.size(); i++)
{
bit_a += ((DoubleWritable[])class1.get(i))[0].get();
bit_b += ((DoubleWritable[])class1.get(i))[1].get();
}
if(!(new DoubleWritable(bit_a / (double)class1.size())).equals(label1[0]))
{
label1[0] = new DoubleWritable(bit_a / (double)class1.size());
change = 1;
}
if(!(new DoubleWritable(bit_b / (double)class1.size())).equals(label1[1]))
{
label1[1] = new DoubleWritable(bit_b / (double)class1.size());
change = 1;
}
bit_a = 0.0D;
bit_b = 0.0D;
for(int i = 0; i < class2.size(); i++)
{
bit_a += ((DoubleWritable[])class2.get(i))[0].get();
bit_b += ((DoubleWritable[])class2.get(i))[1].get();
}
if(!(new DoubleWritable(bit_a / (double)class2.size())).equals(label2[0]))
{
label2[0] = new DoubleWritable(bit_a / (double)class2.size());
change = 1;
}
if(!(new DoubleWritable(bit_b / (double)class2.size())).equals(label2[1]))
{
label2[1] = new DoubleWritable(bit_b / (double)class2.size());
change = 1;
}
}
String result = (new StringBuilder("\n(")).append(label1[0]).append(",").append(label1[1]).append(") ::\n").toString();
for(Iterator iterator1 = class1.iterator(); iterator1.hasNext();)
{
DoubleWritable val[] = (DoubleWritable[])iterator1.next();
result = (new StringBuilder(String.valueOf(result))).append("(").append(val[0]).append(",").append(val[1]).append(")\n").toString();
}
result = (new StringBuilder(String.valueOf(result))).append("(").append(label2[0]).append(",").append(label2[1]).append(") ::\n").toString();
for(Iterator iterator2 = class2.iterator(); iterator2.hasNext();)
{
DoubleWritable val[] = (DoubleWritable[])iterator2.next();
result = (new StringBuilder(String.valueOf(result))).append("(").append(val[0]).append(",").append(val[1]).append(")\n").toString();
}
context.write(new Text((new StringBuilder(String.valueOf(t1))).append("|").append(t2).toString()), new Text(result));
}
public volatile void reduce(Object obj, Iterable iterable, org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException
{
reduce((Text)obj, (Iterable)iterable, (org.apache.hadoop.mapreduce.Reducer.Context)context);
}
}
3. 朴素贝叶斯(Naive Bayes)
·分别计算不同属性的值所对应的不同目标属性结果的概率,再根据贝叶斯定理,计算目标属性结果。(目标属性:例如 本文决策树中的元数据实例中的Rain属性)
·基于贝叶斯的分类原理
已知
有对象X={A0,A1,A2,,,An},Ak为X的属性,例如:A0=3,A1=0.A2=Y…,Ak=”bb”。
有分类C={C0,C1,C3,,,Cn},基于贝叶斯定理,分别计算:
P(X)*P(C0|X), P(X)*P(C1|X), P(X)*P(C2|X),,,P(X)*P(Cn|X),
如果max= P(X)*P(Ck|X),那么对象X应分到Ck类。
其中,P(X)*P(Ck|X)=P(X)*P(Ck&X)/P(X)=P(Ck&X)= P(Ck)*P(X|Ck),即:
P(X)*P(Ck|X)= P(Ck)*P(X|Ck);...........................................公式3.1
·Example(此处元数据与本文决策树示例相同):
==基于公式3.1==
根据元数据,可得:
下雨概率: R=P(rain)=50%;
不下雨概率: N=P(NotRain)=50%;
则:
下雨是白天概率:P(Day|rain)=50.0% ,反之,P(Day|NotRain)=50.0%
下雨是多云概率:P(LotCloud|rain)=83.3% ,反之,P(Day|NotRain)=33.3%
令:
A=P(Day& LotCloud |Rain)=50%*83.3%;
B=P(Day& LotCloud |NotRain)=50%*33.3%;
那么:
可求得:“白天,多云,下雨“ 的可能性为 R*A=50%*50%*83.3%,
“白天,多云,没雨“ 的可能性为 N*B=50%*50%*33.3%。
若果R*A > N*B
那么对象(白天,多云)应当分到“下雨类”中去
反之
那么对象(白天,多云)应当分到“没雨类”中去