数据挖掘 算法小结 1

来源:互联网 发布:淘宝开直通车有用吗 编辑:程序博客网 时间:2024/05/18 18:53

数据挖掘 算法小结 1

 (2013-01-23 12:59:27)
转载
标签: 

大数据

 

数据挖掘

 

算法

 

决策树

 

聚类

分类: 代码控

  大数据最近越来越火,本座岂能不与时俱进......

  这里集合了数据挖掘的三个基本算法决策树(ID3,C4.5)、K-means、朴素贝叶斯的基本思想,小结下。


1.       决策树 (decision tree)

·元数据:我们的目标是 判断  Rain or not

No.

Cloud

Time

Tempera

Rain

 

 

 

 

 

1

Lot

Day

15.6

Y

2

Lot

Night

30.2

Y

3

Lot

Night

1.2

N

4

Bit

Night

-2.3

N

5

Lot

Day

10.2

Y

6

Bit

Day

13.2

N

7

Some

Day

28.3

N

8

Lot

Day

32.2

Y

9

Lot

Night

11.6

Y

10

Some

Day

10.7

N

 

·决策树-部分统计结果:

 Sum 总计

        Y/N=5/5

 Cloud

        Lot         5/1

        Some       0/2

        Bit         0/2

Time

        Day        3/3

        Night       2/2

 

·算法(log为以2为底的对数,以Cloud属性为例)

 

<1>ID3 (因为ID3不能计算连续的属性Tempera,所以此处忽略此属性)

              I       = -[0.5*log(0.5)+ 0.5*log(0.5)];

              G(cloud)=I-[-(6/10)*((5/6)*log(5/6)+(1/6)*log(1/6))-2/10*((0/2)*log(0/2)+                   (2/2)*log(2/2))- 2/10*((0/2)*log(0/2)+(2/2)*log(2/2)];

 

最后以G值最高的属性作为此次分类依据,即 以此属性的不同值分类,

例如:假设Cloud  G值( G(Cloud) )最高,以TA分类,则可分为三类:LotSomeBit

 

<2>C4.5

        * 增益率

        Split(cloud) = -[6/10*log(6/10)+ 2/10*log(2/10)+ 6/10*log(2/10)];

        GRatio(cloud)= G(cloud)/ Split(cloud);

 

        * 连续属性分割

         Tempera 9个位置可以切,分别计算切每个位置所形成的两个离散属性值后此属性的熵。     根     据客户需要,取最大的n个切点分割属性值。

 

2.       k-means

 

伪代码~

a. 假设已知分类class_Num=3,那么首先将元数据的前三个默认为三个中心点(此处也可以使用随机算法或其    他算法定义中心点),将他们作为三个class的中心。

b. 求其他数据到三个中心的距离,并将其分到有最小距离的聚类中。

c. 划归完毕所有数据后,重新计算三个聚类的中心点,即此类的平均值作为中心点。

d. 若果新的中心点与原有的不同,则重复b,c 两步。

e. 若果相同,则算法结束。

java 代码:

  在hadoop 里运行了自己写的MapReduce 的K-means程序,这其实是解一道K-means的算法题目的代码,原题在 http://cloudcomputing.ruc.edu.cn/Chinese/problempage.jsp?id=1009

代码如下:package :mine 

K_means.java:

 

package mine;


import java.io.PrintStream;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;


// Referenced classes of package mine:

//            MyMapper, MyReducer


public class K_means

{


    public K_means()

    {

    }


    public static void main(String args[])

        throws Exception

    {

        Configuration conf = new Configuration();

        String otherArgs[] = (new GenericOptionsParser(conf, args)).getRemainingArgs();

        if(otherArgs.length != 2)

        {

            System.err.println("Usage: k-means <in> <out>");

            System.exit(2);

        }

        Job job = new Job(conf, "k-means");

        job.setJarByClass(mine/K_means);

        job.setMapperClass(mine/MyMapper);

        job.setReducerClass(mine/MyReducer);

        job.setMapOutputKeyClass(org/apache/hadoop/io/Text);

        job.setMapOutputValueClass(org/apache/hadoop/io/Text);

        job.setOutputKeyClass(org/apache/hadoop/io/Text);

        job.setOutputValueClass(org/apache/hadoop/io/Text);

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

MyMapper.java:

 

package mine;


import java.io.IOException;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;


public class MyMapper extends Mapper

{


    public MyMapper()

    {

    }


    public void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper.Context context)

        throws IOException, InterruptedException

    {

        String source = value.toString();

        context.write(new Text("1"), new Text(source.substring(1, source.length() - 1)));

    }


    public volatile void map(Object obj, Object obj1, org.apache.hadoop.mapreduce.Mapper.Context context)

        throws IOException, InterruptedException

    {

        map((LongWritable)obj, (Text)obj1, (org.apache.hadoop.mapreduce.Mapper.Context)context);

    }

}

MyReducer.java:
package mine;

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MyReducer extends Reducer
{

    public MyReducer()
    {
    }

    public double countLength(DoubleWritable a[], DoubleWritable b[])
    {
        double l = 0.0D;
        l = Math.sqrt((b[1].get() - a[1].get()) * (b[1].get() - a[1].get()) + (b[0].get() - a[0].get()) * (b[0].get() - a[0].get()));
        return l;
    }

    public void reduce(Text key, Iterable values, org.apache.hadoop.mapreduce.Reducer.Context context)
        throws IOException, InterruptedException
    {
        int k = 2;
        int change = 1;
        List class1 = new ArrayList();
        List class2 = new ArrayList();
        DoubleWritable miw[];
        for(Iterator iterator = values.iterator(); iterator.hasNext(); class1.add(miw))
        {
            Text val = (Text)iterator.next();
            String ss = val.toString();
            String s[] = ss.split(",");
            miw = new DoubleWritable[2];
            miw[0] = new DoubleWritable(Integer.parseInt(s[0]));
            miw[1] = new DoubleWritable(Integer.parseInt(s[1]));
        }

        String t1 = "";
        String t2 = "";
        DoubleWritable label1[] = new DoubleWritable[2];
        label1[0] = ((DoubleWritable[])class1.get(0))[0];
        label1[1] = ((DoubleWritable[])class1.get(0))[1];
        DoubleWritable label2[] = new DoubleWritable[2];
        label2[0] = ((DoubleWritable[])class1.get(1))[0];
        label2[1] = ((DoubleWritable[])class1.get(1))[1];
        while(change == 1) 
        {
            change = 0;
            int c1 = class1.size();
            int c2 = class2.size();
            for(int j = 0; j < c1; j++)
            {
                DoubleWritable zan[] = (DoubleWritable[])class1.get(0);
                double m1 = countLength(zan, label1);
                double m2 = countLength(zan, label2);
                class1.remove(0);
                if(m1 <= m2)
                {
                    class1.add(zan);
                } else
                {
                    class2.add(zan);
                }
            }

            for(int j = 0; j < c2; j++)
            {
                DoubleWritable zan[] = (DoubleWritable[])class2.get(0);
                double m1 = countLength(zan, label1);
                double m2 = countLength(zan, label2);
                class2.remove(0);
                if(m1 <= m2)
                {
                    class1.add(zan);
                } else
                {
                    class2.add(zan);
                }
            }

            double bit_a = 0.0D;
            double bit_b = 0.0D;
            for(int i = 0; i < class1.size(); i++)
            {
                bit_a += ((DoubleWritable[])class1.get(i))[0].get();
                bit_b += ((DoubleWritable[])class1.get(i))[1].get();
            }

            if(!(new DoubleWritable(bit_a / (double)class1.size())).equals(label1[0]))
            {
                label1[0] = new DoubleWritable(bit_a / (double)class1.size());
                change = 1;
            }
            if(!(new DoubleWritable(bit_b / (double)class1.size())).equals(label1[1]))
            {
                label1[1] = new DoubleWritable(bit_b / (double)class1.size());
                change = 1;
            }
            bit_a = 0.0D;
            bit_b = 0.0D;
            for(int i = 0; i < class2.size(); i++)
            {
                bit_a += ((DoubleWritable[])class2.get(i))[0].get();
                bit_b += ((DoubleWritable[])class2.get(i))[1].get();
            }

            if(!(new DoubleWritable(bit_a / (double)class2.size())).equals(label2[0]))
            {
                label2[0] = new DoubleWritable(bit_a / (double)class2.size());
                change = 1;
            }
            if(!(new DoubleWritable(bit_b / (double)class2.size())).equals(label2[1]))
            {
                label2[1] = new DoubleWritable(bit_b / (double)class2.size());
                change = 1;
            }
        }
        String result = (new StringBuilder("\n(")).append(label1[0]).append(",").append(label1[1]).append(") ::\n").toString();
        for(Iterator iterator1 = class1.iterator(); iterator1.hasNext();)
        {
            DoubleWritable val[] = (DoubleWritable[])iterator1.next();
            result = (new StringBuilder(String.valueOf(result))).append("(").append(val[0]).append(",").append(val[1]).append(")\n").toString();
        }

        result = (new StringBuilder(String.valueOf(result))).append("(").append(label2[0]).append(",").append(label2[1]).append(") ::\n").toString();
        for(Iterator iterator2 = class2.iterator(); iterator2.hasNext();)
        {
            DoubleWritable val[] = (DoubleWritable[])iterator2.next();
            result = (new StringBuilder(String.valueOf(result))).append("(").append(val[0]).append(",").append(val[1]).append(")\n").toString();
        }

        context.write(new Text((new StringBuilder(String.valueOf(t1))).append("|").append(t2).toString()), new Text(result));
    }

    public volatile void reduce(Object obj, Iterable iterable, org.apache.hadoop.mapreduce.Reducer.Context context)
        throws IOException, InterruptedException
    {
        reduce((Text)obj, (Iterable)iterable, (org.apache.hadoop.mapreduce.Reducer.Context)context);
    }
}


 

3.       朴素贝叶斯(Naive Bayes)

·分别计算不同属性的值所对应的不同目标属性结果的概率,再根据贝叶斯定理,计算目标属性结果。(目标属性:例如 本文决策树中的元数据实例中的Rain属性)

·基于贝叶斯的分类原理

      已知

有对象X={A0,A1,A2,,,An},AkX的属性,例如:A0=3,A1=0.A2=Y…,Ak=”bb”

      有分类C={C0,C1,C3,,,Cn},基于贝叶斯定理,分别计算:

         P(X)*P(C0|X), P(X)*P(C1|X), P(X)*P(C2|X),,,P(X)*P(Cn|X)

如果max= P(X)*P(Ck|X),那么对象X应分到Ck类。

         其中,P(X)*P(Ck|X)=P(X)*P(Ck&X)/P(X)=P(Ck&X)= P(Ck)*P(X|Ck),即:

               P(X)*P(Ck|X)= P(Ck)*P(X|Ck);...........................................公式3.1

 

·Example(此处元数据与本文决策树示例相同):

   ==基于公式3.1==

 

   根据元数据,可得:

   下雨概率:      R=P(rain)=50%;

   不下雨概率:    N=P(NotRain)=50%;

 

   则:

下雨是白天概率:P(Day|rain)=50.0% ,反之,P(Day|NotRain)=50.0%

下雨是多云概率:P(LotCloud|rain)=83.3% ,反之,P(Day|NotRain)=33.3%

 

令:

A=P(Day& LotCloud |Rain)=50%*83.3%

B=P(Day& LotCloud |NotRain)=50%*33.3%

 

那么:

可求得:“白天,多云,下雨“ 的可能性为 R*A=50%*50%*83.3%

“白天,多云,没雨“ 的可能性为 N*B=50%*50%*33.3%

若果R*A > N*B

那么对象(白天,多云)应当分到“下雨类”中去

反之

那么对象(白天,多云)应当分到“没雨类”中去

0 0
原创粉丝点击