[学习] 数据挖掘-贝叶斯分类(例子,代码)

来源:互联网 发布:linux黑客工具 编辑:程序博客网 时间:2024/04/28 18:36

什么是贝叶斯分类:

首先举个经典的例子, A病症检测: 有1/100的人A病症检测会成阳性,地球上有1/1000的人会的A病症,得了A病症的人有90%的概率显示A病症检测阳性,那么当一个人A病症检测阳性了他的得A病的概率是?
答案是 1/1000*0.9*100 = 0.09 9%的概率,怎么证明网上有很多,@link:http://www.cnblogs.com/leoo2sk/archive/2010/09/17/1829190.html
相关公式:
P(B|A) = P(A|B)P(B)/P(A)

一般操作:
使用大量数据作为训练集,来预测下新来数据的某一属性。
原理:
通过P(n)*P(a|n)*P(b|n),,,,计算已知参数在各个n的情况下的概率,取得最大的P就是预测结果

作为基础数据挖掘,就写个例子来玩下看,顺便Mark 下

例子:

生成一个爱好 收入表, 女生爱购物收入低,男生爱打球收入高
    public static void genTestBayes(String path) {        for (int i = 0; i < 20000; i++) {            String data = "";            if (Math.random() <= 0.5) {                data = data + "男 ";                double son = Math.random();                if (son > 0.95) {                    data = data + "购物 ";                } else if (son <= 0.6) {                    data = data + "打球 ";                } else if (son > 0.6 && son <= 0.85) {                    data = data + "电影 ";                } else {                    data = data + "吃饭 ";                }                double daughter = Math.random();                if (daughter > 0.5) {                    data = data + "3000";                } else if (son <= 0.2) {                    data = data + "1000";                } else {                    data = data + "2000";                }            } else {                data = data + "女 ";                double son = Math.random();                if (son > 0.3) {                    data = data + "购物 ";                } else if (son <= 0) {                    data = data + "打球 ";                } else if (son > 0 && son <= 0.15) {                    data = data + "电影 ";                } else {                    data = data + "吃饭 ";                }                double daughter = Math.random();                if (daughter > 0.8) {                    data = data + "3000";                } else if (son <= 0.5) {                    data = data + "1000";                } else {                    data = data + "2000";                }            }            FileUnit.write(data, path);        }    }

对生成的数据进行Map Reduce操作:
Map:
public class BayesMap extends Mapper<LongWritable, Text, Text, IntWritable> {    private IntWritable one = new IntWritable(1);    @Override    public void map(LongWritable key, Text value, Context context)            throws IOException, InterruptedException {        String[] param = value.toString().split(" ");        Text text = new Text();        for (int i = 0; i < param.length; i++) {            text.set(i + "." + (i==0?"":param[0]) + "-" + param[i]);            context.write(text, one);        }    }}

Reduce:
public class BayesReduce extends Reducer<Text,IntWritable,Text,IntWritable>{        @Override    public void reduce(Text key, Iterable<IntWritable> values, Context context)            throws IOException, InterruptedException {        Integer sum = 0;        Iterator<IntWritable> iw = values.iterator();        while (iw.hasNext()){            sum += iw.next().get();        }        context.write(key, new IntWritable(sum));    }}

运行Map-Reduce 并且计算各种概率:
    public static void calculRate(String path){        List<String> array = FileUnit.readList(path + "/part-r-00000");        HashMap<String, Integer> v = new HashMap<String, Integer>();        HashMap<String, Integer> k = new HashMap<String, Integer>();        //求和        for(String line:array){            String keyValue[] = line.split("\\s");            v.put(keyValue[0], Integer.parseInt(keyValue[1]));            String totalKey[] = keyValue[0].split("-");            Integer total = k.get(totalKey[0]);            if(total == null){                k.put(totalKey[0], Integer.parseInt(keyValue[1]));            }else{                k.put(totalKey[0], Integer.parseInt(keyValue[1]) + total);            }        }        //再循环 获得概率map        for(String key : v.keySet()){            String totalKey[] = key.split("-");            double a = v.get(key);            double b = k.get(totalKey[0]);            double c = a/b;            FileUnit.write(key+ " " + c, path + "/part-my-00000");        }            }        public static void main(String args[]) throws IOException, InterruptedException, ClassNotFoundException {        String path = "/home/jiangww/1";        String path_temp = "/tmp/test";        //生成函数        //Bayes.genTestBayes(path);               Configuration conf = new Configuration();        conf.set("mapred.job.reuse.jvm.num.tasks", "-1");        Job job = new Job(conf, "Bayes");        job.setJarByClass(Bayes.class);                job.setMapperClass(BayesMap.class);        job.setReducerClass(BayesReduce.class);                job.setOutputKeyClass(Text.class);        job.setOutputValueClass(IntWritable.class);                job.setInputFormatClass(TextInputFormat.class);        job.setOutputFormatClass(TextOutputFormat.class);                job.setNumReduceTasks(1);        FileInputFormat.addInputPath(job, new Path(path));        FileInputFormat.setMinInputSplitSize(job, 5 * 1024 * 1024);        FileInputFormat.setMaxInputSplitSize(job, 100 * 1024 * 1024);        FileOutputFormat.setOutputPath(job, new Path(path_temp));        job.waitForCompletion(true);                Bayes.calculRate(path_temp);    }

以上就完成了训练集的操作:

以下要进行预测:
public class BayesForecast {    static Map<String,Double> P = new HashMap<String,Double>();    static Map<String,Double> Px = new HashMap<String,Double>();    public static void loadMap(String path) {        List<String> array = FileUnit.readList(path);        for(String line : array){            String[] data= line.split(" ");            if(data[0].contains(".-")){                P.put(data[0].replaceAll("\\d\\.-", ""), Double.parseDouble(data[1]));            }else{                Px.put(data[0], Double.parseDouble(data[1]));            }        }    }        public static void forecast(String input) {        String in[] = input.split(" ");        //套用公式        double rate = 0;        String sex = "";        for(String k:P.keySet()){            double nRate = P.get(k);            for(int i = 1;i<in.length;i++){                String a = i+ "." + k+ "-" +in[i];                nRate = nRate * (Px.get(a) == null?0:Px.get(a));            }            if(nRate > rate){                rate = nRate;                sex = k;            }        }        System.out.println(sex);    }        public static void main(String args[]) throws IOException, InterruptedException, ClassNotFoundException {        String path = "/tmp/test/part-my-00000";        BayesForecast.loadMap(path);        BayesForecast.forecast("? 打球 3000");        BayesForecast.forecast("? 购物 1000");        BayesForecast.forecast("? 电影 2000");        BayesForecast.forecast("? 吃饭 2000");    }}

测试结果:
男女男女

完成,和我们预期一致。

0 0