mapreduce计算分词权重

来源：互联网发布：手机网络制式有几种编辑：程序博客网时间：2024/05/17 21:38

需求：

计算每个词在每篇文章中的权值

原始文件片段：

3823890201582094   今天我约了豆浆，油条。约了电饭煲几小时后饭就自动煮好，还想约豆浆机，让我早晨多睡一小时，豆浆就自然好。起床就可以喝上香喷喷的豆浆了。
3823890210294392   今天我约了豆浆，油条
3823890235477306   一会儿带儿子去动物园约起～
3823890239358658   继续支持
3823890256464940   约起来！次饭去！
3823890264861035   我约了吃饭哦
3823890281649563   和家人一起相约吃个饭！
3823890285529671   今天约了广场一起滑旱冰
3823890294242412   九阳双预约豆浆机即将全球首发啦，我要约你一起吃早餐
3823890314914825   今天天气晴好，姐妹们约起，一起去逛街。
3823890323625419   全国包邮！九阳（Joyoung）JYL-
3823890335901756   今天是今年最暖和的一天，果断出来逛街！
3823890364788305   春天来了，约好友一起出去去踏青，去赏花！
3823890369489295   我在平湖，让你开挂练九阳真经，走火入魔毁了三叉神经了吧，改练九阴真经吧小子。   (免费下载 )
3823890373686361   约了小伙伴一起去理发！
3823890378201539   今天约了姐妹去逛街吃美食，周末玩得很开心啊！
3823890382081678   这几天一直在约，因为感冒发烧了，所以和老公约好了陪我去打针，求九阳安慰，我想喝豆浆，药好苦的
3823890399188850   和吃货的约会么就是吃
3823890419856548   全国包邮！九阳（Joyoung）JYK-
3823890436963972   我亲爱的

结果片段：

3823890201582094   我:2.19722   香喷喷:5.5835   多:3.13549   睡:3.66356   豆浆:4.15888   早晨:3.55535   想约:4.66344   喝上:4.66344   就可以:4.56435   煮:4.56435   约:2.19722       电饭煲:5.02388   的:0   就:5.88888   好:5.27811   后:3.49651       自动:5.36129   今天:2.19722   油条:4.07754   饭:4.89035   豆浆机:1.38629   一小时:4.77068   几小时:5.87212   自然:5.87212   还:3.58352   让:3.21888   了:4.15888   起床:4.02535
3823890210294392   约:1.09861   我:1.09861   豆浆:1.38629       今天:2.19722       了:1.38629   油条:4.07754
3823890235477306   一会儿:6.97073   去:2.30259       儿子:4.26268   约:1.09861   动物园:5.5835   带:4.07754       起:3.82864
3823890239358658   继续:4.89035       支持:3.04452
3823890256464940   次:5.87212   起来:2.83321       约:1.09861       饭:4.89035   去:2.30259
3823890264861035   约:1.09861   我:1.09861       了:1.38629   吃饭:3.97029       哦:2.89037
3823890281649563   和家人:4.89035       一起:2.30259       吃个:5.36129   相约:3.68888   饭:4.89035
3823890285529671   了:1.38629   今天:2.19722       广场:5.5835   滑旱冰:6.97073       一起:2.30259   约:1.09861
3823890294242412   九阳:0   我:1.09861   全球:5.5835   早餐:2.56495   双:2.19722   你:2.56495   一起:2.30259       首发:2.56495   预约:2.07944   啦:2.70805   即将:5.02388       吃:2.70805   要约:4.39445   豆浆机:1.38629
3823890314914825   一起:2.30259   约:1.09861       去:2.30259   逛街:4.33073   姐妹:5.17615       今天:2.19722   天气晴好:6.27664   起:3.82864   们:3.13549
3823890323625419   邮:4.89035       全国:5.36129   jyl-:6.97073   包:4.07754       joyoung:4.12713   九阳:0
3823890335901756   的:0   今年:5.5835   暖和:5.5835       果断:6.97073   出来:5.02388   逛街:4.33073       一天:4.18965   最:3.49651   今天是:4.66344
3823890364788305   出:4.89035   来了:3.3673   去去:6.27664   去:2.30259       好友:5.36129       赏花:5.17615   踏青:4.18965   约:1.09861   一起:2.30259   春天:2.89037
3823890369489295   让:3.21888   练:11.74424   下载:5.87212       九阳:0   吧:2.63906   我:1.09861   九阴真经:6.27664   免费:5.5835   挂:6.97073   了吧:6.97073   平湖:6.97073   走火入魔:5.02388   真经:4.77068   小子:6.97073   开:4.56435   你:2.56495       三叉神经:6.97073   在:2.70805   毁了:6.97073   改:6.97073
3823890373686361   一起:2.30259       约:1.09861   理发:5.5835   了:1.38629       小伙伴:4.02535   去:2.30259
3823890378201539   吃:2.70805   得很:5.87212   啊:3.4012   今天:2.19722   姐妹:5.17615   开心:3.68888   去:2.30259       玩:3.52636   周末:3.46574       逛街:4.33073   了:1.38629   约:1.09861   美食:3.78419
3823890382081678   老公:3.4012   所以:4.12713   好了:4.12713   我:1.09861   约:2.19722   感冒:6.27664   陪我:5.87212   豆浆:1.38629   发:5.36129       去:2.30259   和:1.79176   打针:6.97073   因为:4.47734   安慰:5.02388   一直在:5.5835   好苦:6.27664   这几:6.27664       想喝:5.36129   九阳:0   求:3.66356   烧了:6.27664   药:6.27664   的:0   天:4.18965
3823890399188850   就是:3.04452   么:5.17615   约会:2.99573   的:0       货:4.77068       吃:5.4161   和:1.79176
3823890419856548   全国:5.36129       九阳:0   joyoung:4.12713       jyk-:6.97073   包:4.07754   邮:4.89035
3823890436963972   亲爱的:4.18965       我:1.09861

思路：

公式：TF* loge(N/DF)

TF:当前词在本篇微博中出现的次数

N：总微博数

DF：当前词在多少微博中出现过

三个MapReduce：
   第一个MR
   mapper：
       今天_3823890201582094:1 ------每个词输入一次
       count:1 ----一篇微博输出一次
   合并：combiner
       数字累加，
   分区（partition）：4个reduce
       count ---- 3
       key % 3 =0，1，2

   Reduce：
       数字累加

   count：988873
   今天_3823890201582094:2
   豆浆_3823890201582094:4

   第二个MR：每个词在那些微博中出现过，统计这些微博总数
   mapper：
       count：988873
       今天：1
       豆浆：1

   合并：combiner
       数字累加，
   Reduce：
       今天：245
       豆浆：4567

   第三个MR：把第一个MR和第二个MR的输出，作为输入

代码：

jar包：hadoop-2.5.1 和 IKAnalyzer2012_FF.jar

定义输入输出路径：

public class Paths {
   public static final String HAD_MR_INPUT = "/had/mr/input";
   public static final String HAD_OUTPUT1 = "/had/output1";
   public static final String HAD_OUTPUT2 = "/had/output2";
   public static final String HAD_OUTPUT3 = "/had/output3";
}

第一个mapreduce：

import java.io.IOException;
import java.io.StringReader;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

public class FirstMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
   @Override
   protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
       String[] line = value.toString().split("\t");
       if (line.length >= 2) {
           String id = line[0].trim();
           String content = line[1].trim();
           StringReader sr = new StringReader(content);
           IKSegmenter iks = new IKSegmenter(sr, true);
           Lexeme lexeme = null;
           while ((lexeme = iks.next()) != null) {
               String word = lexeme.getLexemeText();
               context.write(new Text(word + "_" + id), new IntWritable(1));
           }
           sr.close();
           context.write(new Text("count"), new IntWritable(1));
       } else {
           System.err.println("error:" + value.toString() + "------------------------");
       }
   }
}

--------------------------------------------------------------------------------------------------------------------------------------------------------

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class FirstReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
   @Override
   protected void reduce(Text text, Iterable<IntWritable> iterable, Context context) throws IOException, InterruptedException {
       int sum = 0;
       for (IntWritable intWritable : iterable) {
           sum += intWritable.get();
       }
       if (text.equals("count")) {
           System.out.println(text.toString() + "==" + sum);
       }
       context.write(text, new IntWritable(sum));
   }
}

----------------------------------------------------------------------------------------------------------------------------------------------------------------

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

public class FirstPartition extends HashPartitioner<Text, IntWritable> {

   @Override
   public int getPartition(Text key, IntWritable value, int numReduceTasks) {
       if (key.equals(new Text("count"))) {
           return 3;
       } else {
           return super.getPartition(key, value, numReduceTasks - 1);
       }
   }

}
-------------------------------------------------------------------------------------------------------------------------------------------

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class FirstJob {
   public static void main(String[] args) {
       Configuration conf = new Configuration();
       conf.set("yarn.resourcemanager.hostname", "hadoop010");
       try {
           Job job = Job.getInstance(conf, "weibo1");
           job.setJarByClass(FirstJob.class);
           // 设置map的输出类型
           job.setOutputKeyClass(Text.class);
           job.setOutputValueClass(IntWritable.class);
           // 设置mapreduce处理类
           job.setNumReduceTasks(4);
           job.setPartitionerClass(FirstPartition.class);
           job.setMapperClass(FirstMapper.class);
           job.setCombinerClass(FirstReducer.class);
           job.setReducerClass(FirstReducer.class);
           // 设置输入输出目录
           FileInputFormat.addInputPath(job, new Path(Paths.HAD_MR_INPUT));
           FileOutputFormat.setOutputPath(job, new Path(Paths.HAD_OUTPUT1));
           if (job.waitForCompletion(true)) {
               System.out.println("FirstJob-执行完毕！");
               TwoJob.mainJob();
           }
       } catch (Exception e) {
           e.printStackTrace();
       }
   }
}

---------------------------------------------------------------------------------------------------------------------------------------------------------

第二个MR：

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class TwoMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

   protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

       FileSplit fs = (FileSplit) context.getInputSplit();

       if (!fs.getPath().getName().contains("part-r-00003")) {

           String[] line = value.toString().trim().split("\t");
           if (line.length >= 2) {
               String[] ss = line[0].split("_");
               if (ss.length >= 2) {
                   String w = ss[0];
                   context.write(new Text(w), new IntWritable(1));
               }
           } else {
               System.out.println("error:" + value.toString() + "-------------");
           }
       }
   }
}

----------------------------------------------------------------------------------------------------------------

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class TwoReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

   protected void reduce(Text key, Iterable<IntWritable> arg1, Context context) throws IOException, InterruptedException {

       int sum = 0;
       for (IntWritable i : arg1) {
           sum = sum + i.get();
       }

       context.write(key, new IntWritable(sum));
   }

}

--------------------------------------------------------------------------------------------------------------

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class TwoJob {

   public static void mainJob() {
       Configuration config = new Configuration();
       config.set("yarn.resourcemanager.hostname", "hadoop010");
       try {
           Job job = Job.getInstance(config, "weibo2");
           job.setJarByClass(TwoJob.class);
           // 设置map任务的输出key类型、value类型
           job.setOutputKeyClass(Text.class);
           job.setOutputValueClass(IntWritable.class);
           // 设置mapreduce处理类
           job.setMapperClass(TwoMapper.class);
           job.setCombinerClass(TwoReducer.class);
           job.setReducerClass(TwoReducer.class);
           // mr运行时的输入数据从hdfs的哪个目录中获取
           FileInputFormat.addInputPath(job, new Path(Paths.HAD_OUTPUT1));
           FileOutputFormat.setOutputPath(job, new Path(Paths.HAD_OUTPUT2));
           if (job.waitForCompletion(true)) {
               System.out.println("TwoJob-执行完毕！");
               LastJob.mainJob();
           }
       } catch (Exception e) {
           e.printStackTrace();
       }
   }
}

----------------------------------------------------------------------------------------------------------------------

第三个MR：

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.text.NumberFormat;
import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class LastMapper extends Mapper<LongWritable, Text, Text, Text> {
   public static Map<String, Integer> cmap = null;
   public static Map<String, Integer> df = null;

   // 在map方法执行之前
   protected void setup(Context context) throws IOException, InterruptedException {
       if (cmap == null || cmap.size() == 0 || df == null || df.size() == 0) {

           URI[] ss = context.getCacheFiles();
           if (ss != null) {
               for (int i = 0; i < ss.length; i++) {
                   URI uri = ss[i];
                   if (uri.getPath().endsWith("part-r-00003")) {
                       Path path = new Path(uri.getPath());
                       BufferedReader br = new BufferedReader(new FileReader(path.getName()));
                       String line = br.readLine();
                       if (line.startsWith("count")) {
                           String[] ls = line.split("\t");
                           cmap = new HashMap<String, Integer>();
                           cmap.put(ls[0], Integer.parseInt(ls[1].trim()));
                       }
                       br.close();
                   } else {
                       df = new HashMap<String, Integer>();
                       Path path = new Path(uri.getPath());
                       BufferedReader br = new BufferedReader(new FileReader(path.getName()));
                       String line;
                       while ((line = br.readLine()) != null) {
                           String[] ls = line.split("\t");
                           df.put(ls[0], Integer.parseInt(ls[1].trim()));
                       }
                       br.close();
                   }
               }
           }
       }
   }

   protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
       FileSplit fs = (FileSplit) context.getInputSplit();

       if (!fs.getPath().getName().contains("part-r-00003")) {

           String[] v = value.toString().trim().split("\t");
           if (v.length >= 2) {
               int tf = Integer.parseInt(v[1].trim());
               String[] ss = v[0].split("_");
               if (ss.length >= 2) {
                   String w = ss[0];
                   String id = ss[1];

                   double s = tf * Math.log(cmap.get("count") / df.get(w));
                   NumberFormat nf = NumberFormat.getInstance();
                   nf.setMaximumFractionDigits(5);
                   context.write(new Text(id), new Text(w + ":" + nf.format(s)));
               }
           } else {
               System.out.println(value.toString() + "-------------");
           }
       }
   }
}

-----------------------------------------------------------------------------

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class LastReduce extends Reducer<Text, Text, Text, Text> {

   protected void reduce(Text key, Iterable<Text> arg1, Context context) throws IOException, InterruptedException {

       StringBuffer sb = new StringBuffer();

       for (Text i : arg1) {
           sb.append(i.toString() + "\t");
       }

       context.write(key, new Text(sb.toString()));
   }

}

----------------------------------------------------------------------------------

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class LastJob {

   public static void mainJob() {
       Configuration config = new Configuration();
       config.set("yarn.resourcemanager.hostname", "hadoop010");
       try {
           Job job = Job.getInstance(config, "weibo3");
           job.setJarByClass(LastJob.class);
           // 添加文件到缓存中
           job.addCacheFile(new Path(Paths.HAD_OUTPUT1 + "/part-r-00003").toUri());
           job.addCacheFile(new Path(Paths.HAD_OUTPUT2 + "/part-r-00000").toUri());
           // 设置map任务的输出key类型、value类型
           job.setOutputKeyClass(Text.class);
           job.setOutputValueClass(Text.class);
           // job.setMapperClass();
           job.setMapperClass(LastMapper.class);
           job.setCombinerClass(LastReduce.class);
           job.setReducerClass(LastReduce.class);
           // mr运行时的输入数据从hdfs的哪个目录中获取
           FileInputFormat.addInputPath(job, new Path(Paths.HAD_OUTPUT1));
           FileOutputFormat.setOutputPath(job, new Path(Paths.HAD_OUTPUT3));
           if (job.waitForCompletion(true)) {
               System.out.println("LastJob-执行完毕！");
               System.out.println("全部工作执行完毕！");
           }
       } catch (Exception e) {
           e.printStackTrace();
       }
   }
}

0 0