mapreduce计算分词权重
来源:互联网 发布:手机网络制式有几种 编辑:程序博客网 时间:2024/05/17 21:38
需求:
计算每个词在每篇文章中的权值
原始文件片段:
3823890201582094 今天我约了豆浆,油条。约了电饭煲几小时后饭就自动煮好,还想约豆浆机,让我早晨多睡一小时,豆浆就自然好。起床就可以喝上香喷喷的豆浆了。
3823890210294392 今天我约了豆浆,油条
3823890235477306 一会儿带儿子去动物园约起~
3823890239358658 继续支持
3823890256464940 约起来!次饭去!
3823890264861035 我约了吃饭哦
3823890281649563 和家人一起相约吃个饭!
3823890285529671 今天约了广场一起滑旱冰
3823890294242412 九阳双预约豆浆机即将全球首发啦,我要约你一起吃早餐
3823890314914825 今天天气晴好,姐妹们约起,一起去逛街。
3823890323625419 全国包邮!九阳(Joyoung)JYL-
3823890335901756 今天是今年最暖和的一天,果断出来逛街!
3823890364788305 春天来了,约好友一起出去去踏青,去赏花!
3823890369489295 我在平湖,让你开挂练九阳真经,走火入魔毁了三叉神经了吧,改练九阴真经吧小子。 (免费下载 )
3823890373686361 约了小伙伴一起去理发!
3823890378201539 今天约了姐妹去逛街吃美食,周末玩得很开心啊!
3823890382081678 这几天一直在约,因为感冒发烧了,所以和老公约好了陪我去打针,求九阳安慰,我想喝豆浆,药好苦的
3823890399188850 和吃货的约会么就是吃
3823890419856548 全国包邮!九阳(Joyoung)JYK-
3823890436963972 我亲爱的
结果片段:
3823890201582094 我:2.19722 香喷喷:5.5835 多:3.13549 睡:3.66356 豆浆:4.15888 早晨:3.55535 想约:4.66344 喝上:4.66344 就可以:4.56435 煮:4.56435 约:2.19722 电饭煲:5.02388 的:0 就:5.88888 好:5.27811 后:3.49651 自动:5.36129 今天:2.19722 油条:4.07754 饭:4.89035 豆浆机:1.38629 一小时:4.77068 几小时:5.87212 自然:5.87212 还:3.58352 让:3.21888 了:4.15888 起床:4.02535
3823890210294392 约:1.09861 我:1.09861 豆浆:1.38629 今天:2.19722 了:1.38629 油条:4.07754
3823890235477306 一会儿:6.97073 去:2.30259 儿子:4.26268 约:1.09861 动物园:5.5835 带:4.07754 起:3.82864
3823890239358658 继续:4.89035 支持:3.04452
3823890256464940 次:5.87212 起来:2.83321 约:1.09861 饭:4.89035 去:2.30259
3823890264861035 约:1.09861 我:1.09861 了:1.38629 吃饭:3.97029 哦:2.89037
3823890281649563 和家人:4.89035 一起:2.30259 吃个:5.36129 相约:3.68888 饭:4.89035
3823890285529671 了:1.38629 今天:2.19722 广场:5.5835 滑旱冰:6.97073 一起:2.30259 约:1.09861
3823890294242412 九阳:0 我:1.09861 全球:5.5835 早餐:2.56495 双:2.19722 你:2.56495 一起:2.30259 首发:2.56495 预约:2.07944 啦:2.70805 即将:5.02388 吃:2.70805 要约:4.39445 豆浆机:1.38629
3823890314914825 一起:2.30259 约:1.09861 去:2.30259 逛街:4.33073 姐妹:5.17615 今天:2.19722 天气晴好:6.27664 起:3.82864 们:3.13549
3823890323625419 邮:4.89035 全国:5.36129 jyl-:6.97073 包:4.07754 joyoung:4.12713 九阳:0
3823890335901756 的:0 今年:5.5835 暖和:5.5835 果断:6.97073 出来:5.02388 逛街:4.33073 一天:4.18965 最:3.49651 今天是:4.66344
3823890364788305 出:4.89035 来了:3.3673 去去:6.27664 去:2.30259 好友:5.36129 赏花:5.17615 踏青:4.18965 约:1.09861 一起:2.30259 春天:2.89037
3823890369489295 让:3.21888 练:11.74424 下载:5.87212 九阳:0 吧:2.63906 我:1.09861 九阴真经:6.27664 免费:5.5835 挂:6.97073 了吧:6.97073 平湖:6.97073 走火入魔:5.02388 真经:4.77068 小子:6.97073 开:4.56435 你:2.56495 三叉神经:6.97073 在:2.70805 毁了:6.97073 改:6.97073
3823890373686361 一起:2.30259 约:1.09861 理发:5.5835 了:1.38629 小伙伴:4.02535 去:2.30259
3823890378201539 吃:2.70805 得很:5.87212 啊:3.4012 今天:2.19722 姐妹:5.17615 开心:3.68888 去:2.30259 玩:3.52636 周末:3.46574 逛街:4.33073 了:1.38629 约:1.09861 美食:3.78419
3823890382081678 老公:3.4012 所以:4.12713 好了:4.12713 我:1.09861 约:2.19722 感冒:6.27664 陪我:5.87212 豆浆:1.38629 发:5.36129 去:2.30259 和:1.79176 打针:6.97073 因为:4.47734 安慰:5.02388 一直在:5.5835 好苦:6.27664 这几:6.27664 想喝:5.36129 九阳:0 求:3.66356 烧了:6.27664 药:6.27664 的:0 天:4.18965
3823890399188850 就是:3.04452 么:5.17615 约会:2.99573 的:0 货:4.77068 吃:5.4161 和:1.79176
3823890419856548 全国:5.36129 九阳:0 joyoung:4.12713 jyk-:6.97073 包:4.07754 邮:4.89035
3823890436963972 亲爱的:4.18965 我:1.09861
思路:
公式:TF* loge(N/DF)
TF:当前词在本篇微博中出现的次数
N:总微博数
DF:当前词在多少微博中出现过
三个MapReduce:
第一个MR
mapper:
今天_3823890201582094:1 ------每个词输入一次
count:1 ----一篇微博输出一次
合并:combiner
数字累加,
分区(partition):4个reduce
count ---- 3
key % 3 =0,1,2
Reduce:
数字累加
count:988873
今天_3823890201582094:2
豆浆_3823890201582094:4
第二个MR:每个词在那些微博中出现过,统计这些微博总数
mapper:
count:988873
今天:1
豆浆:1
合并:combiner
数字累加,
Reduce:
今天:245
豆浆:4567
第三个MR:把第一个MR和第二个MR的输出,作为输入
代码:
jar包:hadoop-2.5.1 和 IKAnalyzer2012_FF.jar
定义输入输出路径:
public class Paths {
public static final String HAD_MR_INPUT = "/had/mr/input";
public static final String HAD_OUTPUT1 = "/had/output1";
public static final String HAD_OUTPUT2 = "/had/output2";
public static final String HAD_OUTPUT3 = "/had/output3";
}
第一个mapreduce:
import java.io.IOException;
import java.io.StringReader;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
public class FirstMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split("\t");
if (line.length >= 2) {
String id = line[0].trim();
String content = line[1].trim();
StringReader sr = new StringReader(content);
IKSegmenter iks = new IKSegmenter(sr, true);
Lexeme lexeme = null;
while ((lexeme = iks.next()) != null) {
String word = lexeme.getLexemeText();
context.write(new Text(word + "_" + id), new IntWritable(1));
}
sr.close();
context.write(new Text("count"), new IntWritable(1));
} else {
System.err.println("error:" + value.toString() + "------------------------");
}
}
}
--------------------------------------------------------------------------------------------------------------------------------------------------------
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class FirstReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text text, Iterable<IntWritable> iterable, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable intWritable : iterable) {
sum += intWritable.get();
}
if (text.equals("count")) {
System.out.println(text.toString() + "==" + sum);
}
context.write(text, new IntWritable(sum));
}
}
----------------------------------------------------------------------------------------------------------------------------------------------------------------
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;
public class FirstPartition extends HashPartitioner<Text, IntWritable> {
@Override
public int getPartition(Text key, IntWritable value, int numReduceTasks) {
if (key.equals(new Text("count"))) {
return 3;
} else {
return super.getPartition(key, value, numReduceTasks - 1);
}
}
}
-------------------------------------------------------------------------------------------------------------------------------------------
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class FirstJob {
public static void main(String[] args) {
Configuration conf = new Configuration();
conf.set("yarn.resourcemanager.hostname", "hadoop010");
try {
Job job = Job.getInstance(conf, "weibo1");
job.setJarByClass(FirstJob.class);
// 设置map的输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 设置mapreduce处理类
job.setNumReduceTasks(4);
job.setPartitionerClass(FirstPartition.class);
job.setMapperClass(FirstMapper.class);
job.setCombinerClass(FirstReducer.class);
job.setReducerClass(FirstReducer.class);
// 设置输入输出目录
FileInputFormat.addInputPath(job, new Path(Paths.HAD_MR_INPUT));
FileOutputFormat.setOutputPath(job, new Path(Paths.HAD_OUTPUT1));
if (job.waitForCompletion(true)) {
System.out.println("FirstJob-执行完毕!");
TwoJob.mainJob();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
---------------------------------------------------------------------------------------------------------------------------------------------------------
第二个MR:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class TwoMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
FileSplit fs = (FileSplit) context.getInputSplit();
if (!fs.getPath().getName().contains("part-r-00003")) {
String[] line = value.toString().trim().split("\t");
if (line.length >= 2) {
String[] ss = line[0].split("_");
if (ss.length >= 2) {
String w = ss[0];
context.write(new Text(w), new IntWritable(1));
}
} else {
System.out.println("error:" + value.toString() + "-------------");
}
}
}
}
----------------------------------------------------------------------------------------------------------------
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class TwoReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
protected void reduce(Text key, Iterable<IntWritable> arg1, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable i : arg1) {
sum = sum + i.get();
}
context.write(key, new IntWritable(sum));
}
}
--------------------------------------------------------------------------------------------------------------
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class TwoJob {
public static void mainJob() {
Configuration config = new Configuration();
config.set("yarn.resourcemanager.hostname", "hadoop010");
try {
Job job = Job.getInstance(config, "weibo2");
job.setJarByClass(TwoJob.class);
// 设置map任务的输出key类型、value类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 设置mapreduce处理类
job.setMapperClass(TwoMapper.class);
job.setCombinerClass(TwoReducer.class);
job.setReducerClass(TwoReducer.class);
// mr运行时的输入数据从hdfs的哪个目录中获取
FileInputFormat.addInputPath(job, new Path(Paths.HAD_OUTPUT1));
FileOutputFormat.setOutputPath(job, new Path(Paths.HAD_OUTPUT2));
if (job.waitForCompletion(true)) {
System.out.println("TwoJob-执行完毕!");
LastJob.mainJob();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
----------------------------------------------------------------------------------------------------------------------
第三个MR:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.text.NumberFormat;
import java.util.HashMap;
import java.util.Map;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class LastMapper extends Mapper<LongWritable, Text, Text, Text> {
public static Map<String, Integer> cmap = null;
public static Map<String, Integer> df = null;
// 在map方法执行之前
protected void setup(Context context) throws IOException, InterruptedException {
if (cmap == null || cmap.size() == 0 || df == null || df.size() == 0) {
URI[] ss = context.getCacheFiles();
if (ss != null) {
for (int i = 0; i < ss.length; i++) {
URI uri = ss[i];
if (uri.getPath().endsWith("part-r-00003")) {
Path path = new Path(uri.getPath());
BufferedReader br = new BufferedReader(new FileReader(path.getName()));
String line = br.readLine();
if (line.startsWith("count")) {
String[] ls = line.split("\t");
cmap = new HashMap<String, Integer>();
cmap.put(ls[0], Integer.parseInt(ls[1].trim()));
}
br.close();
} else {
df = new HashMap<String, Integer>();
Path path = new Path(uri.getPath());
BufferedReader br = new BufferedReader(new FileReader(path.getName()));
String line;
while ((line = br.readLine()) != null) {
String[] ls = line.split("\t");
df.put(ls[0], Integer.parseInt(ls[1].trim()));
}
br.close();
}
}
}
}
}
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
FileSplit fs = (FileSplit) context.getInputSplit();
if (!fs.getPath().getName().contains("part-r-00003")) {
String[] v = value.toString().trim().split("\t");
if (v.length >= 2) {
int tf = Integer.parseInt(v[1].trim());
String[] ss = v[0].split("_");
if (ss.length >= 2) {
String w = ss[0];
String id = ss[1];
double s = tf * Math.log(cmap.get("count") / df.get(w));
NumberFormat nf = NumberFormat.getInstance();
nf.setMaximumFractionDigits(5);
context.write(new Text(id), new Text(w + ":" + nf.format(s)));
}
} else {
System.out.println(value.toString() + "-------------");
}
}
}
}
-----------------------------------------------------------------------------
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class LastReduce extends Reducer<Text, Text, Text, Text> {
protected void reduce(Text key, Iterable<Text> arg1, Context context) throws IOException, InterruptedException {
StringBuffer sb = new StringBuffer();
for (Text i : arg1) {
sb.append(i.toString() + "\t");
}
context.write(key, new Text(sb.toString()));
}
}
----------------------------------------------------------------------------------
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class LastJob {
public static void mainJob() {
Configuration config = new Configuration();
config.set("yarn.resourcemanager.hostname", "hadoop010");
try {
Job job = Job.getInstance(config, "weibo3");
job.setJarByClass(LastJob.class);
// 添加文件到缓存中
job.addCacheFile(new Path(Paths.HAD_OUTPUT1 + "/part-r-00003").toUri());
job.addCacheFile(new Path(Paths.HAD_OUTPUT2 + "/part-r-00000").toUri());
// 设置map任务的输出key类型、value类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// job.setMapperClass();
job.setMapperClass(LastMapper.class);
job.setCombinerClass(LastReduce.class);
job.setReducerClass(LastReduce.class);
// mr运行时的输入数据从hdfs的哪个目录中获取
FileInputFormat.addInputPath(job, new Path(Paths.HAD_OUTPUT1));
FileOutputFormat.setOutputPath(job, new Path(Paths.HAD_OUTPUT3));
if (job.waitForCompletion(true)) {
System.out.println("LastJob-执行完毕!");
System.out.println("全部工作执行完毕!");
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
- mapreduce计算分词权重
- mapreduce计算分词权重
- 应用各种算法都要分词计算权重
- 即插即用demo系列——结巴分词并计算权重
- 中文检索(分词、同义词、权重)
- 搜索引擎如何计算权重
- sphinx 权重计算
- css 权重 计算
- 计算汉明权重
- css权重的计算
- css权重计算小结
- 信息熵计算权重
- CSS权重计算问题
- 关键词权重计算算法
- 文本挖掘分词mapreduce化
- 词权重计算及应用
- 词权重计算及应用
- 词权重计算及应用
- picasso类图
- Majority Element 查找出现一半的数字
- 堆和栈的区别
- java AbstractQueuedSynchronizer的介绍和原理分析
- Centos版Linux 一些常用操作命令 收集
- mapreduce计算分词权重
- test
- java面试题
- makefile之二:符号、函数说明
- java学习—— 制作简单的计算器
- Excel Sheet Column Number 进制转换
- 腾讯实习生笔试题 软件开发-后台开发方向2014.4.20
- JDBC连接Sql server
- vs2013 error:C4996