Hadoop经典案例Spark实现(六)——求最大的K个值并排序
来源:互联网 发布:刘双军网络二胡教学15 编辑:程序博客网 时间:2024/05/01 13:39
Hadoop经典案例Spark实现(六)——求最大的K个值并排序
一、需求分析
b.txt
预测结果:(求 Top N=5 的结果)
Map任务代码
Reduce代码
Job提交
三、Spark实现-Scala版本
spark排序传入false参数即可倒序
一、需求分析
#orderid,userid,payment,productid
求topN的payment值
a.txt1,9819,100,1212,8918,2000,1113,2813,1234,224,9100,10,11015,3210,490,1116,1298,28,12117,1010,281,908,1818,9000,20
b.txt
100,3333,10,100101,9321,1000,293102,3881,701,20103,6791,910,30104,8888,11,39
预测结果:(求 Top N=5 的结果)
190002200031234410005910
二、MapReduce实现
因为MR默认是升序的因此要自定义输入类型
自定义倒充的整型输入
import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import org.apache.hadoop.io.WritableComparable;public class MyIntWritable implements WritableComparable<MyIntWritable> {private Integer num;public MyIntWritable(Integer num) {this.num = num;}public MyIntWritable() {}@Overridepublic void write(DataOutput out) throws IOException {out.writeInt(num);}@Overridepublic void readFields(DataInput in) throws IOException {this.num = in.readInt();}@Overridepublic int compareTo(MyIntWritable o) {int minus = this.num - o.num;return minus * (-1);}@Overridepublic int hashCode() {return this.num.hashCode();}@Overridepublic boolean equals(Object obj) {if (obj instanceof MyIntWritable) {return false;}MyIntWritable ok2 = (MyIntWritable) obj;return (this.num == ok2.num);}@Overridepublic String toString() {return num + "";}}
Map任务代码
import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class TopNMapper extends Mapper<LongWritable, Text, MyIntWritable, Text> {protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString().trim();if (line.length() > 0) {// 1,9819,100,121String[] arr = line.split(",");if (arr.length == 4) {int payment = Integer.parseInt(arr[2]);context.write(new MyIntWritable(payment), new Text(""));}}}}
Reduce代码
import java.io.IOException;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class TopNReducer extends Reducer<MyIntWritable, Text, Text, MyIntWritable> {private int idx = 0;@Overrideprotected void reduce(MyIntWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {idx++;if (idx <= 5) {context.write(new Text(idx + ""), key);}}}
Job提交
import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class JobMain {public static void main(String[] args) throws Exception {Configuration configuration = new Configuration();Job job = new Job(configuration, "topn_job");job.setJarByClass(JobMain.class);job.setMapperClass(TopNMapper.class);job.setMapOutputKeyClass(MyIntWritable.class);job.setMapOutputValueClass(Text.class);job.setReducerClass(TopNReducer.class);job.setOutputKeyClass(MyIntWritable.class);job.setOutputValueClass(Text.class);FileInputFormat.addInputPath(job, new Path(args[0]));Path path = new Path(args[1]);FileSystem fs = FileSystem.get(configuration);if (fs.exists(path)) {fs.delete(path, true);}FileOutputFormat.setOutputPath(job, path);job.setNumReduceTasks(1);System.exit(job.waitForCompletion(true) ? 0 : 1);}}
三、Spark实现-Scala版本
val six = sc.textFile("/tmp/spark/six")var idx = 0;val res = six.filter(x => (x.trim().length>0) && (x.split(",").length==4)).map(_.split(",")(2)).map(x => (x.toInt,"")).sortByKey(false).map(x=>x._1).take(5).foreach(x => {idx = idx+1println(idx +"\t"+x)})
spark排序传入false参数即可倒序
2 0
- Hadoop经典案例Spark实现(六)——求最大的K个值并排序
- Hadoop 案例6-----TopN问题:求最大的K个值并排序
- Hadoop经典案例Spark实现(五)——求最大最小值问题
- Hadoop经典案例Spark实现(三)——数据排序
- Spark经典案例5-求最大最小值
- Hadoop经典案例Spark实现(一)——通过采集的气象数据分析每年的最高温度
- Hadoop经典案例Spark实现(一)——通过采集的气象数据分析每年的最高温度
- Hadoop经典案例Spark实现(二)——数据去重问题
- Hadoop经典案例Spark实现(四)——平均成绩
- Hadoop经典案例Spark实现(七)——日志分析:分析非结构化文件
- 求最大的K个值
- Spark经典案例4-求平局值
- Spark经典案例6-求top值
- 求数组中最大K个值的下标
- 求数组中第K个最大的值
- 求第K个最大的数
- 求最大值和最大k个值
- Hadoop入门案例(六)之二次排序,全排序基础下的二次排序
- 时间to_date,层级查询 --工作备忘2016/1/8
- 图的遍历 DFS(深度优先),BFS(广度优先)
- gradle老问题
- Android4.0窗口机制和创建过程分析
- Android实现静默安装与卸载
- Hadoop经典案例Spark实现(六)——求最大的K个值并排序
- java中的Cipher类
- Parcelable 复杂对象,对象列表等
- tcpdump filters for HTTP GET & HTTP POST
- 在iPhone5以上机器显示iPhone4尺寸的launchImage
- 算法洗脑系列(8篇)——第二篇 递归思想
- <九度 OJ>题目1526:朋友圈
- linux性能测试命令
- iOS 强制传参 NSInvocation