Hadoop经典案例Spark实现（六）——求最大的K个值并排序

来源：互联网发布：刘双军网络二胡教学15 编辑：程序博客网时间：2024/05/01 13:39

Hadoop经典案例Spark实现（六）——求最大的K个值并排序

一、需求分析

#orderid,userid,payment,productid

求topN的payment值

a.txt

1,9819,100,1212,8918,2000,1113,2813,1234,224,9100,10,11015,3210,490,1116,1298,28,12117,1010,281,908,1818,9000,20

b.txt

100,3333,10,100101,9321,1000,293102,3881,701,20103,6791,910,30104,8888,11,39

预测结果：（求 Top N=5 的结果）

190002200031234410005910

二、MapReduce实现

因为MR默认是升序的因此要自定义输入类型

自定义倒充的整型输入

import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import org.apache.hadoop.io.WritableComparable;public class MyIntWritable implements WritableComparable<MyIntWritable> {private Integer num;public MyIntWritable(Integer num) {this.num = num;}public MyIntWritable() {}@Overridepublic void write(DataOutput out) throws IOException {out.writeInt(num);}@Overridepublic void readFields(DataInput in) throws IOException {this.num = in.readInt();}@Overridepublic int compareTo(MyIntWritable o) {int minus = this.num - o.num;return minus * (-1);}@Overridepublic int hashCode() {return this.num.hashCode();}@Overridepublic boolean equals(Object obj) {if (obj instanceof MyIntWritable) {return false;}MyIntWritable ok2 = (MyIntWritable) obj;return (this.num == ok2.num);}@Overridepublic String toString() {return num + "";}}

Map任务代码

import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class TopNMapper extends Mapper<LongWritable, Text, MyIntWritable, Text> {protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString().trim();if (line.length() > 0) {// 1,9819,100,121String[] arr = line.split(",");if (arr.length == 4) {int payment = Integer.parseInt(arr[2]);context.write(new MyIntWritable(payment), new Text(""));}}}}

Reduce代码

import java.io.IOException;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class TopNReducer extends Reducer<MyIntWritable, Text, Text, MyIntWritable> {private int idx = 0;@Overrideprotected void reduce(MyIntWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {idx++;if (idx <= 5) {context.write(new Text(idx + ""), key);}}}

Job提交

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class JobMain {public static void main(String[] args) throws Exception {Configuration configuration = new Configuration();Job job = new Job(configuration, "topn_job");job.setJarByClass(JobMain.class);job.setMapperClass(TopNMapper.class);job.setMapOutputKeyClass(MyIntWritable.class);job.setMapOutputValueClass(Text.class);job.setReducerClass(TopNReducer.class);job.setOutputKeyClass(MyIntWritable.class);job.setOutputValueClass(Text.class);FileInputFormat.addInputPath(job, new Path(args[0]));Path path = new Path(args[1]);FileSystem fs = FileSystem.get(configuration);if (fs.exists(path)) {fs.delete(path, true);}FileOutputFormat.setOutputPath(job, path);job.setNumReduceTasks(1);System.exit(job.waitForCompletion(true) ? 0 : 1);}}

三、Spark实现-Scala版本

val six = sc.textFile("/tmp/spark/six")var idx = 0;val res = six.filter(x => (x.trim().length>0) && (x.split(",").length==4)).map(_.split(",")(2)).map(x => (x.toInt,"")).sortByKey(false).map(x=>x._1).take(5).foreach(x => {idx = idx+1println(idx +"\t"+x)})

spark排序传入false参数即可倒序

2 0