Spark系列修炼---入门笔记24
来源:互联网 发布:js时间控件 时分秒 编辑:程序博客网 时间:2024/05/29 19:28
核心内容:
1、Spark中基础Top N算法实战
2、Spark中分组Top N算法实战
3、排序算法RangePartitioner内幕解密
最近周围的人都在陆陆续续的找工作,本来之前也有一些小的躁动、有点小发慌,但是随着Hadoop1.0、Hadoop2.0与Yarn、Hbase、Hive等的不断深入,这种躁动渐渐的消失了下来,因为能感觉到自己现在已经很有实力了,学习Spark到今天也有20讲了,王家林老师说20讲是一个分水岭,学完20讲月薪15000是没问题的,在学完接下来的15讲Spark内部的机制和性能调优,月薪2万以上是没问题的,呵呵,自己越来越有信心了,还是那句话,脚踏实地,一步一步的来………
好了,进入本文的正题,所谓TopN可以理解为热门的商品等等,TopN在电商、社交网络、媒体等领域具有重要的应用,TopN算法(首先要进行排序)分为基础的TopN算法和分组的TopN算法,而分组的TopN算法是一种更常见的算法(重点掌握),分组排序也是更通用的,所谓分组排序就是有不同类型的数据,我们要找出不同类型的数据中,每种类型数据里面的topN元素。
OK了,咱们先进行基础的TopN算法实战………
实例程序1:Spark中基础的TopN算法
输入数据:
142573279145
算法要求:从大到小,求出top5中的数据。
代码部分:
package com.spark.topnimport org.apache.spark.rdd.RDDimport org.apache.spark.{SparkConf, SparkContext}/** * Created by hp on 2016/12/16. * 本程序的目的是实现TopN中的基础排序 */object TopNBasic{ def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName("TopNBasic") conf.setMaster("local") //本地运行模式 val sc = new SparkContext(conf) val lines:RDD[String] = sc.textFile("C:\\topn.txt") //生成key-value键值对的方式为了方便sortByKey排序 val pairs:RDD[(Int,Null)] = lines.map(line => (line.toInt,null)) //接下来我们进行降序排序:默认是按照升序进行排序的 val sortedPairs:RDD[(Int,Null)] = pairs.sortByKey(false) //过滤出排序后的内容本身 val result:RDD[Int] = sortedPairs.map(pair => pair._1) //获取排名前5位的元素内容,注意返回的不是一个RDD val top5:Array[Int] = result.take(5) top5.foreach(println) sc.stop() }}
运行结果:
97755
好的,接下来我们用MapReduce实现同样的功能,输入数据同上,代码部分如下:
package topN;import java.io.IOException;import java.util.ArrayList;import java.util.Collections;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FSDataInputStream;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IOUtils;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;//MapReduce中基础Top N算法实战 public class TopNBasic { public static String path1 = "C:\\topn.txt"; public static String path2 = "C:\\dirout\\"; public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); FileSystem fileSystem = FileSystem.get(conf); if(fileSystem.exists(new Path(path2))) //如果输出路径事先存在,则删除 { fileSystem.delete(new Path(path2), true); } Job job = Job.getInstance(conf, "TopNBasic"); job.setJarByClass(TopNBasic.class); //编写驱动 FileInputFormat.setInputPaths(job, new Path(path1)); job.setInputFormatClass(TextInputFormat.class); job.setMapperClass(MyMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(NullWritable.class); //shuffle洗牌阶段 job.setReducerClass(MyReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); job.setOutputFormatClass(TextOutputFormat.class); FileOutputFormat.setOutputPath(job, new Path(path2)); //将任务提交给JobTracker job.waitForCompletion(true); //查看程序的运行结果 FSDataInputStream fr = fileSystem.open(new Path("C:\\dirout\\part-r-00000")); IOUtils.copyBytes(fr,System.out,1024,true); } public static class MyMapper extends Mapper<LongWritable, Text, Text, NullWritable> { @Override protected void map(LongWritable k1, Text v1,Context context)throws IOException, InterruptedException { String line = v1.toString(); context.write(new Text(line), NullWritable.get()); } } //中间经过分区、排序、分组、shuffle进入到Reducer阶段 public static class MyReducer extends Reducer<Text, NullWritable, Text, NullWritable> { public static ArrayList<Integer> arr = new ArrayList<Integer>(); @Override //<90,{NullWritable,NullWritable,NullWritable}> protected void reduce(Text k2, Iterable<NullWritable> v2s,Context context)throws IOException, InterruptedException { for(NullWritable v2 : v2s) { arr.add(Integer.valueOf(k2.toString())); } } @Override protected void cleanup(Context context)throws IOException, InterruptedException { Collections.sort(arr);//arr集合中的元素由低到高进行排序 Collections.reverse(arr); //获取前5个数据最大值 String line = String.valueOf(arr.get(0))+"\t"+String.valueOf(arr.get(1))+"\t" +String.valueOf(arr.get(2))+"\t"+String.valueOf(arr.get(3))+ "\t"+String.valueOf(arr.get(4)); context.write(new Text(line), NullWritable.get()); } }}
运行结果:
9 7 7 5 5
呵呵,Hadoop中的MapReduce显得好麻烦………
好的,接下来我们谈论Spark中分组Top N算法。
输入数据:
spark 100Hadoop 65spark 99Hadoop 61spark 195Hadoop 60spark 98Hadoop 69spark 91Hadoop 64spark 89Hadoop 98spark 88Hadoop 99spark 100Hadoop 68spark 60Hadoop 79spark 97Hadoop 200
算法要求:进行分组的topN,并对分组的key进行排序。
实例代码:
package com.spark.topnimport org.apache.spark.rdd.RDDimport org.apache.spark.{SparkConf, SparkContext}/** * Created by hp on 2016/12/17. * 本程序的目的是进行分组的topN */object TopNGroup{ def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local") conf.setAppName("TopNGroup") val sc = new SparkContext(conf) val lines:RDD[String] = sc.textFile("C:\\word3.txt") val pairs:RDD[(String,Int)] = lines.map(line => { val splited:Array[String] = line.split(" ") val first = splited(0) val second = splited(1) (first,second.toInt) }) //相同key的value放到同一个集合当中,并对key进行排序 val sortKeyPairs:RDD[(String,Iterable[Int])] = pairs.groupByKey().sortByKey() val results:RDD[(String,List[Int])] = sortKeyPairs.map(line => { val list:List[Int] = line._2.toList.sortWith(_>_) (line._1,list.take(3)) }) results.collect().foreach(println) }}
运行结果:
(Hadoop,List(200, 99, 98))(spark,List(195, 100, 100))
当然上面的程序还可以简化成这样:
package com.spark.topnimport org.apache.spark.{SparkConf, SparkContext}/** * Created by hp on 2016/12/17. * 本程序的目的是进行分组的topN */object TopNGroup{ def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName("TopNGroup") conf.setMaster("local") val sc:SparkContext = new SparkContext(conf) val lines = sc.textFile("C:\\word3.txt") val pairs = lines.map(line => { val splited = line.split(" ") (splited(0),splited(1).toInt) }).groupByKey().sortByKey(false).map(line => { val arr:List[Int] = line._2.toList.sortWith(_>_).take(3) (line._1,arr) }).collect().foreach(println) sc.stop() }}
好的,接下来我们通过MapReduce实现同样的功能,实验输入数据同上。
实例代码:
package topN;import java.io.IOException;import java.util.ArrayList;import java.util.Collections;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FSDataInputStream;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IOUtils;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;public class TopNGroup { public static String path1 = "C:\\word3.txt"; public static String path2 = "C:\\dirout\\"; public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); FileSystem fileSystem = FileSystem.get(conf); if(fileSystem.exists(new Path(path2))) //如果输出路径事先存在,则删除 { fileSystem.delete(new Path(path2), true); } Job job = Job.getInstance(conf, "TopNGroup"); job.setJarByClass(TopNGroup.class); //编写驱动 FileInputFormat.setInputPaths(job, new Path(path1)); job.setInputFormatClass(TextInputFormat.class); job.setMapperClass(MyMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class); //shuffle洗牌阶段 job.setReducerClass(MyReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setOutputFormatClass(TextOutputFormat.class); FileOutputFormat.setOutputPath(job, new Path(path2)); //将任务提交给JobTracker job.waitForCompletion(true); //查看程序的运行结果 FSDataInputStream fr = fileSystem.open(new Path("C:\\dirout\\part-r-00000")); IOUtils.copyBytes(fr,System.out,1024,true); } public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> { @Override protected void map(LongWritable k1, Text v1,Context context)throws IOException, InterruptedException { String[] splited = v1.toString().split(" "); String word = splited[0]; String num = splited[1]; context.write(new Text(word), new LongWritable(Long.parseLong(num))); } } //中间经过分区、排序、分组、shuffle public static class MyReducer extends Reducer<Text, LongWritable, Text, Text> { @Override //<Spark,{100,900,800,70.........}> protected void reduce(Text k2, Iterable<LongWritable> v2s,Context context)throws IOException, InterruptedException { ArrayList<Long> arr = new ArrayList<Long>(); for(LongWritable v2 : v2s) { arr.add(v2.get()); } //Java中Collection接口的实现类本身并没有提供排序、倒置等方法,这些方法由collections类实现 Collections.sort(arr); Collections.reverse(arr); String line = arr.get(0)+"\t"+arr.get(1)+"\t"+arr.get(2); context.write(k2, new Text(line)); } }}
运行结果:
Hadoop 200 99 98spark 195 100 100
呵呵,MapReduce显得过于麻烦………
OK,介绍完两种TOPN算法之后,我们介绍一下排序算法sortByKey的RangePartitioner的相关内幕:
见源码:
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length) : RDD[(K, V)] = self.withScope { val part = new RangePartitioner(numPartitions, self, ascending) new ShuffledRDD[K, V, V](self, part) .setKeyOrdering(if (ascending) ordering else ordering.reverse) }
def sketch[K : ClassTag]( rdd: RDD[K], sampleSizePerPartition: Int): (Long, Array[(Int, Long, Array[K])]) = { val shift = rdd.id // val classTagK = classTag[K] // to avoid serializing the entire partitioner object val sketched = rdd.mapPartitionsWithIndex { (idx, iter) => val seed = byteswap32(idx ^ (shift << 16)) val (sample, n) = SamplingUtils.reservoirSampleAndCount( iter, sampleSizePerPartition, seed) Iterator((idx, n, sample)) }.collect() val numItems = sketched.map(_._2).sum (numItems, sketched) }
def reservoirSampleAndCount[T: ClassTag]( input: Iterator[T], k: Int, seed: Long = Random.nextLong()) : (Array[T], Long) = { val reservoir = new Array[T](k) // Put the first k elements in the reservoir. var i = 0 while (i < k && input.hasNext) { val item = input.next() reservoir(i) = item i += 1 }
从源码中,可以看出我们RangePartitioner采用的就是水塘抽样算法!
OK,今天就写到这里了,继续努力!
- Spark系列修炼---入门笔记24
- Spark系列修炼---入门笔记1
- Spark系列修炼---入门笔记2
- Spark系列修炼---入门笔记3
- Spark系列修炼---入门笔记4
- Spark系列修炼---入门笔记5
- Spark系列修炼---入门笔记9
- Spark系列修炼---入门笔记10
- Spark系列修炼---入门笔记8
- Spark系列修炼---入门笔记11
- Spark系列修炼---入门笔记12
- Spark系列修炼---入门笔记13
- Spark系列修炼---入门笔记14
- Spark系列修炼---入门笔记15
- Spark系列修炼---入门笔记16
- Spark系列修炼---入门笔记17
- Spark系列修炼---入门笔记18
- Spark系列修炼---入门笔记19
- Java设计模式----策略模式
- 内联汇编--引用源码中定义的数组
- RecyclerView跳转到指定位置的两种方式
- Android 输入框中原意字符串,十六进制字符串与字节数组
- java枚举类型
- Spark系列修炼---入门笔记24
- hadoop配置文件hdfs-site.xml
- 计算字符串的宽度与高度
- xilinx开发时遇到的烧写与下载可执行文件出现的效果不一致的解决办法
- ScrollView+LinearLayout 仿Listview 效果
- mysql常用指令
- caffe 编译中出现的错误——fatal error: hdf5.h: 没有那个文件或目录
- A. Crazy Computer
- 钓鱼问题-DFS全排列+模拟