Spark模拟实现某网站用户浏览次数最多的url统计

来源:互联网 发布:win10桌面激活windows 编辑:程序博客网 时间:2024/05/17 08:43

现在假设有一个IT教育网站,有Java,PHP,net等多个栏目,下面是模拟实现的网站日志

第一个字段是访问日期,第二个字段是访问的URL,其中每个栏目有一个独立域名,如下:

java.aaaaaaa.cn
net.aaaaaaa.cn
php.aaaaaaa.cn

20160321101954  http://java.aaaaaaa.cn/java/course/javaeeadvanced.shtml20160321101954  http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101954  http://java.aaaaaaa.cn/java/course/android.shtml20160321101954  http://java.aaaaaaa.cn/java/video.shtml20160321101954  http://java.aaaaaaa.cn/java/teacher.shtml20160321101954  http://java.aaaaaaa.cn/java/course/android.shtml20160321101954  http://php.aaaaaaa.cn/php/teacher.shtml20160321101954  http://net.aaaaaaa.cn/net/teacher.shtml20160321101954  http://java.aaaaaaa.cn/java/course/hadoop.shtml20160321101954  http://java.aaaaaaa.cn/java/course/base.shtml20160321101954  http://net.aaaaaaa.cn/net/course.shtml20160321101954  http://php.aaaaaaa.cn/php/teacher.shtml20160321101954  http://net.aaaaaaa.cn/net/video.shtml20160321101954  http://java.aaaaaaa.cn/java/course/base.shtml20160321101954  http://net.aaaaaaa.cn/net/teacher.shtml20160321101954  http://java.aaaaaaa.cn/java/video.shtml20160321101954  http://java.aaaaaaa.cn/java/video.shtml20160321101954  http://net.aaaaaaa.cn/net/video.shtml20160321101954  http://net.aaaaaaa.cn/net/course.shtml20160321101954  http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101954  http://java.aaaaaaa.cn/java/course/android.shtml20160321101955  http://php.aaaaaaa.cn/php/course.shtml20160321101955  http://net.aaaaaaa.cn/net/teacher.shtml20160321101955  http://php.aaaaaaa.cn/php/teacher.shtml20160321101955  http://java.aaaaaaa.cn/java/course/base.shtml20160321101955  http://net.aaaaaaa.cn/net/teacher.shtml20160321101955  http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101955  http://php.aaaaaaa.cn/php/video.shtml20160321101955  http://net.aaaaaaa.cn/net/course.shtml20160321101955  http://php.aaaaaaa.cn/php/video.shtml20160321101955  http://java.aaaaaaa.cn/java/course/android.shtml20160321101955  http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101955  http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101955  http://net.aaaaaaa.cn/net/video.shtml20160321101955  http://net.aaaaaaa.cn/net/teacher.shtml20160321101955  http://java.aaaaaaa.cn/java/teacher.shtml20160321101955  http://java.aaaaaaa.cn/java/course/android.shtml20160321101955  http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101955  http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101955  http://net.aaaaaaa.cn/net/video.shtml20160321101956  http://java.aaaaaaa.cn/java/course/javaeeadvanced.shtml20160321101956  http://net.aaaaaaa.cn/net/video.shtml20160321101956  http://net.aaaaaaa.cn/net/video.shtml20160321101956  http://java.aaaaaaa.cn/java/course/javaeeadvanced.shtml20160321101956  http://java.aaaaaaa.cn/java/course/android.shtml20160321101956  http://java.aaaaaaa.cn/java/course/hadoop.shtml20160321101956  http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101956  http://java.aaaaaaa.cn/java/course/javaeeadvanced.shtml20160321101956  http://php.aaaaaaa.cn/php/teacher.shtml20160321101956  http://net.aaaaaaa.cn/net/teacher.shtml20160321101956  http://java.aaaaaaa.cn/java/course/base.shtml20160321101956  http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101956  http://php.aaaaaaa.cn/php/teacher.shtml20160321101956  http://net.aaaaaaa.cn/net/course.shtml20160321101956  http://net.aaaaaaa.cn/net/teacher.shtml20160321101956  http://php.aaaaaaa.cn/php/video.shtml20160321101956  http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101956  http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101956  http://java.aaaaaaa.cn/java/course/hadoop.shtml20160321101957  http://java.aaaaaaa.cn/java/teacher.shtml20160321101957  http://php.aaaaaaa.cn/php/teacher.shtml20160321101957  http://net.aaaaaaa.cn/net/teacher.shtml20160321101957  http://net.aaaaaaa.cn/net/teacher.shtml20160321101957  http://php.aaaaaaa.cn/php/teacher.shtml20160321101957  http://php.aaaaaaa.cn/php/course.shtml20160321101957  http://java.aaaaaaa.cn/java/course/base.shtml20160321101957  http://net.aaaaaaa.cn/net/course.shtml20160321101957  http://java.aaaaaaa.cn/java/video.shtml20160321101957  http://php.aaaaaaa.cn/php/video.shtml20160321101957  http://net.aaaaaaa.cn/net/teacher.shtml20160321101957  http://java.aaaaaaa.cn/java/video.shtml20160321101957  http://net.aaaaaaa.cn/net/video.shtml20160321101957  http://java.aaaaaaa.cn/java/course/hadoop.shtml20160321101957  http://net.aaaaaaa.cn/net/course.shtml20160321101957  http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101957  http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101958  http://net.aaaaaaa.cn/net/course.shtml20160321101958  http://java.aaaaaaa.cn/java/course/hadoop.shtml20160321101958  http://php.aaaaaaa.cn/php/video.shtml20160321101958  http://php.aaaaaaa.cn/php/course.shtml20160321101958  http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101958  http://net.aaaaaaa.cn/net/video.shtml20160321101958  http://java.aaaaaaa.cn/java/course/base.shtml

需求:统计出每个域名下面访问次数最多的前三个URL


代码:


import java.net.URLimport org.apache.spark.{SparkConf, SparkContext}object UrlCount {  def main(args: Array[String]): Unit = {    val conf  = new SparkConf().setAppName("UrlCount").setMaster("local[2]")    val sc = new SparkContext(conf)    val rdd1 = sc.textFile("E:\\aaaaaa.log").map(line =>{      val f = line.split("\t")      (f(1),1)    })    val rdd2 = rdd1.reduceByKey(_+_)    val rdd3 = rdd2.map(t => {      val url = t._1      val host = new URL(url).getHost      (host,url,t._2)    })    val rdd4 = rdd3.groupBy(_._1).mapValues(it =>{      it.toList.sortBy(_._3).reverse.take(3)    })    sc.stop()  }}