Spark模拟实现某网站用户浏览次数最多的url统计
来源:互联网 发布:win10桌面激活windows 编辑:程序博客网 时间:2024/05/17 08:43
现在假设有一个IT教育网站,有Java,PHP,net等多个栏目,下面是模拟实现的网站日志
第一个字段是访问日期,第二个字段是访问的URL,其中每个栏目有一个独立域名,如下:
java.aaaaaaa.cn
net.aaaaaaa.cn
php.aaaaaaa.cn
20160321101954 http://java.aaaaaaa.cn/java/course/javaeeadvanced.shtml20160321101954 http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101954 http://java.aaaaaaa.cn/java/course/android.shtml20160321101954 http://java.aaaaaaa.cn/java/video.shtml20160321101954 http://java.aaaaaaa.cn/java/teacher.shtml20160321101954 http://java.aaaaaaa.cn/java/course/android.shtml20160321101954 http://php.aaaaaaa.cn/php/teacher.shtml20160321101954 http://net.aaaaaaa.cn/net/teacher.shtml20160321101954 http://java.aaaaaaa.cn/java/course/hadoop.shtml20160321101954 http://java.aaaaaaa.cn/java/course/base.shtml20160321101954 http://net.aaaaaaa.cn/net/course.shtml20160321101954 http://php.aaaaaaa.cn/php/teacher.shtml20160321101954 http://net.aaaaaaa.cn/net/video.shtml20160321101954 http://java.aaaaaaa.cn/java/course/base.shtml20160321101954 http://net.aaaaaaa.cn/net/teacher.shtml20160321101954 http://java.aaaaaaa.cn/java/video.shtml20160321101954 http://java.aaaaaaa.cn/java/video.shtml20160321101954 http://net.aaaaaaa.cn/net/video.shtml20160321101954 http://net.aaaaaaa.cn/net/course.shtml20160321101954 http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101954 http://java.aaaaaaa.cn/java/course/android.shtml20160321101955 http://php.aaaaaaa.cn/php/course.shtml20160321101955 http://net.aaaaaaa.cn/net/teacher.shtml20160321101955 http://php.aaaaaaa.cn/php/teacher.shtml20160321101955 http://java.aaaaaaa.cn/java/course/base.shtml20160321101955 http://net.aaaaaaa.cn/net/teacher.shtml20160321101955 http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101955 http://php.aaaaaaa.cn/php/video.shtml20160321101955 http://net.aaaaaaa.cn/net/course.shtml20160321101955 http://php.aaaaaaa.cn/php/video.shtml20160321101955 http://java.aaaaaaa.cn/java/course/android.shtml20160321101955 http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101955 http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101955 http://net.aaaaaaa.cn/net/video.shtml20160321101955 http://net.aaaaaaa.cn/net/teacher.shtml20160321101955 http://java.aaaaaaa.cn/java/teacher.shtml20160321101955 http://java.aaaaaaa.cn/java/course/android.shtml20160321101955 http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101955 http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101955 http://net.aaaaaaa.cn/net/video.shtml20160321101956 http://java.aaaaaaa.cn/java/course/javaeeadvanced.shtml20160321101956 http://net.aaaaaaa.cn/net/video.shtml20160321101956 http://net.aaaaaaa.cn/net/video.shtml20160321101956 http://java.aaaaaaa.cn/java/course/javaeeadvanced.shtml20160321101956 http://java.aaaaaaa.cn/java/course/android.shtml20160321101956 http://java.aaaaaaa.cn/java/course/hadoop.shtml20160321101956 http://java.aaaaaaa.cn/java/course/javaee.shtml20160321101956 http://java.aaaaaaa.cn/java/course/javaeeadvanced.shtml20160321101956 http://php.aaaaaaa.cn/php/teacher.shtml20160321101956 http://net.aaaaaaa.cn/net/teacher.shtml20160321101956 http://java.aaaaaaa.cn/java/course/base.shtml20160321101956 http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101956 http://php.aaaaaaa.cn/php/teacher.shtml20160321101956 http://net.aaaaaaa.cn/net/course.shtml20160321101956 http://net.aaaaaaa.cn/net/teacher.shtml20160321101956 http://php.aaaaaaa.cn/php/video.shtml20160321101956 http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101956 http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101956 http://java.aaaaaaa.cn/java/course/hadoop.shtml20160321101957 http://java.aaaaaaa.cn/java/teacher.shtml20160321101957 http://php.aaaaaaa.cn/php/teacher.shtml20160321101957 http://net.aaaaaaa.cn/net/teacher.shtml20160321101957 http://net.aaaaaaa.cn/net/teacher.shtml20160321101957 http://php.aaaaaaa.cn/php/teacher.shtml20160321101957 http://php.aaaaaaa.cn/php/course.shtml20160321101957 http://java.aaaaaaa.cn/java/course/base.shtml20160321101957 http://net.aaaaaaa.cn/net/course.shtml20160321101957 http://java.aaaaaaa.cn/java/video.shtml20160321101957 http://php.aaaaaaa.cn/php/video.shtml20160321101957 http://net.aaaaaaa.cn/net/teacher.shtml20160321101957 http://java.aaaaaaa.cn/java/video.shtml20160321101957 http://net.aaaaaaa.cn/net/video.shtml20160321101957 http://java.aaaaaaa.cn/java/course/hadoop.shtml20160321101957 http://net.aaaaaaa.cn/net/course.shtml20160321101957 http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101957 http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101958 http://net.aaaaaaa.cn/net/course.shtml20160321101958 http://java.aaaaaaa.cn/java/course/hadoop.shtml20160321101958 http://php.aaaaaaa.cn/php/video.shtml20160321101958 http://php.aaaaaaa.cn/php/course.shtml20160321101958 http://java.aaaaaaa.cn/java/course/cloud.shtml20160321101958 http://net.aaaaaaa.cn/net/video.shtml20160321101958 http://java.aaaaaaa.cn/java/course/base.shtml
需求:统计出每个域名下面访问次数最多的前三个URL
代码:
import java.net.URLimport org.apache.spark.{SparkConf, SparkContext}object UrlCount { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("UrlCount").setMaster("local[2]") val sc = new SparkContext(conf) val rdd1 = sc.textFile("E:\\aaaaaa.log").map(line =>{ val f = line.split("\t") (f(1),1) }) val rdd2 = rdd1.reduceByKey(_+_) val rdd3 = rdd2.map(t => { val url = t._1 val host = new URL(url).getHost (host,url,t._2) }) val rdd4 = rdd3.groupBy(_._1).mapValues(it =>{ it.toList.sortBy(_._3).reverse.take(3) }) sc.stop() }}
阅读全文
0 0
- Spark模拟实现某网站用户浏览次数最多的url统计
- js统计网站浏览次数
- JS实现找到某字符串中出现次数最多的字符,并统计次数
- Servlet实现用户请求次数的统计
- 统计网站访问次数的实现
- 实现TOP K(选做):统计sogou500w中,发关键字次数最多的 *前20名用户UID和发关键字次数。
- 统计某字符串内出现次数最多的字符
- 用hash_map统计出现次数最多的前N个URL
- 博客实现浏览量统计次数
- 统计出现次数最多的程序
- 统计连续出现次数最多的单词
- 统计出现次数最多的数据
- 统计出现次数最多的数
- 如何设计文章浏览次数的统计
- php统计网站/html页面浏览访问次数程序
- php统计网站/html页面浏览访问次数程序
- c++实现“统计输入的string中重复次数最多的string”
- 统计网站的访问次数
- 第九周项目1 二叉树算法库
- 70. Climbing Stairs
- java(十六):concurrent(1)—生产者与消费者
- Maven安装,配置及更改本地资源库
- HashTable和HashMap的区别详解
- Spark模拟实现某网站用户浏览次数最多的url统计
- 武汉代孕哪家好-尚德代孕
- Java-冒泡,选择排序,二分查找算法
- 基于wsimport生成代码的客户端
- Gcc编译选项大全
- c++中为什么不提倡使用vector<bool>(转)
- class中自身调用方法
- python面向对象
- C++猜数字游戏