用户行为日志的统计,Java mapreduce与Scala spark的代码存档...
来源:互联网 发布:淘宝详情页模板网 编辑:程序博客网 时间:2024/05/28 15:28
原意是想存档一份spark的wordcount的代码,但觉得wordcount能体现出的东西少了一些,再加上写成spark遇到了各种各样的坑,索性就把之前一个用java mapreduce写的用户行为日志统计的代码用scala的spark逻辑上大致实现了一次(不完全一致,有实现的细节差别),以证明初步写成一个spark程序。代码仅供参考map,reduce文件读写过程,由于缺少引用的相关包,单独的代码是不能直接运行的。两份代码都是在maven框架下写的,注意pom中的依赖,其中的spark版本最好要与集群配置的spark版本一致。
无论mapreduce, 还是写spark。给我感觉对于读入文件实际上是已经分了一个初始的<K,V>的,其中的K可能是一个void之类的,而每一个value对应的就是输入文件的每一行(可以自己修改规则)。
java mapreduce大致的逻辑比较简单,spark注意一下flatmap与map的区别。flatmap是将每一行的数据作了map后再做一个扁平化操作。比如做wordcount,输入文件是
123 456 123
123 456
如果直接对该文件进行map得到的结果是
Array[Array[(K,V)]]] = {
Array[(K,V)] = { (123,1), (456,1), (123, 1) }
Array[(K,V)] = { (123,1), (456,1) }
}
对其做flatmap可以得到
Array[(K,V)] = { (123,1), (456,1), (123, 1), (123,1), (456,1) }
方便下一步的shuffle与reduce操作。
而对于spark的reduce操作,除了可以写成下述的reduceByKey之外,还可以写成reduce根据key来定制对value的操作。这里一般是带入两个值进去然后写该两个value变为一个value的逻辑,暂时不知道是否能写成Java mapreduce中foreach求和之类的过程。
注意spark在map的时候不要随便返回一个null,可能会导致程序运行失败,返回一个该类型的空对象就好。
总的来说spark代码比mapreduce的短,并且扩展性更强(比如可以很方便的在做完一次mapreduce之后再接着做mapreduce)。当然只算是第一个spark代码,要学的东西还有很多。
Java mapreduce:
package com.news.rec.monitor;import com.newsRec.model.UserActionLog;import com.sohu.newsRec.parser.UserLogParser;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;import java.io.IOException;/** * Created by jacobzhou on 2016/9/1. */public class UserActiveCount extends Configured implements Tool { private static String DELIMA = "\t"; public static class MRMapper extends Mapper<Object, Text, Text, Text> { private UserLogParser userActorParser = new UserLogParser(); private long getReadTime(String line){ String readTime = ""; int index = line.indexOf("readTime") + 9; while (line.charAt(index)>='0' && line.charAt(index)<='9'){ readTime += line.charAt(index); index ++; } if (!readTime.equals("")) return Long.parseLong(readTime); else return 0; } protected void map(Object key, Text value, Mapper.Context context) throws IOException, InterruptedException { String line = value.toString(); UserActionLog userLog = userActorParser.parseKV(line); String act = userLog.getAct(); long gbCode = userLog.getgbCode(); long pvNum = 0; long expoNum = 0; long tmNum = 0; long readTime = getReadTime(line); if (readTime<4 || readTime>3000) readTime = 0; if (act.equals("expo")) expoNum = 1; else if (act.equals("pv")) pvNum = 1; else if (act.equals("tm")){ tmNum = 1; if (readTime == 0) return; } String net = userLog.getNet(); if (net==null || net.trim().equals("")){ net = "blank"; } String wKey = "net" + DELIMA + net + DELIMA + "gbCode" + DELIMA + gbCode; String wValue = expoNum + DELIMA + pvNum + DELIMA + tmNum + DELIMA + readTime; context.write(new Text(wKey), new Text(wValue)); } protected void cleanup(Context context) throws IOException, InterruptedException {} } public static class MRReducer extends Reducer<Text,Text,Text,Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { String sKey[] = key.toString().split(DELIMA); long expoNum, pvNum, tmNum, readTime; String result; expoNum=pvNum=tmNum=readTime=0; for (Text val : values) { String data[] = val.toString().split(DELIMA); expoNum += Long.parseLong(data[0]); pvNum += Long.parseLong(data[1]); tmNum += Long.parseLong(data[2]); readTime += Long.parseLong(data[3]); } result = expoNum + DELIMA + pvNum + DELIMA + tmNum + DELIMA + readTime; context.write(key, new Text(result)); } } public int run(String[] args) throws Exception { Configuration conf = getConf(); conf.set("mapreduce.job.queuename", "datacenter"); conf.set("mapred.max.map.failures.percent", "5"); int reduceTasksMax = 10; Job job = new Job(conf); job.setJobName("userActiveStatistic job"); job.setNumReduceTasks(reduceTasksMax); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setMapperClass(MRMapper.class); job.setReducerClass(MRReducer.class); job.setJarByClass(UserActiveCount .class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job,new Path(args[0])); TextOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1; } public static void main(String[] args) throws Exception { try{ System.out.println("start run job!"); int ret = ToolRunner.run(new UserActiveCount(), args); System.exit(ret); }catch (Exception e){ e.printStackTrace(); } }}
Scala Spark, map, 把每行数据变成一个(key,value)
package zzyimport com.newsRec.parser.UserLogParserimport org.apache.spark.{SparkConf, SparkContext}/** * Created by jacobzhou on 2016/10/11. */object newsMonitor { private val DELIMA: String ="\t" private val userActorParser = new UserLogParser var num = 0 def mapData(line : String): (String,String) ={ if (num < 100) { println(line) num = num + 1 } val userLog = UserLogParser.parseKV(line) val act: String = userLog.getAct val gbCode: Long = userLog.getgbCode var pvNum: Long = 0 var expoNum: Long = 0 var tmNum: Long = 0 if (act == "expo") expoNum = 1 else if (act == "pv") pvNum = 1 else if (act == "tm") tmNum = 1 var net: String = userLog.getNet if (net == null || net.trim == "") net = "blank" val wKey: String = "net" + DELIMA + net + DELIMA + "gbCode" + DELIMA + gbCode val wValue: String = expoNum + DELIMA + pvNum + DELIMA + tmNum (wKey , wValue) } def reduceData(a: String, b : String): String = { var expoNum: Long = 0L var pvNum: Long = 0L var tmNum: Long = 0L val dataA: Array[String] = a.split(DELIMA) val dataB: Array[String] = b.split(DELIMA) expoNum = dataA(0).toLong + dataB(0).toLong pvNum = dataA(1).toLong + dataB(1).toLong tmNum = dataA(2).toLong + dataB(2).toLong return expoNum + DELIMA + pvNum + DELIMA + tmNum } def main(args: Array[String]): Unit ={ println("Running") val conf = new SparkConf() conf.setAppName("SparkTest") val input = args(0) val output = args(1) val sc = new SparkContext(conf) val inData = sc.textFile(input) val tmp = inData.map(line => mapData(line)).reduceByKey((x,y) => reduceData(x,y));//.collect().foreach(println) tmp.saveAsTextFile(output); }}
Scala Spark, flatmap,有更好的扩展性,比如一行数据拆分成多个(key,value)就要先组合成一个List[(key,vlaue)]再通过flatmap展开
package zzyimport com.newsRec.parser.UserLogParserimport org.apache.spark.{SparkConf, SparkContext}/** * Created by jacobzhou on 2016/9/18. */object newsMonitor { private val DELIMA: String ="\t" private val userActorParser = new UserLogParser var num = 0 def mapData(line : String): Map[String,String] ={ if (num < 100) { println(line) num = num + 1 } val userLog = UserLogParser.parseKV(line) val act: String = userLog.getAct val gbCode: Long = userLog.getgbCode var pvNum: Long = 0 var expoNum: Long = 0 var tmNum: Long = 0 if (act == "expo") expoNum = 1 else if (act == "pv") pvNum = 1 else if (act == "tm") tmNum = 1 var net: String = userLog.getNet if (net == null || net.trim == "") net = "blank" val wKey: String = "net" + DELIMA + net + DELIMA + "gbCode" + DELIMA + gbCode val wValue: String = expoNum + DELIMA + pvNum + DELIMA + tmNum return Map(wKey -> wValue); } def reduceData(a: String, b : String): String = { var expoNum: Long = 0L var pvNum: Long = 0L var tmNum: Long = 0L val dataA: Array[String] = a.split(DELIMA) val dataB: Array[String] = b.split(DELIMA) expoNum = dataA(0).toLong + dataB(0).toLong pvNum = dataA(1).toLong + dataB(1).toLong tmNum = dataA(2).toLong + dataB(2).toLong return expoNum + DELIMA + pvNum + DELIMA + tmNum } def main(args: Array[String]): Unit ={ println("Running") val conf = new SparkConf() conf.setAppName("SparkTest") val input = args(0) val output = args(1) val sc = new SparkContext(conf) val inData = sc.textFile(input) val tmp = inData.flatMap(line => mapData(line)).reduceByKey((x,y) => reduceData(x,y));//.collect().foreach(println) tmp.saveAsTextFile(output); }}
启动spark脚本举例
output=zeyangzhou/countinput=zeyangzhou/datahadoop fs -rmr $output jar=/opt/develop/zeyangzhou/zzy-1.0-SNAPSHOT-jar-with-dependencies.jarSPARK=/usr/lib/spark/bin/spark-submit${SPARK} --queue datacenter \ --class zzy.newsMonitor \ --executor-memory 15g \ --master yarn-cluster \ --driver-memory 20g \ --num-executors 30 \ --executor-cores 15 \ $jar $input $output
- 用户行为日志的统计,Java mapreduce与Scala spark的代码存档...
- 问题记录:iOS 用户行为统计代码的剥离
- 用Python编写MapReduce代码与调用-某一天之前的所有活跃用户统计(1)
- 用Python编写MapReduce代码与调用-某一天之前的所有活跃用户统计(2)
- 用户行为日志的采集
- 用户行为日志的采集
- 用户行为日志的采集
- MapReduce性能优化_8. 优化MapReduce的用户JAVA代码
- 基于Spark的用户行为路径分析
- spark与mapreduce的区别
- Android 优质精准的用户行为统计和日志打捞方案
- [IPhone]基于Flurry的用户行为统计
- Scala与MapReduce开发的IDE插件
- spark 里的 scala 代码剖析
- java,scala之spark streaming 版本的单词统计(通过监听端口)
- Spark RDD 实现电影点评用户行为分析 (Scala)
- 网站用户行为数据统计与分析之六:elasticsearch的配置和使用
- JAVA代码如何设置SPARK的日志打印级别
- SDWebImage内部实现过程
- 一些常见的web小功能
- C# 多线程(一)
- [CODEVS3243]区间翻转(线段树||splay||块链)
- 编辑器(Win记事本、Sublime、Notepad++)对常见字符编码的处理和区别:GB2312、GBK、ANSI、Unicode、UTF-8
- 用户行为日志的统计,Java mapreduce与Scala spark的代码存档...
- 把空格用%20代替
- 【30.53%】【hdu 5878】I Count Two Three
- jxl导出excel提示”文件错误 可能某些数字格式已丢失“解决办法
- Git分支管理策略
- [kuangbin带你飞]专题一 简单搜索 G POJ3087
- Android 解决65535的限制 使用android-support-multidex解决Dex超出方法数的限制问题
- PCA(主城分析法)
- 中文字符编码:GB2312、GBK、ANSI、Unicode、UTF-8