Spark日志分析项目Demo(4)--RDD使用,用户行为统计分析
来源:互联网 发布:手机淘宝所有宝贝链接 编辑:程序博客网 时间:2024/05/22 15:15
先说说需求,日志挖掘
(1)随机抽取100个session,统计时长(session时间),步长(session访问页面个数)
(2)统计top10热门品类
(3)统计top10热门品类上的top10用户
下面介绍通过日志分析用户行为流程
(1)某个J2EE项目在接收用户创建任务的请求之后,会将任务信息插入MySQL的task表中,任务参数以JSON格式封装在task_param 字段中。这是项目前提,不是本项目的内容。
接着J2EE平台会执行我们的spark-submit shell脚本,并将taskid作为参数传递给spark-submit shell脚本. spark-submit shell脚本在执行时,是可以接收参数的,并且会将接收的参数,传递给Spark作业的main函数 参数就封装在main函数的args数组中
用户可能指定的条件如下:
* 1、时间范围:起始日期~结束日期
* 2、性别:男或女
* 3、年龄范围
* 4、职业:多选
* 5、城市:多选
* 6、搜索词:多个搜索词,只要某个session中的任何一个action搜索过指定的关键词,那么session就符合条件
* 7、点击品类:多个品类,只要某个session中的任何一个action点击过某个品类,那么session就符合条件
根据main函数获得的参数,程序从数据库里获得查询参数。
(2)首先要从user_visit_action内存临时表中,查询出来指定日期范围内的行为数据。
(3)上面查询出的数据进行map操作,以sessionid为key
(4)将上面的数据进行session粒度的数据聚合,session粒度的数据 与用户信息数据进行join,就可以获取到session粒度的数据+session对应的user的信息
(5)按照使用者在j2ee 平台指定的筛选参数进行数据过滤,生成公共的RDD:就是通过筛选条件的session的访问明细数据。在过滤的时候,对每个session的访问时长(访问开始到结束时间)和访问步长(访问页面的次数),进行计算。
(6)抽取100个session,算出平均每天多少session抽取。统计每个小时占当天session的比例。根据前两个数值确定随机抽取的session。根据随机抽取的session,统计访问时长和访问步长的情况,写入数据库
(7)在(5)的结果上获取top10热门品类
(8)在(7)的结果上获取top10热门品类上,每个品类的Top10活跃用户
下面结合代码说说每一步的实现:
(1)
// 创建需要使用的DAO组件 ITaskDAO taskDAO = DAOFactory.getTaskDAO(); // 首先得查询出来指定的任务,并获取任务的查询参数 long taskid = ParamUtils.getTaskIdFromArgs(args, Constants.SPARK_LOCAL_TASKID_SESSION); Task task = taskDAO.findById(taskid); if(task == null) { System.out.println(new Date() + ": cannot find this task with id [" + taskid + "]."); return; } JSONObject taskParam = JSONObject.parseObject(task.getTaskParam());
上面的ITaskDAO,Task,ParamUtils都是项目自己实现的工具类,目的是为了查询参数。
(2)通过SQL过滤数据
JavaRDD<Row> actionRDD = SparkUtils.getActionRDDByDateRange(sqlContext, taskParam);
/** * 获取指定日期范围内的用户行为数据RDD * @param sqlContext * @param taskParam * @return */ public static JavaRDD<Row> getActionRDDByDateRange( SQLContext sqlContext, JSONObject taskParam) { String startDate = ParamUtils.getParam(taskParam, Constants.PARAM_START_DATE); String endDate = ParamUtils.getParam(taskParam, Constants.PARAM_END_DATE); String sql = "select * " + "from user_visit_action " + "where date>='" + startDate + "' " + "and date<='" + endDate + "'"; DataFrame actionDF = sqlContext.sql(sql); return actionDF.javaRDD(); }
(3)下面实现map操作,以sessionid为key
JavaPairRDD<String, Row> sessionid2actionRDD = getSessionid2ActionRDD(actionRDD);sessionid2actionRDD = sessionid2actionRDD.persist(StorageLevel.MEMORY_ONLY());
/** * 获取sessionid2到访问行为数据的映射的RDD * @param actionRDD * @return*/public static JavaPairRDD<String, Row> getSessionid2ActionRDD(JavaRDD<Row> actionRDD) { return actionRDD.mapPartitionsToPair(new PairFlatMapFunction<Iterator<Row>, String, Row>() { private static final long serialVersionUID = 1L; @Override public Iterable<Tuple2<String, Row>> call(Iterator<Row> iterator) throws Exception { List<Tuple2<String, Row>> list = new ArrayList<Tuple2<String, Row>>(); while(iterator.hasNext()) { Row row = iterator.next(); list.add(new Tuple2<String, Row>(row.getString(2), row)); } return list; } }); }
(4)把相同sessionId key的做内容聚合后,与用户信息数据进行join。
JavaPairRDD<String, String> sessionid2AggrInfoRDD = aggregateBySession(sc, sqlContext, sessionid2actionRDD);
/** * 对行为数据按session粒度进行聚合 * @param actionRDD 行为数据RDD * @return session粒度聚合数据 */ private static JavaPairRDD<String, String> aggregateBySession( JavaSparkContext sc, SQLContext sqlContext, JavaPairRDD<String, Row> sessinoid2actionRDD) { // 对行为数据按session粒度进行分组 JavaPairRDD<String, Iterable<Row>> sessionid2ActionsRDD = sessinoid2actionRDD.groupByKey(); // 对每一个session分组进行聚合,将session中所有的搜索词和点击品类都聚合起来 // 到此为止,获取的数据格式,如下:<userid,partAggrInfo(sessionid,searchKeywords,clickCategoryIds)> JavaPairRDD<Long, String> userid2PartAggrInfoRDD = sessionid2ActionsRDD.mapToPair( new PairFunction<Tuple2<String,Iterable<Row>>, Long, String>() { private static final long serialVersionUID = 1L; @Override public Tuple2<Long, String> call(Tuple2<String, Iterable<Row>> tuple) throws Exception { String sessionid = tuple._1; Iterator<Row> iterator = tuple._2.iterator(); StringBuffer searchKeywordsBuffer = new StringBuffer(""); StringBuffer clickCategoryIdsBuffer = new StringBuffer(""); Long userid = null; // session的起始和结束时间 Date startTime = null; Date endTime = null; // session的访问步长 int stepLength = 0; // 遍历session所有的访问行为 while(iterator.hasNext()) { // 提取每个访问行为的搜索词字段和点击品类字段 Row row = iterator.next(); if(userid == null) { userid = row.getLong(1); } String searchKeyword = row.getString(5); Long clickCategoryId = row.getLong(6); // 实际上这里要对数据说明一下 // 并不是每一行访问行为都有searchKeyword何clickCategoryId两个字段的 // 其实,只有搜索行为,是有searchKeyword字段的 // 只有点击品类的行为,是有clickCategoryId字段的 // 所以,任何一行行为数据,都不可能两个字段都有,所以数据是可能出现null值的 // 我们决定是否将搜索词或点击品类id拼接到字符串中去 // 首先要满足:不能是null值 // 其次,之前的字符串中还没有搜索词或者点击品类id if(StringUtils.isNotEmpty(searchKeyword)) { if(!searchKeywordsBuffer.toString().contains(searchKeyword)) { searchKeywordsBuffer.append(searchKeyword + ","); } } if(clickCategoryId != null) { if(!clickCategoryIdsBuffer.toString().contains( String.valueOf(clickCategoryId))) { clickCategoryIdsBuffer.append(clickCategoryId + ","); } } // 计算session开始和结束时间 Date actionTime = DateUtils.parseTime(row.getString(4)); if(startTime == null) { startTime = actionTime; } if(endTime == null) { endTime = actionTime; } if(actionTime.before(startTime)) { startTime = actionTime; } if(actionTime.after(endTime)) { endTime = actionTime; } // 计算session访问步长 stepLength++; } String searchKeywords = StringUtils.trimComma(searchKeywordsBuffer.toString()); String clickCategoryIds = StringUtils.trimComma(clickCategoryIdsBuffer.toString()); // 计算session访问时长(秒) long visitLength = (endTime.getTime() - startTime.getTime()) / 1000; // 大家思考一下 // 我们返回的数据格式,即使<sessionid,partAggrInfo> // 但是,这一步聚合完了以后,其实,我们是还需要将每一行数据,跟对应的用户信息进行聚合 // 问题就来了,如果是跟用户信息进行聚合的话,那么key,就不应该是sessionid // 就应该是userid,才能够跟<userid,Row>格式的用户信息进行聚合 // 如果我们这里直接返回<sessionid,partAggrInfo>,还得再做一次mapToPair算子 // 将RDD映射成<userid,partAggrInfo>的格式,那么就多此一举 // 所以,我们这里其实可以直接,返回的数据格式,就是<userid,partAggrInfo> // 然后跟用户信息join的时候,将partAggrInfo关联上userInfo // 然后再直接将返回的Tuple的key设置成sessionid // 最后的数据格式,还是<sessionid,fullAggrInfo> // 聚合数据,用什么样的格式进行拼接? // 我们这里统一定义,使用key=value|key=value String partAggrInfo = Constants.FIELD_SESSION_ID + "=" + sessionid + "|" + Constants.FIELD_SEARCH_KEYWORDS + "=" + searchKeywords + "|" + Constants.FIELD_CLICK_CATEGORY_IDS + "=" + clickCategoryIds + "|" + Constants.FIELD_VISIT_LENGTH + "=" + visitLength + "|" + Constants.FIELD_STEP_LENGTH + "=" + stepLength + "|" + Constants.FIELD_START_TIME + "=" + DateUtils.formatTime(startTime); return new Tuple2<Long, String>(userid, partAggrInfo); } }); // 查询所有用户数据,并映射成<userid,Row>的格式 String sql = "select * from user_info"; JavaRDD<Row> userInfoRDD = sqlContext.sql(sql).javaRDD(); JavaPairRDD<Long, Row> userid2InfoRDD = userInfoRDD.mapToPair( new PairFunction<Row, Long, Row>() { private static final long serialVersionUID = 1L; @Override public Tuple2<Long, Row> call(Row row) throws Exception { return new Tuple2<Long, Row>(row.getLong(0), row); } }); /** * 这里就可以说一下,比较适合采用reduce join转换为map join的方式 * * userid2PartAggrInfoRDD,可能数据量还是比较大,比如,可能在1千万数据 * userid2InfoRDD,可能数据量还是比较小的,你的用户数量才10万用户 * */ // 将session粒度聚合数据,与用户信息进行join JavaPairRDD<Long, Tuple2<String, Row>> userid2FullInfoRDD = userid2PartAggrInfoRDD.join(userid2InfoRDD); // 对join起来的数据进行拼接,并且返回<sessionid,fullAggrInfo>格式的数据 JavaPairRDD<String, String> sessionid2FullAggrInfoRDD = userid2FullInfoRDD.mapToPair( new PairFunction<Tuple2<Long,Tuple2<String,Row>>, String, String>() { private static final long serialVersionUID = 1L; @Override public Tuple2<String, String> call( Tuple2<Long, Tuple2<String, Row>> tuple) throws Exception { String partAggrInfo = tuple._2._1; Row userInfoRow = tuple._2._2; String sessionid = StringUtils.getFieldFromConcatString( partAggrInfo, "\\|", Constants.FIELD_SESSION_ID); int age = userInfoRow.getInt(3); String professional = userInfoRow.getString(4); String city = userInfoRow.getString(5); String sex = userInfoRow.getString(6); String fullAggrInfo = partAggrInfo + "|" + Constants.FIELD_AGE + "=" + age + "|" + Constants.FIELD_PROFESSIONAL + "=" + professional + "|" + Constants.FIELD_CITY + "=" + city + "|" + Constants.FIELD_SEX + "=" + sex; return new Tuple2<String, String>(sessionid, fullAggrInfo); } }); return sessionid2FullAggrInfoRDD; }
(5)同时进行过滤和统计
//下面这个Spark自定义累加器是用来统计的,可以通过add方法在累加器上进行累加操作Accumulator<String> sessionAggrStatAccumulator = sc.accumulator( "", new SessionAggrStatAccumulator());JavaPairRDD<String, String> filteredSessionid2AggrInfoRDD = filterSessionAndAggrStat( sessionid2AggrInfoRDD, taskParam, sessionAggrStatAccumulator);filteredSessionid2AggrInfoRDD = filteredSessionid2AggrInfoRDD.persist(StorageLevel.MEMORY_ONLY());
/** * 过滤session数据,并进行聚合统计 * @param sessionid2AggrInfoRDD * @return */private static JavaPairRDD<String, String> filterSessionAndAggrStat( JavaPairRDD<String, String> sessionid2AggrInfoRDD, final JSONObject taskParam, final Accumulator<String> sessionAggrStatAccumulator) { // 为了使用我们后面的ValieUtils,所以,首先将所有的筛选参数拼接成一个连接串 // 此外,这里其实大家不要觉得是多此一举 // 其实我们是给后面的性能优化埋下了一个伏笔 String startAge = ParamUtils.getParam(taskParam, Constants.PARAM_START_AGE); String endAge = ParamUtils.getParam(taskParam, Constants.PARAM_END_AGE); String professionals = ParamUtils.getParam(taskParam, Constants.PARAM_PROFESSIONALS); String cities = ParamUtils.getParam(taskParam, Constants.PARAM_CITIES); String sex = ParamUtils.getParam(taskParam, Constants.PARAM_SEX); String keywords = ParamUtils.getParam(taskParam, Constants.PARAM_KEYWORDS); String categoryIds = ParamUtils.getParam(taskParam, Constants.PARAM_CATEGORY_IDS); String _parameter = (startAge != null ? Constants.PARAM_START_AGE + "=" + startAge + "|" : "") + (endAge != null ? Constants.PARAM_END_AGE + "=" + endAge + "|" : "") + (professionals != null ? Constants.PARAM_PROFESSIONALS + "=" + professionals + "|" : "") + (cities != null ? Constants.PARAM_CITIES + "=" + cities + "|" : "") + (sex != null ? Constants.PARAM_SEX + "=" + sex + "|" : "") + (keywords != null ? Constants.PARAM_KEYWORDS + "=" + keywords + "|" : "") + (categoryIds != null ? Constants.PARAM_CATEGORY_IDS + "=" + categoryIds: ""); if(_parameter.endsWith("\\|")) { _parameter = _parameter.substring(0, _parameter.length() - 1); } final String parameter = _parameter;// 根据筛选参数进行过滤JavaPairRDD<String, String> filteredSessionid2AggrInfoRDD = sessionid2AggrInfoRDD.filter( new Function<Tuple2<String,String>, Boolean>() { private static final long serialVersionUID = 1L; @Override public Boolean call(Tuple2<String, String> tuple) throws Exception { // 首先,从tuple中,获取聚合数据 String aggrInfo = tuple._2; // 接着,依次按照筛选条件进行过滤 // 按照年龄范围进行过滤(startAge、endAge) if(!ValidUtils.between(aggrInfo, Constants.FIELD_AGE, parameter, Constants.PARAM_START_AGE, Constants.PARAM_END_AGE)) { return false; } // 按照职业范围进行过滤(professionals) // 互联网,IT,软件 // 互联网 if(!ValidUtils.in(aggrInfo, Constants.FIELD_PROFESSIONAL, parameter, Constants.PARAM_PROFESSIONALS)) { return false; } // 按照城市范围进行过滤(cities) // 北京,上海,广州,深圳 // 成都 if(!ValidUtils.in(aggrInfo, Constants.FIELD_CITY, parameter, Constants.PARAM_CITIES)) { return false; } // 按照性别进行过滤 // 男/女 // 男,女 if(!ValidUtils.equal(aggrInfo, Constants.FIELD_SEX, parameter, Constants.PARAM_SEX)) { return false; } // 按照搜索词进行过滤 // 我们的session可能搜索了 火锅,蛋糕,烧烤 // 我们的筛选条件可能是 火锅,串串香,iphone手机 // 那么,in这个校验方法,主要判定session搜索的词中,有任何一个,与筛选条件中 // 任何一个搜索词相当,即通过 if(!ValidUtils.in(aggrInfo, Constants.FIELD_SEARCH_KEYWORDS, parameter, Constants.PARAM_KEYWORDS)) { return false; } // 按照点击品类id进行过滤 if(!ValidUtils.in(aggrInfo, Constants.FIELD_CLICK_CATEGORY_IDS, parameter, Constants.PARAM_CATEGORY_IDS)) { return false; } // 如果经过了之前的多个过滤条件之后,程序能够走到这里 // 那么就说明,该session是通过了用户指定的筛选条件的,也就是需要保留的session // 那么就要对session的访问时长和访问步长,进行统计,根据session对应的范围 // 进行相应的累加计数 // 主要走到这一步,那么就是需要计数的session sessionAggrStatAccumulator.add(Constants.SESSION_COUNT); // 计算出session的访问时长和访问步长的范围,并进行相应的累加 long visitLength = Long.valueOf(StringUtils.getFieldFromConcatString( aggrInfo, "\\|", Constants.FIELD_VISIT_LENGTH)); long stepLength = Long.valueOf(StringUtils.getFieldFromConcatString( aggrInfo, "\\|", Constants.FIELD_STEP_LENGTH)); calculateVisitLength(visitLength); calculateStepLength(stepLength); return true; } /** * 计算访问时长范围 * @param visitLength */ private void calculateVisitLength(long visitLength) { if(visitLength >=1 && visitLength <= 3) { sessionAggrStatAccumulator.add(Constants.TIME_PERIOD_1s_3s); } else if(visitLength >=4 && visitLength <= 6) { sessionAggrStatAccumulator.add(Constants.TIME_PERIOD_4s_6s); } else if(visitLength >=7 && visitLength <= 9) { sessionAggrStatAccumulator.add(Constants.TIME_PERIOD_7s_9s); } else if(visitLength >=10 && visitLength <= 30) { sessionAggrStatAccumulator.add(Constants.TIME_PERIOD_10s_30s); } else if(visitLength > 30 && visitLength <= 60) { sessionAggrStatAccumulator.add(Constants.TIME_PERIOD_30s_60s); } else if(visitLength > 60 && visitLength <= 180) { sessionAggrStatAccumulator.add(Constants.TIME_PERIOD_1m_3m); } else if(visitLength > 180 && visitLength <= 600) { sessionAggrStatAccumulator.add(Constants.TIME_PERIOD_3m_10m); } else if(visitLength > 600 && visitLength <= 1800) { sessionAggrStatAccumulator.add(Constants.TIME_PERIOD_10m_30m); } else if(visitLength > 1800) { sessionAggrStatAccumulator.add(Constants.TIME_PERIOD_30m); } }/** * 计算访问步长范围 * @param stepLength*/private void calculateStepLength(long stepLength) { if(stepLength >= 1 && stepLength <= 3) { sessionAggrStatAccumulator.add(Constants.STEP_PERIOD_1_3); } else if(stepLength >= 4 && stepLength <= 6) { sessionAggrStatAccumulator.add(Constants.STEP_PERIOD_4_6); } else if(stepLength >= 7 && stepLength <= 9) { sessionAggrStatAccumulator.add(Constants.STEP_PERIOD_7_9); } else if(stepLength >= 10 && stepLength <= 30) { sessionAggrStatAccumulator.add(Constants.STEP_PERIOD_10_30); } else if(stepLength > 30 && stepLength <= 60) { sessionAggrStatAccumulator.add(Constants.STEP_PERIOD_30_60); } else if(stepLength > 60) { sessionAggrStatAccumulator.add(Constants.STEP_PERIOD_60); } } }); return filteredSessionid2AggrInfoRDD; }
(6)
//随机抽取randomExtractSession(sc, task.getTaskid(),filteredSessionid2AggrInfoRDD, sessionid2detailRDD);/** * 特别说明 * 我们知道,要将上一个功能的session聚合统计数据获取到,就必须是在一个action操作触发job之后 * 才能从Accumulator中获取数据,否则是获取不到数据的,因为没有job执行,Accumulator的值为空 * 所以,我们在这里,将随机抽取的功能的实现代码,放在session聚合统计功能的最终计算和写库之前 * 因为随机抽取功能中,有一个countByKey算子,是action操作,会触发job*/// 计算出各个范围的session占比,并写入MySQL calculateAndPersistAggrStat(sessionAggrStatAccumulator.value(),task.getTaskid());
/** * 随机抽取session * @param sessionid2AggrInfoRDD */ private static void randomExtractSession( JavaSparkContext sc, final long taskid, JavaPairRDD<String, String> sessionid2AggrInfoRDD, JavaPairRDD<String, Row> sessionid2actionRDD) { /** * 第一步,计算出每天每小时的session数量 */ // 获取<yyyy-MM-dd_HH,aggrInfo>格式的RDD JavaPairRDD<String, String> time2sessionidRDD = sessionid2AggrInfoRDD.mapToPair( new PairFunction<Tuple2<String,String>, String, String>() { private static final long serialVersionUID = 1L; @Override public Tuple2<String, String> call( Tuple2<String, String> tuple) throws Exception { String aggrInfo = tuple._2; String startTime = StringUtils.getFieldFromConcatString( aggrInfo, "\\|", Constants.FIELD_START_TIME); String dateHour = DateUtils.getDateHour(startTime); return new Tuple2<String, String>(dateHour, aggrInfo); } }); /** * 思考一下:这里我们不要着急写大量的代码,做项目的时候,一定要用脑子多思考 * * 每天每小时的session数量,然后计算出每天每小时的session抽取索引,遍历每天每小时session * 首先抽取出的session的聚合数据,写入session_random_extract表 * 所以第一个RDD的value,应该是session聚合数据 * */ // 得到每天每小时的session数量 /** * 每天每小时的session数量的计算 * 是有可能出现数据倾斜的吧,这个是没有疑问的 * 比如说大部分小时,一般访问量也就10万;但是,中午12点的时候,高峰期,一个小时1000万 * 这个时候,就会发生数据倾斜 * * 我们就用这个countByKey操作,给大家演示第三种和第四种方案 * */ Map<String, Object> countMap = time2sessionidRDD.countByKey(); /** * 第二步,使用按时间比例随机抽取算法,计算出每天每小时要抽取session的索引 */ // 将<yyyy-MM-dd_HH,count>格式的map,转换成<yyyy-MM-dd,<HH,count>>的格式 Map<String, Map<String, Long>> dateHourCountMap = new HashMap<String, Map<String, Long>>(); for(Map.Entry<String, Object> countEntry : countMap.entrySet()) { String dateHour = countEntry.getKey(); String date = dateHour.split("_")[0]; String hour = dateHour.split("_")[1]; long count = Long.valueOf(String.valueOf(countEntry.getValue())); Map<String, Long> hourCountMap = dateHourCountMap.get(date); if(hourCountMap == null) { hourCountMap = new HashMap<String, Long>(); dateHourCountMap.put(date, hourCountMap); } hourCountMap.put(hour, count); } // 开始实现我们的按时间比例随机抽取算法 // 总共要抽取100个session,先按照天数,进行平分 int extractNumberPerDay = 100 / dateHourCountMap.size(); // <date,<hour,(3,5,20,102)>> /** * session随机抽取功能 * * 用到了一个比较大的变量,随机抽取索引map * 之前是直接在算子里面使用了这个map,那么根据我们刚才讲的这个原理,每个task都会拷贝一份map副本 * 还是比较消耗内存和网络传输性能的 * * 将map做成广播变量 * */ Map<String, Map<String, List<Integer>>> dateHourExtractMap = new HashMap<String, Map<String, List<Integer>>>(); Random random = new Random(); for(Map.Entry<String, Map<String, Long>> dateHourCountEntry : dateHourCountMap.entrySet()) { String date = dateHourCountEntry.getKey(); Map<String, Long> hourCountMap = dateHourCountEntry.getValue(); // 计算出这一天的session总数 long sessionCount = 0L; for(long hourCount : hourCountMap.values()) { sessionCount += hourCount; } Map<String, List<Integer>> hourExtractMap = dateHourExtractMap.get(date); if(hourExtractMap == null) { hourExtractMap = new HashMap<String, List<Integer>>(); dateHourExtractMap.put(date, hourExtractMap); } // 遍历每个小时 for(Map.Entry<String, Long> hourCountEntry : hourCountMap.entrySet()) { String hour = hourCountEntry.getKey(); long count = hourCountEntry.getValue(); // 计算每个小时的session数量,占据当天总session数量的比例,直接乘以每天要抽取的数量 // 就可以计算出,当前小时需要抽取的session数量 int hourExtractNumber = (int)(((double)count / (double)sessionCount) * extractNumberPerDay); if(hourExtractNumber > count) { hourExtractNumber = (int) count; } // 先获取当前小时的存放随机数的list List<Integer> extractIndexList = hourExtractMap.get(hour); if(extractIndexList == null) { extractIndexList = new ArrayList<Integer>(); hourExtractMap.put(hour, extractIndexList); } // 生成上面计算出来的数量的随机数 for(int i = 0; i < hourExtractNumber; i++) { int extractIndex = random.nextInt((int) count); while(extractIndexList.contains(extractIndex)) { extractIndex = random.nextInt((int) count); } extractIndexList.add(extractIndex); } } } /** * fastutil的使用,很简单,比如List<Integer>的list,对应到fastutil,就是IntList */ Map<String, Map<String, IntList>> fastutilDateHourExtractMap = new HashMap<String, Map<String, IntList>>(); for(Map.Entry<String, Map<String, List<Integer>>> dateHourExtractEntry : dateHourExtractMap.entrySet()) { String date = dateHourExtractEntry.getKey(); Map<String, List<Integer>> hourExtractMap = dateHourExtractEntry.getValue(); Map<String, IntList> fastutilHourExtractMap = new HashMap<String, IntList>(); for(Map.Entry<String, List<Integer>> hourExtractEntry : hourExtractMap.entrySet()) { String hour = hourExtractEntry.getKey(); List<Integer> extractList = hourExtractEntry.getValue(); IntList fastutilExtractList = new IntArrayList(); for(int i = 0; i < extractList.size(); i++) { fastutilExtractList.add(extractList.get(i)); } fastutilHourExtractMap.put(hour, fastutilExtractList); } fastutilDateHourExtractMap.put(date, fastutilHourExtractMap); } /** * 广播变量,很简单 * 其实就是SparkContext的broadcast()方法,传入你要广播的变量,即可 */ final Broadcast<Map<String, Map<String, IntList>>> dateHourExtractMapBroadcast = sc.broadcast(fastutilDateHourExtractMap); /** * 第三步:遍历每天每小时的session,然后根据随机索引进行抽取 */ // 执行groupByKey算子,得到<dateHour,(session aggrInfo)> JavaPairRDD<String, Iterable<String>> time2sessionsRDD = time2sessionidRDD.groupByKey(); // 我们用flatMap算子,遍历所有的<dateHour,(session aggrInfo)>格式的数据 // 然后呢,会遍历每天每小时的session // 如果发现某个session恰巧在我们指定的这天这小时的随机抽取索引上 // 那么抽取该session,直接写入MySQL的random_extract_session表 // 将抽取出来的session id返回回来,形成一个新的JavaRDD<String> // 然后最后一步,是用抽取出来的sessionid,去join它们的访问行为明细数据,写入session表 JavaPairRDD<String, String> extractSessionidsRDD = time2sessionsRDD.flatMapToPair( new PairFlatMapFunction<Tuple2<String,Iterable<String>>, String, String>() { private static final long serialVersionUID = 1L; @Override public Iterable<Tuple2<String, String>> call( Tuple2<String, Iterable<String>> tuple) throws Exception { List<Tuple2<String, String>> extractSessionids = new ArrayList<Tuple2<String, String>>(); String dateHour = tuple._1; String date = dateHour.split("_")[0]; String hour = dateHour.split("_")[1]; Iterator<String> iterator = tuple._2.iterator(); /** * 使用广播变量的时候 * 直接调用广播变量(Broadcast类型)的value() / getValue() * 可以获取到之前封装的广播变量 */ Map<String, Map<String, IntList>> dateHourExtractMap = dateHourExtractMapBroadcast.value(); List<Integer> extractIndexList = dateHourExtractMap.get(date).get(hour); ISessionRandomExtractDAO sessionRandomExtractDAO = DAOFactory.getSessionRandomExtractDAO(); int index = 0; while(iterator.hasNext()) { String sessionAggrInfo = iterator.next(); if(extractIndexList.contains(index)) { String sessionid = StringUtils.getFieldFromConcatString( sessionAggrInfo, "\\|", Constants.FIELD_SESSION_ID); // 将数据写入MySQL SessionRandomExtract sessionRandomExtract = new SessionRandomExtract(); sessionRandomExtract.setTaskid(taskid); sessionRandomExtract.setSessionid(sessionid); sessionRandomExtract.setStartTime(StringUtils.getFieldFromConcatString( sessionAggrInfo, "\\|", Constants.FIELD_START_TIME)); sessionRandomExtract.setSearchKeywords(StringUtils.getFieldFromConcatString( sessionAggrInfo, "\\|", Constants.FIELD_SEARCH_KEYWORDS)); sessionRandomExtract.setClickCategoryIds(StringUtils.getFieldFromConcatString( sessionAggrInfo, "\\|", Constants.FIELD_CLICK_CATEGORY_IDS)); sessionRandomExtractDAO.insert(sessionRandomExtract); // 将sessionid加入list extractSessionids.add(new Tuple2<String, String>(sessionid, sessionid)); } index++; } return extractSessionids; } }); /** * 第四步:获取抽取出来的session的明细数据 */ JavaPairRDD<String, Tuple2<String, Row>> extractSessionDetailRDD = extractSessionidsRDD.join(sessionid2actionRDD); extractSessionDetailRDD.foreachPartition( new VoidFunction<Iterator<Tuple2<String,Tuple2<String,Row>>>>() { private static final long serialVersionUID = 1L; @Override public void call( Iterator<Tuple2<String, Tuple2<String, Row>>> iterator) throws Exception { List<SessionDetail> sessionDetails = new ArrayList<SessionDetail>(); while(iterator.hasNext()) { Tuple2<String, Tuple2<String, Row>> tuple = iterator.next(); Row row = tuple._2._2; SessionDetail sessionDetail = new SessionDetail(); sessionDetail.setTaskid(taskid); sessionDetail.setUserid(row.getLong(1)); sessionDetail.setSessionid(row.getString(2)); sessionDetail.setPageid(row.getLong(3)); sessionDetail.setActionTime(row.getString(4)); sessionDetail.setSearchKeyword(row.getString(5)); sessionDetail.setClickCategoryId(row.getLong(6)); sessionDetail.setClickProductId(row.getLong(7)); sessionDetail.setOrderCategoryIds(row.getString(8)); sessionDetail.setOrderProductIds(row.getString(9)); sessionDetail.setPayCategoryIds(row.getString(10)); sessionDetail.setPayProductIds(row.getString(11)); sessionDetails.add(sessionDetail); } ISessionDetailDAO sessionDetailDAO = DAOFactory.getSessionDetailDAO(); sessionDetailDAO.insertBatch(sessionDetails); } }); } /** * 计算各session范围占比,并写入MySQL * @param value */ private static void calculateAndPersistAggrStat(String value, long taskid) { // 从Accumulator统计串中获取值 long session_count = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.SESSION_COUNT)); long visit_length_1s_3s = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_1s_3s)); long visit_length_4s_6s = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_4s_6s)); long visit_length_7s_9s = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_7s_9s)); long visit_length_10s_30s = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_10s_30s)); long visit_length_30s_60s = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_30s_60s)); long visit_length_1m_3m = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_1m_3m)); long visit_length_3m_10m = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_3m_10m)); long visit_length_10m_30m = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_10m_30m)); long visit_length_30m = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_30m)); long step_length_1_3 = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.STEP_PERIOD_1_3)); long step_length_4_6 = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.STEP_PERIOD_4_6)); long step_length_7_9 = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.STEP_PERIOD_7_9)); long step_length_10_30 = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.STEP_PERIOD_10_30)); long step_length_30_60 = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.STEP_PERIOD_30_60)); long step_length_60 = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.STEP_PERIOD_60)); // 计算各个访问时长和访问步长的范围 double visit_length_1s_3s_ratio = NumberUtils.formatDouble( (double)visit_length_1s_3s / (double)session_count, 2); double visit_length_4s_6s_ratio = NumberUtils.formatDouble( (double)visit_length_4s_6s / (double)session_count, 2); double visit_length_7s_9s_ratio = NumberUtils.formatDouble( (double)visit_length_7s_9s / (double)session_count, 2); double visit_length_10s_30s_ratio = NumberUtils.formatDouble( (double)visit_length_10s_30s / (double)session_count, 2); double visit_length_30s_60s_ratio = NumberUtils.formatDouble( (double)visit_length_30s_60s / (double)session_count, 2); double visit_length_1m_3m_ratio = NumberUtils.formatDouble( (double)visit_length_1m_3m / (double)session_count, 2); double visit_length_3m_10m_ratio = NumberUtils.formatDouble( (double)visit_length_3m_10m / (double)session_count, 2); double visit_length_10m_30m_ratio = NumberUtils.formatDouble( (double)visit_length_10m_30m / (double)session_count, 2); double visit_length_30m_ratio = NumberUtils.formatDouble( (double)visit_length_30m / (double)session_count, 2); double step_length_1_3_ratio = NumberUtils.formatDouble( (double)step_length_1_3 / (double)session_count, 2); double step_length_4_6_ratio = NumberUtils.formatDouble( (double)step_length_4_6 / (double)session_count, 2); double step_length_7_9_ratio = NumberUtils.formatDouble( (double)step_length_7_9 / (double)session_count, 2); double step_length_10_30_ratio = NumberUtils.formatDouble( (double)step_length_10_30 / (double)session_count, 2); double step_length_30_60_ratio = NumberUtils.formatDouble( (double)step_length_30_60 / (double)session_count, 2); double step_length_60_ratio = NumberUtils.formatDouble( (double)step_length_60 / (double)session_count, 2); // 将统计结果封装为Domain对象 SessionAggrStat sessionAggrStat = new SessionAggrStat(); sessionAggrStat.setTaskid(taskid); sessionAggrStat.setSession_count(session_count); sessionAggrStat.setVisit_length_1s_3s_ratio(visit_length_1s_3s_ratio); sessionAggrStat.setVisit_length_4s_6s_ratio(visit_length_4s_6s_ratio); sessionAggrStat.setVisit_length_7s_9s_ratio(visit_length_7s_9s_ratio); sessionAggrStat.setVisit_length_10s_30s_ratio(visit_length_10s_30s_ratio); sessionAggrStat.setVisit_length_30s_60s_ratio(visit_length_30s_60s_ratio); sessionAggrStat.setVisit_length_1m_3m_ratio(visit_length_1m_3m_ratio); sessionAggrStat.setVisit_length_3m_10m_ratio(visit_length_3m_10m_ratio); sessionAggrStat.setVisit_length_10m_30m_ratio(visit_length_10m_30m_ratio); sessionAggrStat.setVisit_length_30m_ratio(visit_length_30m_ratio); sessionAggrStat.setStep_length_1_3_ratio(step_length_1_3_ratio); sessionAggrStat.setStep_length_4_6_ratio(step_length_4_6_ratio); sessionAggrStat.setStep_length_7_9_ratio(step_length_7_9_ratio); sessionAggrStat.setStep_length_10_30_ratio(step_length_10_30_ratio); sessionAggrStat.setStep_length_30_60_ratio(step_length_30_60_ratio); sessionAggrStat.setStep_length_60_ratio(step_length_60_ratio); // 调用对应的DAO插入统计结果 ISessionAggrStatDAO sessionAggrStatDAO = DAOFactory.getSessionAggrStatDAO(); sessionAggrStatDAO.insert(sessionAggrStat); }
/** * 计算各session范围占比,并写入MySQL * @param value */private static void calculateAndPersistAggrStat(String value, long taskid) { // 从Accumulator统计串中获取值 long session_count = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.SESSION_COUNT)); long visit_length_1s_3s = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_1s_3s)); long visit_length_4s_6s = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_4s_6s)); long visit_length_7s_9s = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_7s_9s)); long visit_length_10s_30s = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_10s_30s)); long visit_length_30s_60s = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_30s_60s)); long visit_length_1m_3m = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_1m_3m)); long visit_length_3m_10m = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_3m_10m)); long visit_length_10m_30m = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_10m_30m)); long visit_length_30m = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.TIME_PERIOD_30m)); long step_length_1_3 = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.STEP_PERIOD_1_3)); long step_length_4_6 = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.STEP_PERIOD_4_6)); long step_length_7_9 = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.STEP_PERIOD_7_9)); long step_length_10_30 = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.STEP_PERIOD_10_30)); long step_length_30_60 = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.STEP_PERIOD_30_60)); long step_length_60 = Long.valueOf(StringUtils.getFieldFromConcatString( value, "\\|", Constants.STEP_PERIOD_60)); // 计算各个访问时长和访问步长的范围 double visit_length_1s_3s_ratio = NumberUtils.formatDouble( (double)visit_length_1s_3s / (double)session_count, 2); double visit_length_4s_6s_ratio = NumberUtils.formatDouble( (double)visit_length_4s_6s / (double)session_count, 2); double visit_length_7s_9s_ratio = NumberUtils.formatDouble( (double)visit_length_7s_9s / (double)session_count, 2); double visit_length_10s_30s_ratio = NumberUtils.formatDouble( (double)visit_length_10s_30s / (double)session_count, 2); double visit_length_30s_60s_ratio = NumberUtils.formatDouble( (double)visit_length_30s_60s / (double)session_count, 2); double visit_length_1m_3m_ratio = NumberUtils.formatDouble( (double)visit_length_1m_3m / (double)session_count, 2); double visit_length_3m_10m_ratio = NumberUtils.formatDouble( (double)visit_length_3m_10m / (double)session_count, 2); double visit_length_10m_30m_ratio = NumberUtils.formatDouble( (double)visit_length_10m_30m / (double)session_count, 2); double visit_length_30m_ratio = NumberUtils.formatDouble( (double)visit_length_30m / (double)session_count, 2); double step_length_1_3_ratio = NumberUtils.formatDouble( (double)step_length_1_3 / (double)session_count, 2); double step_length_4_6_ratio = NumberUtils.formatDouble( (double)step_length_4_6 / (double)session_count, 2); double step_length_7_9_ratio = NumberUtils.formatDouble( (double)step_length_7_9 / (double)session_count, 2); double step_length_10_30_ratio = NumberUtils.formatDouble( (double)step_length_10_30 / (double)session_count, 2); double step_length_30_60_ratio = NumberUtils.formatDouble( (double)step_length_30_60 / (double)session_count, 2); double step_length_60_ratio = NumberUtils.formatDouble( (double)step_length_60 / (double)session_count, 2); // 将统计结果封装为Domain对象 SessionAggrStat sessionAggrStat = new SessionAggrStat(); sessionAggrStat.setTaskid(taskid); sessionAggrStat.setSession_count(session_count); sessionAggrStat.setVisit_length_1s_3s_ratio(visit_length_1s_3s_ratio); sessionAggrStat.setVisit_length_4s_6s_ratio(visit_length_4s_6s_ratio); sessionAggrStat.setVisit_length_7s_9s_ratio(visit_length_7s_9s_ratio); sessionAggrStat.setVisit_length_10s_30s_ratio(visit_length_10s_30s_ratio); sessionAggrStat.setVisit_length_30s_60s_ratio(visit_length_30s_60s_ratio); sessionAggrStat.setVisit_length_1m_3m_ratio(visit_length_1m_3m_ratio); sessionAggrStat.setVisit_length_3m_10m_ratio(visit_length_3m_10m_ratio); sessionAggrStat.setVisit_length_10m_30m_ratio(visit_length_10m_30m_ratio); sessionAggrStat.setVisit_length_30m_ratio(visit_length_30m_ratio); sessionAggrStat.setStep_length_1_3_ratio(step_length_1_3_ratio); sessionAggrStat.setStep_length_4_6_ratio(step_length_4_6_ratio); sessionAggrStat.setStep_length_7_9_ratio(step_length_7_9_ratio); sessionAggrStat.setStep_length_10_30_ratio(step_length_10_30_ratio); sessionAggrStat.setStep_length_30_60_ratio(step_length_30_60_ratio); sessionAggrStat.setStep_length_60_ratio(step_length_60_ratio); // 调用对应的DAO插入统计结果 ISessionAggrStatDAO sessionAggrStatDAO = DAOFactory.getSessionAggrStatDAO(); sessionAggrStatDAO.insert(sessionAggrStat); }
(7)
// 获取top10热门品类List<Tuple2<CategorySortKey, String>> top10CategoryList = getTop10Category(task.getTaskid(), sessionid2detailRDD);
/** * 获取top10热门品类 * @param filteredSessionid2AggrInfoRDD * @param sessionid2actionRDD */ private static List<Tuple2<CategorySortKey, String>> getTop10Category( long taskid, JavaPairRDD<String, Row> sessionid2detailRDD) { /** * 第一步:获取符合条件的session访问过的所有品类 */ // 获取session访问过的所有品类id // 访问过:指的是,点击过、下单过、支付过的品类 JavaPairRDD<Long, Long> categoryidRDD = sessionid2detailRDD.flatMapToPair( new PairFlatMapFunction<Tuple2<String,Row>, Long, Long>() { private static final long serialVersionUID = 1L; @Override public Iterable<Tuple2<Long, Long>> call( Tuple2<String, Row> tuple) throws Exception { Row row = tuple._2; List<Tuple2<Long, Long>> list = new ArrayList<Tuple2<Long, Long>>(); Long clickCategoryId = row.getLong(6); if(clickCategoryId != null) { list.add(new Tuple2<Long, Long>(clickCategoryId, clickCategoryId)); } String orderCategoryIds = row.getString(8); if(orderCategoryIds != null) { String[] orderCategoryIdsSplited = orderCategoryIds.split(","); for(String orderCategoryId : orderCategoryIdsSplited) { list.add(new Tuple2<Long, Long>(Long.valueOf(orderCategoryId), Long.valueOf(orderCategoryId))); } } String payCategoryIds = row.getString(10); if(payCategoryIds != null) { String[] payCategoryIdsSplited = payCategoryIds.split(","); for(String payCategoryId : payCategoryIdsSplited) { list.add(new Tuple2<Long, Long>(Long.valueOf(payCategoryId), Long.valueOf(payCategoryId))); } } return list; } }); /** * 必须要进行去重 * 如果不去重的话,会出现重复的categoryid,排序会对重复的categoryid已经countInfo进行排序 * 最后很可能会拿到重复的数据 */ categoryidRDD = categoryidRDD.distinct(); /** * 第二步:计算各品类的点击、下单和支付的次数 */ // 访问明细中,其中三种访问行为是:点击、下单和支付 // 分别来计算各品类点击、下单和支付的次数,可以先对访问明细数据进行过滤 // 分别过滤出点击、下单和支付行为,然后通过map、reduceByKey等算子来进行计算 // 计算各个品类的点击次数 JavaPairRDD<Long, Long> clickCategoryId2CountRDD = getClickCategoryId2CountRDD(sessionid2detailRDD); // 计算各个品类的下单次数 JavaPairRDD<Long, Long> orderCategoryId2CountRDD = getOrderCategoryId2CountRDD(sessionid2detailRDD); // 计算各个品类的支付次数 JavaPairRDD<Long, Long> payCategoryId2CountRDD = getPayCategoryId2CountRDD(sessionid2detailRDD); /** * 第三步:join各品类与它的点击、下单和支付的次数 * * categoryidRDD中,是包含了所有的符合条件的session,访问过的品类id * * 上面分别计算出来的三份,各品类的点击、下单和支付的次数,可能不是包含所有品类的 * 比如,有的品类,就只是被点击过,但是没有人下单和支付 * * 所以,这里,就不能使用join操作,要使用leftOuterJoin操作,就是说,如果categoryidRDD不能 * join到自己的某个数据,比如点击、或下单、或支付次数,那么该categoryidRDD还是要保留下来的 * 只不过,没有join到的那个数据,就是0了 * */ JavaPairRDD<Long, String> categoryid2countRDD = joinCategoryAndData( categoryidRDD, clickCategoryId2CountRDD, orderCategoryId2CountRDD, payCategoryId2CountRDD); /** * 第四步:自定义二次排序key */ /** * 第五步:将数据映射成<CategorySortKey,info>格式的RDD,然后进行二次排序(降序) */ JavaPairRDD<CategorySortKey, String> sortKey2countRDD = categoryid2countRDD.mapToPair( new PairFunction<Tuple2<Long,String>, CategorySortKey, String>() { private static final long serialVersionUID = 1L; @Override public Tuple2<CategorySortKey, String> call( Tuple2<Long, String> tuple) throws Exception { String countInfo = tuple._2; long clickCount = Long.valueOf(StringUtils.getFieldFromConcatString( countInfo, "\\|", Constants.FIELD_CLICK_COUNT)); long orderCount = Long.valueOf(StringUtils.getFieldFromConcatString( countInfo, "\\|", Constants.FIELD_ORDER_COUNT)); long payCount = Long.valueOf(StringUtils.getFieldFromConcatString( countInfo, "\\|", Constants.FIELD_PAY_COUNT)); CategorySortKey sortKey = new CategorySortKey(clickCount, orderCount, payCount); return new Tuple2<CategorySortKey, String>(sortKey, countInfo); } }); JavaPairRDD<CategorySortKey, String> sortedCategoryCountRDD = sortKey2countRDD.sortByKey(false); /** * 第六步:用take(10)取出top10热门品类,并写入MySQL */ ITop10CategoryDAO top10CategoryDAO = DAOFactory.getTop10CategoryDAO(); List<Tuple2<CategorySortKey, String>> top10CategoryList = sortedCategoryCountRDD.take(10); for(Tuple2<CategorySortKey, String> tuple: top10CategoryList) { String countInfo = tuple._2; long categoryid = Long.valueOf(StringUtils.getFieldFromConcatString( countInfo, "\\|", Constants.FIELD_CATEGORY_ID)); long clickCount = Long.valueOf(StringUtils.getFieldFromConcatString( countInfo, "\\|", Constants.FIELD_CLICK_COUNT)); long orderCount = Long.valueOf(StringUtils.getFieldFromConcatString( countInfo, "\\|", Constants.FIELD_ORDER_COUNT)); long payCount = Long.valueOf(StringUtils.getFieldFromConcatString( countInfo, "\\|", Constants.FIELD_PAY_COUNT)); Top10Category category = new Top10Category(); category.setTaskid(taskid); category.setCategoryid(categoryid); category.setClickCount(clickCount); category.setOrderCount(orderCount); category.setPayCount(payCount); top10CategoryDAO.insert(category); } return top10CategoryList; }
/** * 获取各品类点击次数RDD * @param sessionid2detailRDD * @return */ private static JavaPairRDD<Long, Long> getClickCategoryId2CountRDD( JavaPairRDD<String, Row> sessionid2detailRDD) { /** * 说明一下: * * 这儿,是对完整的数据进行了filter过滤,过滤出来点击行为的数据 * 点击行为的数据其实只占总数据的一小部分 * 所以过滤以后的RDD,每个partition的数据量,很有可能跟我们之前说的一样,会很不均匀 * 而且数据量肯定会变少很多 * * 所以针对这种情况,还是比较合适用一下coalesce算子的,在filter过后去减少partition的数量 * */ JavaPairRDD<String, Row> clickActionRDD = sessionid2detailRDD.filter( new Function<Tuple2<String,Row>, Boolean>() { private static final long serialVersionUID = 1L; @Override public Boolean call(Tuple2<String, Row> tuple) throws Exception { Row row = tuple._2; return row.get(6) != null ? true : false; } });// .coalesce(100); /** * 对这个coalesce操作做一个说明 * * 我们在这里用的模式都是local模式,主要是用来测试,所以local模式下,不用去设置分区和并行度的数量 * local模式自己本身就是进程内模拟的集群来执行,本身性能就很高 * 而且对并行度、partition数量都有一定的内部的优化 * * 这里我们再自己去设置,就有点画蛇添足 * * 但是就是跟大家说明一下,coalesce算子的使用,即可 * */ JavaPairRDD<Long, Long> clickCategoryIdRDD = clickActionRDD.mapToPair( new PairFunction<Tuple2<String,Row>, Long, Long>() { private static final long serialVersionUID = 1L; @Override public Tuple2<Long, Long> call(Tuple2<String, Row> tuple) throws Exception { long clickCategoryId = tuple._2.getLong(6); return new Tuple2<Long, Long>(clickCategoryId, 1L); } }); /** * 计算各个品类的点击次数 * * 如果某个品类点击了1000万次,其他品类都是10万次,那么也会数据倾斜 * */ JavaPairRDD<Long, Long> clickCategoryId2CountRDD = clickCategoryIdRDD.reduceByKey( new Function2<Long, Long, Long>() { private static final long serialVersionUID = 1L; @Override public Long call(Long v1, Long v2) throws Exception { return v1 + v2; } }); return clickCategoryId2CountRDD; }
/** * 连接品类RDD与数据RDD * @param categoryidRDD * @param clickCategoryId2CountRDD * @param orderCategoryId2CountRDD * @param payCategoryId2CountRDD * @return */ private static JavaPairRDD<Long, String> joinCategoryAndData( JavaPairRDD<Long, Long> categoryidRDD, JavaPairRDD<Long, Long> clickCategoryId2CountRDD, JavaPairRDD<Long, Long> orderCategoryId2CountRDD, JavaPairRDD<Long, Long> payCategoryId2CountRDD) { // 解释一下,如果用leftOuterJoin,就可能出现,右边那个RDD中,join过来时,没有值 // 所以Tuple中的第二个值用Optional<Long>类型,就代表,可能有值,可能没有值 JavaPairRDD<Long, Tuple2<Long, Optional<Long>>> tmpJoinRDD = categoryidRDD.leftOuterJoin(clickCategoryId2CountRDD); JavaPairRDD<Long, String> tmpMapRDD = tmpJoinRDD.mapToPair( new PairFunction<Tuple2<Long,Tuple2<Long,Optional<Long>>>, Long, String>() { private static final long serialVersionUID = 1L; @Override public Tuple2<Long, String> call( Tuple2<Long, Tuple2<Long, Optional<Long>>> tuple) throws Exception { long categoryid = tuple._1; Optional<Long> optional = tuple._2._2; long clickCount = 0L; if(optional.isPresent()) { clickCount = optional.get(); } String value = Constants.FIELD_CATEGORY_ID + "=" + categoryid + "|" + Constants.FIELD_CLICK_COUNT + "=" + clickCount; return new Tuple2<Long, String>(categoryid, value); } }); tmpMapRDD = tmpMapRDD.leftOuterJoin(orderCategoryId2CountRDD).mapToPair( new PairFunction<Tuple2<Long,Tuple2<String,Optional<Long>>>, Long, String>() { private static final long serialVersionUID = 1L; @Override public Tuple2<Long, String> call( Tuple2<Long, Tuple2<String, Optional<Long>>> tuple) throws Exception { long categoryid = tuple._1; String value = tuple._2._1; Optional<Long> optional = tuple._2._2; long orderCount = 0L; if(optional.isPresent()) { orderCount = optional.get(); } value = value + "|" + Constants.FIELD_ORDER_COUNT + "=" + orderCount; return new Tuple2<Long, String>(categoryid, value); } }); tmpMapRDD = tmpMapRDD.leftOuterJoin(payCategoryId2CountRDD).mapToPair( new PairFunction<Tuple2<Long,Tuple2<String,Optional<Long>>>, Long, String>() { private static final long serialVersionUID = 1L; @Override public Tuple2<Long, String> call( Tuple2<Long, Tuple2<String, Optional<Long>>> tuple) throws Exception { long categoryid = tuple._1; String value = tuple._2._1; Optional<Long> optional = tuple._2._2; long payCount = 0L; if(optional.isPresent()) { payCount = optional.get(); } value = value + "|" + Constants.FIELD_PAY_COUNT + "=" + payCount; return new Tuple2<Long, String>(categoryid, value); } }); return tmpMapRDD; }
(8) 获取热门品种上的top10活跃session
getTop10Session(sc, task.getTaskid(), top10CategoryList, sessionid2detailRDD);
/** * 获取top10活跃session * @param taskid * @param sessionid2detailRDD */ private static void getTop10Session( JavaSparkContext sc, final long taskid, List<Tuple2<CategorySortKey, String>> top10CategoryList, JavaPairRDD<String, Row> sessionid2detailRDD) { /** * 第一步:将top10热门品类的id,生成一份RDD */ List<Tuple2<Long, Long>> top10CategoryIdList = new ArrayList<Tuple2<Long, Long>>(); for(Tuple2<CategorySortKey, String> category : top10CategoryList) { long categoryid = Long.valueOf(StringUtils.getFieldFromConcatString( category._2, "\\|", Constants.FIELD_CATEGORY_ID)); top10CategoryIdList.add(new Tuple2<Long, Long>(categoryid, categoryid)); } JavaPairRDD<Long, Long> top10CategoryIdRDD = sc.parallelizePairs(top10CategoryIdList); /** * 第二步:计算top10品类被各session点击的次数 */ JavaPairRDD<String, Iterable<Row>> sessionid2detailsRDD = sessionid2detailRDD.groupByKey(); JavaPairRDD<Long, String> categoryid2sessionCountRDD = sessionid2detailsRDD.flatMapToPair( new PairFlatMapFunction<Tuple2<String,Iterable<Row>>, Long, String>() { private static final long serialVersionUID = 1L; @Override public Iterable<Tuple2<Long, String>> call( Tuple2<String, Iterable<Row>> tuple) throws Exception { String sessionid = tuple._1; Iterator<Row> iterator = tuple._2.iterator(); Map<Long, Long> categoryCountMap = new HashMap<Long, Long>(); // 计算出该session,对每个品类的点击次数 while(iterator.hasNext()) { Row row = iterator.next(); if(row.get(6) != null) { long categoryid = row.getLong(6); Long count = categoryCountMap.get(categoryid); if(count == null) { count = 0L; } count++; categoryCountMap.put(categoryid, count); } } // 返回结果,<categoryid,sessionid,count>格式 List<Tuple2<Long, String>> list = new ArrayList<Tuple2<Long, String>>(); for(Map.Entry<Long, Long> categoryCountEntry : categoryCountMap.entrySet()) { long categoryid = categoryCountEntry.getKey(); long count = categoryCountEntry.getValue(); String value = sessionid + "," + count; list.add(new Tuple2<Long, String>(categoryid, value)); } return list; } }) ; // 获取到to10热门品类,被各个session点击的次数 JavaPairRDD<Long, String> top10CategorySessionCountRDD = top10CategoryIdRDD .join(categoryid2sessionCountRDD) .mapToPair(new PairFunction<Tuple2<Long,Tuple2<Long,String>>, Long, String>() { private static final long serialVersionUID = 1L; @Override public Tuple2<Long, String> call( Tuple2<Long, Tuple2<Long, String>> tuple) throws Exception { return new Tuple2<Long, String>(tuple._1, tuple._2._2); } }); /** * 第三步:分组取TopN算法实现,获取每个品类的top10活跃用户 */ JavaPairRDD<Long, Iterable<String>> top10CategorySessionCountsRDD = top10CategorySessionCountRDD.groupByKey(); JavaPairRDD<String, String> top10SessionRDD = top10CategorySessionCountsRDD.flatMapToPair( new PairFlatMapFunction<Tuple2<Long,Iterable<String>>, String, String>() { private static final long serialVersionUID = 1L; @Override public Iterable<Tuple2<String, String>> call( Tuple2<Long, Iterable<String>> tuple) throws Exception { long categoryid = tuple._1; Iterator<String> iterator = tuple._2.iterator(); // 定义取topn的排序数组 String[] top10Sessions = new String[10]; while(iterator.hasNext()) { String sessionCount = iterator.next(); long count = Long.valueOf(sessionCount.split(",")[1]); // 遍历排序数组 for(int i = 0; i < top10Sessions.length; i++) { // 如果当前i位,没有数据,那么直接将i位数据赋值为当前sessionCount if(top10Sessions[i] == null) { top10Sessions[i] = sessionCount; break; } else { long _count = Long.valueOf(top10Sessions[i].split(",")[1]); // 如果sessionCount比i位的sessionCount要大 if(count > _count) { // 从排序数组最后一位开始,到i位,所有数据往后挪一位 for(int j = 9; j > i; j--) { top10Sessions[j] = top10Sessions[j - 1]; } // 将i位赋值为sessionCount top10Sessions[i] = sessionCount; break; } // 比较小,继续外层for循环 } } } // 将数据写入MySQL表 List<Tuple2<String, String>> list = new ArrayList<Tuple2<String, String>>(); for(String sessionCount : top10Sessions) { if(sessionCount != null) { String sessionid = sessionCount.split(",")[0]; long count = Long.valueOf(sessionCount.split(",")[1]); // 将top10 session插入MySQL表 Top10Session top10Session = new Top10Session(); top10Session.setTaskid(taskid); top10Session.setCategoryid(categoryid); top10Session.setSessionid(sessionid); top10Session.setClickCount(count); ITop10SessionDAO top10SessionDAO = DAOFactory.getTop10SessionDAO(); top10SessionDAO.insert(top10Session); // 放入list list.add(new Tuple2<String, String>(sessionid, sessionid)); } } return list; } }); /** * 第四步:获取top10活跃session的明细数据,并写入MySQL */ JavaPairRDD<String, Tuple2<String, Row>> sessionDetailRDD = top10SessionRDD.join(sessionid2detailRDD); sessionDetailRDD.foreach(new VoidFunction<Tuple2<String,Tuple2<String,Row>>>() { private static final long serialVersionUID = 1L; @Override public void call(Tuple2<String, Tuple2<String, Row>> tuple) throws Exception { Row row = tuple._2._2; SessionDetail sessionDetail = new SessionDetail(); sessionDetail.setTaskid(taskid); sessionDetail.setUserid(row.getLong(1)); sessionDetail.setSessionid(row.getString(2)); sessionDetail.setPageid(row.getLong(3)); sessionDetail.setActionTime(row.getString(4)); sessionDetail.setSearchKeyword(row.getString(5)); sessionDetail.setClickCategoryId(row.getLong(6)); sessionDetail.setClickProductId(row.getLong(7)); sessionDetail.setOrderCategoryIds(row.getString(8)); sessionDetail.setOrderProductIds(row.getString(9)); sessionDetail.setPayCategoryIds(row.getString(10)); sessionDetail.setPayProductIds(row.getString(11)); ISessionDetailDAO sessionDetailDAO = DAOFactory.getSessionDetailDAO(); sessionDetailDAO.insert(sessionDetail); } }); }}
- Spark日志分析项目Demo(4)--RDD使用,用户行为统计分析
- Spark RDD 实现电影点评用户行为分析 (Scala)
- Spark日志分析项目Demo(3)--Spark入口和DataFrame
- Spark项目:知名手机厂商用户行为实时分析系统
- Spark日志分析项目Demo(1)--Flume-ng的安装
- Spark日志分析项目Demo(5)--自定义Accumulator
- Spark日志分析项目Demo(9)--常规性能调优
- Spark日志分析项目Demo(10) --JVM调优
- Spark开发电商日志分析用户行为聚合功能练习
- 用户行为统计分析工具比较
- Hadoop用户行为分析项目
- 用户行为分析项目随笔
- 基于Spark的用户行为路径分析
- Spark日志分析项目Demo(6)--页面单跳转化率分析
- 日志采集分析项目Demo
- Spark RDD 源码分析
- 网站日志分析项目案例(三)统计分析
- 某大型互联网企业用户上网行为日志分析系统——云计算项目实战
- Python解决消除重复元素问题【2017年网易春招笔试题】
- android webview cookie保存
- 【zookeeper】zookeeper概述
- iptables
- CentOS5和6的启动流程
- Spark日志分析项目Demo(4)--RDD使用,用户行为统计分析
- K的倍数 (美团点评2018校招内推编程题)
- Linux初体验
- linux知识点大合集
- Python3.5——文件修改详解
- 665. Non-decreasing Array
- postman中 form-data、x-www-form-urlencoded、raw、binary的区别
- HUSTOJ 1144: 电子表
- TensorFlow 中文资源精选,官方网站,安装教程,入门教程,实战项目,学习路径。