spark 统计线上日志ip分组统计

来源:互联网 发布:现在开淘宝店晚吗 编辑:程序博客网 时间:2024/05/17 05:02

线上日志格式如下


每次访问都会记录ip,标记红色部分

获取某一行看看怎么解析

2017-12-01 09:57:11.970 [http-nio-8082-exec-2] INFO  - com.fullshare.common.aop.ControllerAop                       [ 144] - 请求head:{content-type=application/json, platform=ios, requestsign=f8ea2ff2af562ac5665ada231317a66b, accept-language=zh-Hans-CN;q=1, en-GB;q=0.9, host=tapi.fshtop.com, x-forwarded-for=192.168.132.167, accept=application/json, appid=123456, appversion=2.5, user-agent=FullShareTop/2.5 (iPhone; iOS 10.3.2; Scale/2.00), authorization=072a2431f2bd6cf8108eb3231488cb6dfcc6e11eead3d04283f67762313b2259b937d07358140ef1acf6c6963f8ad42bb088f3223638244e, osversion=10.3.2, mode=iPhone7,2, deviceid=88793C63-994E-44FF-A8BA-506B3897C963, clienttime=1512093518629, content-length=67, brand=iphone, channel=appstore, idfa=CC2E3934-6C3E-4E64-9894-02603E7CED3A}


可以写代码了

那些安装hadoop ,spark我不说了网上有,jar包引入在我另一篇文章

代码如下

package test.spark;import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaPairRDD;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import scala.Tuple2;/** *  * @author huangjiangnan * */public class FilterLine {@SuppressWarnings("resource")public static void main(String[] args) {SparkConf conf = new SparkConf().setMaster("spark://192.168.7.202:7077").setAppName(FilterLine.class.getName());JavaSparkContext sc = new JavaSparkContext(conf);JavaRDD<String> inputRDD=sc.textFile("hdfs://192.168.7.202:900/test/nohup.out");//java  lambda表达式 jdk8以上,省很多代码//转化RDD,过滤,只需要想要的行JavaRDD<String> reqRDD=inputRDD.filter((String x)->{if(x.contains("请求head")){return true;}return false;});//JavaPairRDD 建值对    JavaPairRDD<String, Integer> pairRDD=reqRDD.mapToPair((String x)->{    String[] ss=x.split(",");    String ip="未知ip";    for (String st : ss) {if(st.contains("x-forwarded-for")){String[] ipStr=st.split("=");if(ipStr.length>1){ip=ipStr[1];break;}}}    return new Tuple2<String,Integer>(ip,1);    }).reduceByKey((Integer num1,Integer num2)->{    return num1+num2;    });    pairRDD.saveAsTextFile("hdfs://192.168.7.202:900/test/FilterLine-spark");}}





打包然后提交执行

最后结果如下