spark通过ip计算IP所在省份,以及广播变量的使用

来源:互联网 发布:演唱会软件 编辑:程序博客网 时间:2024/06/03 18:21

其中需要一个IP段对应的码表内容大体如下(已经上传到csdn,下载地址:点击跳转下载页面):

1,3708713472,3708715007,"河南省","信阳市","联通","221.14.122.0","221.14.127.255"2,3708649472,3708813311,"河南省",,"联通","221.13.128.0","221.15.255.255"3,3720390656,3720391679,"河北省","邢台市","联通","221.192.168.0","221.192.171.255"4,1038992128,1038992383,"黑龙江省","齐齐哈尔市","铁通","61.237.195.0","61.237.195.255"

可以通过都好分割,其中第一列为ID,第二列为ip地址转换成long型后的上界,第三列为下界,第四列为省份,第五列为城市,第六列为运营商,第七列ip上界,第八列ip下界

广播变量其实就是和hadoop的map端join一样,将数据分发到各个执行节点的内存里面,在spark中使用:
sc.broadcast 这个方法就能将变量广播到各个执行节点里面,具体用法如下工程

工程项目如下:

大体内容:根据ip获得访问城市的省份,并且根据访问次数进行排序

这里写图片描述

其中Bootstrap:

package cn.lijie.businessimport org.apache.spark.{SparkConf, SparkContext}/**  * User: lijie  */object Bootstrap {  /**    * 二分查找    *    * @param arr    * @param ip    * @return    */  def binarySearch(arr: Array[(String, String, String, String)], ip: Long): Int = {    var l = 0    var h = arr.length - 1    while (l <= h) {      var m = (l + h) / 2      if ((ip >= arr(m)._1.toLong) && (ip <= arr(m)._2.toLong)) {        return m      } else if (ip < arr(m)._1.toLong) {        h = m - 1      } else {        l = m + 1      }    }    -1  }  /**    * IP转long    *    * @param ip    * @return    */  def ip2Long(ip: String): Long = {    val arr = ip.split("[.]")    var num = 0L    for (i <- 0 until arr.length) {      num = arr(i).toLong | num << 8L    }    num  }  def main(args: Array[String]): Unit = {    //    print(3395782400.00.toLong)    //1,3708713472.00,3708715007.00,"河南省","信阳市","联通","221.14.122.0","221.14.127.255"    //id  下界  上界  省份  城市  运营商  ip段下界   ip段下界    //这里对IP.txt里面的内容进行排序,安装上界的升序排    val conf = new SparkConf().setMaster("local[2]").setAppName("ip")    val sc = new SparkContext(conf)    val rdd1 = sc.textFile("src/main/file/*.txt").map(x => {      val s = x.split(",")      //下界  上界  省份  运营商      (s(1), s(2), s(3), s(5))    }).sortBy(_._1)    //广播变量    val bd = sc.broadcast(rdd1.collect)    val rdd2 = sc.textFile("src/main/file/*.info").map(x => {      val s = x.split(",")      //(ip,1)      (s(1), 1)    }).reduceByKey(_ + _).sortBy(_._2)    rdd2.map(x => {      val ipLong = ip2Long(x._1)      //获取下标      val index = binarySearch(bd.value, ipLong)      //没找到的返回unknown      if (index == -1) {        (ipLong, x._1, x._2, "unknown", "unknown")      } else {        //获取省份        val p = bd.value(index)._3        //获取运营商        val y = bd.value(index)._4        (ipLong, x._1, x._2, p, y)      }    }).repartition(1).saveAsTextFile("C:\\Users\\Administrator\\Desktop\\out")    sc.stop()  }}

ip.txt文件就是我上传的那份文件
下载地址:点击跳转下载页面

ip.info是我模拟的几条数据:

14:45:17,202.98.248.24215:45:17,219.220.199.25016:45:17,219.220.199.25018:45:17,202.98.248.24218:45:17,202.98.248.24218:45:17,202.98.248.24218:45:17,202.98.248.24218:45:17,202.98.248.24216:45:17,114.139.223.1315:45:17,219.220.199.25016:45:17,219.220.199.25015:45:17,219.220.199.25016:45:17,219.220.199.25015:45:17,219.220.199.25016:45:17,219.220.199.25013:45:17,114.139.223.1310:45:17,114.139.223.1313:45:17,114.139.223.1310:45:17,114.139.223.1310:45:17,114.10.123.13

执行完成后:

这里写图片描述

这里写图片描述

阅读全文
0 0