Spark 升级（1.6.1 升级到2.0.2）幽灵分区

来源：互联网发布：java数据库编程入门编辑：程序博客网时间：2024/05/02 02:22

1、过程描述

升级spark 版本时，原先已经删除的数据的分区都存在，spark 查询时会检查分区，对没有数据的分区会报错。无论查询的是或者不是已删除的分区的数据；

2、错误排查过程

1）利用impala 登录,查询原来的数据，正常查询；

[dmp175:21000] > select eid,ecid,sld,count(sld) from base_dmp_v1 where day=20171221 group by eid,ecid,sld limit 10;Query: select eid,ecid,sld,count(sld) from base_dmp_v1 where day=20171221 group by eid,ecid,sld limit 108+------+------+------------------------------------------------+------------+| eid  | ecid | sld                                            | count(sld) |+------+------+------------------------------------------------+------------+| 7921 | 15   | only-250629-175-167-154-163.nstool.netease.com | 2          || 7921 | 15   | only-732014-175-167-146-171.nstool.netease.com | 1          || 143  | 21   | www.simcere.com                                | 2          || 49   | 15   | atf.flyert.com                                 | 18         || 9    | 1    | fw.hswzyj.com                                  | 2          || 143  | 21   | 1513857630728.081.sngdia.imtmp.net             | 1          || 7921 | 15   | only-822862-175-167-138-69.nstool.netease.com  | 2          || 7922 | 15   | 249989901-mmspdu.s3.lecloud.com                | 10         || 9    | 1    | wq22.82mo.cn                                   | 10         || 143  | 21   | 1513837643698.051.sngdia.imtmp.net             | 1          |+------+------+------------------------------------------------+------------+Fetched 10 row(s) in 342.76s

2）用spark beeline 登录查询报错；

0: jdbc:hive2://192.168.200.175:10000/default> insert into table statistic_sld_group_eidecid  select eid,ecid,sld,count(sld) from base_dmp_v1 where day=20171221 group by eid,ecid,sld ;Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 45 in stage 1.0 failed 4 times, most recent failure: Lost task 45.3 in stage 1.0 (TID 172, 192.168.200.174): java.io.FileNotFoundException: File does not exist: hdfs://logSave/home/data/etl/export/parquet/base_dmp_v1/day=20170721/hour=09        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)        at org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)        at org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)        at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)        at scala.collection.Iterator$class.foreach(Iterator.scala:893)        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)        at scala.collection.AbstractIterator.to(Iterator.scala:1336)        at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)        at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)        at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)        at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)        at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:912)        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)        at org.apache.spark.scheduler.Task.run(Task.scala:86)        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)        at java.lang.Thread.run(Thread.java:745)Driver stacktrace: (state=,code=0)

3) 输入命令查看表的分区:show partitions tablename ; 结果如下所示：

0: jdbc:hive2://192.168.200.175:10000/default> show partitions base_dmp_v1;+-----------------------+--+|        result         |+-----------------------+--+| day=20171220/hour=10  || day=20171220/hour=11  || day=20171220/hour=12  || day=20171220/hour=13  || day=20171220/hour=14  || day=20171220/hour=15  || day=20171220/hour=16  || day=20171220/hour=17  || day=20171220/hour=18  || day=20171220/hour=19  || day=20171220/hour=20  || day=20171220/hour=21  || day=20171220/hour=22  || day=20171220/hour=23  || day=20171221/hour=00  || day=20171221/hour=01  || day=20171221/hour=02  || day=20171221/hour=03  || day=20171221/hour=04  || day=20171221/hour=05  || day=20171221/hour=06  || day=20171221/hour=07  || day=20171221/hour=08  || day=20171221/hour=09  || day=20171221/hour=10  || day=20171221/hour=11  || day=20171221/hour=12  || day=20171221/hour=13  || day=20171221/hour=14  || day=20171221/hour=15  || day=20171221/hour=16  || day=20171221/hour=17  || day=20171221/hour=18  || day=20171221/hour=19  || day=20171221/hour=20  || day=20171221/hour=21  || day=20171221/hour=22  || day=20171221/hour=23  || day=20171222/hour=00  || day=20171222/hour=01  || day=20171222/hour=02  || day=20171222/hour=03  || day=20171222/hour=04  || day=20171222/hour=05  || day=20171222/hour=06  || day=20171222/hour=07  || day=20171222/hour=08  || day=20171222/hour=09  || day=20171222/hour=10  || day=20171222/hour=11  || day=20171222/hour=12  || day=20171222/hour=13  || day=20171222/hour=14  |+-----------------------+-

分析结果; 发现有很多已经删除的数据的分区依旧存在；

3 解决方案

很简单，删处已经删除的数据的分区(python 脚本)；具体命令为：ALTER TABLE base_dmp_v1 DROP IF EXISTS PARTITION(day= 20171206);

以下为删处所有的分区的python 脚本；

import  os#coding=utf-8year="2017"bigmonths=["01","03","05","07","08","10","12"]middlemonths=["04","06","09","11"]smallmonth=["02"]daybig=["01", "02", "03" ,"04", "05", "06" ,"07" ,"08" ,"09" ,"10" ,"11" ,"12" ,"13" ,"14" ,"15", "16" ,"17" ,     "18", "19" ,"20", "21" ,"22" ,"23","24","25","26","27","28","29","30","31"]daymiddle=["01", "02", "03" ,"04", "05", "06" ,"07" ,"08" ,"09" ,"10" ,"11" ,"12" ,"13" ,"14" ,"15", "16" ,"17" ,     "18", "19" ,"20", "21" ,"22" ,"23","24","25","26","27","28","29","30"]daysmall=["01", "02", "03" ,"04", "05", "06" ,"07" ,"08" ,"09" ,"10" ,"11" ,"12" ,"13" ,"14" ,"15", "16" ,"17" ,     "18", "19" ,"20", "21" ,"22" ,"23","24","25","26","27","28"]partiontion=""for month in bigmonths:    for day in daybig:        dayint=int(day)        if (month =='12' and dayint>=20 and dayint<=31):            continue        partiontion=year+month+day        sql = "beeline -u jdbc:hive2://192.168.200.175:10000/default -n root -e " \                  "'ALTER TABLE base_dmp_v1 DROP IF EXISTS  PARTITION(day=" + partiontion + ");';"        print(sql)        os.system(sql)for month in middlemonths:    for day in daymiddle:        partiontion=year+month+day        sql = "beeline -u jdbc:hive2://192.168.200.175:10000/default -n root -e " \                  "'ALTER TABLE base_dmp_v1 DROP IF EXISTS  partition(day=" + partiontion + ");';"        print sql        os.system(sql)for month in smallmonth:    for day in daysmall:        partiontion=year+month+day        sql = "beeline -u jdbc:hive2://192.168.200.175:10000/default -n root -e " \                  "'ALTER TABLE base_dmp_v1 DROP IF EXISTS  partition(day=" + partiontion + ");';"        print sql        os.system(sql)

4 删处分区后

1）查看分区，结果只剩未删除的分区；

Beeline version 1.2.1.spark2 by Apache Hive0: jdbc:hive2://192.168.200.175:10000/default> show partitions base_dmp_v1;+-----------------------+--+|        result         |+-----------------------+--+| day=20171222/hour=00  || day=20171222/hour=01  || day=20171222/hour=02  || day=20171222/hour=03  || day=20171222/hour=04  || day=20171222/hour=05  || day=20171222/hour=06  || day=20171222/hour=07  || day=20171222/hour=08  || day=20171222/hour=09  || day=20171222/hour=10  || day=20171222/hour=11  || day=20171222/hour=12  || day=20171222/hour=13  || day=20171222/hour=14  |+-----------------------+--+53 rows selected (0.292 seconds)

2）查询结果正常

0: jdbc:hive2://192.168.200.175:10000/default> select eid,ecid,sld,count(sld) from base_dmp_v1 where day=20171221 group by eid,ecid,sld  limit 10;+-------+-------+--------------------------+-------------+--+|  eid  | ecid  |           sld            | count(sld)  |+-------+-------+--------------------------+-------------+--+| 7342  | 15    | res.imtt.qq.com          | 11628217    || 83    | 5     | wup.huya.com             | 6543218     || 117   | 15    | screenshot.dwstatic.com  | 492138      || 21    | 6     | msg.mobile.kugou.com     | 1728706     || 304   | 15    | i8.mifile.cn             | 215481      || 7922  | 15    | gad.netease.com          | 31423       || 7115  | 21    | 185.38.13.130            | 114         || 143   | 21    | url.cn                   | 892582      || 6     | 5     | 42.56.78.17              | 491946      || 2     | 17    | mdws.openapi.360.cn      | 1148830     |+-------+-------+--------------------------+-------------+--+10 rows selected (212.995 seconds)

3)比较impala 和spark 查询速度，寻找spark 更快的原因；

前提：impala和spark 共用hive 和hdfs;

查看impala官方文档：

Factors Affecting Scalability：

A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with cluster size. However, for some workloads, the scalability might be bounded by the network, or even by memory.

If the workload is already network bound (on a 10 GB network), increasing the cluster size won’t reduce the network load; in fact, a larger cluster could increase network traffic because some queries involve "broadcast" operations to all DataNodes. Therefore, boosting the cluster size does not improve query throughput in a network-constrained environment.

Let’s look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor network is saturated yet. This can happen because currently Impala uses only a single core per node to process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system cannot run more than 2 such queries at the same time.

Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound workload. It’s just that the CPU will not be saturated. Per-node throughput will be lower than 1.6 GB/sec. Consider increasing the memory per node.

As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the throughput estimate.

由此可见，Impala用1核处理join操作和聚合操作；因此，最多可以用两核处理数据；spark 可以设置集群中每个节点的核数和内存大小，本项目中选择的核数和内存为5核，
10G内存，肯定处理速度更快。

阅读全文

0 0