Hive-0.9 升级 Hive-0.12后结合HBase统计遇到的BUG处理

来源:互联网 发布:山西大学教务网络 编辑:程序博客网 时间:2024/04/30 13:43

1. 空指针异常


问题分析:最初以为是配置文件中缺损了某个值,然后用vimdiff比对了hive0.9和hive0.12之间的配置差异,在排除了配置文件出错的可能性后,下载了源码来看

问题解决:主要还是按照这个wiki上的来解决https://issues.apache.org/jira/browse/HIVE-5515

2. Map重复读Hbase

这个问题其实由来已久,其实我看到最早在hive0.9就已经有了,太坑爹了!



问题分析:修改源码打成新jar包后,通过tasklog可以发现,每个map的startRow和endRow竟然是一样的,Hbase的数据被重复scan,直接会造成reduce的最终结果是真实值的map倍,故猜测BUG应该是 map切片Hbase的时候出错了 

问题解决:在HiveHBaseTableInputFormat.java中getRecordReader方法里注释掉

//      tableSplit = convertFilter(jobConf, scan, tableSplit, iKey,//        getStorageFormatOfKey(columnsMapping.get(iKey).mappingSpec,//        jobConf.get(HBaseSerDe.HBASE_TABLE_DEFAULT_STORAGE_TYPE, "string")));
然后bulid.xml那边添加

  <target name="jar-hbase-handler" depends="init">    <subant buildpath="hbase-handler/build.xml" target="jar">      <property name="is-offline" value="${is-offline}"/>      <property name="thrift.home" value="${thrift.home}"/>      <property name="build.dir.hive" location="${build.dir.hive}"/>    </subant>  </target>
接着仅打包这个就可以了。
后来在hive wiki上提交这个BUG后,收到这个问题的解决回复:https://issues.apache.org/jira/browse/HIVE-3420

3.hive并发提交任务,出现数据混读

由于项目刚上线启动后会马上同时触发提交好几个hive产生的Mapreduce任务,结果跑完后发现这几个hive任务竟然都读了同一个hive任务要scan hbase的数据。

问题分析:猜测应该是某个全部变量或者某个单例引起的错误,而且map的时候就发生了,故还是锁定在HiveHBaseTableInputFormat.java。

增加打印堆栈信息,添加在getSplits函数return前面:

      new Exception("test hive:" + System.identityHashCode(this)).printStackTrace();      System.out.println("this:" + System.identityHashCode(this) + ", conf:" + System.identityHashCode(jobConf));      System.out.println("TableSplits:" + Arrays.asList(splits));
编译打包,替换线上jar包,重启hiveserver,执行并发提交

结果发现HiveHBaseTableInputFormat其实是个单例,conf是不同:

java.lang.Exception: test hive        at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.<init>(HiveHBaseTableInputFormat.java:86)        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113)        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getInputFormatFromCache(HiveInputFormat.java:194)        at org.apache.hadoop.hive.ql.exec.Utilities$3.run(Utilities.java:1940)        at org.apache.hadoop.hive.ql.exec.Utilities.getInputSummary(Utilities.java:1962)        at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.setNumberOfReducers(MapRedTask.java:409)        at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:99)        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)        at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)        at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:55)java.lang.Exception: test hive:1051896348        at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getSplits(HiveHBaseTableInputFormat.java:527)        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:294)        at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:303)        at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)        at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)        at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)        at java.security.AccessController.doPrivileged(Native Method)        at javax.security.auth.Subject.doAs(Subject.java:396)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)        at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:425)        at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:144)        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)        at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)        at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:55)job name:INSERT OVERWRITE TABLE hb_jt_s...subappad_id(Stage-0) startRow:subappad|79868893999999999 endRow:subappad|80308768999999999job name:INSERT OVERWRITE TABLE hb_jt_stat...mobad_id(Stage-0) startRow:mobad|79868893999999999 endRow:mobad|80308768999999999job name:INSERT OVERWRITE TABLE hb_jt_statistic...tag(Stage-0) startRow:sub2main|79868893999999999 endRow:sub2main|80308768999999999this:1620951483, conf:828005119TableSplits:[[storage2.test.lan:sub2main|79868893999999999\x00,sub2main|80308768999999999\x00]]this:1620951483, conf:588002473TableSplits:[[storage2.test.lan:sub2main|79868893999999999\x00,sub2main|80308768999999999\x00]]this:1620951483, conf:407508952TableSplits:[[storage2.test.lan:sub2main|79868893999999999\x00,sub2main|80308768999999999\x00]]

简单的一个修正方法

创建HiveHBaseTableInputFormatRealExe.java,copy HiveHBaseTableInputFormat.java的全部,然后原来的HiveHBaseTableInputFormat.java简化成一个空壳调用。

public class HiveHBaseTableInputFormat implements InputFormat<ImmutableBytesWritable, Result> {  @Override  public RecordReader<ImmutableBytesWritable, Result> getRecordReader(InputSplit split,      JobConf jobConf, final Reporter reporter) throws IOException {    HiveHBaseTableInputFormatRealExecute exe = new HiveHBaseTableInputFormatRealExecute();    return exe.getRecordReader(split, jobConf, reporter);  }  @Override  public InputSplit[] getSplits(JobConf jobConf, int numSplits) throws IOException {    HiveHBaseTableInputFormatRealExecute exe = new HiveHBaseTableInputFormatRealExecute();    return exe.getSplits(jobConf, numSplits);  }}
由于HiveHBaseTableInputFormat.java里有个static方法被外部调用了,稍微修改下就可修复。

PS: 我是hive的初学者,有什么不对的地方可以联系我