hbase Invalid HFile block magic on hdfs system

来源:互联网 发布:mac口红专柜多少钱一支 编辑:程序博客网 时间:2024/05/01 06:48

昨天hbase突然出错,前台的应用访问hbase链接堆死,看hbase日志,报如下错误:

java.io.IOException: Could not seek StoreFileScanner[HFileScanner for reader reader=hdfs://master:9000/hbase/metadata/7d30805699ab98a11cf1f3f4945d9609/meta/5042f025c06c45819cc5c3821e6298cf, compression=none, cacheConf=CacheConfig:enabled [cacheDataOnRead=true] [cacheDataOnWrite=false] [cacheIndexesOnWrite=false] [cacheBloomsOnWrite=false] [cacheEvictOnClose=false] [cacheCompressed=false], firstKey=001100220736330/meta:refvalue/20140922/Put, lastKey=001100301928568/meta:refvalue/20140922/Put, avgKeyLen=39, avgValueLen=8, entries=483817, length=27616128, cur=001100220819798/meta:C001/20140922/Maximum/vlen=0/ts=0] to key 001100220819798/meta:C001/LATEST_TIMESTAMP/Maximum/vlen=0/ts=0        at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:158)        at org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:351)        at org.apache.hadoop.hbase.regionserver.KeyValueHeap.pollRealKV(KeyValueHeap.java:333)        at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:291)        at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:256)        at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:519)        at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:402)        at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:127)        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3354)        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3310)        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3327)        at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4066)        at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4039)        at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1944)        at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3346)        at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)        at java.lang.reflect.Method.invoke(Method.java:597)        at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1376)Caused by: java.io.IOException: Invalid HFile block magic: \x00\x00\x00\x00\x00\x00\x00\x00        at org.apache.hadoop.hbase.io.hfile.BlockType.parse(BlockType.java:153)        at org.apache.hadoop.hbase.io.hfile.BlockType.read(BlockType.java:164)        at org.apache.hadoop.hbase.io.hfile.HFileBlock.<init>(HFileBlock.java:254)        at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1779)        at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1637)        at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:327)        at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:213)        at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:455)        at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:475)        at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:226)        at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:145)        ... 19 more

看到hbase的主题上有这个,和我的hbase版本一样,具体的链接是:https://issues.apache.org/jira/i#browse/HBASE-5885 ,描述是基于本地的fs,网上其他地方没有关于这个问题的过多的描述;然后看到代码,patch和我线上的代码一样;看来是已经打过这个patch了,先放一边;

先说说我的这个问题,我昨天下午再给hbase入数据,然后前台的监控就报hbase链接错误,最终还导致前台的应用就部分机器卡死;找了半天问题,没找见,还怀疑自己入的数据里面有什么脏字符导致的;

我先把这个服务器上的regionserver关掉,但是其他机器上的regionserver还是会报同样的错误;

后来我想到,先把hbase的这个校验给关掉;具体的参数是在hbase-site里面,配置上hbase.regionserver.checksum.verify,为false的时候,是不校验,true是校验;这个参数具体在org.apache.hadoop.hbase.HConstants里有描述;具体的描述如下:

 

/**    * If this parameter is set to true, then hbase will read   * data and then verify checksums. Checksum verification    * inside hdfs will be switched off.  However, if the hbase-checksum    * verification fails, then it will switch back to using   * hdfs checksums for verifiying data that is being read from storage.   *   * If this parameter is set to false, then hbase will not   * verify any checksums, instead it will depend on checksum verification   * being done in the hdfs client.   */  public static final String HBASE_CHECKSUM_VERIFICATION =       "hbase.regionserver.checksum.verify";

加载这个值的类是HRegionServer.class

0.94.0如果hbase-site中未配置此值,默认为true,0.94.2如果hbase-site中未配置此值,默认则为false,不清楚,为啥会有这个变化;

我先把他改成false,重启hbase的regionserver,照样会有报错;如下:

2014-09-24 16:02:46,492 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: b[5136, 5648]=0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_9091742659655150538:of:/hbase/log/d7ed8a849af9d171b8f76a766b8a5857/log/070f0e7ddbcc413a893397eb3b3141f3 at 0

说明hdfs校验同样是失败的;

Hadoop系统为了保证数据的一致性,会对文件生成相应的校验文件,并在读写的时候进行校验,确保数据的准确性。

所以说此类的问题,只有一种情况会导致,就是硬盘故障,就是写进去的文件(无所谓完整不完整,生成的校验文件实际上是根据写进去的文件生成的,所以说校验文件和文件块是一一对应,并保持一致的),只要写进去的和读出来的内容保持一致了,校验才可以通过,这个块有3份,写到这个机器的1份如果校验不过去,实际上,hbase也不会去读其他对应冗余块,所以说,处理的方法,不是把此节点上的regionserver,而是把datanode,regionserver通通关闭,然后集群自动检测这个节点不存在后,让他自己再去补齐这个块的冗余;

我把此节点datanode,regionserver关闭后,开始对此节点的硬盘进行检测;

检测后,发现有一块硬盘,存在坏块;摘掉此硬盘,恢复hbase的校验参数,服务正常;

hadoop集群可以做到机器直接宕机的容错,但是实际上没有做到磁盘的故障的容错,看来磁盘的监控十分有必要去做;

0 0
原创粉丝点击