memstore的flush流程分析

来源：互联网发布：python socket 编辑：程序博客网时间：2024/05/18 01:06

memstore的flush流程分析

memstore的flush发起主要从以下几个地方进行：

a.在HRegionServer调用multi进行更新时，检查是否超过全局的memstore配置的最大值与最小值，

如果是，发起一个WakeupFlushThread的flush请求，如果超过全局memory的最大值，需要等待flush完成。

b.在HRegionServer进行数据更新时，调用HRegion.batchMutate更新store中数据时，

如果region.memstore的大小超过配置的regionmemstore size时，发起一个FlushRegionEntry的flush请求，

c.client端显示调用HRegionServer.flushRegion请求

d.通过hbase.regionserver.optionalcacheflushinterval配置，

默认3600000ms的HRegionServer.PeriodicMemstoreFlusher定时flush线程

flush的执行过程

flush的具体执行通过MemStoreFlusher完成，当发起flushRequest时，

会把flush的request添加到flushQueue队列中，同时把request添加到regionsInQueue列表中。

MemStoreFlusher实例生成时会启动MemStoreFlusher.FlushHandler线程实例，

此线程个数通过hbase.hstore.flusher.count配置,默认为1

privateclassFlushHandlerextendsHasThread{

@Override

publicvoidrun(){

while(!server.isStopped()){

FlushQueueEntryfqe = null;

try{

wakeupPending.set(false);// allow someone to wake us up again

从队列中取出一个flushrequest，此队列是一个阻塞队列，如果flushQueue队列中没有值，

等待hbase.server.thread.wakefrequency配置的ms,默认为10*1000

fqe=flushQueue.poll(threadWakeFrequency,TimeUnit.MILLISECONDS);

if(fqe ==null||fqeinstanceofWakeupFlushThread) {

如果没有flushrequest或者flushrequest是一个全局flush的request

检查所有的memstore是否超过hbase.regionserver.global.memstore.lowerLimit配置的值，默认0.35

if(isAboveLowWaterMark()){

LOG.debug("Flushthread woke up because memory above low water="

StringUtils.~~humanReadableInt~~(globalMemStoreLimitLowMark));

超过配置的最小memstore的值，flsuh掉最大的一个memstore的region

此执行方法的流程分析见MemStoreFlusher.flushOneForGlobalPressure流程分析

if(!flushOneForGlobalPressure()){

....................此处部分代码没有显示

Thread.sleep(1000);

没有需要flush的region,叫醒更新线程的等待，

HregionServer执行数据更新的相关方法如果发现memstore的总和超过配置的最大值时，会wait更新线程，等待flush

wakeUpIfBlocking();

}

//Enqueue another one of these tokens so we'll wake up again

发起另一个叫醒的全局flushrequest,生成WakeupFlushThread的request

wakeupFlushThread();

}

continue;

}

正常的flushrequest,

单个regionmemstore大小超过hbase.hregion.memstore.flush.size配置的值,默认1024*1024*128L

此执行方法的流程分析见MemStoreFlusher.flushRegion

FlushRegionEntryfre= (FlushRegionEntry)fqe;

if(!flushRegion(fre)){

break;

}

}catch(InterruptedExceptionex){

continue;

}catch(ConcurrentModificationExceptionex){

continue;

}catch(Exceptionex){

LOG.error("Cacheflusher failed for entry " + fqe,ex);

if(!server.checkFileSystem()){

break;

}

结束MemStoreFlusher的线程调用，通常是regionserverstop

synchronized(regionsInQueue){

regionsInQueue.clear();

flushQueue.clear();

}

//Signal anyone waiting, so they see the close flag

wakeUpIfBlocking();

LOG.info(getName()+" exiting");

}

MemStoreFlusher.flushOneForGlobalPressure流程分析

此方法主要用来取出所有region是memstore最大的一个region，并执行flush操作。

privatebooleanflushOneForGlobalPressure(){

SortedMap<Long,HRegion>regionsBySize=

server.getCopyOfOnlineRegionsSortedBySize();

Set<HRegion>excludedRegions=newHashSet<HRegion>();

booleanflushedOne=false;

while(!flushedOne){

//Find the biggest region that doesn't have too manystorefiles

//(might be null!)

取出memstore占用最大的一个region，但这个region需要满足以下条件：

a.region的writestate.flushing==false,同时writestate.writesEnabled==true,非readonly

b.region中所有的store中的storefile的个数小于hbase.hstore.blockingStoreFiles配置的值，默认为7

此处去找region时，是按region的memstore的大小从大到小排序组成。取出满足以上条件的最大的memstore的region

如果都不满足，返回null

HRegionbestFlushableRegion=getBiggestMemstoreRegion(

regionsBySize,excludedRegions,true);

//Find the biggest region, total, even if it might have too manyflushes.

取出memstore占用最大的一个region，但这个region需要满足以下条件：

a.region的writestate.flushing==false,同时writestate.writesEnabled==true,非readonly

b.按region的memstore的大小从大到小排序组成。取出满足以上条件的最大的memstore的region

如果都不满足，返回null,此处不检查region中是否有store的文件个数超过指定的配置值。

HRegionbestAnyRegion=getBiggestMemstoreRegion(

regionsBySize,excludedRegions,false);

如果没有拿到上面第二处检查的region，那么表示没有需要flush的region，返回，不进行flush操作。

if(bestAnyRegion==null){

LOG.error("Abovememory mark but there are no flushable regions!");

returnfalse;

}

得到最需要进行flush的region,

如果memstore最大的region的memory使用大小已经超过了没有storefile个数超过配置的region的memory大小的2倍

那么优先flush掉此region的memstore

HRegionregionToFlush;

if(bestFlushableRegion!=null&&

bestAnyRegion.memstoreSize.get()> 2 * bestFlushableRegion.memstoreSize.get()){

....................此处部分代码没有显示

if(LOG.isDebugEnabled()){

....................此处部分代码没有显示

}

regionToFlush=bestAnyRegion;

}else{

如果要flush的region中没有一个region的storefile个数没有超过配置的值，

(所有region中都有store的file个数超过了配置的store最大storefile个数)，

优先flush掉memstore的占用最大的region

if(bestFlushableRegion==null){

regionToFlush=bestAnyRegion;

}else{

如果要flush的region中，有region的store还没有超过配置的最大storefile个数，优先flush掉此region

这样做的目的是为了减少一小部分region数据写入过热，compact太多,而数据写入较冷的region一直没有被flush

regionToFlush=bestFlushableRegion;

}

Preconditions.checkState(regionToFlush.memstoreSize.get()> 0);

LOG.info("Flushof region " + regionToFlush+" due to global heap pressure");

执行flush操作，设置全局flush的标识为true,见memStoreFlusher.flushRegion全局流程

如果flush操作出现错误，需要把此region添加到excludedRegions列表中，

表示这次flush一个region的行为中跳过此region,找下一个memstore最大的region进行flush

flushedOne=flushRegion(regionToFlush,true);

if(!flushedOne){

LOG.info("Excludingunflushable region " +regionToFlush+

"- trying to find a different region to flush.");

excludedRegions.add(regionToFlush);

}

returntrue;

}

MemStoreFlusher.flushRegion执行流程分析全局

此方法传入的第二个参数=true表示全局flush，否则表示region的memstore达到指定大小

返回true表示flush成功，否则表示flush失败

privatebooleanflushRegion(finalHRegionregion, finalbooleanemergencyFlush){

synchronized(this.regionsInQueue){

从regionsInQueue列表中移出此region,并得到region的flush请求

FlushRegionEntryfqe= this.regionsInQueue.remove(region);

如果是全局的flush请求，从flushQueue队列中移出此flush请求

if(fqe !=null&&emergencyFlush){

//Need to remove from region from delay queue. When NOT an

//emergencyFlush, then item was removed via a flushQueue.poll.

flushQueue.remove(fqe);

}

lock.readLock().lock();

try{

执行HRegion.flushcache操作，返回true表示需要做compact，否则表示不需要发起compact请求

booleanshouldCompact=region.flushcache();

//We just want to check the size

检查是否需要进行split操作，以下条件不做split

a.如果是meta表，不做split操作。

b.如果region配置有distributedLogReplay,同时region在open后，还没有做replay，isRecovering=true

c.splitRequest的值为false,true表示通过client调用过regionServer.splitregion操作。

d.如果c为false,同时当前region中有store的大小

不超过hbase.hregion.max.filesize的配置值，默认10* 1024 * 1024 * 1024L(10g)

或者不超过了hbase.hregion.memstore.flush.size配置的值，默认为1024*1024*128L(128m)*

(此region所在的table在当前rs中的所有region个数*此region所在的table在当前rs中的所有region个数)

e.如果c为false,或者store中有storefile的类型为reference,也就是此storefile引用了另外一个storefile

f.如果cde的检查结果为true,同时client发起过split请求，

如果client发起请求时指定了在具体的splitrow时，但此row在当前region中并不存在，不需要做split

g.以上检查都是相反的值时，此时需要做split操作。

booleanshouldSplit=region.checkSplit()!=null;

if(shouldSplit){

如果需要进行region的split操作，发起split请求

this.server.compactSplitThread.requestSplit(region);

}elseif(shouldCompact){

如果需要做compact发起一个系统的compact请求

server.compactSplitThread.requestSystemCompaction(

region,Thread.currentThread().getName());

}

}catch(DroppedSnapshotExceptionex){

....................此处部分代码没有显示

server.abort("Replayof HLog required. Forcing server shutdown",ex);

returnfalse;

}catch(IOExceptionex){

....................此处部分代码没有显示

if(!server.checkFileSystem()){

returnfalse;

}

}finally{

lock.readLock().unlock();

叫醒所有对region中数据更新的请求线程，让更新数据向下执行(全局flush会wait做更新)

wakeUpIfBlocking();

}

returntrue;

}

Hregion.flushcache执行流程分析

执行flush流程，并在执行flush前调用cp的preFlush方法与在执行后调用cp.postFlush方法，

在flush前把writestate.flushing设置为true,表示region正在做flush操作，完成后设置为false

publicbooleanflushcache()throws IOException {

//fail-fast instead of waiting on the lock

检查region是否正在进行close。返回false表示不做compact

if(this.closing.get()){

LOG.debug("Skippingflush on " + this+" because closing");

returnfalse;

}

MonitoredTaskstatus =TaskMonitor.get().createStatus("Flushing" + this);

status.setStatus("Acquiringreadlock on region");

//block waiting for the lock for flushing cache

lock.readLock().lock();

try{

如果当前region已经被close掉，不执行flush操作。返回false表示不做compact

if(this.closed.get()){

LOG.debug("Skippingflush on " + this+" because closed");

status.abort("Skipped:closed");

returnfalse;

}

执行cp的flush前操作

if(coprocessorHost!=null){

status.setStatus("Runningcoprocessor pre-flush hooks");

coprocessorHost.preFlush();

}

if(numMutationsWithoutWAL.get()> 0) {

numMutationsWithoutWAL.set(0);

dataInMemoryWithoutWAL.set(0);

}

synchronized(writestate){

把region的状态设置为正在flush

if(!writestate.flushing&&writestate.writesEnabled){

this.writestate.flushing=true;

}else{

....................此处部分代码没有显示

如果当前region正在做flush,或者region是readonly状态，不执行flush操作。返回false表示不做compact

returnfalse;

}

try{

执行flush操作，对region中所有的store的memstore进行flush操作。

返回是否需要做compact操作的一个boolean值

booleanresult =internalFlushcache(status);

执行cp的flush后操作

if(coprocessorHost!=null){

status.setStatus("Runningpost-flush coprocessor hooks");

coprocessorHost.postFlush();

}

status.markComplete("Flushsuccessful");

returnresult;

}finally{

synchronized(writestate){

设置正在做flush的状态flushing的值为false,表示flush结束

writestate.flushing=false;

设置region的flush请求为false

this.writestate.flushRequested=false;

叫醒所有等待中的更新线程

writestate.notifyAll();

}

}finally{

lock.readLock().unlock();

status.cleanup();

}

flushcache方法调用此方法，而此方法又掉其的一个重载方法

protectedbooleaninternalFlushcache(MonitoredTaskstatus)

throwsIOException {

returninternalFlushcache(this.log,-1,status);

}

执行flush操作，通过flushcache调用而来,返回是否需要compact

protectedbooleaninternalFlushcache(

finalHLogwal,finallongmyseqid,MonitoredTaskstatus)

throwsIOException {

if(this.rsServices!=null&&this.rsServices.isAborted()){

//Don't flush when server aborting, it's unsafe

thrownewIOException("Abortingflush because server is abortted...");

}

设置flush的开始时间为当前系统时间,计算flush的耗时用

finallongstartTime =EnvironmentEdgeManager.currentTimeMillis();

//Clear flush flag.

//If nothing to flush, return and avoid logging start/stop flush.

如果memstore的大小没有值，不执行flsuh直接返回false

if(this.memstoreSize.get()<= 0) {

returnfalse;

}

if(LOG.isDebugEnabled()){

LOG.debug("Startedmemstore flush for " + this+

",current region memstore size " +

StringUtils.~~humanReadableInt~~(this.memstoreSize.get())+

((wal!=null)?"":"; wal is null, using passedsequenceid=" + myseqid));

}

//Stop updates while we snapshot thememstoreof all stores. We only have

//to do this for a moment. Its quick. The subsequent sequence id that

//goes into the HLog after we've flushed all these snapshots also goes

//into the info file that sits beside the flushed files.

//We also set thememstoresize to zero here before we allow updates

//again so its value will represent the size of the updates received

//during the flush

MultiVersionConsistencyControl.WriteEntryw =null;

//We have to take a write lock during snapshot, or else a write could

//end up in both snapshot andmemstore(makes it difficult to do atomic

//rows then)

status.setStatus("Obtaininglock to block concurrent updates");

//block waiting for the lock for internal flush

this.updatesLock.writeLock().lock();

longflushsize =this.memstoreSize.get();

status.setStatus("Preparingto flush by snapshotting stores");

List<StoreFlushContext>storeFlushCtxs=newArrayList<StoreFlushContext>(stores.size());

longflushSeqId= -1L;

try{

//Record themvccfor all transactions in progress.

生成一个MultiVersionConsistencyControl.WriteEntry实例，此实例的writernumber为mvcc的++memstoreWrite

把WriteEntry添加到mvcc的writeQueue队列中

w=mvcc.beginMemstoreInsert();

取出并移出writeQueue队列中的WriteEntry实例，得到writerNumber的值，

并把最大的writerNumber(最后一个)的值复制给memstoreRead，

叫醒readWaiters的等待(mvcc.waitForRead(w)会等待叫醒)

mvcc.advanceMemstore(w);

if(wal !=null){

把wal中oldestUnflushedSeqNums列表中此region未flush的seqid(appendedits日志后最大的seqid)移出

把wal中oldestUnflushedSeqNums中此region的seqid添加到oldestFlushingSeqNums列表中。

得到进行flush的seqid,此值通过wal(FSHLog)的logSeqNum加一得到，

logSeqNum的值通过openRegion调用后得到的regiwriteQueueon的seqid,此值是当前rs中所有region的最大的seqid

同时每次appendhlog日志时，会把logSeqNum加一的值加一，并把此值当成hlog的seqid,

LongstartSeqId=wal.startCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());

if(startSeqId==null){

status.setStatus("Flushwill not be started for [" +this.getRegionInfo().getEncodedName()

+"]- WAL is going away");

returnfalse;

}

flushSeqId=startSeqId.longValue();

}else{

flushSeqId=myseqid;

}

for(Stores :stores.values()){

迭代region下的每一个store,生成HStore.StoreFlusherImpl实例

storeFlushCtxs.add(s.createFlushContext(flushSeqId));

}

//prepare flush (take a snapshot)

for(StoreFlushContextflush :storeFlushCtxs){

迭代region下的每一个store,把memstore下的kvset复制到memstore的snapshot中并清空kvset的值

把memstore的snapshot复制到HStore的snapshot中

flush.prepare();

}

}finally{

this.updatesLock.writeLock().unlock();

}

Strings= "Finished memstore snapshotting "+this+

",syncing WAL and waiting on mvcc, flushsize="+flushsize;

status.setStatus(s);

if(LOG.isTraceEnabled())LOG.trace(s);

//syncunflushedWAL changes when deferred log sync is enabled

//see HBASE-8208 for details

if(wal !=null&& !shouldSyncLog()){

把wal中的日志写入到HDFS中

wal.sync();

}

//wait for all in-progress transactions to commit to HLog before

//we can start the flush. This prevents

//uncommitted transactions from being written into HFiles.

//We have to block before we start the flush, otherwise keys that

//were removed via a rollbackMemstore could be written toHfiles.

等待mvcc中writeQueue队列处理完成，得到最大的memstoreRead值，

线程等待到mvcc.advanceMemstore(w)处理完成去叫醒。

mvcc.waitForRead(w);

s="Flushing stores of "+this;

status.setStatus(s);

if(LOG.isTraceEnabled())LOG.trace(s);

//Any failure from here on out will be catastrophic requiring server

//restart sohlogcontent can be replayed and put back into thememstore.

//Otherwise, the snapshot content while backed up in thehlog,it will not

//be part of the current running servers state.

booleancompactionRequested=false;

try{

//A. Flushmemstoreto all the HStores.

//Keep running vector of all store files that includes both old and the

//just-made new flush store file. The new flushed file is still in the

//tmpdirectory.

for(StoreFlushContextflush :storeFlushCtxs){

迭代region下的每一个store,调用HStore.flushCache方法，把store中snapshot的数据flush到hfile中

使用从wal中得到的最新的seqid

通过hbase.hstore.flush.retries.number配置flush失败的重试次数，默认为10次

通过hbase.server.pause配置flush失败时的重试间隔，默认为1000ms

针对每一个Store的flush实例，

通过hbase.hstore.defaultengine.compactionpolicy.class配置，默认DefaultStoreFlusher进行

每一个HStore.StoreEngine通过hbase.hstore.engine.class配置，默认DefaultStoreEngine

生成StoreFile.Writer实例，此实例的路径为region的.tmp目录下生成一个UUID的文件名称，

调用storeFlusher的flushSnapshot方法,并得到flush的.tmp目录下的hfile文件路径,

检查文件是否合法(创建StoreFile.createReader不出错表示合法)

把memstore中的kv写入到此file文件中

把此hfile文件的metadata(fileinfo)中写入flush时的最大seqid.

把生成的hfile临时文件放入到HStore.StoreFlusherImpl实例的tempFiles列表中。

等待调用HStore.StoreFlusherImpl.commit

flush.flushCache(status);

}

//Switch snapshot (inmemstore)-> newhfile(thus causing

//all the store scanners to reset/reseek).

for(StoreFlushContextflush :storeFlushCtxs){

通过HStore.StoreFlusherImpl.commit把.tmp目录下的刚flush的hfile文件移动到指定的cf目录下

针对Hfile文件生成StoreFile与Reader,并把StoreFile添加到HStore的storefiles列表中。

清空HStore.memstore.snapshot的值。

通过hbase.hstore.defaultengine.compactionpolicy.class配置的compactionPolicy,

默认为ExploringCompactionPolicy,检查是否需要做compaction,

通过hbase.hstore.compaction.min配置最小做compaction的文件个数,默认为3.

老版本通过hbase.hstore.compactionThreshold进行配置，最小值不能小于2

如果当前的Store中所有的Storefile的个数减去正在做compact的个数值大于或等于上面配置的值时，

表示需要做compact

booleanneedsCompaction=flush.commit(status);

if(needsCompaction){

compactionRequested=true;

}

storeFlushCtxs.clear();

//Set down thememstoresize by amount of flush.

this.addAndGetGlobalMemstoreSize(-flushsize);

}catch(Throwablet){

//An exception here means that the snapshot was not persisted.

//Thehlogneeds to be replayed so its content is restored tomemstore.

//Currently, only a server restart will do this.

//We used to only catch IOEs but its possible that we'd get other

//exceptions -- e.g. HBASE-659 was about an NPE -- so now we catch

//all and sundry.

if(wal !=null){

wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());

}

DroppedSnapshotExceptiondse= newDroppedSnapshotException("region:" +

Bytes.toStringBinary(getRegionName()));

dse.initCause(t);

status.abort("Flushfailed: " +StringUtils.stringifyException(t));

throwdse;

}

//If we get to here, the HStores have been written.

if(wal !=null){

把FSHLog.oldestFlushingSeqNums中此region的上一次flush的seqid移出

wal.completeCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());

}

//Record latest flush time

更新region的最后一次flush时间

this.lastFlushTime= EnvironmentEdgeManager.currentTimeMillis();

//Update the last flushed sequence id for region

if(this.rsServices!=null){

设置regionserver中completeSequenceId的值为最新进行过flush的wal中的seqid

completeSequenceId=flushSeqId;

}

//C. Finally notify anyone waiting onmemstoreto clear:

//e.g. checkResources().

synchronized(this){

notifyAll();// FindBugs NN_NAKED_NOTIFY

}

longtime =EnvironmentEdgeManager.currentTimeMillis()-startTime;

longmemstoresize=this.memstoreSize.get();

Stringmsg= "Finished memstore flush of ~"+

StringUtils.~~humanReadableInt~~(flushsize)+"/"+ flushsize+

",currentsize=" +

StringUtils.~~humanReadableInt~~(memstoresize)+"/"+ memstoresize+

"for region " +this+" in "+ time +"ms, sequenceid="+flushSeqId+

",compaction requested=" +compactionRequested+

((wal==null)?"; wal=null":"");

LOG.info(msg);

status.setStatus(msg);

this.recentFlushes.add(newPair<Long,Long>(time/1000,flushsize));

返回是否需要进行compaction操作。

returncompactionRequested;

}

Region的MemStore达到指定值时的flush

此种flush是region中memstoresize的值达到配置的值上限时，发起的flushrequest,

通过MemStoreFlusher.FlusherHandler.run-->flushRegion(finalFlushRegionEntryfqe)发起

privatebooleanflushRegion(finalFlushRegionEntryfqe){

HRegionregion= fqe.region;

如果region不是meta的region,同时region中有sotre中的storefile个数达到指定的值，

通过hbase.hstore.blockingStoreFiles配置，默认为7

if(!region.getRegionInfo().isMetaRegion()&&

isTooManyStoreFiles(region)){

检查flushrequest的等待时间是否超过了指定的等待时间，如果超过打印一些日志

通过hbase.hstore.blockingWaitTime配置，默认为90000ms

if(fqe.isMaximumWait(this.blockingWaitTime)){

LOG.info("Waited" + (System.currentTimeMillis()-fqe.createTime)+

"mson a compaction to clean up 'too many store files'; waited "+

"longenough... proceeding with flush of "+

region.getRegionNameAsString());

}else{

如果flushrequest的等待时间还不到指定可接受的最大等待时间，

同时还没有进行过重新flushrequest,(在队列中重新排队)

flushQueue队列按FlushRegionEntry的过期时间进行排序，默认情况下是先进先出，

除非调用过FlushRegionEntry.requeue方法显示指定过期时间

//If this is first time we've been put off, then emit a log message.

if(fqe.getRequeueCount()<= 0) {

//Note: We don't impose blockingStoreFiles constraint on meta regions

LOG.warn("Region" + region.getRegionNameAsString()+" has too many "+

"storefiles; delaying flush up to " +this.blockingWaitTime+"ms");

检查是否需要发起splitrequest,如果是发起splitrequest,如果不需要，发起compactionrequest.

if(!this.server.compactSplitThread.requestSplit(region)){

try{

发起compactionrequest.因为此时store中文件个数太多。

可以通过创建table时使用COMPACTION_ENABLED来控制是否做compaction操作，可设置值TRUE/FALSE

this.server.compactSplitThread.requestSystemCompaction(

region,Thread.currentThread().getName());

}catch(IOExceptione){

LOG.error(

"Cacheflush failed for region " +Bytes.toStringBinary(region.getRegionName()),

RemoteExceptionHandler.checkIOException(e));

}

//Put back on the queue. Have it come back out of the queue

//after a delay of this.blockingWaitTime / 100ms.

重新对flushQueue中当前的flushrequest进行排队，排队到默认900ms后在执行

this.flushQueue.add(fqe.requeue(this.blockingWaitTime/ 100));

//Tell a lie, it's not flushed but it'sok

returntrue;

}

执行flush操作流程，把全局flush的参数设置为false,表示是memstoresize的值达到配置的值上限时

执行流程不重复分析，见MemStoreFlusher.flushRegion执行流程分析全局

returnflushRegion(region,false);

}

0 0