解读Flink中轻量级的异步快照机制--Flink 1.2 源码

来源:互联网 发布:医保报销软件 编辑:程序博客网 时间:2024/05/01 23:59

上一篇文章中,对于ABS算法,其实现主要通过checkpoint的barrier的阻塞与释放来实现。

本片重点关注ABS在Flink 1.2中源码的实现。

1、CheckpointBarrierHandler

此接口位于org.apache.flink.streaming.runtime.io中,管理从input channel获取的barrier的信息。它提供了如下几种方法:

public interface CheckpointBarrierHandler {    BufferOrEvent getNextNonBlocked() throws Exception;    void registerCheckpointEventHandler(StatefulTask task);    void cleanup() throws IOException;    boolean isEmpty();    long getAlignmentDurationNanos();}

其中关于barrier的阻塞与释放,主要在getNextNonBlocked() 中实现。

根据CheckpointingMode的不同,Flink提供了2种不同的检查点模式:

1、Exactly once2、At least once

其中默认的模式是EXACTLY_ONCE。

对应这两种不同的模式,Flink提供了2种不同的实现类:

1、BarrierBuffer类(对应于Exactly Once)2、BarrierTracker类(对应于At Least Once)

这里写图片描述

由于论文中重点强调input channel的阻塞,即对于Exactly Once的实现,因此我们这里也重点关注代码中BarrierBuffer类的实现。

2、BarrierBuffer类

我们先回顾一下上一篇论文中关于此算法的伪码:

这里写图片描述

其核心就是一个input channel收到barrier,立刻阻塞,然后判断是否收到所有input channel的barrier,如果全部收到,则广播出barrier,触发此task的检查点,并对阻塞的channel释放锁。

实际上,为了防止输入流的背压(back-pressuring),BarrierBuffer并不是真正的阻塞这个流,而是将此channel中,barrier之后数据通过一个BufferSpiller来buffer起来,当channel的锁释放后,再从buffer读回这些数据,继续处理。

下面我们看看这个类的具体实现:

public class BarrierBuffer implements CheckpointBarrierHandler {    private static final Logger LOG = LoggerFactory.getLogger(BarrierBuffer.class);    /** The gate that the buffer draws its input from */    private final InputGate inputGate; //一个task对应一个InputGate,代表input的数据集合(可能来自不同的input channel)    /** Flags that indicate whether a channel is currently blocked/buffered */    private final boolean[] blockedChannels; // 标记每个input channel是否被阻塞(或者叫被buffer)    /** The total number of channels that this buffer handles data from */    private final int totalNumberOfInputChannels; // input channel的数量,可通过InputGate获得    /** To utility to write blocked data to a file channel */    private final BufferSpiller bufferSpiller; // 将被阻塞的input channel的数据写到buffer    /** The pending blocked buffer/event sequences. Must be consumed before requesting     * further data from the input gate. */    private final ArrayDeque<BufferSpiller.SpilledBufferOrEventSequence> queuedBuffered; // barrier到达时,此operator中在之前buffered的数据要消费掉    /** The maximum number of bytes that may be buffered before an alignment is broken. -1 means unlimited */    private final long maxBufferedBytes; // 最多允许buffer的字节数,-1代表无限制    /** The sequence of buffers/events that has been unblocked and must now be consumed     * before requesting further data from the input gate */    private BufferSpiller.SpilledBufferOrEventSequence currentBuffered; // 已经buffer的数据    /** Handler that receives the checkpoint notifications */    private StatefulTask toNotifyOnCheckpoint; // 通知检查点进行    /** The ID of the checkpoint for which we expect barriers */    private long currentCheckpointId = -1L; // 当前检查点ID    /** The number of received barriers (= number of blocked/buffered channels)     * IMPORTANT: A canceled checkpoint must always have 0 barriers */    private int numBarriersReceived; // 接收到的barrier的数量,这个值最终要等于buffered channel的数量。当一个检查点被cancel时,此值为0    /** The number of already closed channels */    private int numClosedChannels; // 已经关闭的channel的数量    /** The number of bytes in the queued spilled sequences */    private long numQueuedBytes; // spill到队列中的数据的字节数    /** The timestamp as in {@link System#nanoTime()} at which the last alignment started */    private long startOfAlignmentTimestamp; // 上一次对齐开始时的时间戳    /** The time (in nanoseconds) that the latest alignment took */    private long latestAlignmentDurationNanos; // 最近一次对齐持续的时间    /** Flag to indicate whether we have drawn all available input */    private boolean endOfStream; // 标记是否流结束(所有的input已经收到barrier,标记检查点完成)    /**     * Creates a new checkpoint stream aligner.     *      * <p>There is no limit to how much data may be buffered during an alignment.     *      * @param inputGate The input gate to draw the buffers and events from.     * @param ioManager The I/O manager that gives access to the temp directories.     *     * @throws IOException Thrown, when the spilling to temp files cannot be initialized.     */    public BarrierBuffer(InputGate inputGate, IOManager ioManager) throws IOException {        this (inputGate, ioManager, -1);    }    /**     * Creates a new checkpoint stream aligner.     *      * <p>The aligner will allow only alignments that buffer up to the given number of bytes.     * When that number is exceeded, it will stop the alignment and notify the task that the     * checkpoint has been cancelled.     *      * @param inputGate The input gate to draw the buffers and events from.     * @param ioManager The I/O manager that gives access to the temp directories.     * @param maxBufferedBytes The maximum bytes to be buffered before the checkpoint aborts.     *      * @throws IOException Thrown, when the spilling to temp files cannot be initialized.     */    public BarrierBuffer(InputGate inputGate, IOManager ioManager, long maxBufferedBytes) throws IOException {        checkArgument(maxBufferedBytes == -1 || maxBufferedBytes > 0);        this.inputGate = inputGate;        this.maxBufferedBytes = maxBufferedBytes;        this.totalNumberOfInputChannels = inputGate.getNumberOfInputChannels();        this.blockedChannels = new boolean[this.totalNumberOfInputChannels];        this.bufferSpiller = new BufferSpiller(ioManager, inputGate.getPageSize());        this.queuedBuffered = new ArrayDeque<BufferSpiller.SpilledBufferOrEventSequence>();    }

其构造方法中传入InputGate参数,每个task都会对应有一个InputGate,目的是专门处理流入到此task中的所有的输入信息,这些输入可能来自多个partition。

这里写图片描述

我们再看看BarrierBuffer中最重要的方法:getNextNonBlocked。

getNextNonBlocked

// ------------------------------------------------------------------------//  Buffer and barrier handling// ------------------------------------------------------------------------@Override    public BufferOrEvent getNextNonBlocked() throws Exception {        while (true) {            // process buffered BufferOrEvents before grabbing new ones            BufferOrEvent next; // buffer代表数据,event代表事件,例如barrier就是个事件            if (currentBuffered == null) {                next = inputGate.getNextBufferOrEvent();// 如果已经buffer的数据为空,则直接从inputGate中获取下一个BufferOrEvent            }            else {                next = currentBuffered.getNext(); // 否则,从currentBuffered的队列中拿到下一个BufferOrEvent                if (next == null) { // 如果next为空,说明已经buffer的数据被处理完了                    completeBufferedSequence(); // 清空currentBuffered,然后继续处理queuedBuffered中的数据                    return getNextNonBlocked(); // 递归调用,此时currentBuffered如果为null,则queuedBuffered也为null;否则如果currentBuffered不为null,说明还要继续处理queuedBuffere中的数据                }            }            if (next != null) {                if (isBlocked(next.getChannelIndex())) { //如果这个channel还是被阻塞,则继续把这条record添加到buffer中                    // if the channel is blocked we, we just store the BufferOrEvent                    bufferSpiller.add(next);                    checkSizeLimit();                }                else if (next.isBuffer()) {//否则如果这个channel不再被阻塞,且下一条记录是数据,则返回此数据                    return next;                }                else if (next.getEvent().getClass() == CheckpointBarrier.class) { // 如果下一个是Barrier,且流没有结束,则说明这个channel收到了barrier了                    if (!endOfStream) {                        // process barriers only if there is a chance of the checkpoint completing                        processBarrier((CheckpointBarrier) next.getEvent(), next.getChannelIndex()); // 此时,进行processBarrier处理                    }                }                else if (next.getEvent().getClass() == CancelCheckpointMarker.class) { // 如果下一个是带有cancel标记的barrier,则进行processCancellationBarrier处理                    processCancellationBarrier((CancelCheckpointMarker) next.getEvent());                }                else {                    if (next.getEvent().getClass() == EndOfPartitionEvent.class) { // 如果此partition的数据全部消费完                        processEndOfPartition(); // 增加numClosedChannels的值,且将此channel解锁                    }                    return next;                }            }            else if (!endOfStream) { // 如果next为null且不是stream的终点,则置为终点,且释放所有channel的锁,重置初始值                // end of input stream. stream continues with the buffered data                endOfStream = true;                releaseBlocksAndResetBarriers();                return getNextNonBlocked();            }            else {                // final end of both input and buffered data                return null;            }        }    }

这个方法中,当收到barrier后,立刻进行processBarrier()的处理,这也是其核心所在。

processBarrier

private void processBarrier(CheckpointBarrier receivedBarrier, int channelIndex) throws Exception {        final long barrierId = receivedBarrier.getId();        // fast path for single channel cases        if (totalNumberOfInputChannels == 1) { // 如果总共的channel数量只有1,此时说明这个operator只有一个input            if (barrierId > currentCheckpointId) { //如果这个barrierId大于当前的检查点ID,则说明这个barrier是一个新的barrier                // new checkpoint                currentCheckpointId = barrierId;//将这个barrierId赋给当前的检查点ID                notifyCheckpoint(receivedBarrier); //触发检查点            }            return;        }        // -- general code path for multiple input channels --        if (numBarriersReceived > 0) { //如果已经收到过barrier            // this is only true if some alignment is already progress and was not canceled            if (barrierId == currentCheckpointId) { // 判断此barrierId与当前的检查点ID是否一致                // regular case                onBarrier(channelIndex); // 如果一直,则阻塞此channel            }            else if (barrierId > currentCheckpointId) { // 如果barrierId大于当前的检查点ID,则说明当前的检查点过期了,跳过当前的检查点                // we did not complete the current checkpoint, another started before                LOG.warn("Received checkpoint barrier for checkpoint {} before completing current checkpoint {}. " +                        "Skipping current checkpoint.", barrierId, currentCheckpointId);                // let the task know we are not completing this                notifyAbort(currentCheckpointId, new CheckpointDeclineSubsumedException(barrierId));// 通知task终止当前的检查点                // abort the current checkpoint                releaseBlocksAndResetBarriers();// 释放所有channel的锁                // begin a the new checkpoint                beginNewAlignment(barrierId, channelIndex);// 根据barrierId,开始新的检查点            }            else {                // ignore trailing barrier from an earlier checkpoint (obsolete now)                return;            }        }        else if (barrierId > currentCheckpointId) { // 如果第一次收到的barrierID大于当前的检查点ID,说明是一个新的barrier            // first barrier of a new checkpoint            beginNewAlignment(barrierId, channelIndex);// 根据barrierId,开始新的检查点        }        else {            // either the current checkpoint was canceled (numBarriers == 0) or            // this barrier is from an old subsumed checkpoint            return;        }        // check if we have all barriers - since canceled checkpoints always have zero barriers        // this can only happen on a non canceled checkpoint        if (numBarriersReceived + numClosedChannels == totalNumberOfInputChannels) { //如果收到所有channel的barrier,说明走到了            // actually trigger checkpoint            if (LOG.isDebugEnabled()) {                LOG.debug("Received all barriers, triggering checkpoint {} at {}",                        receivedBarrier.getId(), receivedBarrier.getTimestamp());            }            releaseBlocksAndResetBarriers(); // 释放所有channel的锁            notifyCheckpoint(receivedBarrier);// 触发检查点        }    }

Flink 1.2中有个变化就是判断当前的operator是否只有一个input channel且收到了最新的barrier,如果是,则开通一个绿色通道,直接进行检查点:notifyCheckpoint。

否则如果有多个input channel(totalNumberOfInputChannels是通过InputGate获得),则只有当收到所有input channel的最新的barrier后,才开始进行检查点:notifyCheckpoint,否则就要先阻塞该input channel,实际上是buffer起来后续的数据。

notifyCheckpoint

private void notifyCheckpoint(CheckpointBarrier checkpointBarrier) throws Exception {        if (toNotifyOnCheckpoint != null) {            CheckpointMetaData checkpointMetaData =                    new CheckpointMetaData(checkpointBarrier.getId(), checkpointBarrier.getTimestamp());            long bytesBuffered = currentBuffered != null ? currentBuffered.size() : 0L;            checkpointMetaData                    .setBytesBufferedInAlignment(bytesBuffered)                    .setAlignmentDurationNanos(latestAlignmentDurationNanos);            toNotifyOnCheckpoint.triggerCheckpointOnBarrier(checkpointMetaData);        }    }

toNotifyOnCheckpoint是个StatefulTask接口,管理每个task接收检查点的通知,其triggerCheckpoint方法是真正的实现。

webUI中对checkpoint的部分增加了很多的元数据信息,包括检查点的详细信息:

这里写图片描述

这里写图片描述

这里写图片描述

包括每个checkpoint中state的大小,检查点的状态,完成的时间以及持续的时间。并且对每一个检查点,可以额看到每一个subtask的详细信息。这点对于检查点的管理、监控以及对state的调整都起到了积极的作用。

4、总结

ABS在Flink中默认是Exactly Once,需要对齐,对齐的算法就是阻塞+解除。阻塞和解除阻塞都有各自的判断依据。

0 0
原创粉丝点击