Abstract
For any concurrent multi-threaded system, distributed computing or otherwise,the inter-thread messaging component is an very important component. In Java, the JDK provided
ArrayBlockingQueue, LinkedBlockingQueue, TransferQueue. And Disruptor (http://lmaxexchange.github.io/disruptor/)
is very famous based on its high performance on its inter-thread messaging, but it does not expose as a BlockingQueue. This blog will introduce a new Blocking Queue based on its ring buffer and also with a benchmark result.
Why require Blocking Queue interface
Blocking queue interface is widely used by existed code, changing to Disruptor directly will cause big changes since disruptor want to control the whole thread scheduling. Second, Disruptor only call back when there is an event arrived, but it does not have a chance to let the application control the behavior when the queue is built-up and do some pro-active throttling.This blog will introduce a BlockingQueue implementation on top of RingBuffer, but there is a limitation,this queue can only be consumed by one consumer thread, but for producer, it can be single or multiple producer thread. This will be useful for the Actor Pattern, which use a blocking queue and one thread to drain queue. The reason is the offset of the consumer side can be hard to maintain if there are multiple consumer threads, multiple thread consumers should use Disruptor WorkerPool to replace the JDK Executor.
Implementation
The source code is available on
Github:https://github.com/xinglang/disruptorqueue/tree/master/disruptorqueue
Since this queue only supports one consumer, so let's call it SingleConsumerDisruptorQueue
The SingleConsumerDisruptorQueue will have a ring buffer and a sequence (consumedSeq) for the
cosnumer, the cosnumedSeq will be the gating sequence of the ring buffer. And there a knownPublishedSeq which used to remember the last known published sequence. Since it will be a
blocking queue, so the wait strategy will be BlockingWaitStrategy (Default one).
private final RingBuffer<Event<T>> ringBuffer;private final Sequence consumedSeq;private final SequenceBarrier barrier;private long knownPublishedSeq;public SingleConsumerDisruptorQueue(int bufferSize, boolean singleProducer) {if (singleProducer) {ringBuffer = RingBuffer.createSingleProducer(new Factory<T>(),normalizeBufferSize(bufferSize));} else {ringBuffer = RingBuffer.createMultiProducer(new Factory<T>(),normalizeBufferSize(bufferSize));}consumedSeq = new Sequence();ringBuffer.addGatingSequences(consumedSeq);barrier = ringBuffer.newBarrier();long cursor = ringBuffer.getCursor();consumedSeq.set(cursor);knownPublishedSeq = cursor;}
For the publish, just use ring buffer publish. And inside the ring buffer, there is a event holder which
acts as a value holder of the item.
@Overridepublic boolean offer(T e) {long seq;try {seq = ringBuffer.tryNext();} catch (InsufficientCapacityException e1) {return false;}publish(e, seq);return true;}private void publish(T e, long seq) {Event<T> holder = ringBuffer.get(seq);holder.setValue(e);ringBuffer.publish(seq);}
For the consume, there is a optimization since only one consumer thread. Each time when call the waitFor, it can get the last known published sequence, if the consumer sequence less than the last known published sequence, it does not need call the barrier waitFor method.
@Overridepublic T take() throws InterruptedException {long l = consumedSeq.get() + 1;while (knownPublishedSeq < l) {try {knownPublishedSeq = barrier.waitFor(l);} catch (AlertException e) {throw new IllegalStateException(e);} catch (TimeoutException e) {throw new IllegalStateException(e);}}Event<T> eventHolder = ringBuffer.get(l);consumedSeq.incrementAndGet();return eventHolder.getValue();}
Performace analysis
First of all, it can get all benefits from the ring buffer design:
- Avoid false sharing
- Pre-allocated ring buffer, no any instance created during publish/consume
- Less context switch, the consumer can get a batch of events without interrupted
Below is a benchmark for the queue and LinkedBlockingQueue, ArrayBlockingQueue and Transfer Queue. The Benchmark run on a baremetal machine with Ubuntu, the benchmark use 1 consumer thread, and 1 to 4 producer thread, each round run 32M put/take, the object for put is a constant string, so there is no any GC overhead for the object creation.
Single Producer benchmark
$ perf stat java -jar disruptortest.jar type=dbq Producers :1, buffer size: 262144, batch:0 SingleConsumerDisruptorQueue transfer rate : 19890 per ms, Used 1687ms for 33554432 Performance counter stats for 'java -jar disruptortest.jar type=dbq': 3729.421847 task-clock # 1.998 CPUs utilized 1,891 context-switches # 0.001 M/sec 76 CPU-migrations # 0.000 M/sec 9,357 page-faults # 0.003 M/sec 9,434,280,791 cycles # 2.530 GHz [83.38%] 5,489,619,603 stalled-cycles-frontend # 58.19% frontend cycles idle [83.35%] 2,618,037,087 stalled-cycles-backend # 27.75% backend cycles idle [66.99%] 10,797,968,145 instructions # 1.14 insns per cycle # 0.51 stalled cycles per insn [83.55%]1,742,973,721 branches # 467.358 M/sec [83.28%] 10,213,770 branch-misses # 0.59% of all branches [83.12%]1.866803438 seconds time elapsed $ perf stat java -jar disruptortest.jar type=abq Producers :1, buffer size: 262144, batch:0 ArrayBlockingQueue transfer rate : 2694 per ms, Used 12451ms for 33554432 Performance counter stats for 'java -jar disruptortest.jar type=abq':22976.952946 task-clock # 1.824 CPUs utilized 232,766 context-switches # 0.010 M/sec 80 CPU-migrations # 0.000 M/sec 68,531 page-faults # 0.003 M/sec 58,643,663,103 cycles # 2.552 GHz [83.14%] 51,767,105,241 stalled-cycles-frontend # 88.27% frontend cycles idle [83.32%]47,084,355,024 stalled-cycles-backend # 80.29% backend cycles idle [66.51%] 12,035,035,540 instructions # 0.21 insns per cycle # 4.30 stalled cycles per insn [83.44%] 2,016,738,256 branches # 87.772 M/sec [83.56%] 20,147,764 branch-misses # 1.00% of all branches [83.49%]12.596555382 seconds time elapsed $ perf stat java -jar disruptortest.jar type=lbq Producers :1, buffer size: 262144, batch:0 LinkedBlockingQueue transfer rate : 1132 per ms, Used 29632ms for 33554432 Performance counter stats for 'java -jar disruptortest.jar type=lbq': 58707.942294 task-clock # 1.968 CPUs utilized 82,377 context-switches # 0.001 M/sec 97 CPU-migrations # 0.000 M/sec 133,543 page-faults # 0.002 M/sec 151,825,969,348 cycles # 2.586 GHz [83.27%] 139,833,905,165 stalled-cycles-frontend # 92.10% frontend cycles idle [83.40%]131,712,244,095 stalled-cycles-backend # 86.75% backend cycles idle [66.67%]10,997,843,405 instructions # 0.07 insns per cycle # 12.71 stalled cycles per insn [83.26%] 1,701,879,665 branches # 28.989 M/sec [83.31%] 23,369,660 branch-misses # 1.37% of all branches [83.35%]29.830928757 seconds time elapsed $ perf stat java -jar disruptortest.jar type=tq Producers :1, buffer size: 262144, batch:0 LinkedTransferQueue transfer rate : 2139 per ms, Used 15685ms for 33554432 Performance counter stats for 'java -jar disruptortest.jar type=tq': 107428.492713 task-clock # 6.737 CPUs utilized10,542 context-switches # 0.000 M/sec 100 CPU-migrations # 0.000 M/sec 245,909 page-faults # 0.002 M/sec 278,182,169,187 cycles # 2.589 GHz [83.33%] 204,478,913,414 stalled-cycles-frontend # 73.51% frontend cycles idle [83.36%]164,497,727,638 stalled-cycles-backend # 59.13% backend cycles idle [66.73%]90,952,113,104 instructions # 0.33 insns per cycle # 2.25 stalled cycles per insn [83.37%] 32,522,385,525 branches # 302.735 M/sec [83.30%] 57,227,684 branch-misses # 0.18% of all branches [83.28%]15.947024802 seconds time elapsed
Multiple Producer benchmark
$ perf stat java -jar disruptortest.jar type=dq producer=4 Producers :4, buffer size: 262144, batch:0 SingleConsumerDisruptorQueue transfer rate : 2859 per ms, Used 46941m for 134217728 Performance counter stats for 'java -jar disruptortest.jar type=dq producer=4': 118905.839793 task-clock # 2.523 CPUs utilized 2,172,912 context-switches # 0.018 M/sec 280 CPU-migrations # 0.000 M/sec 28,697 page-faults # 0.000 M/sec 141,597,737,150 cycles # 1.191 GHz [83.18%] 113,618,387,640 stalled-cycles-frontend # 80.24% frontend cycles idle [83.42%] 96,562,209,060 stalled-cycles-backend # 68.19% backend cycles idle [66.86%] 55,227,379,587 instructions # 0.39 insns per cycle # 2.06 stalled cycles per insn [83.45%] 9,312,400,407 branches # 78.317 M/sec [83.19%] 64,375,263 branch-misses # 0.69% of all branches [83.35%]47.133747893 seconds time elapsed $ perf stat java -jar disruptortest.jar type=abq producer=4 Producers :4, buffer size: 262144, batch:0 ArrayBlockingQueue transfer rate : 2047 per ms, Used 65546ms for 134217728Performance counter stats for 'java -jar disruptortest.jar type=abq producer=4':Multiple Producer benchmark79345.046656 task-clock # 1.208 CPUs utilized 3,003,905 context-switches # 0.038 M/sec 594 CPU-migrations # 0.000 M/sec 77,227 page-faults # 0.001 M/sec 102,931,605,765 cycles # 1.297 GHz [83.10%] 78,913,722,891 stalled-cycles-frontend # 76.67% frontend cycles idle [83.46%]65,701,179,927 stalled-cycles-backend # 63.83% backend cycles idle [66.99%]52,891,419,177 instructions # 0.51 insns per cycle # 1.49 stalled cycles per insn [83.41%] 9,307,141,741 branches # 117.300 M/sec [83.21%] 79,855,221 branch-misses # 0.86% of all branches [83.23%]65.694123910 seconds time elapsed $ perf stat java -jar disruptortest.jar type=lbq producer=4 Producers :4, buffer size: 262144, batch:0 LinkedBlockingQueue transfer rate : 2795 per ms, Used 48014ms for 134217728 Performance counter stats for 'java -jar disruptortest.jar type=lbq producer=4':110080.375452 task-clock # 2.284 CPUs utilized 3,644,802 context-switches # 0.033 M/sec 597 CPU-migrations # 0.000 M/sec 136,440 page-faults # 0.001 M/sec 185,250,018,068 cycles # 1.683 GHz [83.46%] 144,448,559,949 stalled-cycles-frontend # 77.97% frontend cycles idle [83.62%]118,250,468,418 stalled-cycles-backend # 63.83% backend cycles idle [66.28%]73,113,563,433 instructions # 0.39 insns per cycle # 1.98 stalled cycles per insn [83.21%] 12,028,209,235 branches # 109.268 M/sec [83.25%] 129,234,077 branch-misses # 1.07% of all branches [83.40%]48.189813503 seconds time elapsed $ perf stat java -jar disruptortest.jar type=tq producer=4 Producers :4, buffer size: 262144, batch:0 LinkedTransferQueue transfer rate : 1438 per ms, Used 93273ms for 134217728Performance counter stats for 'java -jar disruptortest.jar type=tq producer=4':761878.416668 task-clock # 8.122 CPUs utilized71,371 context-switches # 0.000 M/sec 203 CPU-migrations # 0.000 M/sec 670,788 page-faults # 0.001 M/sec 1,976,200,012,808 cycles # 2.594 GHz [83.33%] 1,584,264,715,610 stalled-cycles-frontend # 80.17% frontend cycles idle [83.34%]1,368,861,011,899 stalled-cycles-backend # 69.27% backend cycles idle [66.68%]487,816,405,509 instructions # 0.25 insns per cycle # 3.25 stalled cycles per insn [83.34%] 169,135,278,863 branches # 221.998 M/sec [83.33%] 615,658,238 branch-misses # 0.36% of all branches [83.33%]93.798977802 seconds time elapsed
Conclusion
Using RingBuffer of disruptor to create a blocking queue is possible. For single producer/consumer case, it can be 5x faster than JDK default blocking queue implementation. In multiple producer case, it is much faster than arrayblocking queue and transfer queue, the linked blocking queue can achieve similar throughput but disruptor one has less context switches and less memory footprint. The only limitation is it only support the single consumer thread. The benefits for the BlockingQueue implementation on top of RingBuffer is it can be just a replacement for the existed code, and it give user more control via the BlockingQueue interface, the WorkerPool provided by disruptor only allow user to give a event handler for callback.