无锁编程入门

来源：互联网发布：视频php服务器配置编辑：程序博客网时间：2024/04/29 05:39

无锁编程入门

本文档的最后也说了，确实是简单介绍，本身需要参考很多其他链接的东西。有翻译不对的地方请帮忙指出来。谢谢。文章中红色字体是我自己加的。

原文地址：http://preshing.com/20120612/an-introduction-to-lock-free-programming

无锁编程是一个挑战，不只是因为任务本身的复杂度，还由于《 but because of how difficult it can be to penetrate the subject in the first place》；

我很幸运，因为我关于无锁编程的入门书是Bruce Dawson的优秀的白皮书。无锁编程注意事项（http://msdn.microsoft.com/en-us/library/windows/desktop/ee418650(v=vs.85).aspx）；和许多人一样，我有机会按照Bruce Dawson的建议，在Xbox 360上进行无锁编程的开发与调试。

从那时起，我写了很多无锁编程的例子。我会列出一些关于无锁编程的文档列表。At times, the information in one source may appear orthogonal to other sources: 例如，一些资料假定内存顺序一致性，但是内存排序是无锁编程的关键问题。新的 C++11 atomic library standard（http://en.cppreference.com/w/cpp/atomic）是我们编写无锁算法的好工具。

在这片文章中，我将再次提出无锁编程的定义，并深入阐述一些关键含义。我们会使用流程图阐述这些概念，并一点点的深入介绍。至少需要写无锁程序的程序员，已经使用mutex、semaphores 、events等其他同步原语编写多线程程序。

这是什么？

人们常常形容无锁编程就是不适用mutex之类的锁进行编程（有点绕，大白话。），锁的介绍（http://preshing.com/20111118/locks-arent-slow-lock-contention-is）。上面这句大白话没有错，但是并不是无锁编程的全部。普遍接受的基于学术文献的定义，是一个更广泛的定义。从本质上讲,无锁是用于描述一些代码的属性，关于这些代码是如何写的没有说太多。

基本上,如果部分程序满足下列条件,那么这部分被认为是无锁的。相反,如果给定的部分代码并不满足这些条件,那么这部分不是无锁的。

上图的解释：

1，你的线程是否是多线程程序？是。或者有中断处理函数。

2，这些线程是否访问相同的内存地址？是。

3，这些线程是否能够互相阻塞？否。即有某种方式使得线程无限期的锁定？（这句话没有明白）。

满足以上三个条件，就是无锁程序了。

在这个意思上来说，在无锁编程中的lock不只是指的mutex，而是可能使得整个应用程序锁定的各种方法，无论是死锁还是活锁-甚至由于线程调度决策引起的问题made by your worst enemy（没有理解什么意思，暂时解释为线程调度器）。这最后一点听起来很好笑，但是这是最关键的。共享互斥锁是互相排斥的，因为一旦一个线程获得了共享锁，那么线程调度器就没有办法再次安排该线程。当然了，真正的操作系统不这样工作，我们仅仅定义了一些术语。

下面这个是一个简单的例子，这个是不包含锁的，但是并不是lock-free。预先，x=0；读者作为练习，考虑两个线程如何工作才能使得他们不退出循环。

while(x == 0)

{

x = 1 - x;

}

没有人期望一个大型应用程序是完全无锁的。通常，我们会有一系列无锁代码库。比如，lock-free queue，可能会有一些无锁操作接口，push、pop以及isEmpty等等。

由Herlihy 和Shavit编写的“ The Art of Multiprocessor Programming”中说了一些操作和类方法，并提到了一个简洁的关于无锁的定义（看幻灯片150，该幻灯片需要在google doc中看，附件中有下载。）“在有无限的请求时，会有无限的方法完成。”。换句话说，如果程序能够一直不停的调用那些无锁方法，那么完成的数量也在不停的增加，这就是全部。算法不可能锁定到那些无锁操作中。

无锁编程的一个重要结果是，如果你挂起一个线程，不会阻塞其他使用无锁方法线程的运行。这暗示了无锁编程的意义，当你编写中断处理程序或者实时系统时，某些工作必须在限定时间内完成，不管进行的其他部分在什么状态。

最后说一句，算法中并没有完全取消阻塞操作。比如：一个 queue的pop操作，如果queu是空的，那么操作会被阻塞。而剩余的代码仍被认为是无锁的。

无锁编程技巧

当你试图编写让人满意的无锁程序时，一系列的技术必然会被提起：原子操作（atomic operations）、内存障碍（ memory barriers）,避免ABA问题（ ABA problem）等等。事情变得很可怕。

当然了，如何使用这个技术就涉及到了其他的一些东西。为了说明这些，我总结了以下的流程图，我将详细的说明每一种情况。

上图说明：暂时未明。。。等待。

原子的读-修改-写操作

原子操作是不可分割的操作内存的方式。没有线程能够看到原子操作的中间过程。在现代处理器中，很多操作都是原子的。比如，内存对齐的简单类型（比如int，char，bool，等等简单类型。long类型应该也属于，long long不确认。。）的读和写操作。

读-修改-写（RMW）操作更进一步，允许你进行更复杂的原子操作。当一个无锁算法必须支持多个写者时，他们（RMW）特别有用，因为当多个线程试图RMW相同地址时，他们会排成一列一个接一个的执行那些操作。我已经在这个blog里提到过RMW操作，如在实现一个轻量级的互斥锁（http://preshing.com/20120226/roll-your-own-lightweight-mutex）,一个递归的互斥（http://preshing.com/20120305/implementing-a-recursive-mutex）和一个轻量级的日志系统（http://preshing.com/20120522/lightweight-in-memory-logging）。

RMW操作的例子包括 _InterlockedIncrement on Win32，OSAtomicAdd32 on iOS, and std::atomic::fetch_add in C++11。请注意，C++11原子标准实现并不能保证在所有平台上都是无锁的，所以最好能够知道你所在平台的功能及工具说明（关于原子操作的）。你可以尝试调用 std::atomic<>::is_lock_free来确认下。

不同的CPU系列支持RMW的方式不同。PowerPC 和ARM 对外提供load-link/store-conditional，这使得你可以在底层实现你自己的RMW原语，当然了这并不常用。普通的RMW操作通常是足够使用的。

如流程图所说，原子RMW操作是无锁编程中比不可少的一部分，就算是单核系统也不例外。没有原子性，一个线程的事务可能会被中断打断，导致状态的不一致性。

比较并交换（CAS）循环

也许最经常讨论的RMW操作是CAS。在WIN32系统上，CAS是通过一些内部函数实现的，如

_InterlockedCompareExchange。通常，程序员在一个循环中执行CAS，反复尝试执行某一个事务。这种模式通常会把一个共享的变量拷贝为临时变量，执行一些操作，并尝试使用CAS把修改后的共享变量设置回去。

void LockFreeQueue::push(Node*newHead)
{
   for (;;)
   {
       // Copy a shared variable (m_Head) to a local.
       Node*oldHead= m_Head;

       // Do some speculative work, not yet visible to other threads.
       newHead->next=oldHead;

       // Next, attempt to publish our changes to the shared variable.
       // If the shared variable hasn't changed, the CAS succeeds and we return.
       // Otherwise, repeat.
       if (_InterlockedCompareExchange(&m_Head,newHead,oldHead) ==oldHead)
           return;
   }
}

这个循环仍然是无锁的，因为如果在一个线程中是判断异常的（if中判断失败），那么就代表着肯定会有另外一个线程执行成功了（尽管在一些架构上提供的弱修正CAS中（weaker variant of CAS），这不一定完成正确。）。只要执行一个CAS循环，必须避开ABA问题（the ABA problem）。

以下为我自己加的内容：如何避免ABA问题呢？按照ABA问题的解释，那么上面那么函数例子（push）是不严谨的。严格意义上来说，Node的内存地址是可重用的。当然了，理论上来说，就算出现了ABA问题，应该也不影响push函数的执行结果。ABA问题是否可以从逻辑上避免呢？比如需要访问的变量为递增计数之类的。看来得查查这个问题了。

顺序一致性

顺序一致性的意思是，当所有线程顺序访问内存时，访问内存的顺序和源码中的顺序是一样的。在顺序一致性的基础上，我上一篇文章里演示的内存重排的问题不会出现的。（http://preshing.com/20120515/memory-reordering-caught-in-the-act）

一个简单的实现顺序一致性的办法（绝对不切实际的）是禁用编译器优化让所有的线程运行在一个cpu上。一个cpu绝对不会看到内存顺序出现问题，就算线程在任何时间进行调度。

一些编程语言提供顺序一致性，就算是优化代码运行在多cpu环境上。在C++11中，你可以声明所有共享变量为C++11原子类型（atomic ），使的变量使用默认内存顺序。在java中，你可以声明共享变量为volatile。下面的例子是从我以前的blog中拿的（http://preshing.com/20120515/memory-reordering-caught-in-the-act），使用C++11的风格编写的。

std::atomic X(0), Y(0);

int r1, r2;

void thread1()

{

X.store(1);

r1 = Y.load();

}

void thread2()

{

Y.store(1);

r2 = X.load();

}

由于C++11的atomic类型保持顺序一致性，那么r1和r2是不可能同时为0的。为了达到这个目标，编译器在幕后额外增加了一些指令-通常是内存隔离或者RMW操作。这些指令的性能会比程序员直接处理内存顺序问题的性能差。

内存序列化

随着流程图的演示，任何时候在多核系统（或者任何对称多处理器， symmetric multiprocessor）上进行无锁编程，你的环境并不能保证顺序一致性。你必须想办法防止内存重排序。（ memory reordering）

在今天的架构上，有三种办法可以做到防止内存重排序，包含编译器重排和cpu重排。

1，一个轻量级的同步或者内存隔离指令，我会在下一期的blog中介绍。（ future posts）：

2，一个完整的内存隔离指令，我已经在以前的blog中介绍过了。demonstrated previously;

3，内存操作规定获取和发布的语义。（不明白Memory operations which provide acquire or release semantics.）

读取原语为了阻止内存重排需要放到读取操作的前面，写入原语为了阻止内存重排需要放到写入操作的后面。这些原语特别适合生产者/消费者的情境，特别是只有一个生产者和一个消费者的情况。我将在下一个blog中讨论这个问题。（future post.）

不同的处理器有不同的内存模式

对于内存重排问题，不同的CPU系列有不同的特性（Different CPU families have different habits）。这些特性需要查询特定CPU厂商及硬件的文档。例如，PowerPC和ARM的CPU可以改变内存存储相对于指令自身的顺序，一般来说Intel的X86/64家族的CPU及AMD的CPU却不会改变。我们看到以前的处理器拥有更加轻松的内存模型（这句话没有明白。We say the former processors have a more relaxed memory model.）。

有一种抽象与硬件细节之外的方法，特别是C++11提供了一种通用的可移植的进行无锁编程的方法。至少现在，大部分无锁编程的程序员至少要了解不同平台间的差异。至少需要记住一个关键点，就是在X86/64指令层，任何从内存获取数据有读原语（comes with acquire semantics），向内存存入数据有写原语（provides release semantics）--至少对于非SSE指令和非WC内存操作。这会导致过去写的在X86/64上的无锁代码不能运行与其他类型CPU上。（ fails on other processors.）。【这一段没有理解，读原语和写原语和SSE及WC有什么关系呢。。。】

如果你对硬件怎么处理和为什么处理内存排序的细节感兴趣。我建议你阅读附录C： Is Parallel Programming Hard. 在任何情况下，你需要记住内存重排序都会由于编译器的原因而发生。

在这篇文章中，我并没有说太多具体的无锁编程的内容，比如：什么时候需要无锁编程？我们到底有多需要它。我们也没有提到验证无锁算法正确性的重要。尽管如此，我希望读者能够通过这篇文章，对无所编程的概念熟悉起来，使得你对其他的阅读没有那么困惑。想往常一样，你过你有任何疑问，请在评论中回复。

【这篇文章被出版到 Hacker Monthly的#29】

其他参考资料：

Anthony Williams’ blog and his book, C++ Concurrency in Action
Dmitriy V’jukov’s website and various forum discussions
Bartosz Milewski’s blog
Charles Bloom’s Low-Level Threading series on his blog
Doug Lea’s JSR-133 Cookbook
Howells and McKenney’s memory-barriers.txt document
Hans Boehm’s collection of links about the C++11 memory model
Herb Sutter’s Effective Concurrency series

May 22, 2012

Lightweight In-Memory Logging

When debugging multithreaded code, it’s not always easy to determine which codepath was taken. You can’t always reproduce the bug while stepping through the debugger, nor can you always sprinkleprintfs throughout the code, as you might in a single-threaded program. There might be millions of events before the bug occurs, andprintf can easily slow the application to a crawl, mask the bug, or create a spam fest in the output log.

One way of attacking such problems is to instrument the code so that events are logged to a circular buffer in memory. This is similar to addingprintfs, except that only the most recent events are kept in the log, and the performance overhead can be made very low using lock-free techniques.

Here’s one possible implementation. I’ve written it specifically for Windows in 32-bit C++, but you could easily adapt the idea to other platforms. The header file contains the following:

#include <windows.h>#include <intrin.h>namespace Logger{    struct Event    {        DWORD tid;        // Thread ID        const char* msg;  // Message string        DWORD param;      // A parameter which can mean anything you want    };    static const int BUFFER_SIZE = 65536;   // Must be a power of 2    extern Event g_events[BUFFER_SIZE];    extern LONG g_pos;    inline void Log(const char* msg, DWORD param)    {        // Get next event index        LONG index = _InterlockedIncrement(&g_pos);        // Write an event at this index        Event* e = g_events + (index & (BUFFER_SIZE - 1));  // Wrap to buffer size        e->tid = ((DWORD*) __readfsdword(24))[9];           // Get thread ID        e->msg = msg;        e->param = param;    }}#define LOG(m, p) Logger::Log(m, p)

And you must place the following in a .cpp file.

namespace Logger{    Event g_events[BUFFER_SIZE];    LONG g_pos = -1;}

This is perhaps one of the simplest examples of lock-free programming which actually does something useful. There’s a single macroLOG, which writes to the log. It uses _InterlockedIncrement, an atomic operation which I’ve talked about inprevious posts, for thread safety. There are no readers. You are meant to be the reader when you inspect the process in the debugger, such as when the program crashes, or when the bug is otherwise caught.

Using It to Debug My Previous Post

My previous post, Memory Reordering Caught In the Act, contains a sample program which demonstrates a specific type of memory reordering. There are two semaphores,beginSema1 and beginSema2, which are used to repeatedly kick off two worker threads.

While I was preparing the post, there was only a single beginSema shared by both threads. To verify that the experiment was valid, I added a makeshift assert to the worker threads. Here’s the Win32 version:

DWORD WINAPI thread1Func(LPVOID param){    MersenneTwister random(1);    for (;;)    {        WaitForSingleObject(beginSema, INFINITE);  // Wait for signal        while (random.integer() % 8 != 0) {} // Random delay        // ----- THE TRANSACTION! -----        if (X != 0) DebugBreak();  // Makeshift assert        X = 1;        _ReadWriteBarrier();  // Prevent compiler reordering only        r1 = Y;        ReleaseSemaphore(endSema, 1, NULL);  // Notify transaction complete    }    return 0;  // Never returns};

Surprisingly, this “assert” got hit, which means that X was not 0 at the start of the experiment, as expected. This puzzled me, sinceas I explained in that post, the semaphores are supposed to guarantee the initial valuesX = 0 and Y = 0 are completely propogated at this point.

I needed more visibility on what was going on, so I added the LOG macro in a few strategic places. Note that the integer parameter can be used to log any value you want. In the secondLOG statement below, I use it to log the initial value of X. Similar changes were made in the other worker thread.

    for (;;)    {        LOG("wait", 0);        WaitForSingleObject(beginSema, INFINITE);  // Wait for signal        while (random.integer() % 8 != 0) {} // Random delay        // ----- THE TRANSACTION! -----        LOG("X ==", X);        if (X != 0) DebugBreak();  // Makeshift assert        X = 1;        _ReadWriteBarrier();  // Prevent compiler reordering only        r1 = Y;        ReleaseSemaphore(endSema, 1, NULL);  // Notify transaction complete    }

And in the main thread:

    for (int iterations = 1; ; iterations++)    {        // Reset X and Y        LOG("reset vars", 0);        X = 0;        Y = 0;        // Signal both threads        ReleaseSemaphore(beginSema, 1, NULL);        ReleaseSemaphore(beginSema, 1, NULL);        // Wait for both threads        WaitForSingleObject(endSema, INFINITE);        WaitForSingleObject(endSema, INFINITE);        // Check if there was a simultaneous reorder        LOG("check vars", 0);        if (r1 == 0 && r2 == 0)        {            detected++;            printf("%d reorders detected after %d iterations\n", detected, iterations);        }    }

The next time the “assert” was hit, I checked the contents of the log simply by watching the expressionsLogger::g_pos and Logger::g_events in the Watch window.

In this case, the assert was hit fairly quickly. Only 17 events were logged in total (0 - 16). The final three events made the problem obvious: a single worker thread had managed to iteratetwice before the other thread got a chance to run. In other words, thread1 had stolen the extra semaphore count which was intended to kick off thread2! Splitting this semaphore into two separate semaphores fixed the bug.

This example was relatively simple, involving a small number of events. In some games I’ve worked on, we’ve used this kind of technique to track down more complex problems. It’s still possible for this technique to mask a bug; for example, whenmemory reordering is the issue. But even if so, that may tell you something about the problem.

Tips on Viewing the Log

The g_events array is only big enough to hold the latest 65536 events. You can adjust this number to your liking, but at some point, the index counterg_pos will have to wrap around. For example, if g_pos has reached a value of 3630838, you can find the last log entry by taking this value modulo 65536. Using interactive Python:

>>> 3630838 % 6553626358

When breaking, you may also find that “CXX0017: Error: symbol not found” is sometimes shown in the Watch window, as seen here:

This usually means that the debugger’s current thread and stack frame context is inside an external DLL instead of your executable. You can often fix it by double-clicking a different stack frame in the Call Stack window and/or a different thread in the Threads window. If all else fails, you can always add the context operator to your Watch expression, explicitly telling the debugger which module to use to resolve these symbols:

One convenient detail about this implementation is that the event log is stored in a global array. This allows the log show up in crash dumps, via an automated crash reporting system for example, even when limitedminidump flags are used.

What Makes This Lightweight?

In this implementation, I strived to make the LOG macro as non-intrusive as reasonably possible. Besides being lock-free, this is mainly achieved through copious use of compiler intrinsics, which avoid the overhead of DLL function calls for certain functions. For example, instead of calling InterlockedIncrement, which involves a call into kernel32.dll, I used the intrinsic function_InterlockedIncrement (with an underscore).

Similarly, instead of getting the current thread ID from GetCurrentThreadId, I used the compiler intrinsic __readfsdword to read the thread ID directly from the Thread Information Block (TIB), an undocumented but well-known data structure in Win32.

You may question whether such micro-optimizations are justified. However, after building several makeshift logging systems, usually to handle millions of events in high-performance, multi-threaded code, I’ve come to believe that the less intrusive you can make it, the better. As a result of these micro-optimizations, the LOG macro compiles down to a few machine instructions, all inlined, with no function calls, no branching and no blocking:

This technique is attractive because it is very easy to integrate. There are many ways you could adapt it, depending on your needs, especially if performance is less of a concern. You could add timestamps and stack traces. You could introduce a dedicated thread to spool the event log to disk, though this would require much more sophisticated synchronization than the single atomic operation used here.

After adding such features, the technique would begin to resemble Microsoft’s Event Tracing for Windows (ETW) framework, so if you’re willing to go that far, it might be interesting to look at ETW’ssupport for user-mode provider events instead.