EPOLL_CTL_DISABLE and multithreaded applications

来源：互联网发布：js 计数器编辑：程序博客网时间：2024/06/06 01:53

EPOLL_CTL_DISABLE and multithreaded applications

http://lwn.net/Articles/520012/

By Michael Kerrisk
October 17, 2012

Other than the merging of the server-side component of TCP Fast Open, one of the few user-space API changes that has gone into the just-closed 3.7 merge window is the addition of a new EPOLL_CTL_DISABLE operation for the epoll_ctl() system call. It's interesting to look at this operation as an illustration of the sometimes unforeseen complexities of dealing with multithreaded applications; that examination is the subject of this article. However, the addition of the EPOLL_CTL_DISABLE feature highlights some common problems in the design of the APIs that the kernel presents to user space. (To be clear: EPOLL_CTL_DISABLE is the fix to a past design problem, not a design problem itself.) These design problems will be the subject of a follow-on article next week.

Understanding the need for EPOLL_CTL_DISABLE requires an understanding of several features of the epoll API. For those who are unfamiliar with epoll, we begin with a high-level picture of how the API works. We then look at the problem that EPOLL_CTL_DISABLE is designed to solve, and how it solves that problem.

An overview of the epoll API

The (Linux-specific) epoll API allows an application to monitor multiple file descriptors in order to determine which of the descriptors are ready to perform I/O. The API was designed as a more efficient replacement for the traditional select() and poll()system calls. Roughly speaking, the performance of those older APIs scales linearly with the number of file descriptors being monitored. That behavior makes select() and poll() poorly suited for modern network applications that may handle thousands of file descriptors simultaneously.

The poor performance of select() and poll() is an inescapable consequence of their design. For each monitoring operation, both system calls require the application to give the kernel a complete list of all of the file descriptors that are of interest. And on each call, the kernel must re-examine the state of all of those descriptors and then pass a data structure back to the application that describes the readiness of the descriptors.

The underlying problem of the older APIs is that they don't allow an application to inform the kernel about its ongoing interest in a (typically unchanging) set of file descriptors. If the kernel had that information, then, as each file descriptor became ready, it could record the fact in preparation for the next request by the application for the set of ready file descriptors. The epoll API allows exactly that approach, by splitting the monitoring API up across three system calls:

epoll_create() creates an internal kernel data structure ("an epoll instance") that is used to record the set of file descriptors that the application is interested in monitoring. The call returns a file descriptor that is used in the remaining epoll APIs.
epoll_ctl() allows the application to inform the kernel about the set of file descriptors it would like to monitor by adding (EPOLL_CTL_ADD) and removing (EPOLL_CTL_DEL) file descriptors from the interest list of the epoll instance. epoll_ctl() can also modify (EPOLL_CTL_MOD) the set of events that are to be monitored for a file descriptor that is already in the interest list. Once a file descriptor has been recorded in the interest list, the kernel tracks I/O events for the file descriptor (e.g., the arrival of new input); if the event causes the file descriptor to become ready, the kernel places the descriptor on the ready list of the epoll instance, in preparation for the next call to epoll_wait().
epoll_wait() requests the kernel to return one or more ready file descriptors. The kernel satisfies this request by simply fetching items from the ready list (the call can block if there are no descriptors that are yet ready). The application usesepoll_wait() each time it wants to check for changes in the readiness of file descriptors. What is notable about epoll_wait() is that the application does not need to pass in a list of file descriptors on each call: the kernel already has that information via preceding calls to epoll_ctl(). In addition, there is no need to rescan the complete set of file descriptors to see which are ready; the kernel has already been recording that information on an ongoing basis because it knows which file descriptors the application is interested in.

Schematically, the epoll API operates as shown in the following diagram:

Because the kernel is able to maintain internal state about the set of file descriptors in which the application is interested,epoll_wait() is much more efficient than select() and poll(). Roughly speaking, its performance scales according to the number of ready file descriptors, rather than the total number of file descriptors being monitored.

Epoll and multithreaded applications: the problem

The author of the patch that implements EPOLL_CTL_DISABLE, Paton Lewis, is not a regular kernel hacker. Rather, he's a developer with a particular user-space itch, and it would seem that a kernel change is the only way of scratching that itch. In the description accompanying the first iteration of his patch, Paton began with the following observation:

It is not currently possible to reliably delete epoll items when using the same epoll set from multiple threads. After calling epoll_ctl with EPOLL_CTL_DEL, another thread might still be executing code related to an event for that epoll item (in response to epoll_wait). Therefore the deleting thread does not know when it is safe to delete resources pertaining to the associated epoll item because another thread might be using those resources.

The deleting thread could wait an arbitrary amount of time after calling epoll_ctl with EPOLL_CTL_DEL and before deleting the item, but this is inefficient and could result in the destruction of resources before another thread is done handling an event returned by epoll_wait.

The fact that the kernel records internal state is the source of a complication for multithreaded applications. The complication arises from the fact that applications may also want to maintain state information about file descriptors. One possible reason for doing this is to prevent file descriptor starvation, the phenomenon that can occur when, for example, an application determines that a file descriptor has data available for reading and then attempts to read all of the available data. It could happen that there is a very large amount of data available (for example, another application may be continuously writing data on the other end of a socket connection). Consequently, the reading application would be tied up for a long period; meanwhile, it does not service I/O events on the other file descriptors—those descriptors are starved of service by the application.

The solution to file descriptor starvation is for the application to maintain a user-space data structure that caches the readiness of each of the file descriptors that it is monitoring. Whenever epoll_wait() informs the application that a file descriptor is ready, then, instead of performing as much I/O as possible on the descriptor, the application makes a record in its cache that the file descriptor is ready. The application logic then takes the form of a loop that (a) periodically calls epoll_wait() and (b) performs alimited amount of I/O on the file descriptors that are marked as ready in the user-space cache. (When the application finds that I/O is no longer possible on one of the file descriptors, then it can mark that descriptor as not ready in the cache.)

Thus, we have a scenario where the both kernel and a user-space application are maintaining state information about the same resources. This can potentially lead to race conditions when competing threads in a multithreaded application want to update state information in both places. The most fundamental piece of state information maintained in both places is "existence".

For example, suppose that an application thread determines that it is no longer necessary to monitor a file descriptor. The thread would first check to see whether the file descriptor is marked as ready in the user-space cache (i.e., there may still be some outstanding I/O to perform), and then, if the file descriptor is not ready, the thread would delete the file descriptor from the user-space cache and from the kernel's epoll interest list using the epoll_ctl(EPOLL_CTL_DEL) operation. However, these steps could fall afoul in scenarios such as the following involving two threads operating on file descriptor 9:

Thread 1Thread 2Determine from the user-space cache that descriptor 9 is not ready. Call epoll_wait(); the call indicates descriptor 9 as ready. Record descriptor 9 as being ready inside the user-space cache so that I/O can later be performed.Delete descriptor 9 from the user-space cache. Delete descriptor 9 from the kernel's epoll interest list using epoll_ctl(EPOLL_CTL_DEL).

Following the above scenario, some data will be lost. Other scenarios could lead to a corrupted cache or an application crash.

No use of (per-file-descriptor) mutexes can eliminate the sorts of races described here, short of protecting the calls to epoll_wait()with a (global) mutex, which has the effect of destroying concurrency. (If one thread is blocked in a epoll_wait() call, then any other thread that tries to acquire the corresponding mutex will also block.)

Epoll and multithreaded applications: the solution

Paton's solution to this problem is to extend the epoll API with a new operation that atomically prevents other threads from receiving further indications that a file descriptor is ready, while at the same time informing the caller whether another thread has "recently" been told the file descriptor is ready. The new operation relies on some of the inner workings of the epoll API.

When adding (EPOLL_CTL_ADD) or modifying (EPOLL_CTL_MOD) a file descriptor in the interest list, the application specifies a mask of I/O events that are of interest for the descriptor. For example, the mask might include both EPOLLIN and EPOLLOUT, if the application wants to know when the file descriptor becomes either readable or writable. In addition, the kernel implicitly adds two further flags to the events mask in the interest list: EPOLLERR, which requests monitoring for error conditions, and EPOLLHUP, which requests monitoring for a "hangup" (e.g., we are monitoring the read end of a pipe, and the write end is closed). When a file descriptor becomes ready,epoll_wait() returns a mask that contains all of the requested events for which the file descriptor is ready. For example, if an application requests monitoring of the read end of a pipe using EPOLLIN and the write end of the pipe is closed, then epoll_wait() will return an events mask that includes both EPOLLIN and EPOLLHUP.

As well as the flags that can be used to monitor file descriptors for various I/O events, there are a few "operational flags"—flags that modify the semantics of the monitoring operation itself. One of these is EPOLLONESHOT. If this flag is specified in the events mask for a file descriptor, then, once the file descriptor becomes ready and is returned by a call to epoll_wait(), it is disabled from further monitoring (but remains in the interest list). If the application is interested in monitoring file descriptor once more, then it must re-enable the file descriptor using the epoll_ctl(EPOLL_CTL_MOD) operation.

Per-descriptor events mask recorded in an epoll interest listOperational flagsI/O event flagsEPOLLONESHOT, EPOLLET, ...EPOLLIN, EPOLLOUT, EPOLLHUP, EPOLLERR, ...

The implementation of EPOLLONESHOT relies on a trick. If this flag is set, then, if the file descriptor indicates as being ready viaepoll_wait(), the kernel clears all of the "non-operational flags" (i.e., the I/O event flags) in the events mask for that file descriptor. This serves as a later cue to the kernel that it should not track I/O events for this file descriptor.

By now, we finally have enough details to understand Paton's extension to the epoll API—the epoll_ctl(EPOLL_CTL_DISABLE) operation—that allows multithreaded applications to avoid the kind of races described above. To successfully use this extension requires the following:

The user-space cache that describes file descriptors should also include a per-descriptor "delete-when-done" flag that defaults to false but can be set true when one thread wants to inform another thread that a particular file descriptor should be deleted.
All epoll_ctl() calls that add or modify file descriptors in the interest list must specify the EPOLLONESHOT flag.
The epoll_ctl(EPOLL_CTL_DISABLE) operation should be used as described in a moment.

In addition, calls to epoll_ctl(EPOLL_CTL_DISABLE) and accesses to the user-space cache must be suitably protected with per-file-descriptor mutexes. We won't go into details here, but the second version of Paton's patch adds a sample application to the kernel source tree (under tools/testing/selftests/epoll/test_epoll.c) that demonstrates the principles.

The new epoll operation is employed via the following call:

    epoll_ctl(epfd, EPOLL_CTL_DISABLE, fd, NULL);

epfd is a file descriptor referring to an epoll instance. fd is the file descriptor in the interest list that is to be disabled. The semantics of this operation handle two cases:

One or more of the I/O event flags is set in the interest list entry for fd. This means that, since the last epoll_ctl()operation that added or modified this interest list entry, no other thread has executed an epoll_wait() call that indicated this file descriptor as being ready. In this case, the kernel clears the I/O event flags in the interest list entry, which prevents subsequent epoll_wait() calls from returning the file descriptor as being ready. The epoll_ctl(EPOLL_CTL_DISABLE) call then returns zero to the caller. At this point, the caller knows that no other thread is operating on the file descriptor, and it can thus safely delete the descriptor from the user-space cache and from the kernel interest list.
No I/O event flag is set in the interest list entry for fd. This means that since the last epoll_ctl() operation that added or modified this interest list entry, another thread has executed an epoll_wait() call that indicated this file descriptor as being ready. In this case, epoll_ctl(EPOLL_CTL_DISABLE) returns –1 with errno set to EBUSY. At this point, the caller knows that another thread is operating on the descriptor, so it sets the descriptor's "delete-when-done" flag in the user-space cache to indicate that the other thread should delete the file descriptor once when it has finished using it.

Thus, we see that with a moderate amount of effort, and a little help from a new kernel interface, a race can be avoided when deleting file descriptors in multithreaded applications that wish to avoid file descriptor starvation.

Concluding remarks

There was relatively little comment on the first iteration of Paton's patch. The only substantive comments came from Christof Meerwald; in response to these, Paton created the second version of his patch. That version received no comments, and was incorporated into 3.7-rc1. It would be nice to think that the relatively paucity of comments reflects the silent agreement that Paton's approach is correct. However, one is left with the nagging feeling that in fact few people have reviewed the patch, which leaves open the question: is this the best solution to the problem?

Although EPOLL_CTL_DISABLE solves the problem, the solution is neither intuitive nor easy to use. The main reason for this is thatEPOLL_CTL_DISABLE is a bolt-on hack to the epoll API that satisfies the requirement (often repeated by Linus Torvalds) that existinguser-space applications must not be broken by making a kernel ABI change. Within that constraint, EPOLL_CTL_DISABLE may be the best solution to the problem. However, it seems certain that a better solution might have been possible if it had incorporated during theoriginal design of the epoll API. Next week's follow-on article will consider whether a better initial solution could have been found and also consider why it might not be possible to find a better solution within the constraints of the current API.

Finally, it's worth noting that the EPOLL_CTL_DISABLE feature is not yet cast in stone, although it will become so in about two months, when Linux 3.7 is released. In the meantime, if someone comes up with a better idea to solve the problem, then the existing approach could be modified or replaced.

(Log in to post comments)

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 18, 2012 16:00 UTC (Thu) by corbet (editor, #1) [Link]

After this article was published, it has become clear that, thanks partially to Michael's questions, this API is likely to be changed by the final 3.7 release. Stay tuned.

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 18, 2012 20:49 UTC (Thu) by mhelsley (subscriber, #11324) [Link]

Rather than a mutex guarding the epoll fd (and thus the interest set), and rather than EPOLL_CTL_DISABLE, could userspace RCU be used to protect the shared resources of the set until they are unused? I haven't fully thought it through but if it works then that might be another scalable solution which is useable "today".

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 19, 2012 12:41 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

That pretty much boils down to delaying the deletion of the items to a moment where all epoll_waits have been done (since epoll_wait is an RCU quiescence point).

An efficient solution for an arbitrary number of epoll_wait threads can be implemented even in userspace and without using a full-blown RCU.

Equip each thread with a) an id or something else that lets each thread refer to "the next" thread; b) a lists of "items waiting to be deleted". Then the deleting thread adds the item being deleted to the first thread's list. Before executing epoll_wait, thread K empties its list and "passes the buck", appending the old contents of its list to that of thread K+1. This is an O(1) operation no matter how many items are being deleted; only Thread N, being the last thread, actually has to go through the list and delete the items.

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 19, 2012 2:25 UTC (Fri) by kjp (subscriber, #39639) [Link]

Better solution: since epoll lets you store a generic 64 bit cookie, just use a 64 bit sequence that increments for each new file descriptor. In a hash table, store the cookie -> fd mapping. The hash table should be thread safe but still scalable, and you could have a ref count too. So all wakeups from epoll need to check the hash table to see if the fd still exists, and check it out (bump refcount) if so.

So unless you need to process more than 2^63 sockets...

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 19, 2012 14:05 UTC (Fri) by mkerrisk (editor, #1978) [Link]

Better solution: since epoll lets you store a generic 64 bit cookie, just use a 64 bit sequence that increments for each new file descriptor.

I haven't thought through your solution very far, but it seems unfortunate to have to chew up the cookie to solve this problem. User space might very want to user the epoll_event.data field for other purposes.

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 19, 2012 15:25 UTC (Fri) by kjp (subscriber, #39639) [Link]

To clarify, my solution is all user space, no kernel changes. It's just changing the kernel from holding a 'strong reference' to a 'weak reference'. (You know what they say about adding a layer of indirection...)

When I did epoll, I used it in edge_triggered mode (and also not 'oneshot') and had a single thread processing the epoll events and scheduling workers. I had worker threads that just pumped data to and from the kernel. I put direct pointers in the epoll data (i.e. strong references) to my internal structures since I had only one thread calling epoll.

But if I had multiple threads calling epoll, I think the solution I outlined would work fine. I don't see what else I would need the cookie for... as long as its still doing its job, pointing to a real data structure of mine.

And the nice thing is, it works with all epoll modes. What's very distasteful about the kernel patch is that it requires ONESHOT (yuck!).

The epoll designer(s) had their thinking caps on with this api. Storing arbitrary cookies + edge triggered mode = Insanely good.

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 19, 2012 15:35 UTC (Fri) by mkerrisk (editor, #1978) [Link]

But if I had multiple threads calling epoll, I think the solution I outlined would work fine. I don't see what else I would need the cookie for... as long as its still doing its job, pointing to a real data structure of mine.

Yes, but other people may want to use the cookie in a quite different way, and it seems a shame to limit the generality of the API by requiring it to be used for this task.

And the nice thing is, it works with all epoll modes. What's very distasteful about the kernel patch is that it requires ONESHOT (yuck!).

Yes, requiring the use of EPOLLONESHOT is rather unfortunate. I strongly suspect that there could be a solution quite similar to the EPOLL_CTL_DISABLE approach that doesn't require EPOLLONESHOT. I have something in mind, but I need to think about it a little more.

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 19, 2012 21:39 UTC (Fri) by wahern (subscriber, #37304) [Link]

ONESHOT is the obvious and easiest way to handle lockless multithreaded processing of an epoll queue. In fact, on Solaris ONESHOT is the only option. There are no persistent events. The kernel, of course, is free to optimize for persistence, but userspace threads don't need to worry about a loaded gun lying around.

Edge-triggered signaling only provides a nominal percentage improvement in performance. If you're already going multithreaded and attempting lockless, than you're already massively multicore. Why bother adding all the complexity of edge-triggered events? Also, it's worth pointing out that with *BSD kqueue, ONESHOT automatically removes the decriptor from the queue. If epoll followed this excellent example then using ONESHOT would be end of story.

Seems to me the simplest solution to starvation is to ask the kernel to return events in FIFO order, i.e. the last one installed will also be the last in the next reported pending queue. That way you can use ONESHOT and still guarantee that there's only ever be a single owner of the object, e.g. exactly one of the threads or the kernel.

For the early termination cases (e.g. a second thread walking a shared queue and destroying sockets), just call shutdown on the socket and let the kernel report it via the normal queue processing.

The root of the problem here is that people want to use both a message passing pattern via epoll messaging, as well as allow arbitrary threads to jump into the fray and manipulate shared contexts. That's just asking for trouble.

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 30, 2012 0:23 UTC (Tue) by normalperson (subscriber, #47508) [Link]

I agree ONESHOT is awesome for MT and probably should've been the default.

epoll_wait() already returns ONESHOT events in FIFO order based on my reading of fs/eventpoll.c (and my own testing/experience). kqueue also seems to return in FIFO order with ONESHOT.

I wrote a server based on this behavior (along with concurrent calls to epoll_wait(...,maxevents=1,...)) for getting fair distribution between threads stuck in I/O wait a while back: http://bogomips.org/cmogstored.git

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 19, 2012 10:47 UTC (Fri) by nix (subscriber, #2304) [Link]

Now, how do I automatically add this excellent documentation to my copy of _The Linux Programming Interface_? The paper just won't update properly! :}

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 19, 2012 14:33 UTC (Fri) by mkerrisk (editor, #1978) [Link]

Thanks for the kind words. I *do* wish I'd thought of the diagram at the time I wrote TLPI, though. I love having good diagrams...

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 25, 2012 5:14 UTC (Thu) by cyanit (guest, #86671) [Link]

Other solution: pass a pointer to a userspace reference count in a new EPOLL_CTL_ADD_RC, which is incremented in the kernel under the epoll lock when an event concerning that fd is returned to userspace.

This way, after EPOLL_CTL_DEL either the fd will never be returned or the reference count has been raised already.

Userspace just needs to be changed to use EPOLL_CTL_ADD_RC and to decrement the reference count after it finished processing the event, and delete the fd data if it goes to zero either at that point or after EPOLL_CTL_DEL.

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 28, 2012 14:39 UTC (Sun) by kjp (subscriber, #39639) [Link]

so both userspace and the kernel are modifying the reference count? I'm confused.

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 25, 2012 16:32 UTC (Thu) by happynut (subscriber, #4117) [Link]

Perhaps I'm missing something, but it seems like the proposed solution is to add a synchronous call to control an asynchronous queue.

Couldn't this be solved with a flag (or an alternate version) of EPOLL_CTL_DEL to add an event to the queue reporting that the delete has been fully processed?

Then the caller of epoll_wait() could then clean up the remaining application's data structures, with no new locks required.

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 28, 2012 14:38 UTC (Sun) by kjp (subscriber, #39639) [Link]

How do you know 'what fully processed' means? Another thread could have been 'woken up' by the kernel, but hasn't gotten around to looking in your internal structures. If another thread gets the 'deleted processed' event, it could delete the data structure prematurely.

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 28, 2012 18:57 UTC (Sun) by happynut (subscriber, #4117) [Link]

I mean "fully processed" by the kernel, which is really the only issue; the application can (and indeed: must, even with the proposed EPOLL_CTL_DISABLE change) control its own concurrency issues with its own locks.

The issue is that the kernel and app are running asynchronously with an implicit race condition around removing file descriptors from epoll; sending a notification through the normal epoll mechanism that the kernel is done should be enough to allow both sides of the API to run asynchronously.

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 27, 2012 14:16 UTC (Sat) by runciter (guest, #75370) [Link]

This is nonsense. The deleting thread should just mark the cache data for that fd as "ready for deletion" and interrupt the epoll_wait (using a write to a pipe monitored by epoll, for example). The thread doing epoll_wait() can then synchronously release the resources. You'll need a mutex for the "ready-for-deletion" flag, but you need it for the "exists" or "ready" flags anyway. It's just a matter of checking the flags: the deleting thread checks "ready" before deleting; the epoll_wait() thread checks "ready for deletion" before updating "ready". With a mutex in place there is no race.

I don't get the point about losing data at all. You've decided to destroy the userspace cache entry *first*, before epoll_ctl() returned. Data will be lost either way.

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 28, 2012 14:49 UTC (Sun) by kjp (subscriber, #39639) [Link]

It sounds like the issue is a timeout case. The diagram shows one thread sees the file descriptor as not ready (no events) and decides to delete it. But, then suddenly an event for it comes in and starts processing on another thread. I don't see how your solution addreses that. Your pipe wakeup could happen at the same time as a 'real' socket wakeup event.

EPOLL_CTL_DISABLE and multithreaded applications

Posted Oct 28, 2012 17:59 UTC (Sun) by kjp (subscriber, #39639) [Link]

My comment was imprecise at best. I'll clarify what I think you are doing:

Thread 1 decides the fd is no longer needed, due to no events
Thread 2 gets a wakeup for a real event, but is then scheduled out and does not progress
Thread 1 deletes the socket from epoll, marks fd as needing deletion, and signals via a pipe.
Another thead 3 then reads the pipe and deletes the fd

That does nothing to address the race with thread 2. There's still a race, all you've added is the essence of a sleep() which delays things. (Like the article mentioned, the solution of adding an arbitrary delay).