Qemu Internals:Overall architecture and threading model--qemu架构和线程模型

来源：互联网发布：广联达预算软件入门编辑：程序博客网时间：2024/05/21 16:14

Stefan Hajoczi:Qemu Internals:Overall architecture and threading model
Stefan Hajnoczi
Open source and virtualization blog

Saturday, 5 March 2011
QEMU Internals: Overall architecture and threading model
This is the first post in a series on QEMU Internals aimed at developers. It is designed to share knowledge of how QEMU works and make it easier for new contributors to learn about the QEMU codebase.

这这篇文章的目的是为了分享qemu是怎么工作的，并且让开发者更快学习qemu代码。

Running a guest involves executing guest code, handling timers, processing I/O, and responding to monitor commands. Doing all these things at once requires an architecture capable of mediating resources in a safe way without pausing guest execution if a disk I/O or monitor command takes a long time to complete. There are two popular architectures for programs that need to respond to events from multiple sources:

让一个虚拟机工作涉及到运行guest代码，时间处理、io处理、monitor的命令响应等。做这些事情需要一个可靠的架构而不暂停guest的运行，比如磁盘io和monitor命令需要很长时间才能处理完。对于需要响应多个事件的情况，目前有两种流行的架构：并行架构和事件驱动架构。

Parallel architecture splits work into processes or threads that can execute simultaneously. I will call this threaded architecture.
Event-driven architecture reacts to events by running a main loop that dispatches to event handlers. This is commonly implemented using the select(2) or poll(2) family of system calls to wait on multiple file descriptors.

并行架构以进程或线程的方式同时处理。我将统称为线程架构。
事件驱动架构通过运行一个main loop来响应事件的处理。通常用select和poll等系统调用来实现等待多个事件fd。

QEMU actually uses a hybrid architecture that combines event-driven programming with threads. It makes sense to do this because an event loop cannot take advantage of multiple cores since it only has a single thread of execution. In addition, sometimes it is simpler to write a dedicated thread to offload one specific task rather than integrate it into an event-driven architecture. Nevertheless, the core of QEMU is event-driven and most code executes in that environment.

QEMU的实际上就结合了事件驱动编程与线程混合架构。这样做是有道理的，因为一个事件循环无法利用多核的优势，因为它只有一个单独的执行线程。此外，有时是简单写一个专用的线程来执行一个特定的任务，而不是将它集成到一个事件驱动的体系结构。然而，QEMU的核心是事件驱动，并且大多数代码是在事件驱动框架中执行的。

The event-driven core of QEMU

An event-driven architecture is centered around the event loop which dispatches events to handler functions. QEMU's main event loop is main_loop_wait() and it performs the following tasks:

事件驱动架构需要将事件循环下发到事件处理函数。qemu的主事件循环函数是main_loop_wait()，它完成以下工作：

Waits for file descriptors to become readable or writable. File descriptors play a critical role because files, sockets, pipes, and various other resources are all file descriptors. File descriptors can be added using qemu_set_fd_handler().
Runs expired timers. Timers can be added using qemu_mod_timer().
Runs bottom-halves (BHs), which are like timers that expire immediately. BHs are used to avoid reentrancy and overflowing the call stack. BHs can be added using qemu_bh_schedule().

1.等待文件描述符变为可读或可写。fd扮演了一个关键角色，因为files, sockets, pipes和各种资源都是以fd的形式呈现的。可以用函数qemu_set_fd_handler()添加fd.
2.运行超时定时器。通过函数qemu_mod_timer()添加定时器。
3.执行bottom-halves (BHs 下半部)，它和定时器一样会立即终止。BHs用来避免重入和堆栈的溢出。BHs的添加方式：qemu_bh_schedule()。

When a file descriptor becomes ready, a timer expires, or a BH is scheduled, the event loop invokes a callback that responds to the event. Callbacks have two simple rules about their environment:

当fd变化、定时器超时、或者调到下半部，时间循环就会调用一个回调函数响应这些事件。回调函数有2条规则：

1.No other core code is executing at the same time so synchronization is not necessary. Callbacks execute sequentially and atomically with respect to other core code. There is only one thread of control executing core code at any given time.
2.No blocking system calls or long-running computations should be performed. Since the event loop waits for the callback to return before continuing with other events, it is important to avoid spending an unbounded amount of time in a callback. Breaking this rule causes the guest to pause and the monitor to become unresponsive.

1.没有其他主代码在运行，所以不需要考虑同步。相对对于主代码来说，回调函数是线性和原子执行的。任何时候只有一个线程执行主代码。
2.不应该执行阻塞的系统调用和需要花费长时间的计算。因为事件循环需要等待回调函数返回以便继续处理其他事件。所以回调函数需要避免长时间运行。如果违反这条规则，将导致虚拟机暂停并且管理器无响应。

This second rule is sometimes hard to honor and there is code in QEMU which blocks. In fact there is even a nested event loop in qemu_aio_wait() that waits on a subset of the events that the top-level event loop handles. Hopefully these violations will be removed in the future by restructuring the code. New code almost never has a legitimate reason to block and one solution is to use dedicated worker threads to offload long-running or blocking code.

第2条规则很难遵守，qemu中还是有代码会被阻塞。事实上，qemu_aio_wait() 还会嵌套事件循环，他会等待上层的事件处理完成。希望在之后的代码重构后能移除这些代码以避免上面情况。新代码没有合理的理由去阻塞，可以用一个专用工作线程取出来那些阻塞或长时间运行的代码。

Offloading specific tasks to worker threads

Although many I/O operations can be performed in a non-blocking fashion, there are system calls which have no non-blocking equivalent. Furthermore, sometimes long-running computations simply hog the CPU and are difficult to break up into callbacks. In these cases dedicated worker threads can be used to carefully move these tasks out of core QEMU.

尽管很多io操作可以以一种非阻塞的方式运行，但很多系统调用却没有非阻塞的替代方式。可以将这些代码放到线程中处理，这样就能将其移出主代码。

One example user of worker threads is posix-aio-compat.c, an asynchronous file I/O implementation. When core QEMU issues an aio request it is placed on a queue. Worker threads take requests off the queue and execute them outside of core QEMU. They may perform blocking operations since they execute in their own threads and do not block the rest of QEMU. The implementation takes care to perform necessary synchronization and communication between worker threads and core QEMU.

在文件posix-aio-compat.c中有一个使用工作线程的例子，一个异步的文件操作。当qemu主进程发出一个aio请求时，会将其放入一个工作队列中。工作线程会在队列中拿出请求并在qemu主进程外处理。工作线程可以处理阻塞操作，因为它在自己的线程中处理而不影响其他部分。这其中，需要处理好工作线程和qemu主进程间的同步和交互关系。

Another example is ui/vnc-jobs-async.c which performs compute-intensive image compression and encoding in worker threads.

另一个例子是在文件ui/vnc-jobs-async.c中，将计算密集型的镜像解压缩的工作放到了工作线程中处理。vnc有专门的工作线程。

Since the majority of core QEMU code is not thread-safe, worker threads cannot call into core QEMU code. Simple utilities like qemu_malloc() are thread-safe but that is the exception rather than the rule. This poses a problem for communicating worker thread events back to core QEMU.

因为qemu主进程的大部分代码非线程安全的，工作线程不能调用到qemu主进程的代码。qemu_malloc函数是线程安全的，但这是例外而不是规则。这就带来了一个问题，即工作线程和qemu主进程通信的问题。

When a worker thread needs to notify core QEMU, a pipe or a qemu_eventfd() file descriptor is added to the event loop. The worker thread can write to the file descriptor and the callback will be invoked by the event loop when the file descriptor becomes readable. In addition, a signal must be used to ensure that the event loop is able to run under all circumstances. This approach is used by posix-aio-compat.c and makes more sense (especially the use of signals) after understanding how guest code is executed.

当工作线程需要通知qemu主进程的时候，一个pipe或qemu_eventfd()的fd需要加入到事件循环中。工作线程可以向fd写东西，当事件循环准备好后就会调用回调函数。此外，信号必须被用来确保该事件循环能够在所有情况下运行。这种使用信号的方法在文件posix-aio-compat.c中使用，在了解guest是怎么运行后会变得更有意义。

上面讲了这么多，个人觉得就是讲了以下几点：
1.qemu使用了线程和事件循环相结合的架构。
2.qemu有主进程(core qemu)、vcpu线程和工作线程（如vnc、aio）。
3.qemu主进程和工作线程之间通过信号、事件fd交互。
4.向事件循环中添加fd可以监测到文件的变化，从而调用回调函数。

下面的就不翻译了。

Executing guest code

So far we have mainly looked at the event loop and its central role in QEMU. Equally as important is the ability to execute guest code, without which QEMU could respond to events but would not be very useful.

There are two mechanism for executing guest code: Tiny Code Generator (TCG) and KVM. TCG emulates the guest using dynamic binary translation, also known as Just-in-Time (JIT) compilation. KVM takes advantage of hardware virtualization extensions present in modern Intel and AMD CPUs for safely executing guest code directly on the host CPU. For the purposes of this post the actual techniques do not matter but what matters is that both TCG and KVM allow us to jump into guest code and execute it.

Jumping into guest code takes away our control of execution and gives control to the guest. While a thread is running guest code it cannot simultaneously be in the event loop because the guest has (safe) control of the CPU. Typically the amount of time spent in guest code is limited because reads and writes to emulated device registers and other exceptions cause us to leave the guest and give control back to QEMU. In extreme cases a guest can spend an unbounded amount of time without giving up control and this would make QEMU unresponsive.

In order to solve the problem of guest code hogging QEMU's thread of control signals are used to break out of the guest. A UNIX signal yanks control away from the current flow of execution and invokes a signal handler function. This allows QEMU to take steps to leave guest code and return to its main loop where the event loop can get a chance to process pending events.

The upshot of this is that new events may not be detected immediately if QEMU is currently in guest code. Most of the time QEMU eventually gets around to processing events but this additional latency is a performance problem in itself. For this reason timers, I/O completion, and notifications from worker threads to core QEMU use signals to ensure that the event loop will be run immediately.

You might be wondering what the overall picture between the event loop and an SMP guest with multiple vcpus looks like. Now that the threading model and guest code has been covered we can discuss the overall architecture.

iothread and non-iothread architecture

The traditional architecture is a single QEMU thread that executes guest code and the event loop. This model is also known as non-iothread or !CONFIG_IOTHREAD and is the default when QEMU is built with ./configure && make. The QEMU thread executes guest code until an exception or signal yields back control. Then it runs one iteration of the event loop without blocking in select(2). Afterwards it dives back into guest code and repeats until QEMU is shut down.

If the guest is started with multiple vcpus using -smp 2, for example, no additional QEMU threads will be created. Instead the single QEMU thread multiplexes between two vcpus executing guest code and the event loop. Therefore non-iothread fails to exploit multicore hosts and can result in poor performance for SMP guests.

Note that despite there being only one QEMU thread there may be zero or more worker threads. These threads may be temporarily or permanent. Remember that they perform specialized tasks and do not execute guest code or process events. I wanted to emphasise this because it is easy to be confused by worker threads when monitoring the host and interpret them as vcpu threads. Remember that non-iothread only ever has one QEMU thread.

The newer architecture is one QEMU thread per vcpu plus a dedicated event loop thread. This model is known as iothread or CONFIG_IOTHREAD and can be enabled with ./configure --enable-io-thread at build time. Each vcpu thread can execute guest code in parallel, offering true SMP support, while the iothread runs the event loop. The rule that core QEMU code never runs simultaneously is maintained through a global mutex that synchronizes core QEMU code across the vcpus and iothread. Most of the time vcpus will be executing guest code and do not need to hold the global mutex. Most of the time the iothread is blocked in select(2) and does not need to hold the global mutex.

Note that TCG is not thread-safe so even under the iothread model it multiplexes vcpus across a single QEMU thread. Only KVM can take advantage of per-vcpu threads.

Conclusion and words about the future
Hopefully this helps communicate the overall architecture of QEMU (which KVM inherits). Feel free to leave questions in the comments below.

In the future the details are likely to change and I hope we will see a move to CONFIG_IOTHREAD by default and maybe even a removal of !CONFIG_IOTHREAD.

0 0