QEMU Accelerator Technical Documentation [zz]

来源:互联网 发布:成都金域名人ktv 编辑:程序博客网 时间:2024/06/16 02:09

Table of Contents

  • 1. Introduction
  • 2. API definition
    • 2.1 RAM, Physical and Virtual addresses
    • 2.2 RAM page dirtiness
    • 2.3 `/dev/kqemu' device
    • 2.4 KQEMU_GET_VERSION ioctl
    • 2.5 KQEMU_INIT ioctl
    • 2.6 KQEMU_MODIFY_RAM_PAGE ioctl
    • 2.7 KQEMU_EXEC ioctl
  • 3. KQEMU inner working and limitations
    • 3.1 Inner working
    • 3.2 General limitations
    • 3.3 Security
    • 3.4 Future Developments

 


 

QEMU Accelerator Technical Documentation

1. Introduction

The QEMU Accelerator (KQEMU) is a driver allowing a user application to run x86 code in a Virtual Machine (VM). The code can be either user or kernel code, in 64, 32 or 16 bit protected mode. KQEMU is very similar in essence to the VM86 Linux syscall call, but it adds some new concepts to improve memory handling.

KQEMU is ported on many host OSes (currently Linux, Windows, FreeBSD, Solaris). It can execute code from many guest OSes (e.g. Linux, Windows 2000/XP) even if the host CPU does not support hardware virtualization.

In that document, we assume that the reader has good knowledge of the x86 processor and of the problems associated with the virtualization of x86 code.

2. API definition

We describe the version 1.3.0 of the Linux implementation. The implementations on other OSes use the same calls, so they can be understood by reading the Linux API specification.

2.1 RAM, Physical and Virtual addresses

KQEMU manipulates three kinds of addresses:

  • RAM addresses are between 0 and the available VM RAM size minus one. They are currently stored on 32 bit words.
  • Physical addresses are addresses after MMU translation. They are stored on longs (32 bit on x86, 64 bits on x86_64).
  • Virtual addresses are addresses before MMU translation. They are stored on longs.

KQEMU has a physical page table which is used to associate a RAM address or a device I/O address range to a given physical page. It also tells if a given RAM address is visible as read-only memory. The same RAM address can be mapped at several different physical addresses. Only 4 GB of physical address space is supported in the current KQEMU implementation. Hence the bits of order >= 32 of the physical addresses are ignored.

The physical page table has the following structure:

phys_to_ram_map is a pointer to an array of 1024 pointers. If phys_to_ram_map[a] is NULL, then the physical memory range (a << 22) to ((a + 1) << 22) is unassigned. Otherwise, it points to an array of 1024 32 bit RAM addresses. phys_to_ram_map[a][b] describe the mapping of the 4K physical page (a << 22) | (b << 12). The bits from 4 to 12 give the device type. The following devices are defined:

IO_MEM_RAM (0)
RAM memory. The 20 high order bits give the corresponding RAM address.
IO_MEM_ROM (1)
ROM memory. The 20 high order bits give the corresponding RAM address.
IO_MEM_UNASSIGNED (2)
Unassigned memory.

All other device types are handled by KQEMU as unassigned memory.

In the current implementation, KQEMU does not support dynamic modification of the physical page by the client.

2.2 RAM page dirtiness

It is very important for the VM to be able to tell if a given RAM page has been modified. It can be used to optimize VGA refreshes, to flush a dynamic translator cache (when used with QEMU), to handle live migration or to optimize MMU emulation.

In KQEMU, each RAM page has an associated dirty byte in the array init_params.ram_dirty. The dirty byte is set to 0xff if the corresponding RAM page is modified. That way, at most 8 clients can manage a dirty bit in each page.

KQEMU reserves one dirty bit 0x04 for its internal use.

The client must notify KQEMU if some entries of the array init_params.ram_dirty were modified from 0xff to a different value. The address of the corresponding RAM pages are stored by the client in the array init_parms.ram_pages_to_update.

The client must also notify KQEMU if a RAM page has been modified independently of the init_params.ram_dirty state. It is done with the init_params.modified_ram_pages array.

Symmetrically, KQEMU notifies the client if a RAM page has been modified with the init_params.modified_ram_pages array. The client can use this information for example to invalidate a dynamic translation cache.

2.3 `/dev/kqemu' device

A user client wishing to create a new virtual machine must open the device `/dev/kqemu'. There is no hard limit on the number of virtual machines that can be created and run at the same time, except for the available memory.

2.4 KQEMU_GET_VERSION ioctl

It returns the KQEMU API version as an int. The client must use it to determine if it is compatible with the KQEMU driver.

2.5 KQEMU_INIT ioctl

Input parameter: struct kqemu_init init_params

It must be called once to initialize the VM. The following structure is used as input parameter:

struct kqemu_init {    uint8_t *ram_base;    unsigned long ram_size;    uint8_t *ram_dirty;    uint32_t **phys_to_ram_map;    unsigned long *pages_to_flush;    unsigned long *ram_pages_to_update;    unsigned long *modified_ram_pages;};

The pointers ram_base, ram_dirty, phys_to_ram_map, pages_to_flush, ram_pages_to_update and modified_ram_pages must be page aligned and must point to user allocated memory.

On Linux, due to a kernel bug related to memory swapping, the corresponding memory must be mmaped from a file. We plan to remove this restriction in a future implementation.

ram_size must be a multiple of 4K and is the quantity of RAM allocated to the VM.

ram_base is a pointer to the VM RAM. It must contain at least ram_size bytes.

ram_dirty is a pointer to a byte array of length ramsize/4096. Each byte indicates if the corresponding VM RAM page has been modified (see section 2.2 RAM page dirtiness)

phys_to_ram_map is a pointer to an array of 1024 pointers. It defines a mapping from the VM physical addresses to the RAM addresses (see section 2.1 RAM, Physical and Virtual addresses)

pages_to_flush is a pointer to an array of KQEMU_MAX_PAGES_TO_FLUSH longs. It is used to indicate which TLB must be flushed before executing code in the VM.

ram_pages_to_update is a pointer to an array of KQEMU_MAX_RAM_PAGES_TO_UPDATE longs. It is used to notify the VM that some RAM pages have been dirtied.

modified_ram_pages is a pointer to an array of KQEMU_MAX_MODIFIED_RAM_PAGES longs. It is used to notify the VM or the client that RAM pages have been modified.

The value 0 is return if the ioctl succeeded.

2.6 KQEMU_MODIFY_RAM_PAGE ioctl

Input parameter: int nb_pages

Notify the VM that nb_pages RAM pages were modified. The corresponding RAM page addresses are written by the client in the init_state.modified_ram_pages array given with the KQEMU_INIT ioctl.

Note: This ioctl does currently nothing, but the clients must use it for later compatibility.

2.7 KQEMU_EXEC ioctl

Input/Output parameter: struct kqemu_cpu_state cpu_state

Structure definitions:

struct kqemu_segment_cache {    uint32_t selector;    unsigned long base;    uint32_t limit;    uint32_t flags;};struct kqemu_cpu_state {#ifdef __x86_64__    unsigned long regs[16];#else    unsigned long regs[8];#endif    unsigned long eip;    unsigned long eflags;    uint32_t dummy0, dummy1, dumm2, dummy3, dummy4;    struct kqemu_segment_cache segs[6]; /* selector values */    struct kqemu_segment_cache ldt;    struct kqemu_segment_cache tr;    struct kqemu_segment_cache gdt; /* only base and limit are used */    struct kqemu_segment_cache idt; /* only base and limit are used */    unsigned long cr0;    unsigned long dummy5;    unsigned long cr2;    unsigned long cr3;    unsigned long cr4;    uint32_t a20_mask;    /* sysenter registers */    uint32_t sysenter_cs;    uint32_t sysenter_esp;    uint32_t sysenter_eip;    uint64_t efer;    uint64_t star;#ifdef __x86_64__    unsigned long lstar;    unsigned long cstar;    unsigned long fmask;    unsigned long kernelgsbase;#endif    uint64_t tsc_offset;    unsigned long dr0;    unsigned long dr1;    unsigned long dr2;    unsigned long dr3;    unsigned long dr6;    unsigned long dr7;    uint8_t cpl;    uint8_t user_only;    uint32_t error_code;    unsigned long next_eip;    unsigned int nb_pages_to_flush;    long retval;    unsigned int nb_ram_pages_to_update;     unsigned int nb_modified_ram_pages;};

Execute x86 instructions in the VM context. The full x86 CPU state is defined in this structure. It contains in particular the value of the 8 (or 16 for x86_64) general purpose registers, the contents of the segment caches, the RIP and EFLAGS values, etc...

If cpu_state.user_only is 1, a user only emulation is done. cpu_state.cpl must be 3 in that case.

KQEMU_EXEC does the following:

  1. Update the internal dirty state of the cpu_state.nb_ram_pages_to_update RAM pages from the array init_params.ram_pages_to_update. If cpu_state.nb_ram_pages_to_update has the value KQEMU_RAM_PAGES_UPDATE_ALL, it means that all the RAM pages may have been dirtied. The array init_params.ram_pages_to_update is ignored in that case.
  2. Update the internal KQEMU state by taking into account that the cpu_state.nb_modified_ram_pages RAM pages from the array init_params.modified_ram_pages where modified by the client.
  3. Flush virtual CPU TLBs corresponding to the virtual address from the array init_params.pages_to_flush of length cpu_state.nb_pages_to_flush. If cpu_state.nb_pages_to_flush is KQEMU_FLUSH_ALL, all the TLBs are flushed. The array init_params.pages_to_flush is ignored in that case.
  4. Load the virtual CPU state from cpu_state.
  5. Execute some code in the VM context.
  6. Save the virtual CPU state into cpu_state.
  7. Indicate the reason for which the execution was stopped in cpu_state.retval.
  8. Update cpu_state.nb_pages_to_flush and init_params.pages_to_flush to notify the client that some virtual CPU TLBs were flushed. The client can use this notification to synchronize its own virtual TLBs with KQEMU.
  9. Set cpu_state.nb_ram_pages_to_update to 1 if some RAM dirty bytes were transitionned from dirty (0xff) to a non dirty value. Otherwise, cpu_state.nb_ram_pages_to_update is set to 0.
  10. Update cpu_state.nb_modified_ram_pages and init_params.modified_ram_pages to notify the client that some RAM pages were modified.

cpu_state.retval indicate the reason why the execution was stopped:

KQEMU_RET_EXCEPTION | n
The virtual CPU raised an exception and KQEMU cannot handle it. The exception number n is stored in the 8 low order bits. The field cpu_state.error_code contains the exception error code if it is needed. It should be noted that in user only emulation, KQEMU handles no exceptions by itself.
KQEMU_RET_INT | n
(user only emulation) The virtual CPU generated a software interrupt (INT instruction for example). The exception number n is stored in the 8 low order bits. The field cpu_state.next_eip contains value of RIP after the instruction raising the interrupt. cpu_state.eip contains the value of RIP at the intruction raising the interrupt.
KQEMU_RET_SOFTMMU
The virtual CPU could not handle the current instruction. This is not a fatal error. Usually the client just needs to interpret it. It can happen because of the following reasons:
  • memory access to an unassigned address or unknown device type ;
  • an instruction cannot be accurately executed by KQEMU (e.g. SYSENTER, HLT, ...) ;
  • more than KQEMU_MAX_MODIFIED_RAM_PAGES were modified ;
  • some unsupported bits were modified in CR0 or CR4 ;
  • GDT.base or LDT.base are not a multiple of 8 ;
  • the GDT or LDT tables were modified while CPL = 3 ;
  • EFLAGS.VM was set.
KQEMU_RET_INTR
A signal from the OS interrupted KQEMU.
KQEMU_RET_SYSCALL
(user only emulation) The SYSCALL instruction was executed. The field cpu_state.next_eip contains value of RIP after the instruction. cpu_state.eip contains the RIP of the intruction.
KQEMU_RET_ABORT
An unrecoverable error was detected. This is usually due to a bug in KQEMU, so it should never happen !

3. KQEMU inner working and limitations

3.1 Inner working

The main priority when implementing KQEMU was simplicity and security. Unlike other virtualization systems, it does not do any dynamic translation nor code patching.

  • KQEMU always executes the target code at CPL = 3 on the host processor. It means that KQEMU can use the page protections to ensure that the VM cannot modify the host OS nor the KQEMU monitor. Moreover, it means that KQEMU does not need to modify the segment limits to ensure memory protection. Another advantage is that this methods works with 64 bit code too.
  • KQEMU maintains a shadow page table simulating the TLBs of the virtual CPU. The shadow page table persists between calls to KQEMU_EXEC.
  • When the target CPL is 3, the target GDT and LDT are copied to the host GDT and LDT so that the LAR and LSL instructions return a meaningful value. This is important for 16 bit code.
  • When the target CPL is different to 3, the host GDT and LDT are cleared so that any segment loading causes a General Protection Fault. That way, KQEMU can intercept every segment loading.
  • All the code running with EFLAGS.IF = 0 is interpreted so that EFLAGS.IF can be accurately reset in the VM. Fortunately, moderns OSes tend to execute very little code with interrupt disabled.
  • KQEMU maintains dirty bits for every RAM pages so that modified RAM pages can be tracked. It it useful to know if the GDT and LDT are modified in user mode, and will be useful later to optimize shadow page tables switching. It is also useful to maintain the coherency of the user space QEMU translation cache.

3.2 General limitations

Note 1: KQEMU does not currently use the hardware virtualization features of newer x86 CPUs. We expect that the limitations would be different in that case.

Note 2: KQEMU supports both x86 and x86_64 CPUs.

Before entering the VM, the following conditions must be satisfied :

  1. CR0.PE = 1 (protected mode must be enabled)
  2. CR0.MP = 1 (native math support)
  3. CR0.WP = 1 (write protection for user pages)
  4. EFLAGS.VM = 0 (no VM86 support)
  5. At least 8 consecutive GDT descriptors must be available (currently at a fixed location in the GDT).
  6. At least 32 MB of virtual address must be free (currently at a fixed location).
  7. All the pages containing the LDT and GDT must be RAM pages.

If EFLAGS.IF is set, the following assumptions are made on the executing code:

  1. If EFLAGS.IOPL = 3, EFLAGS.IOPL = 0 is returned in EFLAGS.
  2. POPF cannot be used to clear EFLAGS.IF
  3. RDTSC returns host cycles (could be improved if needed).
  4. The values returned by SGDT, SIDT, SLDT are invalid.
  5. Reading CPL.rpl and SS.rpl always returns 3 regardless of the CPL.
  6. in 64 bit mode with CPL != 3, reading SS.sel does not give 0 if the OS stored 0 in it.
  7. LAR, LSL, VERR, VERW return invalid results if CPL != 3.
  8. The CS and SS segment cache must be consistent with the descriptor tables.
  9. The DS, ES, FS, and GS segment cache must be consistent with the descriptor tables for CPL = 3.
  10. Some rarely used intructions trap to the user space client (performance issue).

If eflags.IF if reset the code is interpreted, so the VM code can be accurately executed. Some intructions trap to the user space emulator because the interpreter does not handle them. A limitation of the interpreter is that currently segment limits are not always tested.

3.3 Security

The VM code is always run with CPL = 3 on the host, so the VM code has no more priviliedge than regular user code.

The MMU is used to protect the memory used by the KQEMU monitor. That way, no segment limit patching is necessary. Moreover, the guest OS is free to use any virtual address, in particular the ones near the start or the end of the virtual address space. The price to pay is that CR3 must be modified at every emulated system call because different page tables are needed for user and kernel modes.

3.4 Future Developments

  • Small API changes to support 32 bit ioctls on 64 bit hosts. Currently only 64 bit ioctls can be used with a 64 bit host OS. It is an issue if one wants to launch the 32 bit QEMU client on a 64 bit host.
  • Support for the Linux 2.6.20 paravirtualization interface. It would enable better performance at the expense of the use of patched kernels. The primary goal of the Linux paravirtualization interface would be to disable the code interpreter when EFLAGS.IF = 0. A simple way to do it is to maintain a KQEMU specific 4K page containing the current value of IF and IOPL that the paravirtualization interface can use.
  • Optimization of the page table shadowing. A shadow page table cache could be implemented by tracking the modification of the guest page tables. The exact performance gains are difficult to estimate because the tracking itself would introduce some performance loss.
  • Support for hardware virtualization. The performance gains, if any, will be small but there would be no limitations regarding what the guest OS can do.
  • Support of guest SMP. There is no particular problem except when a RAM page must be unlocked because the host has not enough memory. This particular case needs specific Inter Processor Interrupts (IPI).
  • Dynamic relocation of the monitor code so that a 32 MB hole in the guest address space is found automatically without making assumptions on the guest OS.

 


This document was generated on 6 February 2007 using texi2html 1.56k.