Xen中的异常和中断(系统调用)、gdt、ldt

来源：互联网发布：剑三捏脸数据萝莉编辑：程序博客网时间：2024/04/29 21:33

特权级是实现系统虚拟化的关键因素，因为通过其可以将整个系统划分为虚拟化管理、系统内核、用户空间等不同部分，实现分级保护和资源共享。因此，深入分析gdt、ldt、异常、中断等的实现方式，对于理解由只含Linux内核至虚拟化管理器的出现，具有重要的理论和实践意义。本文将结合源码分析、xen的docs说明，充分阐述上述问题的解决方案。

GDT：gdt开始时是由xen设定的，决定了guest os的启动位置；如果guest os不同意，可以通过hypercall进行修改。guest os在启动的时候，会利用xen为其提供的约定的GDT，这个GDT不在guest os内存空间内。如果，guest OS不想利用该GDT所提供的空间，而是想利用其它的位于ring1 或者ring3的“flat”空间，那么guest os需要首先从其内存空间内分配一块GDT内存，然后向xen注册。该注册过程是利用 int set_gdt(unsigned long frame_list, int entries)函数实现的，其中，frame_list是由14个machine page frame,新的GDT位于其内。在注册之后，这些frame只能为只读，只能利用前14个frame，因为第15,16个frame会被用来存放xen的GDT 项，xen保留项的具体内容参见（xen/include/public/arch-x86_32.h）（这就是说一共有16个frames）。entries代表frames中的项数。

/*
* SEGMENT DESCRIPTOR TABLES
*/
/*
* A number of GDT entries are reserved by Xen. These are not situated at the
* start of the GDT because some stupid OSes export hard-coded selector values
* in their ABI. These hard-coded values are always near the start of the GDT,
* so Xen places itself out of the way, at the far end of the GDT.
*/
#define FIRST_RESERVED_GDT_PAGE 14
#define FIRST_RESERVED_GDT_BYTE (FIRST_RESERVED_GDT_PAGE * 4096)
#define FIRST_RESERVED_GDT_ENTRY (FIRST_RESERVED_GDT_BYTE / 8)

xen也允许guest os去改变指定的segment descriptor，这是通过hypercall update_descriptor（uint64_t ma, uint64_t desc）实现的。

LDT:guest os也可以自主更新LDT，这是通过mmu_update实现的，该函数将LDT的基地址和待项数作为参数。ldt也可以更新特定项，方法同GDT更新特定项。

对于gdt、ldt，32位的check_descriptor是通过 fixup_guest_code_selector将其DPL设为1(位于./xen/arch/x86/x86_32/mm.c)，而对于64位，则是将其设为0（位于./xen/arch/x86/x86_64/mm.c）

系统调用（软中断）、异常是由内而外；硬件中断一般是由外而内：

Xen does not allow the guest kernel to set up the IDT for the processor, but
allows the guest kernel to pass on an IDT it desires to Xen by means of a
hypercall. Xen makes its own IDT on behalf of the guest kernel that the
processor accesses. However, when it does this the stack that the hardware
jumps to is at privilege level 0 and is a stack accessible only to Xen.

Xen emulates the hardware behavior to the OS by creating a bounce
frame on the Linux Kernel stack just as the x86 hardware would do. How-
ever, Xen does not turn the control directly to the Interrupt Service Routine
as set by the Linux kernel. Instead, the Xen hypervisor jumps to the “hy-
pervisor callback” routine defined in the XenoLinux kernel after creating a
bounce frame on the kernel stack of the domain.

参考资料

http://www.sprg.uniroma2.it/kernelhacking2008/lectures/lkhc08-06.pdf

xen Interface manual Xen v3.0 for x86

非常有用的材料

Xen/IA64 interrupt virtualization

IDT

In vanilla Linux the IDT is initialized in trap_init() using set__gate() functions. Because Xen handles the IDT, it requires all calls to these function to be replaced with a single call to the HYPERVISOR_set_trap_table() hypercall.

HYPERVISOR_set_trap_table() accepts as a parameter the virtual IDT of the guest, represented by the trap_table structure (of type struct trap_info) in traps-xen.c.

struct trap_info resembles a trap or interrupt gate, having fields for vector, handler segment selector and offset.

Xen maintains two IDT's, one global IDT (its own) and other per domain IDT. Xen uses global IDT to register the entire trap handler except for system call handler (int 0x80).

Virtual IDT

• A virtual IDT is provided by guest OS for setting up interrupt vector table.

• The exception stack frame presented to a virtual trap handler is identical to its native equivalent.

Xen guest 中的trap table 见linux-2.6-xen-sparse /arch/i386/kernel/traps-xen.c

static trap_info_t trap_table[] = {

{ 0, 0, __KERNEL_CS, (unsigned long)divide_error },

{ 1, 0|4, __KERNEL_CS, (unsigned long)debug },

{ 3, 3|4, __KERNEL_CS, (unsigned long)int3 },

{ 4, 3, __KERNEL_CS, (unsigned long)overflow },

{ 5, 0, __KERNEL_CS, (unsigned long)bounds },

{ 6, 0, __KERNEL_CS, (unsigned long)invalid_op },

{ 7, 0|4, __KERNEL_CS, (unsigned long)device_not_available },

{ 9, 0, __KERNEL_CS, (unsigned long)coprocessor_segment_overrun },

{ 10, 0, __KERNEL_CS, (unsigned long)invalid_TSS },

{ 11, 0, __KERNEL_CS, (unsigned long)segment_not_present },

{ 12, 0, __KERNEL_CS, (unsigned long)stack_segment },

{ 13, 0, __KERNEL_CS, (unsigned long)general_protection },

{ 14, 0|4, __KERNEL_CS, (unsigned long)page_fault },

{ 15, 0, __KERNEL_CS, (unsigned long)fixup_4gb_segment },

{ 16, 0, __KERNEL_CS, (unsigned long)coprocessor_error },

{ 17, 0, __KERNEL_CS, (unsigned long)alignment_check },

#ifdef CONFIG_X86_MCE

{ 18, 0, __KERNEL_CS, (unsigned long)machine_check },

#endif

{ 19, 0, __KERNEL_CS, (unsigned long)simd_coprocessor_error },

{ SYSCALL_VECTOR, 3, __KERNEL_CS, (unsigned long)system_call },

{ 0, 0, 0, 0 }

};

void trap_init() {

HYPERVISOR_set_trap_table(trap_table);

}

hypercall为什么要使用中断门

hypercall 使用中断门见hypercall篇“Xen中的实现”小节。

trap/interrupt gate for hypercall

a curious question about IDT descriptor type for hypercall. What's the reason to use interrupt-gate type (14) for hypercall (0x82) on 32bit Xen?

回答:

Everything's an interrupt gate on 32-bit Xen, so that we can safely (atomically) save away guest segment register state. NMI is the only real pain, and I suppose MCE too.

Interrupt handlers save and restore segment registers. We could fault on a reload of a segment register and lose the original segment register value.

trap的流处理程

入口和出口见xen/arch/x86/x86-32/entry.S中handle_exception。

可分为以下几种情况

1 guest application的系统调用，直接切换到ring 1 的guest kernel执行,见后面的小节。

2 其余情况由xen的异常处理程序处理，发生异常 ==> 陷入VMM. When an exception occurs the processor transfers control to the Xen hypervisor, using the Xen exception handlers in entry.S.

2.1 如下面的异常 in xen/arch/x86/traps.c，都将调用do_trap

DO_ERROR_NOCODE(TRAP_divide_error, divide_error)

DO_ERROR_NOCODE(TRAP_overflow, overflow)

DO_ERROR_NOCODE(TRAP_bounds, bounds)

DO_ERROR_NOCODE(TRAP_copro_seg, coprocessor_segment_overrun)

DO_ERROR( TRAP_invalid_tss, invalid_TSS)

DO_ERROR( TRAP_no_segment, segment_not_present)

DO_ERROR( TRAP_stack_error, stack_segment)

DO_ERROR_NOCODE(TRAP_copro_error, coprocessor_error)

DO_ERROR( TRAP_alignment_check, alignment_check)

DO_ERROR_NOCODE(TRAP_simd_error, simd_coprocessor_error)

do_trap()==> 判断trap是否来自Guest OS ==> 如果是，调用do_guest_trap()。否则xen panic。
Guest OS App --> VMM --> Guest OS Kernel

2.2 GPE和Invalid op有自己的处理函数do_general_protection 和do_invalid_op。特别值得一提的是do_general_protection，有时候guest kernel执行sensitive instruction会导致GPE，所以调用emulate_privileged_op模拟执行，其他的处理类似do_trap

一个示例见http://wiki.xensource.com/xenwiki/XenMemoryManagement

do_guest_trap的处理

1 Gets from the guest context the gate for the exception

2 Creates the exception frame required by the guest OS to process the exception

Then iret is executed to return control to the guest OS exception handler

另外提一下 The Definitive Guide to the Xen Hypervisor 7.2 p120的说法不确切。

“The code path for delivering a trap is significantly simpler than that for events.When the guest is run on a particular (physical) CPU, the hypervisor installs an Interrupt Descriptor Table (IDT) on behalf of the guest domain. This means that the interrupt handling path does not involve the hypervisor at all, for all interrupts are handled by the guest.”

guest的System Call

trap table有如下项：

{ SYSCALL_VECTOR, 3, __KERNEL_CS, (unsigned long)system_call },

前面已经提到int 80h被特殊对待，If everything is 32-bit, "int 80" will be used, but it'll be directed directly to the guest kernel in ring 1 (i.e. the hypervisor isn't involved).

具体的实现见：

in xen/arch/x86/traps.c:do_set_trap_table():

if ( cur.vector == 0x80 )

init_int80_direct_trap(curr);

init_int80_direct_trap 将设置int80_desc，然后进程切换时paravirt_ctxt_switch_to =>set_int80_direct_trap

When a VM gets scheduled, its system call handler (from per domain IDT table) is registered with the processor（VCPU内）. Hence when a domain/VM executes a system call, its own handler is executed.

==》这样X86_32就可以不陷入VMM了。而且可以做到每个Guest OS的system call不同。

Implementation differs for x86_64: Xen registers its own system call handler with the processor and from that handler routes the request to VM/Domain specific handler.

==》因为x86_64的Kernel也是在Ring-3上(和以前的Ring-0不同)，以前的system call不能用了，只能改写。

http://hal.archives-ouvertes.fr/docs/00/43/10/31/PDF/Technical_Report_Syscall_Interception.pdf

System Calls in x86_32

xen: add more Xen dom0 support

Xen和guest都有各自的init_IRQ函数，irq_desc全局数组，do_IRQ处理函数，以及中断返回处理，简单来说，就是xen的中断处理借鉴了Linux的实现

全景

来自Xen Intro- version 1.0的材料非常精当

Registration (or binding) of irqs in guest domains:

第一部分：guest的初始化，guest的irq实际和evtchn绑定，

The guest OS calls init_IRQ() when it boots (start_kernel() method calls init_IRQ() ; file init/main.c). (init_IRQ() is in file sparse/arch/xen/kernel/evtchn.c) There can be 256 physical irqs; so there is an array called irq_desc with 256 entries. (file sparse/include/linux/irq.h)

All elements in this array are initialized in init_IRQ() so that their status is disabled (IRQ_DISABLED).

Now, when a physical driver starts it usually calls request_irq(). This method eventually calls setup_irq() (both in sparse/kernel/irq/manage.c). which calls startup_pirq(). startup_pirq() send a hypercall to the hypervisor (HYPERVISOR_event_channel_op) in order to bind the physical irq (pirq).The hypercall is of type EVTCHNOP_bind_pirq. See: startup_pirq() (file sparse/arch/xen/kernel/evtchn.c)

注1：在xen 3.1中已经不包含这个文件sparse/kernel/irq/manage.c，该文件在Linux内核中

注2：physical driver 对应static struct hw_interrupt_type pirq_type 。

static struct hw_interrupt_type pirq_type = {

.typename = "Phys-irq",

.startup = startup_pirq,

};

而setup_irq中有这样调用desc->handler->startup(irq)。

第二部分：Xen

On the Hypervisor side, handling this hypervisor call is done in: evtchn_bind_pirq() method (file /common/event_channel.c) which calls pirq_guest_bind() (file arch/x86/irq.c). The pirq_guest_bind() changes the status of the corresponding irq_desc array element to be enabled (~IRQ_DISABLED注[3]). it also calls startup() method. Now when an interrupts arrives from the controller (the APIC), we arrive at do_IRQ() method as is also in usual linux kernel

(also in arch/x86/irq.c). The Hypervisor handles only timer and serial interrupts. Other interrupts are passed to the domains by calling _do_IRQ_guest() (In fact, the IRQ_GUEST flag is set for all interrupts except for timer and serial interrupts). _do_IRQ_guest() send the interrupt by calling send_guest_pirq() to all guests who are registered on this IRQ. The send_guest_pirq() creates an event channel (an instance of evtchn注[4]) and sets the pending flag of this event channel. (by calling evtchn_set_pending()) Then, asynchronously, Xen will notify this domain regarding this interrupt calling evtchn_set_pending()) Then, asynchronously, Xen will notify this domain regarding this interrupt (unless it is masked).

注[3]: 此处的irq_desc注意是xen的irq_desc，而第一部分提到设置为IRQ_DISABLED是guest的irq_desc。

注[4]：这个说法不确切，“The send_guest_pirq() creates an event channel” 该event channel是在evtchn_bind_pirq时已经分配好，send_guest_pirq只是根据pirq找到该evtchn而已。

Xen中断的处理

初始化init_IRQ函数在xen/arch/x86/i8259.c文件中

When an interrupt occurs control passes to the Xen common_interrupt routine(见文件asm/asm_defns.h中的宏BUILD_COMMON_IRQ), that calls the Xen do_IRQ function.(该函数在xen/arch/x86/irq.c文件中)

do_IRQ:

Checks who has the responsibility to handle the interrupt:

The VMM: the interrupt is handled internally by the VMM

One ore more guest OS: it calls __do_IRQ_guest function

__do_IRQ_guest:

For each domain that has a binding to the IRQ sets to 1 the pending flag of the event channel via send_guest_pirq

xen仅仅需要处理2个物理中断，即串口中断（ns16550）和计时器中断，分见于函数ns16550_init_postirq和early_time_init。

guest中断的处理

In Xen interrupts to be notified to the Linux guest OS are handled through the event channels notification mechanism.

During startup the guest OS installs two handlers (event and failsafe) via the HYPERVISOR_set_callbacks hypercall:

The event callback is the handler to be called to notify an event to the guest OS

The failsafe callback is used when a fault occurs when using the event callback

linux-2.6-xen-sparse/arch/i386/mach-xen/setup.c中有代码如下

void __init machine_specific_arch_setup(void)

{

static struct callback_register __initdata event = {

.type = CALLBACKTYPE_event,

.address = { __KERNEL_CS, (unsigned long)hypervisor_callback },

};

static struct callback_register __initdata failsafe = {

.type = CALLBACKTYPE_failsafe,

.address = { __KERNEL_CS, (unsigned long)failsafe_callback },

};

ret = HYPERVISOR_callback_op(CALLBACKOP_register, &event);

if (ret == 0)

ret = HYPERVISOR_callback_op(CALLBACKOP_register, &failsafe);

hypervisor_callback 在linux-2.6-xen-sparse/arch/i386/kernel/entry-xen.S文件中，其实现和作用见“xen的ret_from_intr”小节的分析。

可以看到：The event callback handler is hypervisor_callback function (is the installed at startup), that calls evtchn_do_upcall. 具体的分析见evtchn分析篇。

evtchn_do_upcall:

1 Checks for pending events

2 Resets to zero the pending flag

3 Uses the evtchn_to_irq array to identify the IRQ binding for the event channel

4 Calls Linux do_IRQ interrupt handler function

Andrndr

Dom 0或driver domain的物理中断

http://blog.csdn.net/snailhit/article/details/6413399

“A guest in Domain 0, or in a driver domain, will want to set up physical IRQ to event channel mappings for the various devices under its control. Before doing this, of course, it will want to discover which devices are already bound to which IRQs. Typically, this is done via BIOS or APIC calls. This is not permitted in Xen, however, so they are forced to use the HYPERVISOR_physdev_op hypercall.”

startup_pirq, enable_pirq等几个操作都调用了HYPERVISOR_physdev_op超级调用.

construct_dom0中有如下代码:

/* DOM0 is permitted full I/O capabilities. */

rc |= irqs_permit_access(dom0, 0, NR_IRQS-1);

问题：

Driver Domain是不是通过XEN_DOMCTL_irq_permission 打开中断?

xen的ret_from_intr

在xen/arch/x86/x86-32/entry.S中

通过CS来判断这个中断是否发生在ring0，如果是就跳到restore_all_xen返回，如果不是就跳到test_all_events，这里就开始进行guest中断的检测和处理。

ENTRY(ret_from_intr)

GET_CURRENT(%ebx)

movl UREGS_eflags(%esp),%eax

movb UREGS_cs(%esp),%al

testl $(3|X86_EFLAGS_VM),%eax

jnz test_all_events

jmp restore_all_xen

test_guest_events先检查upcall_mask，如果没有置位再检查upcall_pending

test_all_events:

…..

test_guest_events:

movl VCPU_vcpu_info(%ebx),%eax

testb $0xFF,VCPUINFO_upcall_mask(%eax)

jnz restore_all_guest

testb $0xFF,VCPUINFO_upcall_pending(%eax)

jz restore_all_guest

/*process_guest_events:*/

sti

leal VCPU_trap_bounce(%ebx),%edx

movl VCPU_event_addr(%ebx),%eax

movl %eax,TRAPBOUNCE_eip(%edx)

movl VCPU_event_sel(%ebx),%eax

movw %ax,TRAPBOUNCE_cs(%edx)

movb $TBF_INTERRUPT,TRAPBOUNCE_flags(%edx)

call create_bounce_frame

jmp test_all_events

create_bounce_frame:

testl $~3,%eax

jz domain_crash_synchronous

movl %eax,UREGS_cs+4(%esp)

movl TRAPBOUNCE_eip(%edx),%eax

movl %eax,UREGS_eip+4(%esp)

ret

如果有事件的话, 首先通过create_bounce_frame构造帧。create_bounce_frame的参数从哪里来呢？这就要回到前面提到的HYPERVISOR_set_callbacks。

xen 中, HYPERVISOR_set_callbacks在xen中的实现为

do_set_callbacks=>register_guest_callback, 该函数纪录了guest中传递过来的callback信息.

static long register_guest_callback(struct callback_register *reg)

{

long ret = 0;

struct vcpu *v = current;

switch ( reg->type )

{

case CALLBACKTYPE_event:

v->arch.guest_context.event_callback_cs = reg->address.cs;

v->arch.guest_context.event_callback_eip = reg->address.eip;

break;

}

xen/arch/x86/x86-32/asm-offset.c中有如下代码

OFFSET(VCPU_event_addr, struct vcpu,

arch.guest_context.event_callback_eip);

这样的话，可以看到hypervisor_callback被准备为create_bounce_frame的参数。所以当通过restore_all_guest返回guest时，hypervisor_callback被调用。

http://166.111.68.94/moin/projects/rtarmor/xen_related/xen_linux_interrupt

0 0