CPU 环, 优先级和保护 CPU Rings, Privilege, and Protection

来源:互联网 发布:json转化为list 编辑:程序博客网 时间:2024/06/07 01:20

CPU Rings, Privilege, and Protection

CPU环, 特权级和保护机制

You probably know intuitively that applications have limited powers in Intel x86 computers and that only operating system code can perform certain tasks, but do you know how this really works? This post takes a look at x86 privilege levels, the mechanism whereby the OS and CPU conspire to restrict what user-mode programs can do. There are four privilege levels, numbered 0 (most privileged) to 3 (least privileged), and three main resources being protected: memory, I/O ports, and the ability to execute certain machine instructions. At any given time, an x86 CPU is running in a specific privilege level, which determines what code can and cannot do. These privilege levels are often described as protection rings, with the innermost ring corresponding to highest privilege. Most modern x86 kernels use only two privilege levels, 0 and 3:

你可能凭直觉就能体会到x86体系下,只有操作系统能执行一些特别的任务,但是你知道为什么会这样吗?这篇文章讲解一下x86下的优先级, 它的存在使得OS和CPU共同合作来限制用户模式程序的作用域。总共有4个特权级,从0(最高)到3(最低),主要有三类资源是被保护的:内存,I/O端口,一些特别的机器指令。在任何给定的时间点,X86的CPU都工作在一个合适的特权级,它决定了哪些代码是可以执行的哪些是不能被执行的。这些特权级通常被描述为保护环,从最内层到最外层,特权级从高到低。大多数x86体系结构只使用2个特权级,0和3(虚拟化XEN的实现,会用到1特权级)。

x86 Protectiong Rings 
x86 Protection Rings

About 15 machine instructions, out of dozens, are restricted by the CPU to ring zero. Many others have limitations on their operands. These instructions can subvert the protection mechanism or otherwise foment chaos if allowed in user mode, so they are reserved to the kernel. An attempt to run them outside of ring zero causes a general-protection exception, like when a program uses invalid memory addresses. Likewise, access to memory and I/O ports is restricted based on privilege level. But before we look at protection mechanisms, let’s see exactly how the CPU keeps track of the current privilege level, which involves the segment selectors from the previous post. Here they are:

大约15个机器指令只能被运行在0环的CPU所执行。其它的指令的操作数收到限制。这些指令如果被用户层使用,会破坏整个保护机制,所以只能被内核来使用。任何尝试在非0环外使用它们就会导致通常的保护异常,比如一个程序使用了一个非法的内存地址。同样,访问内存和I/O端口的行为也是要遵循特权级的约束。在我们深入到保护机制之前,让我们先看看CPU是如何跟踪当前的特权级别的,这里用到了之前一篇文章中提到的段选择子。如下所示:

x86 Segment Selectors 
Segment Selectors – Data and Code

The full contents of data segment selectors are loaded directly by code into various segment registers such as ss (stack segment register) and ds (data segment register). This includes the contents of the Requested Privilege Level (RPL) field, whose meaning we tackle in a bit. The code segment register (cs) is, however, magical. First, its contents cannot be set directly by load instructions such as mov, but rather only by instructions that alter the flow of program execution, like call. Second, and importantly for us, instead of an RPL field that can be set by code, cs has aCurrent Privilege Level (CPL) field maintained by the CPU itself. This 2-bit CPL field in the code segment register is always equal to the CPU’s current privilege level. The Intel docs wobble a little on this fact, and sometimes online documents confuse the issue, but that’s the hard and fast rule. At any time, no matter what’s going on in the CPU, a look at the CPL in cs will tell you the privilege level code is running with.

数据选择子的内容直接由代码加载到诸如ss和ds中。这些内容中就包括了请求特权级Request Privilege Level(RPL),它比较好理解。而cs相比ds就有些特别。首先,它的内容不能通过诸如mov之类的指令直接加载,而只能被那些改变程序执行流的指令所加载,如call。第二,更要要的是RPL可以被代码所设置,而cs的当前特权级Current Privilege Level(CPL)位描述了当前CPU的特权级。这个代码段寄存器中的2bit的CPL始终保持和CPU当前的特权级别一致。虽然这点在intel的一些官方文档上经常描述有所出入,但是这是固定的逻辑。所以,任何时候,无论当前CPU在做什么,查看这时CS中的CPL就能知道CPU的当前特权级别。

Keep in mind that the CPU privilege level has nothing to do with operating system users. Whether you’re root, Administrator, guest, or a regular user, it does not matterAll user code runs in ring 3 and all kernel code runs in ring 0, regardless of the OS user on whose behalf the code operates. Sometimes certain kernel tasks can be pushed to user mode, for example user-mode device drivers in Windows Vista, but these are just special processes doing a job for the kernel and can usually be killed without major consequences.

请记住,CPU的特权级别对于操作系统的用户来说它们什么也不是。无论你是root用户,管理员,普通用户,你们都不会关心它。所有的用户代码都工作在3环并且所有的内核代码工作在0环,无论操作系统的用户在操作什么样的代码。有时候,一些内核任务会在用户态模式下执行,例如,vista下的用户模式设备驱动,但是这只是一个特被的为内核服务的进程,可以被kill而对核心任务没有影响。

Due to restricted access to memory and I/O ports, user mode can do almost nothing to the outside world without calling on the kernel. It can’t open files, send network packets, print to the screen, or allocate memory. User processes run in a severely limited sandbox set up by the gods of ring zero. That’s why it’s impossible, by design, for a process to leak memory beyond its existence or leave open files after it exits. All of the data structures that control such things – memory, open files, etc – cannot be touched directly by user code; once a process finishes, the sandbox is torn down by the kernel. That’s why our servers can have 600 days of uptime – as long as the hardware and the kernel don’t crap out, stuff can run for ever. This is also why Windows 95 / 98 crashed so much: it’s not because “M$ sucks” but because important data structures were left accessible to user mode for compatibility reasons. It was probably a good trade-off at the time, albeit at high cost.

由于用户模式访问内存和I/O端口是受限制的,用户模式不通过内核提供的接口是没办法存活。它不能打开文件,发送网络包,显示屏幕或者分配内存。用或进程运行在0环所提供的受限的一个沙箱中。这就是为什么不可能绕过这些设计一个程序可以在它已经销毁之后还能导致内存泄露或者保持一个打开的文件句柄。所有的这些资源不能够被用户层代码直接操作;一旦一个进程结束,它所寄生的沙箱就会被内核收回(除非内核出现了bug:))。这就是为什么我们的server能够持续运行600天,除非硬件错误而不会当机。这也是为什么95/98系统总是崩溃,不是因为恶心的微软而是因为重要的数据结构没有被沙箱很好的保护,而导致一些关键资源被用户模式的程序直接非法操作了。

The CPU protects memory at two crucial points: when a segment selector is loaded and when a page of memory is accessed with a linear address. Protection thus mirrors memory address translation where both segmentation and paging are involved. When a data segment selector is being loaded, the check below takes place:

CPU在两个关键点上为内存提供保护:当一个段选择子被加载以及当一个内存页被一个线性地址多访问时。在这里保护机制镜像了内存地址翻译的动作,包含了分段和分页。当一个数据段选择子被加载时,发生了如下的检查逻辑:(请注意,只要DPL大于CPL或者RPL两者之一就ok)

x86 Segment Protection 
x86 Segment Protection

Since a higher number means less privilege, MAX() above picks the least privileged of CPL and RPL, and compares it to the descriptor privilege level (DPL). If the DPL is higher or equal, then access is allowed. The idea behind RPL is to allow kernel code to load a segment using lowered privilege. For example, you could use an RPL of 3 to ensure that a given operation uses segments accessible to user-mode. The exception is for the stack segment register ss, for which the three of CPL, RPL, and DPL must match exactly.

因为数字越大代表权限越低,上图中的MAX返回CPL和RPL中的最低权限要求,并且与DPL相比较。如果DPL高于或者等于它(即DPL所描述的权限低于要求的),则可以访问。这里RPL的意义在于,使得内核可以以较低的权限来加载一个段。例如,你可以使用一个3RPL为一个给定的操作使用段来访问用户空间。但是有一个例外,那就是SS,对于它的访问,要求CPL,RPL和DPL完全一致。

In truth, segment protection scarcely matters because modern kernels use a flat address space where the user-mode segments can reach the entire linear address space. Useful memory protection is done in the paging unit when a linear address is converted into a physical address. Each memory page is a block of bytes described by a page table entry containing two fields related to protection: a supervisor flag and a read/write flag. The supervisor flag is the primary x86 memory protection mechanism used by kernels. When it is on, the page cannot be accessed from ring 3. While the read/write flag isn’t as important for enforcing privilege, it’s still useful. When a process is loaded, pages storing binary images (code) are marked as read only, thereby catching some pointer errors if a program attempts to write to these pages. This flag is also used to implement copy on write when a process is forked in Unix. Upon forking, the parent’s pages are marked read only and shared with the forked child. If either process attempts to write to the page, the processor triggers a fault and the kernel knows to duplicate the page and mark it read/write for the writing process.

事实上,短保护机制很少收到关注了,因为现在的内核使用flat的内存布局,用户模式的段可以访问到整个线性地址空间。所以,有效的内存保护坐落在由线性地址转化为物理地址的分页单元中。每个内存页是一些由页表单元所表述的数据,该也表单元包含两个与保护机制相关的域:管理员标志喝读写标志。其中,管理员标志是x86内存管理机制中被内核使用的主要标志。当该标志为1时,该page不能被3环的程序所访问。虽然,读写标志对于特权级来说不如前者重要,但是也提供了一些作用。当一个进程被加载时,页所包含的二进制内容是被标识为只读,并且在有程序尝试写这些页的时候触发一些异常。它被用来实现copy on write机制。当执行fork时,父进程的页面被标识为只读,并且与fork出来的子进程共享。如果两者任何一个对其进行了写操作,则处理器会触发一个异常,并且内核救治到了需要把该页复制一份,并给予需要写它的进程的相对应的页为读写权限。

Finally, we need a way for the CPU to switch between privilege levels. If ring 3 code could transfer control to arbitrary spots in the kernel, it would be easy to subvert the operating system by jumping into the wrong (right?) places. A controlled transfer is necessary. This is accomplished via gate descriptors and via the sysenter instruction. A gate descriptor is a segment descriptor of type system, and comes in four sub-types: call-gate descriptor, interrupt-gate descriptor, trap-gate descriptor, and task-gate descriptor. Call gates provide a kernel entry point that can be used with ordinary call and jmp instructions, but they aren’t used much so I’ll ignore them. Task gates aren’t so hot either (in Linux, they are only used in double faults, which are caused by either kernel or hardware problems).

最后,CPU需要一个能在各特权级之间切换的机制。如果一个运行在3环的代码能够随意的切换自己的特权级,那么就很容易导致这个系统工作在危险的环境中。一个受控的转换是必须的。这通过gate descriptors和sysenter指令来完成。门描述符是一种系统段描述符,它有4种形式:call-gate descriptor, interrupt-gate descriptor,trap-gate descriptor,task-gate descrpitor。调用门提供了一个给call和jmp指令使用的能够进入内核的入口,但是它们使用的不多,所以这里忽略它们。而任务们在Linux中也没有被怎么使用。(Linux用它们来实现由内核或者硬件产生的double faults)

That leaves two juicier ones: interrupt and trap gates, which are used to handle hardware interrupts (e.g., keyboard, timer, disks) and exceptions (e.g., page faults, divide by zero). I’ll refer to both as an “interrupt”. These gate descriptors are stored in the Interrupt Descriptor Table (IDT). Each interrupt is assigned a number between 0 and 255 called a vector, which the processor uses as an index into the IDT when figuring out which gate descriptor to use when handling the interrupt. Interrupt and trap gates are nearly identical. Their format is shown below along with the privilege checks enforced when an interrupt happens. I filled in some values for the Linux kernel to make things concrete.

那么还剩下中断门和陷阱门了。前者用来处理硬件的中断(键盘,时钟等),后者用来处理一些异常(页异常,除0等等)。我把两者都统称为interrupt。这些描述符保存在中断描述符表中(IDT)。总共有0-255个中断域,它们是处理器使用它们的索引。它们两者很相似。具体的格式如下所示:

Interrupt Descriptor with Privilege Check 
Interrupt Descriptor with Privilege Check

Both the DPL and the segment selector in the gate regulate access, while segment selector plus offset together nail down an entry point for the interrupt handler code. Kernels normally use the segment selector for the kernel code segment in these gate descriptors. An interrupt can never transfer control from a more-privileged to a less-privileged ring. Privilege must either stay the same (when the kernel itself is interrupted) or be elevated (when user-mode code is interrupted). In either case, the resulting CPL will be equal to to the DPL of the destination code segment; if the CPL changes, a stack switch also occurs. If an interrupt is triggered by code via an instruction like int n, one more check takes place: the gate DPL must be at the same or lower privilege as the CPL. This prevents user code from triggering random interrupts. If these checks fail – you guessed it – a general-protection exception happens. All Linux interrupt handlers end up running in ring zero.

第一句话不知道该如何理解:((内核通常在门描述符中为内核代码段使用段选择子。中断不可能从高特权级转变为低特权级。特权级必须保持同样的特权级(当该中断是由内核自己触发的,如内核自发的缺页)或者被提升权限(当用户模式代码触发了中断)。任何一种情况下,结果的CPL比去等于目标代码段的DPL;如果CPL发生了转变,堆栈也许要发生变化。如果中断是由类似int n的指令触发的,需要更多的检查:门的DPL必须小于等于CPL。这防止用户随意触发中断。如果这些检查没有通过,则保护异常就会发生。所有的Linux的中断处理都结束在环0。

我:对这一段的描述,推荐看ULK或者国人写的《自己动手写操作系统》的描述。另外,我后面会总结一下这些内容在arm体系上又是如何的:)欢迎大家共同学习指正。

During initialization, the Linux kernel first sets up an IDT in setup_idt() that ignores all interrupts. It then uses functions in include/asm-x86/desc.h to flesh out common IDT entries inarch/x86/kernel/traps_32.c. In Linux, a gate descriptor with “system” in its name is accessible from user mode and its set function uses a DPL of 3. A “system gate” is an Intel trap gate accessible to user mode. Otherwise, the terminology matches up. Hardware interrupt gates are not set here however, but instead in the appropriate drivers.

在初始化过程中,Linux内核会首先在setup_idt()中设置IDT,并忽略多有的中断。然后使用include/asm-x86/desc.h中的函数来把通常的IDT的项填充到arch/x86/kernel/traps_32.c中。在Linux中,由system的名字所标注的门描述符是被用户模式所访问,并且相关的设置函数使用的DPL是3。这个"system gate"是intel定义的陷阱门,被提供给用户模式访问。而硬件的中断门没有在这里被设置,而是在相关的驱动中被设置。

Three gates are accessible to user mode: vectors 3 and 4 are used for debugging and checking for numeric overflows, respectively. Then a system gate is set up for the SYSCALL_VECTOR, which is 0×80 for the x86 architecture. This was the mechanism for a process to transfer control to the kernel, to make a system call, and back in the day I applied for an “int 0×80″ vanity license plate :) . Starting with the Pentium Pro, the sysenter instruction was introduced as a faster way to make system calls. It relies on special-purpose CPU registers that store the code segment, entry point, and other tidbits for the kernel system call handler. When sysenter is executed the CPU does no privilege checking, going immediately into CPL 0 and loading new values into the registers for code and stack (cs, eip, ss, and esp). Only ring zero can load the sysenter setup registers, which is done in enable_sep_cpu().

有3个门是可以被用户模式访问的:其中IDT中的3,4分别被用来进行debug和数字溢出。而system gate被设置为SYSCALL_VECTOR, 其中的int 0x80是x86系统所指定的入口(arm中通过swi来进入相关的系统调用)。从pentium Pro之后,引入了一个快速的转换到内核的指令:sysenter。它依赖与特殊的CPU寄存器,用它来存储代码段,入口点和其它于内核系统调用处理相关的内容。当sysenter指令被执行时,CPU是不进行特权级别的检查,它立即转入CPL0的级别下并且加载新的code, stack(cs, eip,ss,esp等等)进寄存器。当然,只有0环可以加载sysenter设置寄存器,它在enable_sep_cpu()中被设置。

Finally, when it’s time to return to ring 3, the kernel issues an iret or sysexit instruction to return from interrupts and system calls, respectively, thus leaving ring 0 and resuming execution of user code with a CPL of 3. Vim tells me I’m approaching 1,900 words, so I/O port protection is for another day. This concludes our tour of x86 rings and protection. Thanks for reading!

最后,是时候回到3环了, 内核调用iret或者sysexit指令从中断和系统调用中返回,此时从0环退出,开始以用户CPL 3来执行用户程序。

原创粉丝点击