处理器简介 - III

来源：互联网发布：淘宝搜索引擎编辑：程序博客网时间：2024/05/19 15:21

最近正在看《Virtual Machines: Versatile Platforms for Systems and Processes》，发现在虚拟机层面涉及了很多热点技术，当然，处理器、OS、Compiler等等，都是重点，这里要继续处理器的学习，很多精彩技术，我就再接再厉，及时看看有没有可以相互借鉴的地方。

概念7：Predication (论断)

Branch Instructions really hurt the performance. – The processor pipeline is flushed, and the instruction cache is potentially flushed. 为什么分支结构指令会导致性能受损？因为CPU的执行过程是从Memory里面取数据，然后即将执行的数据放在Cache里面，但是Branch Instruction的存在有可能会引导CPU去执行一个非常“突兀”的指令，这个指令要么不在Cache里面，要么在Cache里面很远的地方，都会使Cache里面的数据处于基本无效的状态，需要重新执行前面的行动，很多操作都相当于“无用功”。为什么要引入if-then结构的Instruction branch呢？——这是带入Standard Branch的好处，至少可以知道下一步会去哪里，而不是突然冒出一个指令，引入“突兀”操作。什么叫做Flush呢？为什么要Flush？——Flush就是放弃现在手上正在执行的一堆指令，重新Load新的指令，然后重新执行。而且Flush是无法避免的，我们所能够做的就是尽量限制不得不Flush的那堆数据/指令的大小，通过利用Standard branch，利用Predication技术，可以找出即将执行的指令和潜在执行的指令，然后这些就是经过限制后的数量最少/带来影响最少的那堆数据/指令，然后我们就把Flush的损失控制在了最小。这里再给出一个更加专业的术语，那就是Cache Flush和Pipeline Flush，前者是没有采用Predication技术的时候，CPU利用率不够的情况下，出现的症状，后者是采用了Predication技术，提高了CPU利用率的时候，仍然需要付出的代价。

Predicate logic is where you execute both branches of the code stream before you have figured out which of the branches you are going to take. You can execute instructions from both potential branch targets, while you are executing the instruction (often a load followed by a compare) to figure out which branch you are going to take. Then you simply throw away all of the results from the branch you *didn’t* take, and continue on with the results of the branch you did take. This is often a huge win, because it optimizes the use of the cache and it doesn’t force you to stall the entire job and all of its threads waiting for a memory fetch. ——后面仍有不足，但已经是经过优化的结果了。We still have are branches and the costly penalties associated with making the wrong choice – pipeline flushes。

Every if-then statement is potentially a branch. – But only potentially, because we might not take the branch. 前面已经讲过了采用“标准分支”的好处，那就是可以将Cache Flush控制在PipeLine Flush，同时，还有另外一项技术“推理技术 – Speculation”或者data Speculation技术，With the Speculation architecture we have support for setting predicates that will tell us what branch path is valid in our code. We simply execute the entire branch then proceed with only those instructions with the ‘branch True’ flag and discard the rest. Since we prepared to execute either path we have no stall.

最后，上面说的这两项技术 – Predication and Speculation，需要占用额外的Silicon空间（from the PHY perspective），那么我们的做法是什么呢？做在Compiler里面，而不是集成在CPU上面，就是专门开辟出PHY上的全新的空间，Compiler，然后提升CPU的利用率。看文章评价：Having the compiler make this choice (like Itanium with bundling) allows you to increase the performance of your application simply by re-compiling and then running it on the same processor.

附录：来自网络的关于Predication和Speculation的释义。

预测技术（Predication）

　　CPU使指令并行性达到最大的一个关键因素是对如何处理判定点以作出正确的判定。传统的结构不能在指令流上超前很远，因此，必须多次推测判定点后的哪个指令流是正确的指令流。为了避免错误地预测执行哪个分支／指令流，目前的CPU可以设法智能猜测(“分支预测”)，运行不同流水线上的两个指令流(推测执行)，或当CPU利用不足时，执行一个或两个分支(无序执行)。这些策略多半时间不能充分利用CPU的并行性，Intel估计，这些结构的“被错误预测的分支”(执行错误的指令流)可损失高达40%的CPU性能。

　　在IA-64中，预测采取了一种新的处理条件分支的方法。在CPU接收代码之前，解决编译器中“超前”这个问题，即编译器同时调度一些等待执行的分支。预测可以把分支移到任何可能的地方。IA-64还把有关每个分支的详细情况告诉CPU，这样，CPU就可以更好地去做推测正确的指令流的工作。

推理技术（Speculation）

　　处理器有时因等待从相对低速的存储器完成装入而空闲着，推理技术就是针对减少处理器的空闲时间而引入的。在CPU准备接收指令前，推理技术从主存或缓存尽可能早地装入指令，因而当某些指令需要从存储器输入数据时，处理器就不会空闲着。实际上，推理是把指令和数据缓存在CPU本身中。IA-64还能把预测和推测结合起来，通过预测消除许多分支，通过推理加快推理执行或提高对剩余分支的分支预测。推理对频繁访问缓存的应用程序特别有效，因为这些应用程序需要迅速地把指令和数据送到CPU上。例如，大型Internet和数据库应用程序必须在大量终端用户之间快速切换。推测可将每个新终端用户指令和数据“预发送”到CPU，通过这一技术，编码指令与其它指令一同进行处理，总体的处理时间大大缩短。

概念8：Pipelining

对于pipelining，一个最简单的Model有5个Stage：

Instruction Fetch, Instruction Decode, Execution, Memory Access, Writeback

详细的介绍，请参见：http://www.cs.swan.ac.uk/~csandy/cs-323/notes/

Pipelining is a natural method of attempting to speed up program execution and the most common form of parallelism around. Its use in computers as a performance enhancing mechanism can be traced back to the late 50s/early 60s, which is remarkably close to the start of the computing era. Most current reasonably state-of-the-art machines are quite heavily pipelined. A common aim in processor design is to increase instruction level parallelism (ILP), though a more useful measure is Cycles per Instruction (CLI) – the lower the better.

The principle of a pipelined processor is that of the assembly line — many components are in different stages of construction at once. The stages of the assembly line are arranged so they all take the same amount of time, and once every time cycle, a new product drops off the end — which may have taken hours to progress from start to finish.

This is a critical point: the length of time an individual instruction takes to execute (latency) is irrelevant. What is important is the total throughput; the length of time a whole block of instructions takes to execute. In practice, an individual instruction is likely to take longer to execute, because of the overheads imposed by implementing pipelining.

现实中的处理器通常有更多的执行Stage，对于执行Stage的数目，这里有一个Trade-off between “执行效率”和”设计复杂度”。

Long pipelines increase Instruction Level Parallelism (ILP) ///// Short pipelines reduce the chance of dependency stalls, and reduce the number of times the pipeline is completely empty because of a branch ///// Each processor designer makes difference choices

概念9：Virtualization Topics

每个Processor拥有不同的优先级，different privilege levels；OS需要在较高Privilege Level上执行；——但是，在同一台Server上运行多个OS的时候，因为OS可以操作的Processor level高级，很可能会造成Security方面的冲突。这时候，利用VM软件，可以很好地控制每个“The VM’s “fix up” the guest operating systems, so they never run at the highest privilege level”，还有可以保证Security“AMD and Intel implement a higher privilege level (VMX Root), allowing solid security between the guest instances”，似乎是类似于Physical Isolation的效果。