[转]The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining

来源：互联网发布：apache记录post日志编辑：程序博客网时间：2024/06/02 00:14

转三篇ARM一个牛人写的怎样优化OpenGL ES应用程序的文章。该系列三篇文章，深入浅出介绍了OpenGL ES API的背后实现和

ARM GPU硬件架构。其中第一篇是所有图形API都通用的概念；后面介绍了Tile-based rendering和ARM自己Shader Core。

这是第一篇，有中英文对照的。

1. 英文原文

Optimization of graphics workloads is often essential to many modern mobile applications, as almost all rendering is now handled directly or indirectly by an OpenGL ES based rendering back-end. One of my colleagues, Michael McGeagh , recently posted a work guide [http://community.arm.com/docs/DOC-8055] on getting the ARM® DS-5™ Streamline™ profiling tools working with the Google Nexus 10 for the purposes of profiling and optimizing graphical applications using the Mali™-T604 GPU. Streamline is a powerful tool giving high resolution visibility of the entire system’s behavior, but it requires the engineer driving it to interpret the data, identify the problem area, and subsequently propose a fix.

For developers who are new to graphics optimization it is fair to say that there is a little bit of a learning curve when first starting out, so this new series of blogs is all about giving content developers the essential knowledge they need to successfully optimize for Mali GPUs. Over the course of the series, I will explore the fundamental macro-scale architectural structures and behaviors developers have to worry about, how this translates into possible problems which can be triggered by content, and finally how to spot them in Streamline.

Abstract Rendering Machine

The most essential piece of knowledge which is needed to successfully analyze the graphics performance of an application is a mental model of how the system beneath the OpenGL ES API functions, enabling an engineer to reason about the behavior they observe.

To avoid swamping developers in implementation details of the driver software and hardware subsystem, which they have no control over and which is therefore of limited value, it is useful to define a simplified abstract machine which can be used as the basis for explanations of the behaviors observed. There are three useful parts to this machine, and they are mostly orthogonal so I will cover each in turn over the first few blogs in this series, but just so you know what to look forward to the three parts of the model are:

The CPU-GPU rendering pipeline
Tile-based rendering
Shader core architecture

In this blog we will look at the first of these, the CPU-GPU rendering pipeline.

Synchronous API, Asynchronous Execution

The most fundamental piece of knowledge which is important to understand is the temporal relationship between the application’s function calls at the OpenGL ES API and the execution of the rendering operations those API calls require. The OpenGL ES API is specified as a synchronous API from the application perspective. The application makes a series of function calls to set up the state needed by its next drawing task, and then calls a glDraw^[1] function — commonly called a draw call — to trigger the actual drawing operation. As the API is synchronous all subsequent API behavior after the draw call has been made is specified to behave as if that rendering operation has already happened, but on nearly all hardware-accelerated OpenGL ES implementations this is an elaborate illusion maintained by the driver stack.

In a similar fashion to the draw calls, the second illusion that is maintained by the driver is the end-of-frame buffer flip. Most developers first writing an OpenGL ES application will tell you that calling eglSwapBuffers swaps the front and back-buffer for their application. While this is logically true, the driver again maintains the illusion of synchronicity; on nearly all platforms the physical buffer swap may happen a long time later.

Pipelining

The reason for needing to create this illusion at all is, as you might expect, performance. If we forced the rendering operations to actually happen synchronously you would end up with the GPU idle when the CPU was busy creating the state for the next draw operation, and the CPU idle while the GPU was rendering. For a performance critical accelerator all of this idle time is obviously not an acceptable state of affairs.

To remove this idle time we use the OpenGL ES driver to maintain the illusion of synchronous rendering behavior, while actually processing rendering and frame swaps asynchronously under the hood. By running asynchronously we can build a small backlog of work, allowing a pipeline to be created where the GPU is processing older workloads from one end of the pipeline, while the CPU is busy pushing new work into the other. The advantage of this approach is that, provided we keep the pipeline full, there is always work available to run on the GPU giving the best performance.

The units of work in the Mali GPU pipeline are scheduled on a per render-target basis, where a render target may be a window surface or an off-screen render buffer. A single render target is processed in a two step process. First, the GPU processes the vertex shading^[2] for all draw calls in the render target, and second, the fragment shading^[3] for the entire render target is processed. The logical rendering pipeline for Mali is therefore a three-stage pipeline of: CPU processing, geometry processing, and fragment processing stages.

Pipeline Throttling

An observant reader may have noticed that the fragment work in the figure above is the slowest of the three operations, lagging further and further behind the CPU and geometry processing stages. This situation is not uncommon; most content will have far more fragments to shade than vertices, so fragment shading is usually the dominant processing operation.

In reality it is desirable to minimize the amount of latency from the CPU work completing to the frame being rendered – nothing is more frustrating to an end user than interacting with a touch screen device where their touch event input and the data on-screen are out of sync by a few 100 milliseconds – so we don’t want the backlog of work waiting for the fragment processing stage to grow too large. In short we need some mechanism to slow down the CPU thread periodically, stopping it queuing up work when the pipeline is already full-enough to keep the performance up.

This throttling mechanism is normally provided by the host windowing system, rather than by the graphics driver itself. On Android for example we cannot process any draw operations in a frame until we know the buffer orientation, because the user may have rotated their device, changing the frame size. SurfaceFlinger — the Android window surface manager – can control the pipeline depth simply by refusing to return a buffer to an application’s graphics stack if it already has more than N buffers queued for rendering.

If this situation occurs you would expect to see the CPU going idle once per frame as soon as “N” is reached, blocking inside an EGL or OpenGL ES API function until the display consumes a pending buffer, freeing up one for new rendering operations.

This same scheme also limits the pipeline buffering if the graphics stack is running faster than the display refresh rate; in this scenario content is "vsync limited" waiting for the vertical blank (vsync) signal which tells the display controller it can switch to the next front-buffer. If the GPU is producing frames faster than the display can show them then SurfaceFlinger will accumulate a number of buffers which have completed rendering but which still need showing on the screen; even though these buffers are no longer part of the Mali pipeline, they count towards the N frame limit for the application process.

As you can see in the pipeline diagram above, if content is vsync limited it is common to have periods where both the CPU and GPU are totally idle. Platform dynamic voltage and frequency scaling (DVFS) will typically try to reduce the current operating frequency in these scenarios, allowing reduced voltage and energy consumption, but as DVFS frequency choices are often relatively coarse some amount of idle time is to be expected.

Summary

In this blog we have looked at synchronous illusion provided by the OpenGL ES API, and the reasons for actually running an asynchronous rendering pipeline beneath the API. Tune in next time, and I’ll continue to develop the abstract machine further, looking at the Mali GPU’s tile-based rendering approach.

Comments and questions welcomed,

Pete

Footnotes

[1] There are many OpenGL ES draw functions which draw things, it doesn't really matter which one you pick for this example.
[2] Vertex (computer graphics) - Wikipedia, the free encyclopedia
[3] Fragment (computer graphics) - Wikipedia, the free encyclopedia

2. 中文翻译

图形工作负载的优化对于许多现代移动应用程序而言往往必不可少，因为几乎所有渲染现在都直接或间接地由基于 OpenGL
ES 的渲染后端负责处理。我的同事 Michael McGeagh 最近发表了一篇工作指南 [http://community.arm.com/docs/DOC-8055]，介绍如何将 ARM®DS-5™ Streamline™ 性能分析工具用于 Google Nexus 10，对利用Mali™-T604 GPU 的图形应用程序进行性能分析和优化。Streamline 是一款强大的工具，能够深入细致地洞悉整个系统的行为，但也需要驾驭它的工程师能够解读相关数据，识别问题区域，进而提出修复建议。

对于初涉图形优化的开发人员而言，起步阶段总会遇到一些困难，所以我写了新的系列博文，给开发人员提供必要的知识，以便他们能够成功地针对 Mali GPU进行优化。在整个系列博文中，我将阐述开发人员必须要考虑的基本宏观体系结构和行为、这些因素如何转化为能被内容触发的潜在问题，以及最终如何在
Streamline 中找出这些问题。

抽象渲染机器

要想成功分析应用程序的图形性能，必须先掌握一个最基本的知识，也就是对 OpenGL ES API 底下系统运作方式建立一个心智模型，让工程师能够推断他们观察到的行为。

为避免让开发人员陷于驱动程序软件和硬件子系统的实施细节的沼泽之中（这些他们无法控制，因而价值有限），有必要定义一个简化的抽象机器，用作解读所观察到的行为的基础。这一机器包含三个有用部分，它们大体上是独立不相干的，所以我将在本系列博文的开头几篇中逐一介绍。不过，为了让你对它们有个初步印象，下面列出该模型的三个部分：

CPU-GPU 渲染管线
基于区块的渲染
着色器核心架构

在本篇博文中，我们将探讨第一个部分，即 CPU-GPU 渲染管线。

同步API，异步执行

务必要了解的一个基本知识是，OpenGL ES API 上应用程序函数调用和这些 API 调用所需渲染运算的执行之间的临时关系。从应用程序的角度而言，OpenGL ES API 被指定为同步 API。应用程序进行一系列的函数调用来设置其下一绘制任务所需的状态，然后调用 glDraw[1] 函数（通常称为绘制调用）触发实际的绘制运算。由于 API 是同步的，执行绘制调用后的所有 API 行为都被指定为要像渲染运算已经发生一样进行，但在几乎所有硬件加速的 OpenGL ES 实现上，这只是一种由驱动程序堆栈维持的美妙假象。

与绘制调用相似，驱动程序维持的第二个假象是帧末缓冲翻转。大多数头一次编写 OpenGL ES 应用程序的开发人员会告诉你，调用 eglSwapBuffers将交换其应用程序的前缓冲和后缓冲。虽然这在逻辑上是对的，但驱动程序再一次维持了同步性的假象；在几乎所有平台上，实际的缓冲交换可能会在很久之后才会发生。

管线化

正如你所想到的，需要创造这一假象的原因在于性能。如果我们强制渲染运算真正同步发生，你就会面临这样的尴尬：CPU 忙于创建下一绘制运算的状态时，GPU 会闲置；GPU 执行渲染时，CPU 会闲置。对于以性能为重的加速器而言，所有这些闲置时间都是绝然不可接受的。

为了去除这一闲置时间，我们使用 OpenGL ES 驱动程序来维持同步渲染行为的假象，而在面纱之后实际是以异步执行的方式处理渲染和帧交换。通过异步运行，我们可以建立一个小小的工作储备以允许创建一个管线，GPU 从管线的一端处理较旧的工作负载，而 CPU 则负责将新的工作推入另一端。这一方式的优势在于，只要管线装满，就始终有工作在 GPU 上运行，提供最佳的性能。

Mali GPU 管线中的工作单元是以渲染目标为单位进行计划的，其中渲染目标可能是屏幕缓存或离屏缓存。单个渲染目标通过两步处理。首先，GPU 为渲染目标中的所有绘制调用处理顶点着色[2]。然后，为整个渲染目标处理片段着色[3]。因此，Mali 的逻辑渲染管线包含三个阶段：CPU 处理阶段、几何处理阶段，以及片段处理阶段。

管线节流

观察力敏锐的读者可能已注意到，上图中片段部分的工作是三个运算中最慢的，被 CPU 和几何处理阶段甩得越来越远。这种情形并不少见；大多数内容中要着色的片段远多于顶点，因此片段着色通常是占主导地位的处理运算。

在现实中，最好要尽可能缩短从 CPU 工作结束到帧被渲染之间的延时 – 对最终用户而言，最让人烦躁的莫过于在操作触控屏设备时，其触控事件输入和屏幕中数据显示之间出现数百毫秒的不同步 – 所以，我们不希望等待片段处理阶段的工作储备变得过大。简而言之，我们需要某种机制来定期减慢 CPU 线程，当管线足够满、能够维持良好性能时停止把工作放入队列。

这种节流机制通常由主机窗口系统提供，而不是图形驱动程序本身。例如，在 Android 上，我们只有在知道缓冲方向时才能处理任何绘制运算，因为用户可能会旋转其设备，造成帧大小出现变化。SurfaceFlinger— Android 窗口表面管理器 – 可以通过一个简单方式控制管线深度：当管线中排队等待渲染的缓冲数量超过 N 个时，拒绝将缓冲返回到应用程序的图形堆栈。

如果出现这种情形，你就会看到：一旦每一帧达到“N”时 CPU 就会进入闲置状态，在内部阻止 EGL 或 OpenGL
ES API 函数，直到显示屏消耗完一个待处理缓存，为新的渲染运算空出一个位置。

如果图形堆栈的运行快于显示刷新率，同样的方案也可限制管线缓冲；在这一情形下，内容受到VSYNC限制”并等待垂直空白（VSYNC同步）信号，该信号告诉显示控制器它可以切换到下一缓冲。如果 GPU 产生帧的速度快于显示屏显示帧的速度，那么
SurfaceFlinger 将积累一定数量已经完成渲染但依然需要显示在屏幕上的缓冲；即使这些缓冲不再是 Mali 管线的一个部分，它们依然算在应用程序进程的 N 帧限制内。

正如上面的管线示意图所示，如果内容受到VSYNC同步限制，那么会经常出现 CPU 和 GPU 都完全闲置的时段。平台动态电压和频率调节 (DVFS) 通常会在此类情形中尝试降低当前的工作频率，以降低电压和功耗，但由于 DVFS 频率选择通常相对粗糙，所以可能会出现一定数量的闲置时间。

小结

本篇博文中，我们探讨了 OpenGL ES API 提供的同步假象，以及 API 下实际运行异步渲染管线的原因。敬请期待下一篇博文，我将继续往下开发这一台抽象机器，探讨 Mali GPU基于区块的渲染做法。

欢迎大家提出意见和问题。

Pete

脚注

[1] 执行绘制的 OpenGL ES 绘制函数有许多，本例中选用哪一个其实没什么关系。

[2] Vertex (computer graphics) – （顶点（计算机图形）） - 维基百科，自由的百科全书

[3] Fragment (computer graphics) - （片段（计算机图形）） - 维基百科，自由的百科全书

0 0