/LGC图形渲染/OpenGL Performance Optimization

来源:互联网 发布:矩阵乘法生活运用 编辑:程序博客网 时间:2024/05/18 03:30

SIGGRAPH '97

Course 24: OpenGL and Window System Integration

OpenGL Performance Optimization


Contents

  • 1. Hardware vs. Software
  • 2. Application Organization
    • 2.1 High Level Organization
    • 2.2 Low Level Organization
  • 3. OpenGL Optimization
    • 3.1 Traversal
    • 3.2 Transformation
    • 3.3 Rasterization
    • 3.4 Texturing
    • 3.5 Clearing
    • 3.6 Miscellaneous
    • 3.7 Window System Integration
    • 3.8 Mesa-specific
  • 4. Evaluation and tuning
    • 4.1 Pipeline tuning
    • 4.2 Double buffering
    • 4.3 Test on several implementations



1. Hardware vs. Software

OpenGL may be implemented by any combination of hardware and software.At the high-end, hardware may implement virtually all of OpenGL while atthe low-end, OpenGL may be implemented entirely in software. In betweenare combination software/hardware implementations. More money buys morehardware and better performance.

Intro-level workstation hardware and the recent PC 3-D hardware typicallyimplement point, line, and polygon rasterization in hardware but implementfloating point transformations, lighting, and clipping in software. Thisis a good strategy since the bottleneck in 3-D rendering is usuallyrasterization and modern CPU's have sufficient floating point performanceto handle the transformation stage.

OpenGL developers must remember that their application may be used on awide variety of OpenGL implementations. Therefore one should considerusing all possible optimizations, even those which have little return onthe development system, since other systems may benefit greatly.

From this point of view it may seem wise to develop your application on alow-end system. There is a pitfall however; some operations which arecheep in software may be expensive in hardware. The moral is: test yourapplication on a variety of systems to be sure the performance is dependable.



2. Application Organization

At first glance it may seem that the performance of interactive OpenGLapplications is dominated by the performance of OpenGL itself. This maybe true in some circumstances but be aware that the organization of theapplication is also significant.

2.1 High Level Organization

Multiprocessing

Some graphical applications have a substantial computational componentother than 3-D rendering. Virtual reality applications must computeobject interactions and collisions. Scientific visualization programsmust compute analysis functions and graphical representations of data.

One should consider multiprocessing in these situations. By assigningrendering and computation to different threads they may be executed inparallel on multiprocessor computers.

For many applications, supporting multiprocessing is just a matter ofpartitioning the render and compute operations into separate threadswhich share common data structures and coordinate with synchronizationprimitives.

SGI's Performer is an example of a high level toolkit designed for thispurpose.

 

Image quality vs. performance

In general, one wants high-speed animation and high-quality images inan OpenGL application.If you can't have both at once a reasonable compromise may be to render atlow complexity during animation and high complexity for static images.

Complexity may refer to the geometric or rendering attributes of a database.Here are a few examples.

  • During interactive rotation (i.e. mouse button held down) render a reduced-polygon model. When drawing a static image draw the full polygon model.
  • During animation, disable dithering, smooth shading, and/or texturing. Enable them for the static image.
  • If texturing is required, use GL_NEAREST sampling and glHint( GL_PERSPECTIVE_CORRECTION_HINT, GL_FASTEST ).
  • During animation, disable antialiasing. Enable antialiasing for the static image.
  • Use coarser NURBS/evaluator tesselation during animation. Use glPolygonMode( GL_FRONT_AND_BACK, GL_LINE ) to inspect tesselation granularity and reduce if possible.

Level of detail management and culling

Objects which are distant from the viewer may be rendered with a reducedcomplexity model. This strategy reduces the demands on all stages of thegraphics pipeline. Toolkits such as Inventor and Performer support thisfeature automatically.

Objects which are entirely outside of the field of view may be culled.This type of high level cull testing can be done efficiently with boundingboxes or spheres and have a major impact on performance. Again, toolkitssuch as Inventor and Performer have this feature.

2.2 Low Level Organization

The objects which are rendered with OpenGL have to be stored in some sortof data structure. Some data structures are more efficient than otherswith respect to how quickly they can be rendered.

Basically, one wants data structures which can be traversed quicklyand passed to the graphics library in an efficient manner. For example,suppose we need to render a triangle strip. The data structure whichstores the list of vertices may be implemented with a linked list or anarray. Clearly the array can be traversed more quickly than a linked list.The way in which a vertex is stored in the data structure is also significant.High performance hardware can process vertexes specified by a pointer morequickly than those specified by three separate parameters.

 

An Example

Suppose we're writing an application which involves drawing a road map.One of the components of the database is a list of cities specified witha latitude, longitude and name. The data structure describing a citymay be:

struct city {
float latitute, longitude;/* city location */
char *name;/* city's name */
int large_flag; /* 0 = small, 1 = large */
};

A list of cities may be stored as an array of city structs.

Our first attempt at rendering this information may be:

void draw_cities( int n, struct city citylist[] )
{
int i;
for (i=0; i < n; i++) {
if (citylist[i].large_flag) {
glPointSize( 4.0 );
}
else {
glPointSize( 2.0 );
}
glBegin( GL_POINTS );
glVertex2f( citylist[i].longitude, citylist[i].latitude );
glEnd();
glRasterPos2f( citylist[i].longitude, citylist[i].latitude );
glCallLists( strlen(citylist[i].name),
GL_BYTE,
citylist[i].name );
}
}

This is a poor implementation for a number of reasons:

  • glPointSize is called for every loop iteration.
  • only one point is drawn between glBegin and glEnd
  • the vertices aren't being specified in the most efficient manner

Here's a better implementation:

void draw_cities( int n, struct city citylist[] )
{
int i;
/* draw small dots first */
glPointSize( 2.0 );
glBegin( GL_POINTS );
for (i=0; i < n ;i++) {
if (citylist[i].large_flag==0) {
glVertex2f( citylist[i].longitude, citylist[i].latitude );
}
}
glEnd();
/* draw large dots second */
glPointSize( 4.0 );
glBegin( GL_POINTS );
for (i=0; i < n ;i++) {
if (citylist[i].large_flag==1) {
glVertex2f( citylist[i].longitude, citylist[i].latitude );
}
}
glEnd();
/* draw city labels third */
for (i=0; i < n ;i++) {
glRasterPos2f( citylist[i].longitude, citylist[i].latitude );
glCallLists( strlen(citylist[i].name),
GL_BYTE,
citylist[i].name );
}
}

In this implementation we're only calling glPointSize twiceand we're maximizing the number of vertices specified betweenglBegin and glEnd.

We can still do better, however. If we redesign the data structures usedto represent the city information we can improve the efficiency of drawingthe city points. For example:

struct city_list {
int num_cities;/* how many cities in the list */
float *position;/* pointer to lat/lon coordinates */
char **name;/* pointer to city names */
float size;/* size of city points */
};

Now cities of different sizes are stored in separate lists.Position are stored sequentially in a dynamically allocated array.By reorganizing the data structures we've eliminated the need for aconditional inside the glBegin/glEnd loops.Also, we can render a list of cities using the GL_EXT_vertex_arrayextension if available, or at least use a more efficient version ofglVertex and glRasterPos.

/* indicates if server can do GL_EXT_vertex_array: */
GLboolean varray_available;

void draw_cities( struct city_list *list )
{
int i;
GLboolean use_begin_end;

/* draw the points */
glPointSize( list->size );

#ifdef GL_EXT_vertex_array
if (varray_available) {
glVertexPointerEXT( 2, GL_FLOAT, 0, list->num_cities, list->position );
glDrawArraysEXT( GL_POINTS, 0, list->num_cities );
use_begin_end = GL_FALSE;
}
else
#else
{
use_begin_end = GL_TRUE;
}
#endif

if (use_begin_end) {
glBegin(GL_POINTS);
for (i=0; i < list->num_cities; i++) {
glVertex2fv( &position[i*2] );
}
glEnd();
}

/* draw city labels */
for (i=0; i < list->num_cities ;i++) {
glRasterPos2fv( list->position[i*2] );
glCallLists( strlen(list->name[i]),
GL_BYTE, list->name[i] );
}
}

As this example shows, it's better to know something about efficient renderingtechniques before designing the data structures. In many cases one has tofind a compromize between data structures optimized for rendering and thoseoptimized for clarity and convenience.

In the following sections the techniques for maximizing performance,as seen above, are explained.



3. OpenGL Optimization

There are many possibilities to improving OpenGL performance. The impactof any single optimization can vary a great deal depending on the OpenGLimplementation.Interestingly, items which have a large impact on softwarerenderers may have no effect on hardware renderers, and vice versa!For example, smooth shading can be expensive in software but free in hardwareWhile glGet* can be cheap in software but expensive in hardware.

After each of the following techniques look for a bracketed list of symbolswhich relates the significance of the optimization to your OpenGLsystem:

  • H - beneficial for high-end hardware
  • L - beneficial for low-end hardware
  • S - beneficial for software implementations
  • all - probably beneficial for all implementations

3.1 Traversal

Traversal is the sending of data to the graphics system. Specifically, wewant to minimize the time taken to specify primitives to OpenGL.

Use connected primitives
Connected primitives such as GL_LINES, GL_LINE_LOOP,GL_TRIANGLE_STRIP, GL_TRIANGLE_FAN, andGL_QUAD_STRIP require fewer vertices to describe anobject than individual line, triangle, or polygon primitives.This reduces data transfer and transformation workload. [all]
Use the vertex array extension
On some architectures function calls are somewhat expensiveso replacing many glVertex/glColor/glNormal calls withthe vertex array mechanism may be very beneficial. [all]
Store vertex data in consecutive memory locations
When maximum performance is needed on high-end systems it'sgood to store vertex data in contiguous memory to maximizethrough put of data from host memory to graphics subsystem. [H,L]
Use the vector versions of glVertex, glColor,glNormal and glTexCoord
The glVertex, glColor, etc. functionswhich take a pointerto their arguments such as glVertex3fv(v) may be muchfaster than those which take individual arguments such asglVertex3f(x,y,z) on systems with DMA-driven graphicshardware. [H,L]
Reduce quantity of primitives
Be careful not to render primitives which are over-tesselated.Experiment with the GLU primitives, for example,to determine the best compromise of image quality vs.tesselation level. Textured objects in particular may stillbe rendered effectively with low geometric complexity. [all]
Display lists
Use display lists to encapsulate frequently drawn objects.Display list data may be stored in the graphics subsystemrather than host memory thereby eliminating host-to-graphicsdata movement.Display lists are also very beneficial when renderingremotely. [all]
Don't specify unneeded per-vertex information
If lighting is disabled don't call glNormal.If texturing is disabled don't call glTexCoord, etc.
Minimize code between glBegin/glEnd
For maximum performance on high-end systems it's extremelyimportant to send vertex data to the graphics system as fastas possible.Avoid extraneous code between glBegin/glEnd.

Example:

glBegin( GL_TRIANGLE_STRIP );
for (i=0; i < n; i++) {
if (lighting) {
glNormal3fv( norm[i] );
}
glVertex3fv( vert[i] );
}
glEnd();

This is a very bad construct. The following is much better:

if (lighting) {
glBegin( GL_TRIANGLE_STRIP );
for (i=0; i < n ;i++) {
glNormal3fv( norm[i] );
glVertex3fv( vert[i] );
}
glEnd();
}
else {
glBegin( GL_TRIANGLE_STRIP );
for (i=0; i < n ;i++) {
glVertex3fv( vert[i] );
}
glEnd();
}
Also consider manually unrolling important rendering loops tomaximize the function call rate.

3.2 Transformation

Transformation includes the transformation of vertices fromglVertex to window coordinates, clipping and lighting.

 

Lighting
  • Avoid using positional lights, i.e. light positions should be of the form (x,y,z,0) [L,S]
  • Avoid using spotlights. [all]
  • Avoid using two-sided lighting. [all]
  • Avoid using negative material and light color coefficients [S]
  • Avoid using the local viewer lighting model. [L,S]
  • Avoid frequent changes to the GL_SHININESSmaterial parameter. [L,S]
  • Some OpenGL implementations are optimized for the case of a single light source.
  • Consider pre-lighting complex objects before rendering, ala radiosity. You can get the effect of lighting by specifying vertex colors instead of vertex normals. [S]
Two sided lighting
If you want both the front and back of polygons shaded thesame try using two light sources instead of two-sidedlighting. Position the two light sources on oppositesides of your object. That way, a polygon will always belit correctly whether it's back or front facing.[L,S]
Disable normal vector normalization when not needed
glEnable/Disable(GL_NORMALIZE) controls whethernormal vectors are scaled to unit length before lighting. If youdo not use glScale you may be able to disablenormalization without ill effects. Normalization is disabledby default. [L,S]
Use connected primitives
Connected primitives such as GL_LINES,GL_LINE_LOOP, GL_TRIANGLE_STRIP,GL_TRIANGLE_FAN, and GL_QUAD_STRIPdecrease traversal and transformation load.
glRect usage
If you have to draw many rectangles consider usingglBegin(GL_QUADS) ... glEnd() instead. [all]

3.3 Rasterization

Rasterization is the process of generating the pixels which representpoints, lines, polygons, bitmaps and the writing of those pixels to theframe buffer. Rasterization is often the bottleneck in softwareimplementations of OpenGL.

Disable smooth shading when not needed
Smooth shading is enabled by default. Flat shading doesn'trequire interpolation of the four color components and is usuallyfaster than smooth shading in software implementations. Hardwaremay perform flat and smooth-shaded rendering at the same ratethough there's at least one case in which smooth shading is fasterthan flat shading (E&S Freedom). [S]
Disable depth testing when not needed
Background objects, for example, can be drawn without depth testingif they're drawn first. Foreground objects can be drawnwithout depth testing if they're drawn last. [L,S]
Disable dithering when not needed
This is easy to forget when developing on a high-end machine.Disabling dithering can make a big difference in softwareimplementations of OpenGL on lower-end machineswith 8 or 12-bitcolor buffers. Dithering is enabled by default. [S]
Use back-face culling whenever possible.
If you're drawing closed polyhedra or other objects for whichback facing polygons aren't visible there's probably no pointin drawing those polygons. [all]
The GL_SGI_cull_vertex extension
SGI's Cosmo GL supports a new culling extension which looks atvertex normals to try to improve the speed of culling.
Avoid extra fragment operations
Stenciling, blending, stippling, alpha testing and logic opscan all take extra time during rasterization. Be sure to disablethe operations which aren't needed. [all]
Reduce the window size or screen resolution
A simple way to reduce rasterization time is to reduce the numberof pixels drawn. If a smaller window or reduced display resolutionare acceptable it's an easy way to improve rasterization speed. [L,S]

3.4 Texturing

Texture mapping is usually an expensive operation in both hardware andsoftware.Only high-end graphics hardware can offer free to low-cost texturing.In any case there are several ways to maximize texture mapping performance.

Use efficient image formats
The GL_UNSIGNED_BYTE component format is typically thefastest for specifying texture images.Experiment with the internal texture formats offered by theGL_EXT_texture extension. Some formats are fasterthan otherson some systems (16-bit texels on the Reality Engine, forexample). [all]
Encapsulate texture maps in texture objects or display lists
This is especially important if you use several texturemaps. By putting textures into display lists or textureobjects the graphics system can manage their storage andminimize data movement between the client and graphicssubsystem. [all]
Use smaller texture maps
Smaller images can be moved from host to texture memory fasterthan large images. More small texture can be stored simultaneouslyin texture memory, reducing texture memory swapping. [all]
Use simpler sampling functions
Experiment with the minification and magnification texture filtersto determine which performs best while giving acceptable results.Generally, GL_NEAREST is fastest and GL_LINEAR is second fastest.[all]
Use the same sampling function for minification and magnification
If both the minification and magnification filters areGL_NEAREST or GL_LINEARthen there's no reason OpenGL has to compute thelambda value which determines whether to use minificationor magnification sampling for each fragment.Avoiding the lambda calculation can be a good performace improvement.
Use a simpler texture environment function
Some texture environment modes may be faster than others. Forexample, the GL_DECAL or GL_REPLACE_EXTfunctions for 3 component textures is a simple assignment of texelsamples to fragments while GL_MODULATE is a linearinterpolation between texel samples and incoming fragments. [S,L]
Combine small textures
If you are using several small textures consider tiling themtogether as a larger texture and modify your texture coordinatesto address the subtexture you want.This technique can eliminate texture bindings.
Use glHint(GL_PERSPECTIVE_CORRECTION_HINT, GL_FASTEST)
This hint can improve the speed of texturing when perspective-correct texture coordinate interpolation isn't needed, such aswhen using a glOrtho() projection.
Animated textures
If you want to use an animated texture, perhaps live video textures,don't use glTexImage2D to repeatedly change the texture.Use glTexSubImage2D orglTexCopyTexSubImage2D.These functions are standard in OpenGL 1.1 and available as extensionsto 1.0.

3.5 Clearing

Clearing the color, depth, stencil and accumulation buffers can betime consuming, especially when it has to be done in software.There are a few tricks which can help.

Use glClear carefully [all]
Clear all relevant color buffers with one glClear.

Wrong:

 

  glClear( GL_COLOR_BUFFER_BIT );
if (stenciling) {
glClear( GL_STENCIL_BUFFER_BIT );
}
Right:

 

  if (stenciling) {
glClear( GL_COLOR_BUFFER_BIT | GL_STENCIL_BUFFER_BIT );
}
else {
glClear( GL_COLOR_BUFFER_BIT );
}
Disable dithering
Disable dithering before clearing the color buffer.Visually, the difference between dithered and undithered clearsis usually negligable.
Use scissoring to clear a smaller area
If you don't need to clear the whole buffer useglScissor() to restrict clearing to a smaller area.[L].
Don't clear the color buffer at all
If the scene you're drawing opaquely covers the entire windowthere is no reason to clear the color buffer.
Eliminate depth buffer clearing
If the scene you're drawing covers the entire window there is atrick which let's you omit the depth buffer clear. The idea isto only use half the depth buffer range for each frame andalternate between using GL_LESS and GL_GREATER as the depth testfunction.

Example:

   int EvenFlag;

/* Call this once during initialization and whenever the window
* is resized.
*/
void init_depth_buffer( void )
{
glClearDepth( 1.0 );
glClear( GL_DEPTH_BUFFER_BIT );
glDepthRange( 0.0, 0.5 );
glDepthFunc( GL_LESS );
EvenFlag = 1;
}

/* Your drawing function */
void display_func( void )
{
if (EvenFlag) {
glDepthFunc( GL_LESS );
glDepthRange( 0.0, 0.5 );
}
else {
glDepthFunc( GL_GREATER );
glDepthRange( 1.0, 0.5 );
}
EvenFlag = !EvenFlag;

/* draw your scene */
}
Avoid glClearDepth( d ) where d!=1.0
Some software implementations may have optimized paths forclearing the depth buffer to 1.0. [S]

3.6 Miscellaneous

Avoid "round-trip" calls
Calls such as glGetFloatv, glGetIntegerv, glIsEnabled,glGetError, glGetString require a slow, round triptransaction between the application and renderer.Especially avoid them in your main rendering code.

Note that software implementations of OpenGL may actually performthese operations faster than hardware systems. If you're developingon a low-end system be aware of this fact. [H,L]

Avoid glPushAttrib
If only a few pieces of state need to be saved and restoredit's often faster to maintain the information in the clientprogram. glPushAttrib( GL_ALL_ATTRIB_BITS ) inparticular can be very expensive on hardware systems. Thiscall may be faster in software implementations than in hardware.[H,L]
Check for GL errors during development
During development call glGetError inside yourrendering/event loop to catch errors. GL errors raised duringrendering can slow down rendering speed. Remove theglGetError call for production code since it's a"round trip" command and can cause delays. [all]
Use glColorMaterial instead of glMaterial
If you need to change a material property on a per vertexbasis, glColorMaterial may be faster thanglMaterial. [all]
glDrawPixels
  • glDrawPixels often performs best withGL_UNSIGNED_BYTE colorcomponents [all]
  • Disable all unnecessary raster operations before callingglDrawPixels. [all]
  • Use the GL_EXT_abgr extension to specify color components inalpha, blue, green, red order on systems which were designedfor IRIS GL. [H,L].
Avoid using viewports which are larger than the window
Software implementations may have to do additional clippingin this situation. [S]
Alpha planes
Don't allocate alpha planes in the color buffer if you don't need them.Specifically, they are not needed for transparency effects.Systems without hardware alpha planes may have to resort to aslow software implementation. [L,S]
Accumulation, stencil, overlay planes
Do not allocate accumulation, stencil or overlay planes if theyare not needed. [all]
Be aware of the depth buffer's depth
Your OpenGL may support several different sizes of depthbuffers- 16 and 24-bit for example. Shallower depth buffersmay be faster than deep buffers both for software and hardwareimplementations. However, the precision of of a 16-bit depthbuffer may not be sufficient for some applications. [L,S]
Transparency may be implemented with stippling instead of blending
If you need simple transparent objects consider usingpolygon stippling instead of alpha blending. The later istypically faster and may actually look better in somesituations. [L,S]
Group state changes together
Try to mimimize the number of GL state changes in your code.When GL state is changed, internal state may have to berecomputed, introducing delays. [all]
Avoid using glPolygonMode
If you need to draw many polygon outlines or vertex pointsuse glBegin with GL_POINTS, GL_LINES,GL_LINE_LOOP or GL_LINE_STRIPinstead as it can be much faster. [all]

3.7 Window System Integration

Minimize calls to the make current call
The glXMakeCurrent call, for example, can be expensiveon hardware systems because the context switch may involve moving alarge amount of data in and out of the hardware.
Visual / pixel format performance
Some X visuals or pixel formats may be faster than others. On PCsfor example, 24-bit color buffers may be slower to read/write than12 or 8-bit buffers. There is often a tradeoff between performanceand quality of frame buffer configurations. 12-bit color may notlook as nice as 24-bit color. A 16-bit depth buffer won't have theprecision of a 24-bit depth buffer.

The GLX_EXT_visual_rating extension can help you selectvisualsbased on performance or quality. GLX 1.2's visualcaveat attribute can tell you if a visual has a performancepenalty associated with it.

It may be worthwhile to experiment with different visuals to determineif there's any advantage of one over another.

Avoid mixing OpenGL rendering with native rendering
OpenGL allows both itself and the native window system torender into the same window. For this to be done correctlysynchronization is needed. The GLX glXWaitX andglXWaitGL functions serve this purpose.

Synchronization hurts performance. Therefore, if you need torender with both OpenGL and native window system calls try togroup the rendering calls to minimize synchronization.

For example, if you're drawing a 3-D scene with OpenGL and displayingtext with X, draw all the 3-D elements first, callglXWaitGL to synchronize, then call all the X drawingfunctions.

Don't redraw more than necessary
Be sure that you're not redrawing your scene unnecissarily.For example, expose/repaint events may come in batches describingseparate regions of the window which must be redrawn.Since one usually redraws the whole window image with OpenGLyou only need to respond to one expose/repaint event.In the case of X, look at the count field of the XExposeEventstructure.Only redraw when it is zero.

Also, when responding to mouse motion events you should skipextra motion events in the input queue.Otherwise, if you try to process every motion event and redrawyour scene there will be a noticable delay between mouse inputand screen updates.

It can be a good idea to put a print statement in your redrawand event loop function so you know exactly what messages arecausing your scene to be redrawn, and when.

SwapBuffer calls and graphics pipe blocking
On systems with 3-D graphics hardware the SwapBuffers call issynchronized to the monitor's vertical retrace.Input to the OpenGL command queue may be blocked until thebuffer swap has completed.Therefore, don't put more OpenGL calls immediately after SwapBuffers.Instead, put application computation instructions which canoverlap with the buffer swap delay.

3.8 Mesa-specific

Mesa is a free library which implements most of the OpenGL API in acompatible manner. Since it is a software library, performance depends agreat deal on the host computer. There are several Mesa-specific featuresto be aware of which can effect performance.

 

Double buffering
The X driver supports two back color buffer implementations: Pixmapsand XImages. The MESA_BACK_BUFFER environment variable controlswhich is used. Which of the two that's faster depends on the natureof your rendering. Experiment.
X Visuals
As described above, some X visuals can be rendered into more quicklythan others. The MESA_RGB_VISUAL environment variablecan be used to determine the quickest visual by experimentation.
Depth buffers
Mesa may use a 16 or 32-bit depth buffer as specified in thesrc/config.h configuration file. 16-bit depth buffers are fasterbut may not offer the precision needed for all applications.
Flat-shaded primitives
If one is drawing a number of flat-shaded primitives all of thesame color the glColor command should be put beforethe glBegin call.

Don't do this:

glBegin(...);
glColor(...);
glVertex(...);
...
glEnd();

Do this:

glColor(...);
glBegin(...);
glVertex(...);
...
glEnd();
glColor*() commands
The glColor[34]ub[v] are the fastestversions of the glColor command.
Avoid double precision valued functions
Mesa does all internal floating point computations in singleprecision floating point.API functions which take double precision floating point valuesmust convert them to single precision.This can be expensive in the case of glVertex, glNormal, etc.



4. Evaluation and Tuning

To maximize the performance of an OpenGL applications one must be ableto evaluate an application to learn what is limiting its speed.Because of the hardware involved it's not sufficient to use ordinaryprofiling tools.Several different aspects of the graphics system must be evaluated.

Performance evaluation is a large subject and only the basics are covered here.For more information see "OpenGL on Silicon Graphics Systems".

4.1 Pipeline tuning

The graphics system can be divided into three subsystems for the purposeof performance evaluation:

  • CPU subsystem - application code which drives the graphics subsystem
  • Geometry subsystem - transformation of vertices, lighting, andclipping
  • Rasterization subsystem - drawing filled polygons, line segments andper-pixel processing

At any given time, one of these stages will be the bottleneck. Thebottleneck must be reduced to improve performance.The strategy is to isolate each subsystem in turn and evaluate changesin performance.For example, by decreasing the workload of the CPU subsystem one candetermine if the CPU or graphics system is limiting performance.

 

4.1.1 CPU subsystem

To isosulate the CPU subsystem one must reduce the graphics workload whilepresevering the application's execution characteristics.A simple way to do this is to replace glVertex()and glNormal calls with glColor calls.If performance does not improve then the CPU stage is the bottleneck.

 

4.1.2 Geometry subsystem

To isoslate the geometry subsystem one wants to reduce the number ofprimitives processed, or reduce the transformation work per primitivewhile producing the same number of pixels during rasterization.This can be done by replacing many small polygons with fewer largeones or by simply disabling lighting or clipping.If performance increases thenyour application is bound by geometry/transformation speed.

 

4.1.3 Rasterization subsystem

A simple way to reduce the rasterization workload is to make your windowsmaller. Other ways to reduce rasterization work is to disable per-pixelprocessing such as texturing, blending, or depth testing.If performance increases, your program is fill limited.

After bottlenecks have been identified the techniques outlined insection 3 can be applied.The process of identifying and reducing bottlenecks should be repeateduntil no further improvements can be made or your minimum performancethreshold has been met.

4.2 Double buffering

For smooth animation one must maintain a high, constant frame rate.Double buffering has an important effect on this.Suppose your application needs to render at 60Hz but isonly getting 30Hz. It's a mistake to think that you mustreduce rendering time by 50% to achive 60Hz. The reasonis the swap-buffers operation is synchronized to occurduring the display's vertical retrace period (at 60Hz forexample). It may be that your application is taking onlya tiny bit too long to meet the 1/60 second rendering timelimit for 60Hz.

Measure the performance of rendering in single buffer modeto determine how far you really are from your target framerate.

4.3 Test on several implementations

The performance of OpenGL implementations varies a lot.One should measure performance and test OpenGL applicationson several different systems to be sure there are nounexpected problems.




Last edited on May 16, 1997 by Brian Paul.

原创粉丝点击