This article will cover a lot of tips that can be used to speed up any 3D graphics system.
To understand why something makes your system faster, it is best to understand what makes it slow. An accurate table of which have the largest impacts can’t really be made because it depends not only on the API but also other factors such as how complex lighting is. I present the major bottlenecks here in the most generalized order.
- Shader swapping: Swapping shaders has a huge impact on older API’s. Each time a shader is set, it must determine how to transform the incoming vertices again and again. DirectX 10 and DirectX 11 require a shader signature when a vertex buffer is created, and the transforming routine is determined only once, so swapping shaders is not so rough. This allows the #2 bottleneck—swapping textures—to be roughly even in terms of overhead, perhaps even worse.
- Texture swapping: Uploading textures is slow. Sometimes unused textures need to be cached out, which is even slower.
- Transfers: Sending vertices and indices is also slow. Never send anything that does not need to be sent.
- Overdraw: Overdraw occurs when a pixel is fully rendered, only to have another pixel fully rendered on top of it later. The previous pixel was a complete waste of time in this case, and that wasted time can be significant if the pixel shader is complex.
- Redundant render states: This mainly applies to systems that are not entirely shader-based. If you are unfortunate enough to not be using shaders for every render call, you must implement wrappers around all of the render states you use, such as lighting on/off, in order to never set the same state twice. If you are using shaders, you only need a mechanism to avoid applying a shader that is already applied, and the same for textures. Lighting will just be your own boolean switch on the CPU, and will be handled, along with every other render state, inside your shader.
The above bottlenecks can be mostly mitigated by sorting the objects that are about to be rendered. Sort by shader, then by texture(s), and then loosely by depth. When it comes to opaque objects, you want to render near-to-far as best as you can without incurring too many shader and texture swaps. When sorting translucent objects, depth is really the only sorting criteria, because far-to-near rendering is necessary to avoid artifacts.
You may need to experiment; if overdraw is not a major bottleneck for you, you can ignore depth sorting (opaque objects) and render all objects that use the same shader, and then within those objects, draw all objects using the same textures together.
This is called a render queue and it is extremely easy to implement, while being very helpful to your performance. For each object that is in view (perform per-object and per-sub-object culling), for each mesh on that object, for each part of each mesh, submit the data needed for sorting to the render queue. It should be very compact. Specifically, pointer to the object that submitted this data, shader ID, texture ID’s, and distance from the camera (which does not need to be true distance, and can be obtained simply via a dot product of the camera view direction against the position of the mesh’s center point). When sorting, instead of copying and moving these little bits of render values around, build a list of indices into this data and sort those instead. This allows you to move only DWORD’s at a time during the sort.
Run over the sorted indices and render each object in the order defined by the indices. In this example, that would be render parts 3, 1, 0, and then 2. This allows you to heavily reduce the amount of data you copy and move. Rendering is not directly handled by the render queue. The the final render is passed off to the actual object that submitted the render-part data. This allows the object to make any final preparations, activate its shader and textures, and select the index/vertex buffers to use for the render. In other words, the data passed to the render queue is just a reference for the render queue to give a good order for the rendering of the objects, and it will be optimal as long as the objects actually use the shaders and textures they told the render queue they would, but they are not necessarily required to do so.
In general, a model has many model parts, or meshes. A robot is a model, but its arm, foot, and head are model parts/meshes. Then, within each of these parts, you will have multiple vertex buffers for each part of those meshes (which I call render parts, for lack of better terminology). If half of the head uses one material and the other half uses another, you need to either make two vertex buffers, or make one vertex buffer but draw it in two passes, the first set of vertices for the left half, and the second set for the right half, changing materials between renders. This is the typical situation.
In this image, there is only one model, but 2 meshes. The blue ring mesh has only one render part, but the cone mesh has faces that don’t all use the same material. We must render the red faces, change materials, and then render the green faces. So this mesh has two render parts/vertex buffers.
For optimization, these vertex buffers should be broken down even further into multiple vertex buffers, using multiple streams together to render each part. For example, store the vertices separately from the normals, texture coordinates, etc. Keep one vertex buffer for vertices, one for normals, binormals, and tangents, and another vertex buffer for the rest, optionally breaking it into even more buffers.
The reason is that transfers to the GPU from the CPU are slow and you never want to send data that is not being used. In a naive approach, with only one vertex buffer for all attributes, normals will be submitted even if lighting is disabled, and sometimes binormals and tangents along with them. By keeping them in their own buffer, it is trivial to avoid submitting them when they are not needed.
Another example is during the creation of shadow maps. You are only creating a depth table, so you don’t need color information at all (except in handling colored glass). If you have the vertices in their own vertex buffer, you can avoid submitting (in an average case) normals and 1 set of texture coordinates per-vertex. That is 20 bytes per vertex. You also don’t have to submit any textures, your shaders can be as simple as humanly possible, and you don’t have to swap shaders between objects during the entire render. This makes the creation of shadow maps almost free.
Z-buffering is a way of making sure nearer pixels stay above pixels farther back. Unfortunately this happens after each pixel has been fully processed by the shaders. This is because shaders are able to change the depth value. Some graphics cards are optimized in that they can take a peek into a shader and realize no depth modifications are being made, and may cull pixels early rather than late. Unfortunately, this can’t be expected behavior (though it is growing more common heavily).
If you have followed my above set of advice, you will have vertices in their own vertex buffer. Once again, without texture uploads and shader swapping, and with little data transfer and small shaders, making a depth-only pre-render of the scene is trivial. A depth pre-pass is extremely efficient in this case, and it works by rendering your scene, saving only the depth values. Once the depth buffer is finalized, set your depth comparison to equal and render your scene again.
This is only efficient if the graphics card is performing early-outs on depth, before the shader is executed. If the card is performing the depth check afterwards, as older cards did, you have only added a bit of extra work to your load. The only way to know which cards are optimized is to simply research which cards are known to be optimized and keep a table of this information, as far as I know.
Shader Permutations vs. Branching
The speed of branching in shaders depends on the hardware and the API. In DirectX 9, true branching can only be done on boolean registers which can’t be modified during execution of the shader. OpenGL makes no guarantees that any true branching will take place. By “true” branching, I am referring to the opposite of the alternative, which is when both sides of the branch are taken and the result is interpolated between them to effectively make one side or the other non-contributory to the result. This is extremely slow.
On the other hand, as pointed out earlier, switching shaders is expensive, particularly on older API’s. You have to consider whether it is less costly to incur the wrath of a dynamic branch, with both sides of the if/else being executed, or to incur the wrath of a shader swap that has a more efficient execution time. As a rule of thumb, if the same shader can be executed for many objects, it is best to use permutation. Permutation is most easily accomplished by writing one shader per rendering class (generating shadow maps is one class of shaders, rendering ambient-only is one class, multi-pass lighting is a class, etc.) using #ifdef to omit or activate chunks of code.
Lighting is one case where it is better to use permutation rather than to branch. In a typical scene, lighting being on or off is something that will effect every object in the scene, so it is unlikely to incur more shader swaps than if lighting was on. Even if lighting is being switched on and off, during a single render, all objects will use only the lighting shaders or the no-lighting shaders.
Exit Early/Work Reduction
Exiting early is not guaranteed to actually work, since some implementations may encounter a discard/return pair but still execute the rest of the shader. We can’t get around this, but when early exits are correctly supported, they can prove a huge help. Here are a few things that can provide early exits or workload reductions:
- Black Lambertian terms: A diffuse material with a 0 specular term has no need for lighting calculations. Once lighting is multiplied by the diffuse term, the result is still 0, and no specular will be added to change that. There is no need to handle lighting at all in this case. By using multiple vertex buffers as mentioned above, you also don’t have to send normals, tangents, and binormals.
- Fully fogged: If a pixel is outside the fog’s far reach, simply return the fog color and exit early.
- Alpha testing: If the alpha is 0, or below the set threshold, exit early before any other computations are done. This implies that the pixel’s alpha is one of the first things you should compute, before lighting, etc.
More to come as I have time to write.