One of the standard weapons in any graphics programmer’s arsenal is a procedure for mesh triangle order optimization. Although most meshes are intended to produce output that is independent of the ordering of their triangles, the order of triangles within a mesh can have a big impact on performance. This gives us great leeway in picking our triangle order and also a great incentive for doing so wisely.

With today’s hardware and game content, a reasonable consideration when picking triangle order is overdraw. Games tend to be limited by fragment processing and early-z rejection is the primary way of keeping that cost to a minimum. The same meshes that are intended to be triangle order independent are usually also intended to be mesh order independent. Consequently most games minimize overdraw by sorting meshes relative to the camera position and rendering them in nearest-to-farthest order. This maximizes early-z rejection from intermesh occlusion, but it doesn’t address the possibility of intra-mesh occlusion. Any concave mesh may, after all, have self-occluding elements that could benefit from a nearest-to-farthest ordering. Since it isn’t feasible to sort the triangles within a mesh based on camera position every frame, the most common overdraw minimizing mesh optimization is ordering triangles offline based on a camera-independent probability of being occluded. If you take a torus, for example, it is pretty clear that the probability of the inner wall of the torus being occluded from a random camera angle is much higher than the probability of the outer wall being occluded. An overdraw-conscious triangle order for a torus would therefore render triangles on the outer wall of the torus before the triangles on the inner wall. Since computing this camera-independent probability of occlusion can be quite expensive and since there are other performance advantages to rendering adjacent triangles together, a good approach to overdraw ordering is to first decompose the mesh into convex clusters of adjacent triangles and then compute the occlusion probability per cluster. This approach is detailed in “Triangle Order Optimization for Graphics Hardware Computation Culling” which was presented at SIGGRAPH in 2006 and is implemented in AMD’s Tootle mesh optimization library.

Although I said that overdraw is a reasonable consideration in picking a triangle order, I get the impression that overdraw optimization is still pretty uncommon in games. There is good cause for skepticism about the value of overdraw optimization, since it has yet to be demonstrated conclusively that it works. Hardware implementations of early-z cull vary widely and aren’t well documented, so it is difficult to construct an a priori argument in favor of overdraw optimization. One obvious concern is that hardware pipelines are getting deeper, and there is no guarantee that depth output from one triangle will even be available to the z-cull unit in time to affect other triangles in the same mesh. Rather than deal with these issues, many game developers choose instead to render an explicit z prepass. A z prepass ensures zero overdraw in fragment processing but adds significant additional CPU setup and GPU vertex processing cost.

There is still plenty of room for improvement in techniques for overdraw optimization of triangle order, but what this field really needs is a comprehensive evaluation of the advantages of overdraw optimization. Until someone can show that real-world games can reap real-world benefits, there isn’t going to be much interest in overdraw minimizing triangle orders.

A second possible consideration in picking a triangle order, and by far the most common one, is making good use of the post-transform vertex cache. Whether you’re sorting whole meshes at once or subsets from your overdraw optimization pass, there is usually plenty of room for minimizing vertex processing work. The concept is pretty simple; indexed triangle meshes allow the same vertex to be used in multiple triangles. Most graphics hardware provide a small cache of recently processed vertices so that when the same index is used more than once the “transformed” vertex can be pulled from the cache rather than being processed again. The effectiveness of the post-transform cache is measured by counting how many vertices must be transformed on average per triangle. This number is known as the “average cache miss ratio” or ACMR. For a mesh with no vertex reuse the ACMR is obviously 3.0. Most meshes however exhibit a lot of vertex reuse and for large meshes an ACMR under 1.0 is not uncommon. The ACMR for a mesh is dependent on the number of entries in the post-transform cache and, for meshes with more vertices than cache entries, it is dependent on the order in which the vertices are used. Luckily for hardware designers, it turns out that there are pretty steeply diminishing returns in ACMR for increasing cache sizes, assuming an optimal vertex order. As long as software developers do their jobs and provide meshes with optimized triangle orders, the hardware can achieve near-optimal ACMR values with a modest-sized cache. Consequently, the graphics chips I work with typically have between 10 and 24 entries in the post-transform cache.

Post-transform cache optimization has been well researched and many consider it to be a solved problem. The top optimization algorithms generally yield ACMRs within a few percent of one another and within a few percent of optimal across a wide range of cache sizes. Even algorithms that are cache size aware are generally tolerant of varying cache sizes and continue to provide competitive results. This is probably because at their hearts most of the modern algorithms are the same. To achieve good runtime performance they are greedy algorithms that employ a heuristic to select vertices that will maximize hits in a simulated cache. These days it doesn’t make much sense to implement your own solution for post-transform cache optimization. There are too many good libraries available for free and too small a gap between them and the theoretical ideal.

I recently had the opportunity to compare some of the front-runners in this field. I took four competing algorithms for post-transform cache optimization and ran them on the mesh libraries for two games. The complete test bed was over 19 million triangles in 25 thousand meshes, and it included plenty of the sort of low-poly, flat-shaded rocks and table legs that cache optimizers hate. I timed the results and computed ACMR values for a range of cache sizes. The algorithms I compared were Direct3D’s D3DXOptimizeFaces function which I believe is based on Hugues Hoppe’s work, my own implementation of Tom Forsyth’s “Linear-Speed Vertex Cache Optimization” algorithm, AMD’s implementation of the “Tipsy” algorithm in Tootle, and a third-party implementation of Gang Lin and Thomas P.-Y. Yu’s K-Cache algorithm. Since Forsyth’s algorithm takes as a parameter the size of the simulated cache and Tipsy takes as a parameter the size of the targeted cache, I tested each algorithm multiple times with appropriate values. Here are the results:

ACMR at various cache sizes Time (secs) 10 14 18 24 32 ----------------------------------------------------------------------------- None 0.00 1.613 1.554 1.507 1.458 1.416 Forsyth-32 131.27 1.235 1.216 1.208 1.203 1.200 Forsyth-64 139.18 1.240 1.215 1.206 1.200 1.196 D3DX 20.05 1.276 1.227 1.220 1.214 1.210 Tootle-14 7.87 1.281 1.232 1.224 1.216 1.209 Tootle-16 7.55 1.288 1.235 1.220 1.213 1.207 Tootle-32 7.50 1.302 1.260 1.236 1.214 1.198 K-Cache 413.36 1.256 1.232 1.219 1.208 1.204

As you can see, Forsyth’s algorithm produced the best results on all cache sizes and Tootle performed the fastest. I’ll be the first to admit that I didn’t put a lot of effort into optimizing my implementation of Forsyth’s algorithm. I tried not to be stupid, but hopefully someone will take a look at it (available here) and provide an implementation that out-performs Tootle. That said, I don’t really believe the implementation is the sole reason for the difference in running time. The Tipsy and Forsyth algorithms are quite similar, and they both claim complexity that is roughly linear in the number of faces. Forsyth’s algorithm, however, only achieves that in the average case. Its worst-case runtime is order n squared in the number of faces. This isn’t true of Tipsy though, which is less pessimistic in the worst case and never sacrifices its linear runtime. Then again, maybe my implementation of Forsyth’s algorithm and the K-Cache implementation I have are completely wrong and I’m doing them both a terrible disservice. Let this be a lesson to algorithm authors—provide a reference implementation! Anyway, regardless of small differences, all the algorithms I tested performed extremely well and left little room for improvement. My personal choice based on these results is D3DX, mainly because I don’t want to keep revisiting the subject and I think D3D has the biggest incentive to stay competitive. Also I think it offers a nice compromise between offline and online performance.

One last note because someone is bound to ask—I tried comparing the algorithms on various subsets of the mesh pool, like only high or low poly meshes or only character or environment meshes, and while there were minor variations, the general trend always held strong. The “Linear-Speed Vertex Cache Algorithm” consistently returned the lowest ACMRs and Tootle consistently ran the fastest.

[…] Triangle Order Optimization […]

did you check whether the game performance improves after re-ordering?

quote “Until someone can show that real-world games can reap real-world benefits, there isn’t going to be much interest in overdraw minimizing triangle orders.”

I have measured performance improvements after triangle order optimization. The amount varies by platform and by how vertex-bound the scenes are, of course. In terms of final frame rate with and without triangle order optimization, I’ve seen improvements between about 3 and 6 percent.

Hi,

I’ve recently make some experimentations about mesh optimization on PS3. I’ve compared only 2 stripper, the old NVStrip and Sony implementation of K-Cache, wich is supposed to take RSX mini-cache into account. In my test scene, i mesured an improvement of about 5% with K-Cache.

After that, i discover the game engine i’m working on was not doing any pre-transform cache optimisation ( i think it’s done inside DX MeshOptimize() ). So i’ve reorder the vertices inside the vertex buffer in the order they are fetched by the index buffer after calling K-Cache. I’ve also created a second index buffer per mesh, dedicated to the Z/Shadow pass. Since color, UV, normals and other additionnal attributes are not used during these rendering passes, you can greatly increase vertices re-use ratio.

Combining all these things, i’ve measure an improvment of about 28% for Z/shadowmap pass compared to only calling NVStrip. The improvement is greater for skinned meshes, arround 32%.

I had to piece this together from some very old email chains, but here are our results from similar optimizations. K-Cache gave us a 0.9% total improvement over our cross-platform cache optimizer (~3.6% improvement in shadow rendering) and breaking our geometry into two vertex streams gave us a 3% total speedup (~12% in shadow rendering).

Thanks for the great writeup and for sharing your implementation of Forsyth. I’m testing your implementation out now in my open source engine, Kraken (http://kearwood.com/projects/kraken)

– Kearwood Gilbert

Something to know if you try to use this code, iterator debugging will complain about the line:

uint* end = &activeFaceList[vertexData.activeFaceListStart + vertexData.activeFaceListSize];

The code is safe, but it can be modified like this to satisfy iterator debugging:

uint* end = &activeFaceList[vertexData.activeFaceListStart + vertexData.activeFaceListSize – 1] + 1;