{"id":9,"date":"2009-03-28T10:15:06","date_gmt":"2009-03-28T17:15:06","guid":{"rendered":"http:\/\/gameangst.com\/?p=9"},"modified":"2009-03-28T10:38:34","modified_gmt":"2009-03-28T17:38:34","slug":"triangle-order-optimization","status":"publish","type":"post","link":"http:\/\/gameangst.com\/?p=9","title":{"rendered":"Triangle Order Optimization"},"content":{"rendered":"<p class=\"MsoNormal\">One of the standard weapons in any graphics programmer&#8217;s arsenal is a procedure for mesh triangle order optimization. \u00a0Although most meshes are intended to produce output that is independent of the ordering of their triangles, the order of triangles within a mesh can have a big impact on performance. \u00a0This gives us great leeway in picking our triangle order and also a great incentive for doing so wisely.<\/p>\n<p class=\"MsoNormal\">With today&#8217;s hardware and game content, a reasonable consideration when picking triangle order is overdraw. \u00a0Games tend to be limited by fragment processing and early-z rejection is the primary way of keeping that cost to a minimum. \u00a0The same meshes that are intended to be triangle order independent are usually also intended to be mesh order independent. \u00a0Consequently most games minimize overdraw by sorting meshes relative to the camera position and rendering them in nearest-to-farthest order. \u00a0This maximizes early-z rejection from intermesh occlusion, but it doesn&#8217;t address the possibility of intra-mesh occlusion. \u00a0Any concave mesh may, after all, have self-occluding elements that could benefit from a nearest-to-farthest ordering. \u00a0Since it isn&#8217;t feasible to sort the triangles within a mesh based on camera position every frame, the most common overdraw minimizing mesh optimization is ordering triangles offline based on a camera-independent probability of being occluded. \u00a0If you take a torus, for example, it is pretty clear that the probability of the inner wall of the torus being occluded from a random camera angle is much higher than the probability of the outer wall being occluded. \u00a0An overdraw-conscious triangle order for a torus would therefore render triangles on the outer wall of the torus before the triangles on the inner wall. \u00a0Since computing this camera-independent probability of occlusion can be quite expensive and since there are other performance advantages to rendering adjacent triangles together, a good approach to overdraw ordering is to first decompose the mesh into convex clusters of adjacent triangles and then compute the occlusion probability per cluster. \u00a0This approach is detailed in\u00a0&#8220;<a href=\"http:\/\/developer.amd.com\/media\/gpu_assets\/Nehab-Triangle_Order_Optimization-SI3D06.pdf\">Triangle Order Optimization for Graphics Hardware Computation Culling<\/a>&#8221; which was presented at SIGGRAPH in 2006 and is implemented in AMD&#8217;s\u00a0<a href=\"http:\/\/developer.amd.com\/gpu\/tootle\/Pages\/default.aspx\">Tootle<\/a>\u00a0mesh optimization library.<\/p>\n<p class=\"MsoNormal\">Although I said that overdraw is a reasonable consideration in picking a triangle order, I get the impression that overdraw optimization is still pretty uncommon in games. \u00a0There is good cause for skepticism about the value of overdraw optimization, since it has yet to be demonstrated conclusively that it works. \u00a0Hardware implementations of early-z cull vary widely and aren&#8217;t well documented, so it is difficult to construct an a priori argument in favor of overdraw optimization. \u00a0One obvious concern is that hardware pipelines are getting deeper, and there is no guarantee that depth output from one triangle will even be available to the z-cull unit in time to affect other triangles in the same mesh. \u00a0Rather than deal with these issues, many game developers choose instead to render an explicit z prepass. \u00a0A z prepass ensures zero overdraw in fragment processing but adds significant additional CPU setup and GPU vertex processing cost.<\/p>\n<p class=\"MsoNormal\">There is still plenty of room for improvement in techniques for overdraw optimization of triangle order, but what this field really needs is a comprehensive evaluation of the advantages of overdraw optimization. \u00a0Until someone can show that real-world games can reap real-world benefits, there isn&#8217;t going to be much interest in overdraw minimizing triangle orders.<\/p>\n<p class=\"MsoNormal\">A second possible consideration in picking a triangle order, and by far the most common one, is making good use of the post-transform vertex cache. \u00a0Whether you&#8217;re sorting whole meshes at once or subsets from your overdraw optimization pass, there is usually plenty of room for minimizing vertex processing work. \u00a0The concept is pretty simple; indexed triangle meshes allow the same vertex to be used in multiple triangles. \u00a0Most graphics hardware provide a small cache of recently processed vertices so that when the same index is used more than once the &#8220;transformed&#8221; vertex can be pulled from the cache rather than being processed again. \u00a0The effectiveness of the post-transform cache is measured by counting how many vertices must be transformed on average per triangle. \u00a0This number is known as the &#8220;average cache miss ratio&#8221; or ACMR. \u00a0For a mesh with no vertex reuse the ACMR is obviously 3.0. \u00a0Most meshes however exhibit a lot of vertex reuse and for large meshes an ACMR under 1.0 is not uncommon. \u00a0The ACMR for a mesh is dependent on the number of entries in the post-transform cache and, for meshes with more vertices than cache entries, it is dependent on the order in which the vertices are used. \u00a0Luckily for hardware designers, it turns out that there are pretty steeply diminishing returns in ACMR for increasing cache sizes, assuming an optimal vertex order. \u00a0As long as software developers do their jobs and provide meshes with optimized triangle orders, the hardware can achieve near-optimal ACMR values with a modest-sized cache. \u00a0Consequently, the graphics chips I work with typically have between 10 and 24 entries in the post-transform cache.<\/p>\n<p class=\"MsoNormal\">Post-transform cache optimization has been well researched and many consider it to be a solved problem. \u00a0The top optimization algorithms generally yield ACMRs within a few percent of one another and within a few percent of optimal across a wide range of cache sizes. \u00a0Even algorithms that are cache size aware are generally tolerant of varying cache sizes and continue to provide competitive results. \u00a0This is probably because at their hearts most of the modern algorithms are the same. \u00a0To achieve good runtime performance they are greedy algorithms that employ a heuristic to select vertices that will maximize hits in a simulated cache. \u00a0These days it doesn&#8217;t make much sense to implement your own solution for post-transform cache optimization. \u00a0There are too many good libraries available for free and too small a gap between them and the theoretical ideal.<\/p>\n<p class=\"MsoNormal\">I recently had the opportunity to compare some of the front-runners in this field. \u00a0I took four competing algorithms for post-transform cache optimization and ran them on the mesh libraries for two games. \u00a0The complete test bed was over 19 million triangles in 25 thousand meshes, and it included plenty of the sort of low-poly, flat-shaded rocks and table legs that cache optimizers hate. \u00a0I timed the results and computed ACMR values for a range of cache sizes. \u00a0The algorithms I compared were Direct3D&#8217;s D3DXOptimizeFaces function which I believe is based on <a href=\"http:\/\/research.microsoft.com\/en-us\/um\/people\/hoppe\/proj\/tvc\/\">Hugues Hoppe&#8217;s work<\/a>, my own implementation\u00a0of Tom Forsyth&#8217;s &#8220;<a href=\"http:\/\/home.comcast.net\/~tom_forsyth\/papers\/fast_vert_cache_opt.html\">Linear-Speed Vertex Cache Optimization<\/a>&#8221; algorithm, AMD&#8217;s implementation of the &#8220;<a href=\"http:\/\/www.cs.princeton.edu\/gfx\/pubs\/Sander_2007_&gt;TR\/tipsy.pdf\">Tipsy<\/a>&#8221; algorithm in Tootle, and a third-party implementation of Gang Lin and Thomas P.-Y. Yu&#8217;s <a href=\"http:\/\/www.ecse.rpi.edu\/~lin\/K-Cache-Reorder\/Lin-Yu-TVCG.pdf\">K-Cache<\/a> algorithm. \u00a0Since Forsyth&#8217;s algorithm takes as a parameter the size of the simulated cache and Tipsy takes as a parameter the size of the targeted cache, I tested each algorithm multiple times with appropriate values. \u00a0Here are the results:<\/p>\n<pre>                               ACMR at various cache sizes\r\n               Time (secs)       10        14        18        24        32\r\n-----------------------------------------------------------------------------\r\nNone                 0.00     1.613     1.554     1.507     1.458     1.416\r\nForsyth-32         131.27     1.235     1.216     1.208     1.203     1.200\r\nForsyth-64         139.18     1.240     1.215     1.206     1.200     1.196\r\nD3DX                20.05     1.276     1.227     1.220     1.214     1.210\r\nTootle-14            7.87     1.281     1.232     1.224     1.216     1.209\r\nTootle-16            7.55     1.288     1.235     1.220     1.213     1.207\r\nTootle-32            7.50     1.302     1.260     1.236     1.214     1.198\r\nK-Cache            413.36     1.256     1.232     1.219     1.208     1.204<\/pre>\n<p class=\"MsoNormal\">As you can see, Forsyth&#8217;s algorithm produced the best results on all cache sizes and Tootle performed the fastest. \u00a0I&#8217;ll be the first to admit that I didn&#8217;t put a lot of effort into optimizing my implementation of Forsyth&#8217;s algorithm. \u00a0I tried not to be stupid, but hopefully someone will take a look at it (available <a href=\"http:\/\/gameangst.com\/wp-content\/uploads\/2009\/03\/forsythtriangleorderoptimizer.cpp\" target=\"_blank\">here<\/a>) and provide an implementation that out-performs Tootle. \u00a0That said, I don&#8217;t really believe the implementation is the sole reason for the difference in running time. \u00a0The Tipsy and Forsyth algorithms are quite similar, and they both claim complexity that is roughly linear in the number of faces. \u00a0Forsyth&#8217;s algorithm, however, only achieves that in the average case. \u00a0Its worst-case runtime is order n squared in the number of faces. \u00a0This isn&#8217;t true of Tipsy though, which is less pessimistic in the worst case and never sacrifices its linear runtime. \u00a0Then again, maybe my implementation of Forsyth&#8217;s algorithm and the K-Cache implementation I have are completely wrong and I&#8217;m doing them both a terrible disservice. \u00a0Let this be a lesson to algorithm authors\u2014provide a reference implementation! \u00a0Anyway, regardless of small differences, all the algorithms I tested performed extremely well and left little room for improvement. \u00a0My personal choice based on these results is D3DX, mainly because I don&#8217;t want to keep revisiting the subject and I think D3D has the biggest incentive to stay competitive. \u00a0Also I think it offers a nice compromise between offline and online performance.<\/p>\n<p class=\"MsoNormal\">One last note because someone is bound to ask\u2014I tried comparing the algorithms on various subsets of the mesh pool, like only high or low poly meshes or only character \u00a0or environment meshes, and while there were minor variations, the general trend always held strong. \u00a0The &#8220;Linear-Speed Vertex Cache Algorithm&#8221; consistently returned the lowest ACMRs and Tootle consistently ran the fastest.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the standard weapons in any graphics programmer&#8217;s arsenal is a procedure for mesh triangle order optimization. \u00a0Although most meshes are intended to produce output that is independent of the ordering of their triangles, the order of triangles within a mesh can have a big impact on performance. \u00a0This gives us great leeway in [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[7,6],"_links":{"self":[{"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/posts\/9"}],"collection":[{"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/gameangst.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=9"}],"version-history":[{"count":32,"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/posts\/9\/revisions"}],"predecessor-version":[{"id":11,"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/posts\/9\/revisions\/11"}],"wp:attachment":[{"href":"http:\/\/gameangst.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=9"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/gameangst.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=9"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/gameangst.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=9"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}