<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Game Angst</title>
	<atom:link href="http://gameangst.com/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://gameangst.com</link>
	<description>because there are many ways to skin a cat, but you only get to choose one</description>
	<lastBuildDate>Mon, 10 May 2010 22:41:07 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>CrossStitch 2.0: Dynamic Shader Linking in a Statically Linked World</title>
		<link>http://gameangst.com/?p=441</link>
		<comments>http://gameangst.com/?p=441#comments</comments>
		<pubDate>Mon, 10 May 2010 22:40:30 +0000</pubDate>
		<dc:creator>Adrian Stone</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[CrossStitch]]></category>
		<category><![CDATA[Despair Engine]]></category>
		<category><![CDATA[Fracture]]></category>
		<category><![CDATA[rendering]]></category>

		<guid isPermaLink="false">http://gameangst.com/?p=441</guid>
		<description><![CDATA[In my last article, I described the first generation of CrossStitch, the shader assembly system used in MechAssault 2.  Today I&#8217;m going to write about the second generation of CrossStitch, the one used in Fracture.
Development on CrossStitch 2.0 began with the development of Day 1&#8217;s Despair Engine.  This was right at the end of MechAssault [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://gameangst.com/?p=402">last article</a>, I described the first generation of <em>CrossStitch</em>, the shader assembly system used in <em><a href="http://xbox.ign.com/objects/655/655496.html">MechAssault 2</a></em>.  Today I&#8217;m going to write about the second generation of <em>CrossStitch</em>, the one used in <em><a href="http://www.lucasarts.com/games/fracture/" target="_blank">Fracture</a></em>.</p>
<p>Development on <em>CrossStitch 2.0</em> began with the development of Day 1&#8217;s <em>Despair Engine</em>.  This was right at the end of <em>MechAssault 2</em>, when the Xbox 360 was still an Apple G5 and the Cell processor was going to be powering everything from super computers to kitchen appliances in a few years.  It is hard to believe looking back, but at that time we were still debating whether to adopt a high-level shading language for graphics.  There were respected voices on the platform side insisting that the performance advantage of writing shaders in assembly language would justify the additional effort.  Thankfully I sided with the HLSL proponents, but that left me with the difficult decision of what to do about <em>CrossStitch</em>.</p>
<p><em>CrossStitch</em> was a relatively simple system targeting a single, very constrained platform.  HLSL introduced multiple target profiles, generic shader outputs, and literal constants, not to mention a significantly more complex and powerful language syntax.  Adding to that, <em>Despair Engine</em> was intended to be cross-platform, and we didn&#8217;t even have specs on some of the platforms we were promising to support.  Because of this, we considered the possibility of dispensing with dynamic shader linking entirely and adopting a conventional HLSL pipeline, implementing our broad feature set with a mixture of compile-time, static, and dynamic branching.  In the end, however, I had enjoyed so much success with the dynamic shader linking architecture of <em>MechAssault 2</em>, I couldn&#8217;t bear to accept either the performance cost of runtime branching or the clunky limitations of precomputing all possible shader permutations.</p>
<p>The decision was made: <em>Despair Engine</em> would feature <em>CrossStitch 2.0</em>.  I don&#8217;t recall how long it took me to write the first version of <em>CrossStitch 2.0</em>.  The early days of Despair development are a blur because we were supporting the completion of <em>MechAssault 2</em> while bootstrapping an entirely new engine on a constantly shifting development platform and work was always proceeding on all fronts.  I know that by December of 2004, however, <em>Despair Engine</em> had a functional implementation of dynamic shader linking in HLSL.</p>
<p><em>CrossStitch 2.0</em> is similar in design to its predecessor.  It features a front-end compiler that transforms shader fragments into an intermediate binary, and a back-end linker that transforms a chain of fragments into a full shader program.  The difference, of course, is that now the front-end compiler parses HLSL syntax and the back-end linker generates HLSL programs.  Since <em>CrossStitch 1.0</em> was mostly limited to vertex shaders with fixed output registers, <em>CrossStitch 2.0</em> introduced a more flexible model for passing data between pipeline stages.  Variables can define and be mapped to named input and output channels; and each shader chain requires an input signature from the stage preceding it and generates an output signature for the stage following it.</p>
<div id="attachment_463" class="wp-caption aligncenter" style="width: 671px"><img class="size-full wp-image-463 " style="border: 1px solid black;" title="HLSL Fragments" src="http://gameangst.com/wp-content/uploads/2010/05/HLSLFragments1.jpg" alt="" width="661" height="710" /><p class="wp-caption-text">A sampling of early HLSL shader fragments.</p></div>
<p><em>CrossStitch&#8217;s</em> primary concern is GPU runtime efficiency, so it is nice that shaders are compiled with full knowledge of the data they&#8217;ll be receiving either from vertex buffers or interpolators.  If, for example, some meshes include per-vertex color and some don&#8217;t, the same series of shader fragments will generate separate programs optimized for each case.  It turns out that this explicit binding of shader programs to attributes and interpolators is a common requirement of graphics hardware, and making the binding explicit in <em>CrossStitch</em> allows for some handy optimizations on fixed consoles.</p>
<p>The early results from <em>CrossStitch 2.0</em> were extremely positive.  The HLSL syntax was a nice break from assembly, and the dynamic fragment system allowed me to quickly and easily experiment with a wide range of options as our rendering pipeline matured.  Just as had happened with <em>MechAssault 2</em>, the feature set of Despair expanded rapidly to become heavily reliant on the capabilities of <em>CrossStitch</em>.  The relationship proved circular too.  Just as <em>CrossStitch</em> facilitated a growth in Despair&#8217;s features, Despair&#8217;s features demanded a growth in <em>CrossStitch&#8217;s</em> capabilities.</p>
<p>The biggest example of this is Despair&#8217;s material editor, <em>Façade</em>.  <em>Façade</em> is a graph-based editor that allows content creators to design extremely complex and flexible materials for every asset.  The materials are presented as a single pipeline flow, taking generic mesh input attributes and transforming them through a series of operations into a common set of material output attributes.  To implement <em>Façade</em>, I both harnessed and extended the power of <em>CrossStitch</em>.  Every core node in a <em>Facade</em> material graph is a shader fragment.  I added reflection support to the <em>CrossStich</em> compiler, so adding a new type of node to <em>Façade</em> is as simple as creating a new shader fragment and annotating its public-facing variables.  Since <em>CrossStitch</em> abstracts away many of the differences between pipeline stages, <em>Façade</em> material graphs don&#8217;t differentiate between per-vertex and per-pixel operations.  The flow of data between pipeline stages is managed automatically depending on the requirements of the graph.</p>
<div id="attachment_458" class="wp-caption aligncenter" style="width: 410px"><img class="size-full wp-image-458 " title="facade" src="http://gameangst.com/wp-content/uploads/2010/05/facade.jpg" alt="" width="400" height="332" /><p class="wp-caption-text">Façade Material Editor</p></div>
<p>It was about 6 months after the introduction of <em>Façade</em> when the first cracks in <em>CrossStitch</em> began to appear.  The problem was shader compilation times.  On <em>MechAssault 2</em> we measured shader compilation times in microseconds.  Loading a brand new level with no cached shader programs in <em>MechAssault 2</em> might cause a half-second hitch as a hundred new shaders were compiled.  If a few new shaders were encountered during actual play, a couple of extra milliseconds in a frame didn&#8217;t impact the designers&#8217; ability to evaluate their work.  Our initial HLSL shaders were probably a hundred times slower to compile than that on a high-end branch-friendly PC.  By the end of 2005 we had moved to proper Xbox 360 development kits and our artists had mastered designing complex effects in <em>Façade</em>.  Single shaders were now taking as long as several seconds to compile, and virtually every asset represented a half-dozen unique shaders.</p>
<p>The unexpected 4-5 decimal order of magnitude increase in shader compilation times proved disastrous.  <em>CrossStitch</em> was supposed to allow the gameplay programmers, artists, and designers to remain blissfully ignorant of how the graphics feature set was implemented.  Now, all of a sudden, everyone on the team was aware of the cost of shader compilation.  The pause for shader compilation was long enough that it could easily be mistaken for a crash, and, since it was done entirely on the fly, on-screen notification of the event couldn&#8217;t be given until <em>after</em> it was complete.  Attempts to make shader compilation asynchronous weren&#8217;t very successful because at best objects would pop in seconds after they were supposed to be visible and at worst a subset of the passes in a multipass process would be skipped resulting in unpredictable graphical artifacts.  Making matters worse, the long delays at level load were followed by massive hitches as new shaders were encountered during play.  It seemed like no matter how many times a designer played a level, new combinations of lighting and effects would be encountered and repeated second-long frame rate hitches would make evaluating the gameplay impossible.</p>
<p>Something had to be done and fast.</p>
<p>Simple optimization was never an option, because almost the entire cost of compilation was in the HLSL compiler itself.  Instead I focused my efforts on the <em>CrossStitch</em> shader cache.  The local cache was made smarter and more efficient, and extended so that multiple caches could be processed simultaneously.  That allowed the QA staff to start checking in their shader caches, which meant tested assets came bundled with all their requisite shaders.  Of course content creators frequently work with untested assets, so there was still a lot of unnecessary redundant shader compilation going on.</p>
<p>To further improve things we introduced a network shader cache.  Shaders were still compiled on-target, but when a missing shader was encountered it would be fetched from a network server before being compiled locally.  Clients updated servers with newly compiled shaders, and since Day 1 has multiple offices and supports distributed development, multiple servers had to be smart enough to act as proxies for one another.</p>
<p>With improvements to the shader cache, life with dynamic, on-the-fly shader compilation was tolerable but not great.  The caching system has only had a few bugs in its lifetime, but it is far more complicated than you might expect and only really understood by a couple of people.  Consequently, a sort of mythology has developed around the shader cache.  Just as programmers will suggest a full rebuild to one another as a possible solution to an inexplicable code bug, content creators and testers can be heard asking each other, &#8220;have you tried deleting your shader cache?&#8221;</p>
<p>At the same time as I was making improvements to the shader cache, I was also working towards the goal of having all shaders needed for an asset compiled at the time the asset was loaded.  I figured compiling shaders at load time would solve the in-game hitching problem and it also seemed like a necessary step towards my eventual goal of moving shader compilation offline.  Unfortunately, doing that without fundamentally changing the nature and usage of <em>CrossStitch</em> was equivalent to solving the halting problem.  <em>CrossStitch</em> exposes literally billions of possible shader programs to the content, taking advantage of the fact that only a small fraction of those will actually be used.  Which fraction, however, is determined by a mind-bending, platform-specific tangle of artist content, lua script, and C++ code.</p>
<p>I remember feeling pretty pleased with myself at the end of <em>MechAssault 2</em> when I learned that <em>Far Cry</em> required a 430 megabyte shader cache compared to MA2&#8217;s svelte 500 kilobyte cache.  That satisfaction evaporated pretty quickly during the man-weeks I spent tracking down unpredicted shader combinations in <em>Fracture</em>.</p>
<p>Even so, by the time we entered full production on <em>Fracture</em>, shader compilation was about as good as it was ever going to get.  A nightly build process loaded every production level and generated a fresh cache.  The build process updated the network shader cache in addition to updating the shader cache distributed with resources, so the team had a nearly perfect cache to start each day with.</p>
<p>As if the time costs of shader compilation weren&#8217;t enough, <em>CrossStitch</em> suffered from an even worse problem on the Xbox 360.  <em>Fracture&#8217;s</em> terrain system implemented a splatting system that composited multiple <em>Facade</em> materials into a localized über material, and then decomposed the über material into multiple passes according the register, sampler, and instruction limits of the target profile.  The result was some truly insane shader programs.</p>
<div id="attachment_479" class="wp-caption aligncenter" style="width: 410px"><img class="size-full wp-image-479 " title="Fracture Terrain Material" src="http://gameangst.com/wp-content/uploads/2010/05/terrain_material.jpg" alt="" width="400" height="400" /><p class="wp-caption-text">The procedurally generated material for one pass of a terrain tile in Fracture.</p></div>
<p>A few <em>Fracture</em> terrain shaders took over 30 seconds to compile and consumed over 160 megabytes of memory in the process.  Since the Xbox 360 development kits have no spare memory, this posed a major problem.  There were times when the content creators would generate a shader that could not be compiled on target without running out of memory and crashing.  It has only happened three times in five years, but we&#8217;ve actually had to run the game in a special, minimal memory mode in order to free up enough memory to compile a necessary shader for a particularly complex piece of content.  Once the shader is present in the network cache, the offending content can be checked in and the rest of the team is none the wiser.</p>
<p><a href="http://www.gamasutra.com/view/feature/4111/dirty_coding_tricks.php">Such things are not unusual in game development</a>, but it still kills me to be responsible for such a god-awful hack of a process.</p>
<p>And yet, <em>CrossStitch</em> continues to earn its keep.  Having our own compiler bridging the gap between our shader code and the platform compiler has proved to be a very powerful thing.  When we added support for the Playstation 3, Chris modified the <em>CrossStitch</em> back-end to compensate for little differences in the Cg compiler.  When I began to worry that some of our shaders were interpolator bound on the Xbox 360, the <em>CrossStitch</em> back-end was modified to perform automatic interpolator packing.  When I added support for Direct3D 10 and several texture formats went missing, <em>CrossStitch</em> allowed me to emulate the missing texture formats in late-bound fragments.  There doesn&#8217;t seem to be a problem <em>CrossStitch</em> can&#8217;t solve, except, of course, for the staggering inefficiency of its on-target, on-the-fly compilation.</p>
<p>For our next project I&#8217;m going to remove <em>CrossStitch</em> from Despair.  I&#8217;m going to do it with a scalpel if possible, but I&#8217;ll do it with a chainsaw if necessary.  I&#8217;m nervous about it, because despite my angst and my disillusionment with dynamic shader compilation, Day 1&#8217;s artists are almost universally fans of the Despair renderer.  They see <em>Façade</em> and the other elements of Despair graphics as a powerful and flexible package that lets them flex their artistic muscles.  I can&#8217;t take that away from them, but I also can&#8217;t bear to write another line of code to work around the costs of on-the-fly shader compilation.</p>
<p>It is clear to me now what I didn&#8217;t want to accept five years ago.  Everyone who has a say in it sees the evolution of GPU programs paralleling the evolution of CPU programs: code is static and data is dynamic.  <em>CrossStitch</em> has had a good run, but fighting the prevailing trends is never a happy enterprise.  Frameworks like DirectX Effects and CgFx have become far more full-featured and production-ready than I expected, and I&#8217;m reasonably confident I can find a way to map the majority of Despair&#8217;s graphics features onto them.  Whatever I come up with, it will draw a clear line between the engine and its shaders and ensure that shaders can be compiled wherever and whenever future platforms demand.</p>
]]></content:encoded>
			<wfw:commentRss>http://gameangst.com/?feed=rss2&amp;p=441</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>CrossStitch: A Configurable Pipeline for Programmable Shading</title>
		<link>http://gameangst.com/?p=402</link>
		<comments>http://gameangst.com/?p=402#comments</comments>
		<pubDate>Mon, 03 May 2010 19:38:49 +0000</pubDate>
		<dc:creator>Adrian Stone</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[CrossStitch]]></category>
		<category><![CDATA[MechAssault]]></category>
		<category><![CDATA[rendering]]></category>

		<guid isPermaLink="false">http://gameangst.com/?p=402</guid>
		<description><![CDATA[Some pieces of code are a source of great pride and some pieces of code are a source of terrible shame.  It is due to the nature of game development, I believe, that many of the most interesting pieces of code I write eventually become both.  That is certainly the case with the CrossStitch shader [...]]]></description>
			<content:encoded><![CDATA[<p>Some pieces of code are a source of great pride and some pieces of code are a source of terrible shame.  It is due to the nature of game development, I believe, that many of the most interesting pieces of code I write eventually become both.  That is certainly the case with the <em>CrossStitch</em> shader assembler, my first significant undertaking in the field of computer graphics.</p>
<p><em>CrossStitch</em> has been the foundation of Day 1&#8217;s graphics libraries for over seven years now, and for almost half of that time it has been a regular source of frustration and embarrassment for me.  What I&#8217;d like to do is vent my frustration and write an article explaining all the ways in which I went wrong with <em>CrossStitch</em> and why, despite my hatred of it, I continue to put up with it.  But before I can do that I need to write this article, a remembrance of where <em>CrossStitch</em> came from and how I once loved it so much.</p>
<p>My first job in the game industry had me implementing the in-game UI for a flight simulator.  When that project was canceled, I switched to an adventure game, for which I handled tools and 3dsmax exporter work and later gameplay and graphical scripting language design.  A couple years later I found myself in charge of the networking code for an action multiplayer title, and shortly after that I inherited the AI and audio code.  Next I came back to scripting languages for a bit and worked with Lua integration, then I spent six months writing a physics engine before becoming the lead on the Xbox port of a PS2 title.  I only had a few people reporting to me as platform lead, so my responsibilities ended up being core systems, memory management, file systems, optimization, and pipeline tools.</p>
<p>When I interviewed at Day 1, one of the senior engineers asked me what I thought my greatest weakness was as a game developer and I answered honestly that although I&#8217;d worked in almost every area of game development, I&#8217;d had little or no exposure to graphics.  Imagine my surprise when less than a month after he hired me the studio director came to me and asked if I&#8217;d be willing to step into the role of graphics engineer.  It turned out that both of Day 1&#8217;s previous graphics engineers had left shortly before my arrival, and the company was without anyone to fill the role.  I was excited for the challenge and the opportunity to try something new, so I quickly accepted the position and went to work.</p>
<p>This left me in a very strange position, however.  I was inheriting the graphics code from <a href="http://www.xbox.com/en-US/games/m/mechassault/"><em>MechAssault</em></a>, so I had a fully working, shippable code base to cut my teeth on, but I was also inheriting a code base with no owner in a discipline for which I had little training, little experience, and no available mentor.  The title we were developing was <a href="http://xbox.ign.com/objects/655/655496.html"><em>MechAssault 2</em></a>, and given that it was a second generation sequel to an unexpected hit, there was a lot of pressure to overhaul the code and attempt a major step forward in graphics quality.</p>
<p>The <em>MechAssault 1</em> graphics engine was first generation Xbox exclusive, which meant fixed-function hardware vertex T&amp;L and fixed-function fragment shading.  As I began to familiarize myself with the code base and the hardware capabilities, I realized that in order to implement the sort of features that I envisioned for <em>MechAssault 2</em>, I&#8217;d need to convert the engine over to programmable shading.  The Xbox had phenomenal vertex shading capability and extremely limited fragment shading capability, so I wanted to do a lot of work in the vertex shader and have the flexibility to pass data to the fragment shader in whatever clever arrangement would allow me to use just a couple instructions to apply whatever per-pixel transformations were necessary to get attractive results on screen.</p>
<p>The problem was that shaders are monolithic single-purpose programs and the entire feature set and design of the graphics engine were reliant on the dynamic, mix-and-match capabilities of the fixed-function architecture.  This is where my lack of experience in graphics and my lack of familiarity with the engine became a problem, because although I was feeling held back by it, I didn&#8217;t feel comfortable cutting existing features or otherwise compromising the existing code.  I wanted the power of programmable shading, but I needed it in the form of a configurable, fixed-function-like pipeline that would be compatible with the architecture I had inherited.</p>
<p>I pondered this dilemma for a few days and eventually a plan for a new programmable shader pipeline started to emerge.  I asked the studio director if I could experiment with converting the graphics engine from fixed-function to programmable shaders, and he found two weeks in the schedule for me to try.  Our agreement was that in order to claim success at the end of two weeks, I had to have our existing feature set working in programmable shaders and I had to be able to show no decrease in memory or performance.</p>
<p>My plan was to write a compiler and linker for a new type of shader program, called shader fragments.  Instead of an object in the game having to specify its entire shader program in a single step, the code path that led to rendering an object would be able to add any number of shader fragments to a list, called a shader chain, and finally the chain would be linked together into a full shader program in time for the object to draw.</p>
<p>My first task was to define the syntax for shader fragments.  I had very limited time so I started with an easily parsable syntax, XML.  Fragments consisted of any number of named variables, any number of constant parameters, and a single block of shader assembly code.  Named variables were either local or global in scope, and they could specify several options such as default initialization, input source (for reading data from vertex attributes), output name (for passing data to fragment shaders), and read-only or write-only flags.</p>
<p>I wrote a simple C preprocessor, an XML parser, and a compiler to transform my fragments into byte-code.  Compiled fragments were linked directly into the executable along with a header file defining the constant parameter names.  Rendering of every object began with clearing the current shader chain and adding a fragment that defined some common global variables with default values.  For example, <em>diffuse</em> was a global, one-initialized variable available to every shader chain and <em>screen_pos</em> was a global variable mapped to the vertex shader position output.  Once the default variables were defined, a mesh might add a new fragment that defined a global variable, <em>vertex_diffuse,</em> with initialization from a vertex attribute and additional global variables for world-space position and normals.  Depending on the type of mesh, world-space position might be a rigid transformation of the vertex position, a single or multi-bone skinned transformation of the vertex position, or a procedurally generated position from a height field or particle system.</p>
<table border="0" cellpadding="0">
<tbody>
<tr>
<td>
<p><div id="attachment_424" class="wp-caption alignnone" style="width: 455px"><img class="size-full wp-image-424 " style="border: 1px solid black;" src="http://gameangst.com/wp-content/uploads/2010/04/FragmentMapDefaults.png" alt="" width="445" height="260" /><p class="wp-caption-text">Common variables were defined for all shader chains.</p></div></td>
<td>
<p><div id="attachment_417" class="wp-caption alignnone" style="width: 367px"><img class="size-full wp-image-417   " style="border: 1px solid black;" title="FragmentGenerateEyeVector.vfrag" src="http://gameangst.com/wp-content/uploads/2010/04/FragmentGenerateEyeVector.png" alt="" width="357" height="230" /><p class="wp-caption-text">A shader fragment for generating the eye vector.</p></div></td>
</tr>
</tbody>
</table>
<p>With world-space position and normals defined, other systems could then add fragments to accumulate per-vertex lighting into the global <em>diffuse</em> variable and to modulate <em>diffuse</em> by <em>material_diffuse</em> or <em>vertex_diffuse</em>.  Systems could also continue to define new variables and outputs, for example per-vertex fog or, in the case of per-pixel lighting, tangent space.  Precedence rules also allowed fragments to remap the inputs or outputs or change the default initialization of variables defined in earlier fragments.</p>
<table border="0" cellpadding="0">
<tbody>
<tr>
<td>
<p><div class="wp-caption alignnone" style="width: 330px"><a href="http://xbox.ign.com/dor/objects/655496/mechassault-2-lone-wolf/images/mechassault-2-lone-wolf-20041102060758085.html" target="_blank"><img class=" " title="MA2 Height Fog" src="http://xboxmedia.ign.com/xbox/image/article/563/563057/mechassault-2-lone-wolf-20041102060758085.jpg" alt="" width="320" height="240" /></a><p class="wp-caption-text">Custom height fog was implemented in shader fragments.</p></div></td>
<td>
<p><div class="wp-caption alignnone" style="width: 330px"><a href="http://xbox.ign.com/dor/objects/655496/mechassault-2-lone-wolf/images/mechassault-2-lone-wolf-20041203113114593.html" target="_blank"><img class=" " title="MA2 Particle Effects" src="http://xboxmedia.ign.com/xbox/image/article/570/570670/mechassault-2-lone-wolf-20041203113114593.jpg" alt="" width="320" height="240" /></a><p class="wp-caption-text">Particles were generated in shader fragments.</p></div></td>
</tr>
</tbody>
</table>
<p>The final step in rendering an object was to transform the active shader chain into a shader program.  Shader programs were compiled on the fly whenever a new shader chain was encountered, resulting in a small but noticeable frame rate hitch.  Luckily, the shader programs were cached to disk between runs so the impact of new shader compilations was minimal.  I had promised to add no additional CPU or GPU cost to rendering, so I put a lot of care into the code for adding shader fragments to shader chains and for looking up the corresponding shader programs.</p>
<p>It was a nerve-racking few weeks for me, however, because it wasn&#8217;t until I&#8217;d implemented the entire system and replaced every fixed-function feature supported by the engine with equivalent shader fragments that I was able to do side-by-side performance comparisons.  Amazingly, the CPU costs ended up being nearly indistinguishable.  Setting state through the fixed-function calls or setting shader constants and shader fragments was so close in performance that victory by even a narrow margin couldn&#8217;t be determined.</p>
<p>The results on the GPU were equally interesting.  The vertex shader implementation outperformed the fixed-function equivalent for everything except lighting.  Simple materials with little or no lighting averaged 10 &#8211; 20% faster in vertex shader form.  As the number of lights affecting a material increased, however, the margin dropped until the fixed-function implementation took the lead.  Our most complex lighting arrangement, a material with ambient, directional, and 3 point lights, was 25% slower on the GPU when implemented in a vertex shader.  With the mix of materials in our MechAssault levels, the amortized results ended up being about a 5% drop in vertex throughput.  This wasn&#8217;t a big concern for me, however, because we weren&#8217;t even close to being vertex bound on the Xbox, and there were several changes I wanted to make to the lighting model that weren&#8217;t possible in fixed-function but would shave a few instructions off the light shaders.</p>
<p>The studio director was pleased with the results, so I checked in my code and overnight the game went from purely fixed-function graphics to purely programmable shaders.  The shader fragment system, which I named <em>CrossStitch</em>, opened up the engine to a huge variety of GPU-based effects, and the feature set of the engine grew rapidly to become completely dependent on shader fragments.  By the time we shipped MechAssault 2, our rendering was composed of over 100 unique shader fragments.  These fragments represented billions of possible shader program permutations, though thankfully only a little more than 500 were present in our final content.</p>
<p>Here are some examples of features in MechAssault 2 that were made possible by shader fragments:</p>
<ul>
<li>a new particle effects system which offloaded some of the simulation costs to the GPU</li>
<li>procedural vertex animation for waving flags and twisting foliage</li>
<li>a space-warp effect attached to certain explosions and projectiles that deformed the geometry of nearby objects</li>
<li>mesh bloating and stretching modifiers for overlay shields and motion blur effects</li>
<li>GPU-based shadow volume extrusion</li>
<li>per-vertex and per-pixel lighting with numerous light types and variable numbers of lights</li>
<li>GPU-based procedural UV generation for projected textures, gobo lights, and decals</li>
<li>content-dependent mesh compression with GPU-based decompression</li>
</ul>
<table border="0" cellpadding="0">
<tbody>
<tr>
<td>
<p><div class="wp-caption alignnone" style="width: 330px"><a href="http://xbox.ign.com/dor/objects/655496/mechassault-2-lone-wolf/images/mechassault-2-lone-wolf-20041020020532474.html" target="_blank"><img class=" " title="MA2 Space Warp" src="http://xboxmedia.ign.com/xbox/image/article/558/558475/mechassault-2-lone-wolf-20041020020532474.jpg" alt="A mech explosion deforms the nearby objects." width="320" height="240" /></a><p class="wp-caption-text">A mech explosion deforms nearby objects.</p></div></td>
<td>
<p><div id="attachment_420" class="wp-caption alignnone" style="width: 330px"><a href="http://gameangst.com/wp-content/uploads/2010/04/MA2_per_pixel.jpg"><img class="size-medium wp-image-420    " title="MA2 Per-pixel Lighting" src="http://gameangst.com/wp-content/uploads/2010/04/MA2_per_pixel-400x300.jpg" alt="" width="320" height="240" /></a><p class="wp-caption-text">MA2 mixed per-vertex and per-pixel lighting.</p></div></td>
</tr>
</tbody>
</table>
<p>By the end of <em>MechAssault 2</em> I was convinced <em>CrossStitch</em> was the only way to program graphics.  I absolutely loved the dynamic shader assembly model I&#8217;d implemented, and I never had a second&#8217;s regret over it.  It was fast, memory efficient, and allowed me to think about graphics in the way I found most natural, as a powerful configurable pipeline of discreet transformations.  If only the story could end there.</p>
<p>In the future I&#8217;ll have to write a follow up to this article describing how <em>CrossStitch</em> evolved after <em>MechAssault 2</em>.  I&#8217;ll have to explain how it grew in sophistication and capability to keep pace with the advances in 3-D hardware, to support high-level language syntax and to abstract away the increasing number of programmable hardware stages.  I&#8217;ll have to explain why the path it followed seemed like such a good one, offering the seduction of fabulous visuals at minimal runtime cost.  But most importantly, I&#8217;ll have to explain how far along that path I was before I realized it was headed in a direction I didn&#8217;t want to be going, a direction completely divergent with the rest of the industry.</p>
]]></content:encoded>
			<wfw:commentRss>http://gameangst.com/?feed=rss2&amp;p=402</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Cascaded Perspective Variance Shadow Mapping</title>
		<link>http://gameangst.com/?p=339</link>
		<comments>http://gameangst.com/?p=339#comments</comments>
		<pubDate>Tue, 26 Jan 2010 03:51:48 +0000</pubDate>
		<dc:creator>Adrian Stone</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Fracture]]></category>
		<category><![CDATA[rendering]]></category>

		<guid isPermaLink="false">http://gameangst.com/?p=339</guid>
		<description><![CDATA[Perhaps the most frustrating thing about shipping anything less than a blockbuster game is the lack of feedback you get on your work.  The very best games of any given year will be scrutinized and evaluated by the community of developers, but everything else is pretty much ignored.  I was surprised and pleased therefore, [...]]]></description>
			<content:encoded><![CDATA[<p>Perhaps the most frustrating thing about shipping anything less than a blockbuster game is the lack of feedback you get on your work.  The very best games of any given year will be scrutinized and evaluated by the community of developers, but everything else is pretty much ignored.  I was surprised and pleased therefore, when a friend pointed me to an old <a href="http://graphicrants.blogspot.com/2008/11/smooth-transitions.html">blog entry</a> by Brian Karis that mentioned the shadows in <a href="http://www.lucasarts.com/games/fracture/" target="_blank">Fracture</a>.  Brian&#8217;s entry is only the fourth piece of feedback I&#8217;ve received on the technology behind Fracture, and it is the third piece specifically complimenting the shadows.  I&#8217;m particularly pleased to see praise for Fracture&#8217;s shadows, because finding just the right technique for them was a long and difficult process.</p>
<p>The first implementation of global shadows in Fracture was a single, orthographic shadow map fit to the view frustum.  This was never intended to be the final solution, but it was quick and easy to implement and it provided a good baseline for measuring the value of other techniques.</p>
<p>Fracture was, at that time, spec&#8217;d to have draw distances of about a thousand meters, and we&#8217;d only budgeted enough memory and fill rate for a single 1024&#215;1024 shadow map.  Fracture was also first person initially, and we wanted convincing self-shadowing on the player&#8217;s 3-D hands and weapons.  To get the kind of shadow resolution we wanted, we&#8217;d have needed a 100k by 100k orthographic shadow map or a 10,000 times increase in our shadow budget.</p>
<div id="attachment_369" class="wp-caption alignright" style="width: 410px"><a href="http://gameangst.com/wp-content/uploads/2010/01/psm0.jpg"><img class="size-medium wp-image-369 " title="Perspective Shadow Map" src="http://gameangst.com/wp-content/uploads/2010/01/psm0-400x225.jpg" alt="" width="400" height="225" /></a><p class="wp-caption-text">The perspective shadow map at its best.</p></div>
<p>After the orthographic shadow map, I implemented a form of perspective shadow mapping.  Perspective shadow maps distribute resolution non-uniformly across the view frustum by utilizing a perspective projection to warp the shape of the shadow frustum to better fit the view frustum.  My implementation of perspective shadow maps was primarily inspired by <a href="http://www.cg.tuwien.ac.at/research/vr/lispsm/">light-space perspective shadow maps</a>.  On the whole I found the perspective shadow map implementation reasonably straightforward, perfectly stable, and very effective at increasing shadow resolution near the player&#8217;s camera.  With a single 1024&#215;1024 perspective shadow map, the shadows in the immediate vicinity of the player were, in ideal conditions, high enough resolution to provide tolerable coarse shadows on the player&#8217;s weapon.</p>
<p>Unfortunately, not all conditions were ideal and there were problems.</p>
<p>Single-pass perspective shadow maps have limited flexibility in how they can redistribute resolution across the view frustum.  When the shadow&#8217;s light direction is orthogonal to the view direction, the perspective shadow frustum is an almost perfect fit for the view frustum.  However, as the shadow direction becomes more aligned with the view direction, perspective shadow maps lose their effectiveness and end up distributing shadow resolution uniformly, just like orthographic shadow maps.  Luckily this isn&#8217;t as big a problem as it first sounds.</p>
<div id="attachment_375" class="wp-caption alignleft" style="width: 410px"><a href="http://gameangst.com/wp-content/uploads/2010/01/psm1.jpg"><img class="size-medium wp-image-375 " title="Perspective Shadow Map Failing" src="http://gameangst.com/wp-content/uploads/2010/01/psm1-400x225.jpg" alt="" width="400" height="225" /></a><p class="wp-caption-text">A barely discernable player shadow at a problematic view angle.</p></div>
<p>When the player camera is looking directly into the light, there aren&#8217;t really any visible shadow boundaries.  Every visible surface is completely self shadowed, and that effect is handled redundantly by shading and shadowing.  When the player is looking directly away from the light, however, he&#8217;s generally looking directly at his own shadow and shadow resolution is very important.  Luckily there are two common characteristics of game worlds that held true for the prototype levels of Fracture.  First, global shadows usually come from above, and second, the player camera is usually situated near the bottom of the world.  Given these two constraints, there was an obvious solution to the problem of poor shadow resolution when looking at the player&#8217;s feet.  Before fitting the shadow frustum to the view frustum, I first clipped the view frustum to the dynamic bounds of the world.  When the camera is aligned with the light direction, the shadow frustum dimensions are constrained by the size of the visible camera&#8217;s far plane.  With the view frustum starting at the player&#8217;s head and clipped by the ground just below his feet, the far plane of the view frustum was never more than fifty square meters and even a uniform distribution of resolution was sufficient to provide high quality shadows.</p>
<p>Despite its problems, the single perspective shadow map implementation persisted in Fracture for over a year of development.  Unfortunately for me, during that time shadows in competing games continued to improve, and the artists and designers felt more and more constrained by the  limitations of perspective shadow maps.  The artists wanted to feature dramatic early morning and late afternoon lighting in levels, and the designers wanted to incorporate more vertical gameplay in which the player rained death from above down on enemies fifty meters below.  We also had occasional issues in which an errant particle effect or physics object would penetrate the terrain and begin plummeting into the uncharted abyss below, dragging the world bounds and shadow quality along with it.  In all of those cases my perspective shadow map implementation was no better than an orthographic shadow map and the shadows deteriorated to little more than dark smudges under large objects.</p>
<div id="attachment_371" class="wp-caption alignright" style="width: 410px"><a href="http://gameangst.com/wp-content/uploads/2010/01/cascaded_psm0.jpg"><img class="size-medium wp-image-371 " title="Cascaded PSM" src="http://gameangst.com/wp-content/uploads/2010/01/cascaded_psm0-400x225.jpg" alt="" width="400" height="225" /></a><p class="wp-caption-text">Player shadows were much improved with cascaded PSMs</p></div>
<p>To address the complaints about the single perspective shadow map approach, my coworker <a href="http://gamearchitect.net/">Kyle</a> extended the system to use a cascade of three perspective shadow maps.  The cascades were constructed so that each map covered a parallel slice of the view frustum, and the depth of each slice was hand-tuned to optimal effect.  The first cascade covered only 4 meters, the next one 100 meters, and the final one fit the entire remaining portion of the view frustum which was still over a thousand meters.  Each of the three maps was the same dimensions as the original shadow map, so memory and rendering costs tripled.  However, the results were inarguably worth it.  In all the cases where the quality of the single perspective shadow map suffered, the cascaded approach maintained high-resolution shadows in the immediate vicinity of the player.  Shadows in the mid-distance (the second cascade) also received a moderate bump in resolution compared to the single map approach.  Unfortunately shadows in the last cascade weren&#8217;t noticeably improved.  The costs of this system were quite high, but nevertheless the artists and designers were appeased and development rolled on for another six months.</p>
<div id="attachment_372" class="wp-caption alignright" style="width: 410px"><a href="http://gameangst.com/wp-content/uploads/2010/01/cascaded_psm1.jpg"><img class="size-medium wp-image-372 " title="Cascaded PSM Failure" src="http://gameangst.com/wp-content/uploads/2010/01/cascaded_psm1-400x225.jpg" alt="" width="400" height="225" /></a><p class="wp-caption-text">From the wrong camera angle, the sharp drop in cascade resolution decapitated the player shadow.</p></div>
<p>After six months with cascaded perspective shadow maps, the development team once again began to have second thoughts.  The biggest complaint about the cascaded solution was the boundary between cascades.  In problem areas the drop in resolution between the first and second cascades was dramatic.  If the player managed to position himself so that the cascade boundary lay across his own shadow, the effect was really unsightly.  In some cases the drop in resolution between the shadow of the player&#8217;s body and the shadow of the player&#8217;s head was so great that the player looked decapitated in his shadow.</p>
<p>From my perspective, there were three additional issues with our shadows.  With the expansion to three cascaded shadow maps, shadows were now fantastically over budget.  Also, in the middle and far distance where shadow resolution still struggled, the constant shimmering and crawling of the shadows as the camera moved really bothered me.  Lastly, the irregularly shaped shadow frustum from the perspective shadow maps made it impossible to get good results from a constant shadow bias.  Again in areas where shadow resolution struggled, the shadow bias was unpredictable resulting in inconsistent, but almost always excessive, &#8220;peter-panning.&#8221;</p>
<p>In the fall of 2007 I embarked on a final rewrite of global shadows for Fracture.  To improve performance, I replaced the three independent shadow maps in the cascaded solution with three square regions allocated out of a single texture.  I then replaced the shader code for sampling the three shadow maps with a new shader that calculated the UVs appropriate to the desired cascade and only performed a single (filtered) sample from the map.  What I had learned from our experience with cascaded shadow maps was that fitting the view frustum with multiple shadow frusta is a far more effective way of increasing shadow map resolution than trying to warp a single shadow map.</p>
<p>Since the performance of my new shader scaled very gracefully with the number of shadow maps, I was able to simultaneously increase the effective shadow map resolution and decrease runtime performance and memory costs by increasing the number of cascades to five.  The three 1024&#215;1024 shadow maps were replaced with five 640&#215;640 regions in a single map.</p>
<div id="attachment_373" class="wp-caption alignright" style="width: 410px"><a href="http://gameangst.com/wp-content/uploads/2010/01/cascaded_usm0.jpg"><img class="size-medium wp-image-373 " title="Cascaded Uniform Shadow Map" src="http://gameangst.com/wp-content/uploads/2010/01/cascaded_usm0-400x225.jpg" alt="" width="400" height="225" /></a><p class="wp-caption-text">More cascades easily compensated for the switch to uniform shadow maps.</p></div>
<p>As the number of cascades increased, the value of the perspective transformation in the shadow frusta decreased.  Meanwhile I was still frustrated with the shimmering and inconsistent shadow bias associated with perspective shadow maps.  I finally made the difficult decision to abandon the perspective shadow map and use traditional orthographic projections for the cascades.  This created a few new optimization opportunities in the shader, but more importantly it allowed me to make the shadows from static objects completely stable despite a moving camera and dynamic world bounds.  I fit the orthographic shadow frusta to the cascading slices of the view frustum, but I required that the shadow frusta be world axis aligned and I quantized their positions so that as a shadow map moved through the world, its pixel centers always landed in the same place.  I also had to quantize the shadow frusta&#8217;s near and far planes, though in a much coarser fashion, to ensure consistent depth for an object in the shadow map.  This was necessary to guarantee 100% stability in the shadow map when using floating point depth formats.</p>
<div id="attachment_393" class="wp-caption alignleft" style="width: 260px"><a href="http://gameangst.com/wp-content/uploads/2010/01/cascade_boundaries1.jpg"><img class="size-full wp-image-393" title="Cascade Boundaries" src="http://gameangst.com/wp-content/uploads/2010/01/cascade_boundaries1.jpg" alt="" width="250" height="250" /></a><p class="wp-caption-text">An aligned cascade boundary, highlighted in red, cuts at odd angles across the view frustum.</p></div>
<p>Interestingly, clearing the shadows of shimmering and crawling created a new problem that I had to deal with.  The uniform distribution of resolution within a cascade and the perfectly stable static shadows made the transitions between cascades really obvious.  Whereas previously the general instability of the shadows helped hide the cascade boundaries sliding through the world, now there was a very clear resolution boundary.  The fact that the cascades were aligned to the world axes and not the view direction contributed to this problem as well.  I always sampled shadows from the highest resolution map available, which meant the cascade boundaries shifted depending on the orientation of the camera.  When the camera was looking diagonally between two world axes, the cascade boundaries were at corners of the shadow maps rather than straight edges.</p>
<p>I was 50% over our original budget for shadows, but the new approach was half the cost of what we&#8217;d been living with for six months so I felt justified in sacrificing a bit more performance.  I expanded each cascade slightly to create an overlap, and I extended the shader to blend between two cascades at the boundaries.  On platforms with good pixel shader dynamic branching this only increased costs by about 15%.  On less capable platforms the dynamic branching was omitted and the cost of applying shadows grew by 40% to twice our original budget.  I also introduced a maximum range for the lowest resolution cascade at this point and faded shadows away completely beyond that distance.  This fit conveniently within the blending code I&#8217;d already added to the shader, and it saved us from wasting shadow resolution on the rarely visible back half of a 1500 meter view frustum.  As I recall, Fracture shipped with between a 250 and 350 meter shadow range depending on the level.</p>
<p>I also at this time pursued an avenue that turned out not to be fruitful.  Much of the cost of applying the cascaded shadow maps came from the filtering I was doing in the shader.  I was using an 8 tap, Poisson disk filter, which I felt was the best compromise between cost and quality.  I decided to try out variance shadow maps instead.  I have seen benefits from variance shadow maps in some situations, but in the case of global cascaded shadows the technique didn&#8217;t pan out.  Variance shadow maps are more expensive to create because they generally can&#8217;t take advantage of depth-only rendering, and they either suffer from precision artifacts or they require more memory.  They also incur a cost in prefiltering.  All of these costs are expected to be made up for by significantly more efficient application of the shadow maps.  In my case this simply wasn&#8217;t true.  Even the best global shadow maps have pretty poor pixel coverage, so fully prefiltering the variance shadow maps is wasteful.  I tried several approaches to optimize the variance shadow map technique, but in the end the performance was only equal to my non-variance implementation and the artifacts from the variance approach outweighed the improved softness of the shadows.</p>
<p>The last change I made for Fracture&#8217;s shadows was to replace the manually specified transition and cascade sizes with ones automatically computed to preserve constant screen-space shadow resolution between cascades.  For any given scene I found that hand-tuned values could produce slightly more appealing results, but since we had many varying FOVs and shadow ranges, I didn&#8217;t feel justified in forcing the artists to manually specify every value.</p>
<p>One of the fun things about the work I put into the shadows in Fracture is that very few of the techniques I experimented with were mutually exclusive.  I was able to, at one point, run through our levels toggling the shadows from perspective to orthographic, variance to non-variance, PCF to Poisson disk, and simultaneously varying the number, size, and resolution of the cascades.  I also implemented debug visualization modes to show the outline of shadow texels in the world rather than the shadows themselves.  Doing this, the value and cost of the various techniques became rapidly apparent.  Perspective shadow maps were great, but I could always get better quality and lower costs by increasing the number of cascades.  Prefiltering with variance shadow maps was great, but a fairly expensive filter during sampling was better looking for the same cost.  The optimal number and size of cascades actually varied slightly depending on the target platform.  Although I said we used five cascades at 640&#215;640, that was only on the Xbox 360.  On the PS3 the sweet spot for performance turned out to be distributing slightly more memory across three higher-resolution cascades.</p>
<div id="attachment_374" class="wp-caption alignleft" style="width: 410px"><a href="http://gameangst.com/wp-content/uploads/2010/01/cascaded_usm1.jpg"><img class="size-medium wp-image-374" title="Cascaded Uniform Shadow Maps Worst Case" src="http://gameangst.com/wp-content/uploads/2010/01/cascaded_usm1-400x225.jpg" alt="" width="400" height="225" /></a><p class="wp-caption-text">The uniform shadow maps provided plenty of resolution even in worst-case scenarios.</p></div>
<p>That pretty much covers the implementation of global shadows that Fracture shipped with.  It survived about two years with few complaints and reasonable performance.  Since Fracture I&#8217;ve made a few more improvements, though mostly in the realm of optimizations.  The most valuable optimization, and the only one I&#8217;ll mention here, was culling the contents of each cascade more aggressively.  Although each cascade must be an orthogonal frustum from the perspective of the GPU, a much smaller volume can be defined on the CPU that covers all the shadow casters relevant to a particular map.  The volume I use for culling each map is constructed by finding the intersection of the view frustum extruded in the direction of the light and the original orthogonal shadow frustum and then subtracting the volume of all the higher resolution maps.  The result isn&#8217;t a convex shape, but it can be decomposed into an inclusive and an exclusive convex volume for fast object intersection queries.</p>
<p>In real-world data, more aggressive culling of the cascades reduced the number of objects rendered into the global shadow map by between 30 and 75 percent depending on the orientation of the camera relative to the light.  The nice thing about this reduction is that it starts at the object level and consequently translates directly into CPU and GPU savings.</p>
<p>I&#8217;m pretty confident now that I won&#8217;t be revisiting global shadows again in this hardware generation.  I&#8217;m completely satisfied with both the cost and the quality of our current solution, and I don&#8217;t feel there is much improvement to be had without introducing a new computation model.  With any luck, the next time I&#8217;m thinking about global shadows it will be in the context of compute shaders and perhaps even non-rasterization based techniques.</p>
]]></content:encoded>
			<wfw:commentRss>http://gameangst.com/?feed=rss2&amp;p=339</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Symbol Sort : A Utility for Measuring C++ Code Bloat</title>
		<link>http://gameangst.com/?p=320</link>
		<comments>http://gameangst.com/?p=320#comments</comments>
		<pubDate>Tue, 22 Sep 2009 22:02:10 +0000</pubDate>
		<dc:creator>Adrian Stone</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[code bloat]]></category>
		<category><![CDATA[optimization]]></category>

		<guid isPermaLink="false">http://gameangst.com/?p=320</guid>
		<description><![CDATA[OVERVIEW
SymbolSort is a utility for analyzing code bloat in C++ applications.  It works by extracting the symbols from a dump generated by the Microsoft DumpBin utility or by reading a PDB file.  It processes the symbols it extracts and generates lists sorted by a number of different criteria.  You can read more about the motivation behind SymbolSort here.
The [...]]]></description>
			<content:encoded><![CDATA[<h3><span style="text-decoration: underline;">OVERVIEW</span></h3>
<p>SymbolSort is a utility for analyzing code bloat in C++ applications.  It works by extracting the symbols from a dump generated by the Microsoft DumpBin utility or by reading a PDB file.  It processes the symbols it extracts and generates lists sorted by a number of different criteria.  You can read more about the motivation behind SymbolSort <a href="http://gameangst.com/?p=46">here</a>.</p>
<p>The lists compiled by SymbolSort are:</p>
<h4>Raw Symbols, sorted by size</h4>
<p>This list is generated from the complete set of symbols.  No deduplication is performed so this list is intended to highlight individual large symbols.</p>
<h4>File contributions, sorted by size</h4>
<p>This list is generated by calculating the total size of symbols that contribute to a folder path.  If the input is a COMDAT dump, the source location for symbols is the .obj or .lib file that DumpBin was run on (see usage for details).  It is important to note that for COMDAT dumps individual symbols will appear multiple times coming from different .obj files.  If the input is a PDB file, the source location for symbols is the actual source file in which the symbol is defined.  The source file for data symbols is not always clearly defined within the PDB so in some cases it is a best guess.</p>
<h4>File contribution, sorted by path</h4>
<p>This is a complete, hierarchical list of the size of symbols in all contributing source files.</p>
<h4>Symbol Sections / Types, sorted by total size and by total count</h4>
<p>This shows a breakdown of symbols by section or type, depending on the kind of information that can be extracted from the input source.</p>
<h4>Merged Duplicate Symbols, sorted by total size and by total count</h4>
<p>This list is generated by merging symbols with identical names.  The symbols are not guaranteed to be the same symbol.  In the case of PDB input there will be very few duplicate symbols.  COMDAT input, however, should contain a large number of duplicate symbols.  This list is useful for measuring total compile and link time for a particular symbol.  A relatively small symbol that appears in a very large number of .obj files will have a large total size and appear near the top of this list.</p>
<h4>Merged Template Symbols, sorted by total size and by total count</h4>
<p>This list is generated by stripping template parameters from symbols and then merging duplicates.  Symbols <em>std::auto_ptr&lt;int&gt;</em> and <em>std::auto_ptr&lt;float&gt;</em> will be transformed into <em>std::auto_ptr&lt;T&gt;</em> in this list and be counted together.</p>
<h4>Merged Overloaded Symbols, sorted by total size and by total count</h4>
<p>This list is generated by stripping template parameters and function parameters from symbols and then merging duplicates.  Overloaded functions <em>sqrt(float)</em> and <em>sqrt(double)</em> will be transformed into <em>sqrt(&#8230;)</em> in this list and be counted together.</p>
<h4>Symbol Tags, sorted by total size and by total count</h4>
<p>This list represents a tag cloud generated from the symbol names.  The symbols are tokenized and the total size and count is tallied for each token.  I&#8217;m not sure what this list is good for, but I&#8217;m all about tag clouds so I couldn&#8217;t resist including it.</p>
<h3><span style="text-decoration: underline;">USAGE</span></h3>
<pre>SymbolSort [options]

Options:
  -in[:type] filename
      Specify an input file with optional type.  Exe and PDB files are
      identified automatically by extension.  Otherwise type may be:
          comdat - the format produced by DumpBin /headers
          sysv   - the format produced by nm --format=sysv
          bsd    - the format produced by nm --format=bsd --print-size

  -out filename
      Write output to specified file instead of stdout

  -count num_symbols
      Limit the number of symbols displayed to num_symbols

  -exclude substring
      Exclude symbols that contain the specified substring

  -diff:[type] filename
      Use this file as a basis for generating a differences report.
      See -in option for valid types.

  -searchpath path
      Specify the symbol search path when loading an exe</pre>
<p>SymbolSort supports three types of input files:</p>
<h4>COMDAT dump</h4>
<p>A COMDAT dump is generated using the DumpBin utility with the /headers option.  DumpBin is included with the Microsoft compiler toolchain. SymbolSort can accept the dump from a single .lib or .obj file, but the best way to use it is to create a complete dump of all the .obj files from an entire application.  The Windows command line utility FOR can be used for this:</p>
<pre>for /R "c:\obj_file_location" %n in (*.obj) do "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin\DumpBin.exe" /headers "%n" &gt;&gt; c:\comdat_dump.txt</pre>
<p>This will generate a concatenated dump of all the headers in all the .obj files in <em>c:\obj_file_location</em>.  Beware, for large applications this could produce a multi-gigabyte file.</p>
<h4>PDB or EXE</h4>
<p>SymbolSort supports reading debug symbol information from .exe files and .pdb files.  The .exe file will only be used to find the location of its matching .pdb file, and then the symbols will be extracted from the PDB.  SymbolSort uses msdia90.dll to extract data from the PDB file.  Msdia90.dll is included with the Microsoft compiler toolchain.  In order to use it you will probably have to register the dll.</p>
<pre>regsvr32 "C:\Program Files\Common Files\Microsoft Shared\VC\msdia90.dll"</pre>
<p>It is important that you register the 64-bit version of msdia90.dll on 64-bit Windows and the 32-bit version on 32-bit Windows.  If you don&#8217;t find msdia90.dll in the path listed above, try looking for it in the Visual Studio install directory under &#8220;\Microsoft Visual Studio 9.0\DIA SDK\bin\&#8221;</p>
<h4>NM dump</h4>
<p>Similar to the COMDAT dump, SymbolSort can accept symbol dumps from the unix utility nm.  The symbols can be extracted from .obj files or entire .elfs.  SymbolSort supports bsd and sysv format dumps.  Sysv is preferred because it contains more information.  The recommended nm command lines are:</p>
<pre>nm --format=sysv --demangle --line-numbers input_file.elf
nm --format=bsd --demangle --line-numbers --print-size input_file.elf</pre>
<h3><span style="text-decoration: underline;">DOWNLOAD</span></h3>
<p><span style="text-decoration: underline;"><a href="http://gameangst.com/wp-content/uploads/2009/09/SymbolSort-1.1.zip">SymbolSort-1.1.zip</a></span></p>
<h3><span style="text-decoration: underline;">BUILDING</span></h3>
<p>The source for SymbolSort is distributed as a single file, SymbolSort.cs.  It can be built as a simple C# command line utility.  In order to get the msdia90 interop to work you must add msdia90.dll as a reference to the C# project.  That is done either by dragging and dropping the dll onto the references folder in the C# project or by right clicking the references folder, selecting &#8220;Add Reference&#8221; and then browsing for the msdia90 dll.</p>
<h3><span style="text-decoration: underline;">REVISION HISTORY</span></h3>
<pre>    1.1    Added support for computing differences between multiple input sources
           Added support for nm output for PS3 / unix platforms
           Changed command line parameters.  See usage for details.
           Added section / type information to output.
    1.0    First release!</pre>
<h3><span style="text-decoration: underline;">FUTURE WORK</span> (to be done by someone else!)</h3>
<ul>
<li>Add a GUI frontend to allow interactive filtering and sorting.</li>
<li>Read both the PDB and the COMDAT dump simultaneously and cross-reference the two.  This would enable new kinds of analysis and richer dumps.</li>
<li>Produce additional merged symbol reports by merging all symbols from the same class or namespace or that match based on some more clever fuzzy comparison.</li>
<li>Improve relative -&gt; absolute path conversion for nm inputs</li>
<li>Figure out how to extract string literal information from PDB.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://gameangst.com/?feed=rss2&amp;p=320</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Minimizing Code Bloat: Redundant Template Instantiation</title>
		<link>http://gameangst.com/?p=246</link>
		<comments>http://gameangst.com/?p=246#comments</comments>
		<pubDate>Tue, 22 Sep 2009 22:01:54 +0000</pubDate>
		<dc:creator>Adrian Stone</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[code bloat]]></category>
		<category><![CDATA[optimization]]></category>

		<guid isPermaLink="false">http://gameangst.com/?p=246</guid>
		<description><![CDATA[This article is a continuation of Minimizing Code Bloat for Faster Builds and Smaller Executables.
The last item on my list of code bloat causes is redundant template instantiation.  Template functions have a lot in common with inline functions.  Like inline functions, template functions are instantiated into the object file of every translation unit in which [...]]]></description>
			<content:encoded><![CDATA[<p><em>This article is a continuation of <a href="http://gameangst.com/?p=46">Minimizing Code Bloat for Faster Builds and Smaller Executables</a>.</em></p>
<p>The last item on my list of code bloat causes is redundant template instantiation.  Template functions have a lot in common with inline functions.  Like inline functions, template functions are instantiated into the object file of every translation unit in which they are used.  Also like inline functions, template functions generate weak symbols which are merged silently by the linker.  Redundant template instantiations won&#8217;t increase your executable size, but they can contribute significantly to your build times.</p>
<p>Consider a commonly used class like <em>std::string</em>.  The std::string class is actually a template class defined as:</p>
<p>
<div class="dean_ch" style="white-space: nowrap;"><span class="kw4">typedef</span> basic_string&lt;char, char_traits&lt;char&gt;, allocator&lt;char&gt; &gt; string;</div>
</p>
<p>In an engine that uses std::strings, some members like std::string::assign may be referenced in nearly every translation unit.  Even engines that eschew the use of std::string may unknowingly be instantiating a lot of std::string code because it is so pervasive in the standard library.  Looking at a symbol dump from one version of Despair Engine, I see over 1600 instances of dozens of functions from std::string.  That&#8217;s right, in every full build we compile most of std::string 1600 times, optimize the code 1600 times, write it into .obj files 1600 times, copy it into .lib files 1600 times, read it into the linker 1600 times, and then throw 1599 identical copies away!  All told I think it is less than 10 megabytes of code and accounts for only a couple percent of our build times, but still, it&#8217;s a lot to pay for a wrapper around a char*.</p>
<p>Don&#8217;t think that these sorts of problems are confined to the standard library.  Any template class that is widely used can produce a lot of code bloat from redundant instantiations.  The question is, what do you do about it?</p>
<p>One option is to convert template code into non template code.  If you never instantiate basic_string with anything other than a single set of parameters, is it really worth being a template class?  The same question could be asked of a templatized math library.  Does matrix&lt;T&gt; need to be a template class when all you ever use is matrix&lt;float&gt;?</p>
<p>Although removing template code is always an option, it isn&#8217;t always a good one.  Template code may be expensive, but it is also really handy.  If you have a templatized math library, chances are that while 99% of the instantiations are matrix&lt;float&gt;, somewhere in your engine or tools you have an instantiation of matrix&lt;double&gt;.  Similarly, amid all those instantiations of basic_string&lt;char&gt;, maybe there&#8217;s a basic_string&lt;wchar_t&gt; or two hanging out.</p>
<p>Luckily there is a way to prevent redundant template instantiations without changing the code.  That is through the use of explicit template instantiation declarations.  Explicit instantiation declarations aren&#8217;t technically part of the C++ standard, but they&#8217;re supported by <a href="http://msdn.microsoft.com/en-us/library/by56e477.aspx">MSVC</a> and <a href="http://gcc.gnu.org/onlinedocs/gcc/Template-Instantiation.html#Template-Instantiation">gcc</a> and they&#8217;re included in the C++09 draft standard, so they&#8217;re a safe bet to use.  What explicit instantiation declarations do is allow you to prevent the compiler from instantiating a particular template in the current translation unit.  They look just like explicit instantiation definitions, but they&#8217;re preceded by the extern keyword.  For example, the explicit instantiation declaration of basic_string&lt;char&gt; looks like this:</p>
<p>
<div class="dean_ch" style="white-space: nowrap;"><span class="kw4">extern</span> <span class="kw2">template</span> <span class="kw2">class</span> basic_string&lt;char&gt;;</div>
</p>
<p>You can add this declaration to a root header in your application and rebuild, and, unless you&#8217;ve missed a translation unit, you&#8217;ll encounter linker errors for missing basic_string&lt;char&gt; symbols.  Once you&#8217;ve removed all the basic_string&lt;char&gt; instantiations from your engine, the next thing to do is to add one back so the linker has all the code it needs to put together a complete executable.  That&#8217;s where explicit instantiation definitions come in.</p>
<p>Explicit instantiation definitions are the counterpart to explicit instantiation declarations.  They tell the compiler to fully instantiate a particular template even if it is otherwise unused in the current translation unit.  Here&#8217;s an example of how you can use the two concepts together to prevent redundant instantiations of basic_string.</p>
<blockquote>
<div class="dean_ch" style="white-space: nowrap;"><span class="co1">// in string_instantiation.h</span><br />
<span class="co2">#ifdef DS_EXTERN_TEMPLATE_INSTANTIATION</span><br />
<span class="kw2">namespace</span> std<br />
<span class="br0">&#123;</span><br />
&nbsp; &nbsp; <span class="kw4">extern</span> <span class="kw2">template</span> <span class="kw2">class</span> basic_string&lt;char&gt;;<br />
<span class="br0">&#125;</span><br />
<span class="co2">#endif</span></p>
<p><span class="co1">// in string_instantiation.cpp</span><br />
<span class="co2">#ifdef DS_EXTERN_TEMPLATE_INSTANTIATION</span><br />
<span class="co2">#undef DS_EXTERN_TEMPLATE_INSTANTIATION</span></p>
<p><span class="co2">#include &quot;string_instantiation.h&quot;</span></p>
<p><span class="kw2">namespace</span> std<br />
<span class="br0">&#123;</span><br />
&nbsp; &nbsp; <span class="kw2">template</span> <span class="kw2">class</span> basic_string&lt;char&gt;<br />
<span class="br0">&#125;</span><br />
<span class="co2">#endif // DS_EXTERN_TEMPLATE_INSTANTIATION</span></div>
</blockquote>
<p>The DS_EXTERN_TEMPLATE_INSTANTIATION <em>#define</em> in the above code serves two purposes.  First it allows us to disable explicit template instantiation for compilers that don&#8217;t support it, and second it allows us to work around a common bug in compilers that do support it.  Although the standard states that an explicit instantiation definition may follow an explicit instantiation declaration within a translation unit, many compilers don&#8217;t like it and won&#8217;t allow an explicit instantiation definition to override a preceding explicit instantiation declaration.</p>
<p>There is one dark fact about explicit template instantiation that you should know before applying it in your own codebase.  Although explicit template instantiation effectively turns template code into non template code from the perspective of code bloat, it doesn&#8217;t really help much with classes like std::string.  The syntax for out-of-line template class function definitions is so cumbersome that many programmers prefer to implement all their template class members entirely within the class declaration.  That&#8217;s what has happened with Microsoft&#8217;s STL implementation.  Unfortunately what that does is implicitly make every member function in a template class an inline function.  Of course most of the member functions are so large that they&#8217;ll never be inlined, but it doesn&#8217;t matter from the perspective of the compiler.</p>
<p>The explicit template declaration of basic_string&lt;char&gt; prevents the compiler from instantiating its members as template functions, but it doesn&#8217;t prevent it from instantiating them as inline functions!  The basic_string class goes from a classic example of code bloat due to redundant template instantiation to a classic example of code bloat due to <a href="http://gameangst.com/?p=224">incorrect inlining</a>!  You can&#8217;t win!</p>
<p>In your own code, at least, you can fix the incorrect inlining problem right after you fix the redundant instantiation problem.  Just move the template class member functions out of line and they&#8217;ll disappear from your symbol dumps.</p>
<p>And that&#8217;s everything I know about code bloat in a voluminous 6 part series.  Perhaps what I should be writing about is blog bloat.  All that&#8217;s left is to <a href="http://gameangst.com/?p=320">post some code</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gameangst.com/?feed=rss2&amp;p=246</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Minimizing Code Bloat: Incorrect Inlining</title>
		<link>http://gameangst.com/?p=224</link>
		<comments>http://gameangst.com/?p=224#comments</comments>
		<pubDate>Mon, 21 Sep 2009 15:19:35 +0000</pubDate>
		<dc:creator>Adrian Stone</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[code bloat]]></category>
		<category><![CDATA[optimization]]></category>

		<guid isPermaLink="false">http://gameangst.com/?p=224</guid>
		<description><![CDATA[This article is a continuation of Minimizing Code Bloat for Faster Builds and Smaller Executables.
Earlier I talked about excessive inlining as a common cause of code bloat.  Excessive inlining is when code that shouldn&#8217;t be inlined is.  Today I&#8217;m going to look at a related problem, code that is declared as inlined but never will [...]]]></description>
			<content:encoded><![CDATA[<p><em>This article is a continuation of <a href="http://gameangst.com/?p=46">Minimizing Code Bloat for Faster Builds and Smaller Executables</a>.</em></p>
<p>Earlier I talked about <a href="http://gameangst.com/?p=212">excessive inlining</a> as a common cause of code bloat.  Excessive inlining is when code that shouldn&#8217;t be inlined is.  Today I&#8217;m going to look at a related problem, code that is declared as inlined but never will be.</p>
<p>Any function whose address is taken for storage in a function pointer or a virtual function table can never fully be inlined.  Even for a function that can legally be inlined, the compiler has ultimate authority over whether or not it will be.  If a function is large the compiler will generally forgo inlining it because the performance benefits of removing the function call overhead don&#8217;t justify the increase in executable size.  That&#8217;s all well and good, but it still poses a problem.</p>
<p>C++ compilers work by dividing programs into something called <a href="http://stackoverflow.com/questions/1106149/what-is-a-translation-unit-in-c">translation units</a>.  A translation unit is a single cpp file and all the headers it includes.  The compiler turns these translation units into object files and ultimately the linker combines the contents of all the object files into an executable.  The important thing to note is that every function visible to the compiler in a translation unit must be compiled as part of that unit.  In the case of a function that is declared inline, the compiler will generate a compiled instantiation of the function in every cpp file that uses it.  The various redundant instantiations are marked as <a href="http://en.wikipedia.org/wiki/Weak_symbol">weak symbols</a> so the linker knows that multiple copies are expected and that it can pick any one of them for use in the final executable.</p>
<p>Knowing this, you can see how wasteful functions that are incorrectly marked as inline can be.  Large functions are less likely to be inlined, are more likely to take a long time to compile, and consequently can be murder on your build times.</p>
<p>By this point you should have a pretty good idea of how to detect incorrect inlining.  The same technique I proposed for measuring excessive inlining and the compile-time cost of template overspecialization works equally well for detecting incorrect inlining.  Dump the symbols from all your object files with <em>DumpBin /headers</em> and look for large functions with lots of redundant instantiations.</p>
<p>One of the most interesting (and insidious!) sources of incorrect inlining is code that you don&#8217;t write at all.  Consider this apparently harmless class definition:</p>
<blockquote>
<div class="dean_ch" style="white-space: nowrap;"><span class="kw2">class</span> DataBox<br />
<span class="br0">&#123;</span><br />
&nbsp; &nbsp; PhysicsData &nbsp; &nbsp; m_physics;<br />
&nbsp; &nbsp; AIData &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;m_ai;<br />
&nbsp; &nbsp; GraphicsData &nbsp; &nbsp;m_graphics;<br />
&nbsp; &nbsp; AnimationData &nbsp; m_animation;<br />
&nbsp; &nbsp; UIData &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;m_ui;<br />
&nbsp; &nbsp; ScriptData &nbsp; &nbsp; &nbsp;m_script;<br />
&nbsp; &nbsp; NetworkData &nbsp; &nbsp; m_network;<br />
<span class="br0">&#125;</span>;</div>
</blockquote>
<p>This class is potentially a major contributor to code bloat from incorrect inlining.  If you&#8217;re scratching your head and wondering how a class with no functions can contribute to code bloat, take a look at the next listing to see this class the way the compiler sees it.</p>
<blockquote>
<div class="dean_ch" style="white-space: nowrap;"><span class="kw2">class</span> DataBox<br />
<span class="br0">&#123;</span><br />
&nbsp; &nbsp; PhysicsData &nbsp; &nbsp; m_physics;<br />
&nbsp; &nbsp; AIData &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;m_ai;<br />
&nbsp; &nbsp; GraphicsData &nbsp; &nbsp;m_graphics;<br />
&nbsp; &nbsp; AnimationData &nbsp; m_animation;<br />
&nbsp; &nbsp; UIData &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;m_ui;<br />
&nbsp; &nbsp; ScriptData &nbsp; &nbsp; &nbsp;m_script;<br />
&nbsp; &nbsp; NetworkData &nbsp; &nbsp; m_network;</p>
<p>&nbsp; &nbsp; <span class="co1">// Not legal C++. &nbsp;This is for illustration purposes only.</span><br />
&nbsp; &nbsp; DataBox<span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; <span class="br0">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; m_physics.<span class="me1">PhysicsData</span><span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; m_ai.<span class="me1">AIData</span><span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; m_graphics.<span class="me1">GraphicsData</span><span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; m_animation.<span class="me1">AnimationData</span><span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; m_ui.<span class="me1">UIData</span><span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; m_script.<span class="me1">ScriptData</span><span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; m_network.<span class="me1">NetworkData</span><span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; <span class="br0">&#125;</span></p>
<p>&nbsp; &nbsp; ~DataBox<span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; <span class="br0">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; m_network.~NetworkData<span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; m_script.~ScriptData<span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; m_ui.~UIData<span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; m_animation.~AnimationData<span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; m_graphics.~GraphicsData<span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; m_ai.~AIData<span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; m_physics.~PhysicsData<span class="br0">&#40;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; <span class="br0">&#125;</span><br />
<span class="br0">&#125;</span>;</div>
</blockquote>
<p>In C++ every class has a constructor, a destructor, a copy constructor, and an assignment operator.  If you don&#8217;t provide these functions the compiler will, and it must do so inline.  Of course good compilers will only instantiate inline functions when they are actually used, so most classes never have their default copy constructor or assignment operators generated, but unless you&#8217;re doing something dangerously clever in your codebase, every class has a destructor instantiated at least once.</p>
<p>If you notice default (also sometimes called automatic) methods showing up in the top slots of your redundant function hotlist, the fix is counterintuitive but also really simple.  Just provide an empty, non-inline implementation in place of the default one.  Providing empty, non-inline implementations of default methods for nontrivial classes can sometimes help your build times in other ways.  If you use a lot of smart pointers in your code, the default destructor of a class holding smart pointers may be the only thing preventing you from replacing the #includes of the headers for the classes stored in smart pointers with forward declarations of those classes.  In other words, inline functions create header dependencies, and inline default methods are no exception.</p>
<p>Speaking of <a href="http://en.wikipedia.org/wiki/Smart_pointer">smart pointers</a>, as great as they are, code bloat is one of their unfortunate costs.  As I mentioned above, good compilers only instantiate inline code when it is actually used.  Reference counting smart pointers allow for distributed ownership of objects, which can be a powerful and convenient design freedom.  However, if responsibility for destroying objects is widely distributed through your code, so is responsibility for instantiating object destructors.  An engine with rigorously defined ownership hierarchies will have few inline destructor instantiations whereas an engine with sloppily defined ownership hierarchies will have many.  So if you use smart pointers, use them responsibly!  Pass by reference and reference count only when necessary.  Unnecessary reference counting isn&#8217;t just inefficient at run time, it is inefficient at compile time too!</p>
<p>If all this sounds like a major pain in the neck, I suggest you give up now and implement <a href="http://buffered.io/2007/12/10/the-magic-of-unity-builds/">unity builds</a>.  They circumvent all these redundant instantiation issues so they&#8217;re fantastic at optimizing build times in codebases with poor structure.  On the other hand, if you&#8217;re still enjoying this trek through compile land, join me next time for a discussion of <a href="http://gameangst.com/?p=246">redundant template instantiation</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gameangst.com/?feed=rss2&amp;p=224</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Minimizing Code Bloat: Static Allocations</title>
		<link>http://gameangst.com/?p=222</link>
		<comments>http://gameangst.com/?p=222#comments</comments>
		<pubDate>Sun, 20 Sep 2009 16:23:36 +0000</pubDate>
		<dc:creator>Adrian Stone</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[code bloat]]></category>
		<category><![CDATA[optimization]]></category>

		<guid isPermaLink="false">http://gameangst.com/?p=222</guid>
		<description><![CDATA[This article is a continuation of Minimizing Code Bloat for Faster Builds and Smaller Executables.
Static allocations are certainly the simplest and most predictable sources of code bloat.  Static allocations refer to data structures in a program for which storage is allocated for the entire duration of the application&#8217;s execution.   In C++ file, class, and [...]]]></description>
			<content:encoded><![CDATA[<p><em>This article is a continuation of <a href="http://gameangst.com/?p=46">Minimizing Code Bloat for Faster Builds and Smaller Executables</a>.</em></p>
<p>Static allocations are certainly the simplest and most predictable sources of code bloat.  Static allocations refer to data structures in a program for which storage is allocated for the entire duration of the application&#8217;s execution.   In C++ file, class, and local static variables all require static allocations.  Additional examples of static allocations in C++ are global variables, string literals, and virtual function tables.</p>
<p>Static allocations can be very useful in some circumstances, but they can also be very wasteful.  Most of the code in an engine is executed very infrequently, which means most of the static allocations supporting that code are rarely needed.  Unfortunately since static allocations exist for the lifetime of the program, they&#8217;re taking up memory whether you&#8217;re using them or not.</p>
<p>In addition to possibly wasting memory, static allocations can increase program link and load times.  The memory for most static allocations is reserved directly within the application binary.  Consequently, every megabyte of static allocation is a megabyte that must be written to disk every time you link, copied across the network every time you deploy, and read from disk every time you launch.</p>
<p>The one exception to this is a special kind of static allocation known as <a href="http://en.wikipedia.org/wiki/BSS">bss</a> data.  Static allocations that are known at compile time to be zero-initialized are placed in a special section of the executable called the bss section.  Memory used by objects in the bss section isn&#8217;t reserved in the executable, instead it is allocated by the program loader.  Bss data doesn&#8217;t significantly impact build times, but it has the same run time memory requirements as other static allocations.</p>
<p>In the list of code bloat causes, static allocations are usually a very distant third, but every engine I&#8217;ve investigated for static allocations has yielded a few unpleasant surprises.</p>
<p>Dumpbin isn&#8217;t of much help when it comes to tracking static allocations.  Data symbols aren&#8217;t packaged as COMDATs, so they don&#8217;t show up in the <em>/headers</em> report I&#8217;m so fond of.   To gather statistics on the static allocations within your application, you&#8217;ll have to crack open the PDB.  Every static and global variable is documented in the program database along with its size and, through some creative cross-referencing, its source file.</p>
<p>Large static objects are easy to find&#8211;just sort the data symbols by size and gawk at the top offenders.  Don&#8217;t stop there though.  If you generate a lot of code with templates or macros, you may have large numbers of relatively small static objects.  To quantify the cost of these objects you&#8217;ll need to do some pattern matching on symbol names.  You can strip template parameters from symbols so <em>MyClass&lt;int&gt;::s_buffer</em> and <em>MyClass&lt;float&gt;::s_buffer</em> get counted together in the stats, or you can strip away the scope entirely and tally the stats for all statics named <em>s_buffer</em> together.  I find it useful to analyze the results from collapsing symbol names in a variety of different ways.  Which option yields the best results depends on the particular codebase I&#8217;m dealing with and the type of code generation it uses.</p>
<p>That covers it for static allocations and for executable size optimizations in general.  I still have a couple topics to go, but from now on they&#8217;re relevant to build times alone.  Coming up, <a href="http://gameangst.com/?p=224">incorrect inlining</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gameangst.com/?feed=rss2&amp;p=222</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Minimizing Code Bloat: Excessive Inlining</title>
		<link>http://gameangst.com/?p=212</link>
		<comments>http://gameangst.com/?p=212#comments</comments>
		<pubDate>Sat, 19 Sep 2009 23:45:53 +0000</pubDate>
		<dc:creator>Adrian Stone</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[code bloat]]></category>
		<category><![CDATA[optimization]]></category>

		<guid isPermaLink="false">http://gameangst.com/?p=212</guid>
		<description><![CDATA[This article is a continuation of Minimizing Code Bloat for Faster Builds and Smaller Executables.
When a function is inlined its code is, from the perspective of the compiler, replicated to every place where the function is called.  For very simple functions the inlined code can be smaller than the out-of-line function call, and the result [...]]]></description>
			<content:encoded><![CDATA[<p><em>This article is a continuation of <a href="http://gameangst.com/?p=46">Minimizing Code Bloat for Faster Builds and Smaller Executables</a>.</em></p>
<p>When a function is inlined its code is, from the perspective of the compiler, replicated to every place where the function is called.  For very simple functions the inlined code can be smaller than the out-of-line function call, and the result is a net reduction in code size.  For most functions, however, the function code is larger than the function call and inlining will increase executable size.</p>
<p>Code bloat from excessive inlining usually isn&#8217;t a problem if you&#8217;re smart enough to trust your compiler.  Compilers apply a set of heuristics to decide whether the performance benefits of inlining a function justify the costs.  If the cost isn&#8217;t justified, the compiler won&#8217;t inline.  Additionally, most compilers err on the side of not inlining.  Unless you&#8217;ve overridden the default compiler behavior with a <a href="http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx" target="_blank">__forceinline</a> directive, you can safely assume that any code bloat that comes from inlining is justified by improved performance.</p>
<p>Still, you can never be too careful, so it&#8217;s worth checking to see how much inlining is costing you.</p>
<p>One technique for measuring the cost of inlining is compiling your engine with inlining enabled and again with inlining disabled, and comparing the size of the resulting executables.  If the executable without inlining is a lot smaller, you&#8217;re probably inlining too much.  Even if your executable isn&#8217;t significantly smaller with inlining disabled, you may still be suffering from excessive inlining.  In some cases inlining a function can result in reduced code size because the inlined code opens up optimization opportunities that were otherwise invisible to the compiler.  Code size reductions from well inlined functions can offset code size increases from poorly inlined functions, making quick executable size comparisons unreliable.</p>
<p>A better way to identify excessive inlining is to use DumpBin on the .obj files of your application.  Even with inlining enabled, inline functions will be instantiated into every object file in which they are used.  All those duplicate instantiations are merged or removed at link time, but if you run DumpBin on the .lib or .obj files you can measure how many object files the inlined functions are being replicated into.  This is similar to the technique I proposed to measure the compile-time cost of <a href="http://gameangst.com/?p=226">template overspecialization</a>.  Sort the database of symbols extracted from DumpBin by <em>symbol_size</em> * <em>symbol_count</em>, where <em>symbol_count</em> is the number of object files in which a particular symbol appears, and skim through the top 100.   This isn&#8217;t a perfect measure of code bloat from function inlining, but it can usually catch the worst offenders.</p>
<p>You can improve the report I described above by discarding symbols with sizes below some threshold.  These will be functions whose inlined code is smaller than an out-of-line function call and are consequently not a factor in code bloat.  Likewise, if you haven&#8217;t been abusing the <em>__forceinline </em>directive, you can discard symbols with sizes above some threshold.  Very large functions will never automatically be inlined by the compiler, so they don&#8217;t increase the size of the executable.  They do, however, contribute unfavorably to compile and link times, but I&#8217;ll talk more about that later.</p>
<p>One last thing&#8211;be sure to review the source for any function you&#8217;re considering deinlining.  Consider carefully how the compiler sees the function in isolation and in the context in which it is inlined.  Are the function parameter values typically known at compile time at the call site?  Is the ratio of the number of times the function is executed in a frame to the number of times the function is called in the source very large?  If the answer to either of these questions is yes, the function may be a good candidate for inlining regardless of what the symbol analysis says.</p>
<p>Next I&#8217;ll be looking at <a href="http://gameangst.com/?p=222">static allocations</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gameangst.com/?feed=rss2&amp;p=212</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Minimizing Code Bloat: Template Overspecialization</title>
		<link>http://gameangst.com/?p=226</link>
		<comments>http://gameangst.com/?p=226#comments</comments>
		<pubDate>Fri, 18 Sep 2009 06:12:29 +0000</pubDate>
		<dc:creator>Adrian Stone</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[code bloat]]></category>
		<category><![CDATA[optimization]]></category>

		<guid isPermaLink="false">http://gameangst.com/?p=226</guid>
		<description><![CDATA[This article is a continuation of Minimizing Code Bloat for Faster Builds and Smaller Executables.
Even in an engine that isn&#8217;t rife with template meta-programming, template overspecialization can be a major source of code bloat.  Very few programmers, in my experience, really think about what the compiler and linker are doing with the code they write. [...]]]></description>
			<content:encoded><![CDATA[<p><em>This article is a continuation of <a href="http://gameangst.com/?p=46">Minimizing Code Bloat for Faster Builds and Smaller Executables</a>.</em></p>
<p>Even in an engine that isn&#8217;t rife with template meta-programming, template overspecialization can be a major source of code bloat.  Very few programmers, in my experience, really think about what the compiler and linker are doing with the code they write.  They design their code in C++, they write their code in C++, and of course they debug their code in C++.  Their view of the programs they write begins and ends at that level, which is certainly understandable since there is more than enough to worry about in C++ alone.</p>
<p>Luckily for us, C++ is a language whose design is very representative of the underlying hardware.  Most of the time it is perfectly reasonable to assume that the code you write in C++ bears a close resemblance to the machine code generated by the compiler, at least in terms of quantity, structure, and performance.  The one place where that assumption breaks down, however, is in template code.  A very large amount of template code can result in little or no machine code and, conversely, a very small amount of template code can generate an enormous volume of machine code.</p>
<p>Consider this snippet from a hypothetical class implementing a Win32 file stream:</p>
<blockquote>
<div class="dean_ch" style="white-space: nowrap;"><span class="kw2">class</span> Stream<br />
<span class="br0">&#123;</span><br />
&nbsp; &nbsp; <span class="co1">// &#8230;</span><br />
&nbsp; &nbsp; <span class="kw2">template</span> &lt;typename T&gt;<br />
&nbsp; &nbsp; <span class="kw4">int</span> Read<span class="br0">&#40;</span>T* obj, uint numObjects<span class="br0">&#41;</span><br />
&nbsp; &nbsp; <span class="br0">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> <span class="br0">&#40;</span>!m_open<span class="br0">&#41;</span> <span class="kw1">return</span> <span class="nu0">0</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> <span class="br0">&#40;</span>m_accessMode == ACCESS_WO<span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> <span class="nu0">0</span>;</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; uint32 bytesRead = <span class="nu0">0</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; uint32 bytesRemaining = numObjects * <span class="kw3">sizeof</span><span class="br0">&#40;</span>T<span class="br0">&#41;</span>;</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw4">char</span>* readBuffer = <span class="kw2">reinterpret_cast</span><span class="br0">&#40;</span>obj<span class="br0">&#41;</span>;</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">while</span> <span class="br0">&#40;</span>bytesRemaining &gt; <span class="nu0">0</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw4">const</span> uint32 kMaxReadSize = <span class="nu0">32</span>*<span class="nu0">1024</span>*<span class="nu0">1024</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; uint32 bytesThisRequest = std::<span class="me2">min</span><span class="br0">&#40;</span>bytesRemaining, kMaxReadSize<span class="br0">&#41;</span>;</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; uint32 bytesReadThisRequest = <span class="nu0">0</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw4">bool</span> success = ::<span class="me2">ReadFile</span><span class="br0">&#40;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; m_hFile,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; readBuffer,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; bytesThisRequest,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &amp;bytesReadThisRequest,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">NULL</span><span class="br0">&#41;</span> != <span class="kw2">FALSE</span>;</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; bytesRemaining -= bytesReadThisRequest;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; readBuffer += bytesReadThisRequest;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; bytesRead += bytesReadThisRequest;</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> <span class="br0">&#40;</span>bytesReadThisRequest == <span class="nu0">0</span> ||<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; !success<span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">break</span>;<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; m_currentPosition += bytesRead;<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> bytesRead;<br />
&nbsp; &nbsp; <span class="br0">&#125;</span><br />
&nbsp; &nbsp; <span class="co1">// &#8230;</span><br />
<span class="br0">&#125;</span>;</div>
</blockquote>
<p>The author of this function is attempting to ease the burden on the calling code by removing the need to specify the size of the object being read into.  This simplifies the interface and removes a potential source of errors.  Unfortunately, the way the function is written is horribly inefficient from a compilation and code size perspective.  The entire contents of the function, some thirty lines of code, have been specialized for every type that is read from a stream.  That&#8217;s a lot of code for the compiler to process, particularly when you consider only one line of code in the entire function is actually dependent on the template parameter.</p>
<p>The story from the linker&#8217;s perspective is a little better.  For objects with the same size, say an <em>unsigned short</em> and a <em>short</em>, the compiler will generate identical code for both instantiations and any decent linker will happily merge those in the executable.  Nevertheless, code deduplication is a slow process so there is no reason to rely on it if you don&#8217;t have to.  Also, even with code deduplication most linkers won&#8217;t do partial function deduplication, so you&#8217;re still left with unique copies of this function in your executable for every size of serialized type in your application.</p>
<p>The compiler-friendly way to write a function like the one above is, of course:</p>
<blockquote>
<div class="dean_ch" style="white-space: nowrap;"><span class="kw2">class</span> Stream<br />
<span class="br0">&#123;</span><br />
&nbsp; &nbsp; <span class="co1">// &#8230;</span><br />
&nbsp; &nbsp; <span class="kw4">int</span> ReadRaw<span class="br0">&#40;</span><span class="kw4">void</span>* buffer, uint bufferSize<span class="br0">&#41;</span>;</p>
<p>&nbsp; &nbsp; <span class="kw2">template</span> &lt;typename T&gt;<br />
&nbsp; &nbsp; <span class="kw4">int</span> Read<span class="br0">&#40;</span>T* obj, uint numObjects<span class="br0">&#41;</span><br />
&nbsp; &nbsp; <span class="br0">&#123;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> ReadRaw<span class="br0">&#40;</span>obj, numObjects * <span class="kw3">sizeof</span><span class="br0">&#40;</span>T<span class="br0">&#41;</span><span class="br0">&#41;</span>;<br />
&nbsp; &nbsp; <span class="br0">&#125;</span><br />
&nbsp; &nbsp; <span class="co1">// &#8230;</span><br />
<span class="br0">&#125;</span>;</div>
</blockquote>
<p>Now only one line of code is specialized for each serialized type and the meat of Stream::Read will only be seen once by both the compiler and the linker.</p>
<p>The example above may seem contrived, and you&#8217;re probably thinking there is no way code like that could end up in your engine.  That&#8217;s certainly the way I felt before I started digging through our .obj files.  To find out if your optimism is justified, take a look at the output of DumpBin I mentioned <a href="http://gameangst.com/?p=46">earlier</a>.</p>
<p>Run DumpBin on the libraries linked by your application and concatenate the results.  Next process the output to build a database of unique symbols recording both their individual sizes and their repeat counts.  If you want to see where your compiler is spending its time, look at the sum of <em>symbol_size * symbol_count</em> for all template symbols regardless of what types they are templatized on.  If you want to see where your linker is spending its time or what&#8217;s contributing to your executable size, ignore the symbol counts and simply take the sum of the symbol sizes for every symbol from the same template.</p>
<p>Once you have a list of top offenders, the next thing to do is to inspect their code and determine whether their degree of specialization is really justified.  In most cases it is pretty easy to turn template functions into non-template functions or at least to factor out whatever portions of the code aren&#8217;t directly dependent on the template parameters.</p>
<p>When you&#8217;re tired of looking at template code, it&#8217;s time to turn your attention to <a href="http://gameangst.com/?p=212">excessive inlining</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gameangst.com/?feed=rss2&amp;p=226</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Minimizing Code Bloat for Faster Builds and Smaller Executables</title>
		<link>http://gameangst.com/?p=46</link>
		<comments>http://gameangst.com/?p=46#comments</comments>
		<pubDate>Fri, 18 Sep 2009 06:02:59 +0000</pubDate>
		<dc:creator>Adrian Stone</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[code bloat]]></category>
		<category><![CDATA[Despair Engine]]></category>
		<category><![CDATA[optimization]]></category>

		<guid isPermaLink="false">http://gameangst.com/?p=46</guid>
		<description><![CDATA[No matter how much code you have, somebody always wants more.  That&#8217;s certainly been the case with every game and every engine I&#8217;ve worked on.  My good friend Kyle, who has for years been tracking all sorts of wonderful statistics about the development of Despair Engine, recently announced that we&#8217;d surpassed 1.5 million lines of [...]]]></description>
			<content:encoded><![CDATA[<p>No matter how much code you have, somebody always wants more.  That&#8217;s certainly been the case with every game and every engine I&#8217;ve worked on.  My good friend <a href="http://gamearchitect.net/about/">Kyle</a>, who has for years been tracking all sorts of wonderful statistics about the development of <a href="http://gamearchitect.net/category/despair-engine/">Despair Engine</a>, recently announced that we&#8217;d surpassed 1.5 million lines of principle engine code.  That doesn&#8217;t count build systems, tools, tests, middleware, or scripts; that&#8217;s 100% Despair-original C++ code the compiler has to wade through every time we build the game.</p>
<p>As if the code we write weren&#8217;t bad enough, our engine is particularly adept at <a href="http://gameangst.com/?p=107">generating code</a>.  Those 1.5 MLOC result in a hefty 37 megabyte executable on the Xbox 360.  Kyle estimates that we&#8217;ll exceed 2 million lines of code before the end of this console generation.  We already have an executable that can&#8217;t fit into main memory on a PS2 or a Gamecube.  By 2014 it won&#8217;t fit onto an Xbox or Wii either.  I don&#8217;t think we&#8217;re in danger of exceeding the capacity of a PS3 or Xbox 360, but the prospect is enough to keep me up at nights.</p>
<p>Looking back at the graphs of our executable size and build and link times, I&#8217;m proud to see a few dips punctuating the otherwise consistent upward march.  At least a few of those dips correspond to weeks when I turned my sleepless nights into successful assaults on code bloat.  What&#8217;s particularly interesting about those dips is that they don&#8217;t correspond to any drops in our code line count.  In my biggest success I reduced the size of our executable by 25% or over 7 megabytes.  At the same time I cut our link times by 19%, and yet I did it all without removing a single line of code!  How?</p>
<p>The first place to start when trimming a codebase is figuring out where to look.  In an engine of 1.5 million lines of code there are a lot of places to hide.  My strategy when dealing with problems like this is to gather as much data as possible into a format that is amenable to analysis.  For code this means extracting symbols from the compiled binaries along with their size, type, source file, and any other details you can get your hands on.  Once you have a dump of all the symbols in your compiled code, it is a pretty simple matter to write a script to collate it and generate whatever statistics you find useful.</p>
<p>There are two methods I use for extracting data from compiled code.  The first is a quick and handy utility called DumpBin.  <a href="http://msdn.microsoft.com/en-us/library/c1h23y6c(VS.71).aspx" target="_blank">DumpBin</a> is part of the Visual Studio toolchain and is available for all Microsoft targets.  There are similar tools available for every compiler I&#8217;ve encountered.  The gcc equivalents, for example, are <a href="http://linux.die.net/man/1/objdump">objdump</a> and <a href="http://linux.die.net/man/1/nm">nm</a>.  These tools are useful because they can work with just about any compiled binary.  They target standard binary formats like <a href="http://en.wikipedia.org/wiki/COFF">PE/COFF</a> and <a href="http://en.wikipedia.org/wiki/Executable_and_Linkable_Format">ELF</a>, and consequently they work on object files, library files, executables and DLLs.  The ability to extract information from intermediate files in the build process is essential if you&#8217;re worried as much about build times as you are about final memory usage.</p>
<p>My preferred way to use DumpBin is to specify the <em>/headers</em> option to extract COMDATs from object files compiled with the <a href="http://msdn.microsoft.com/en-us/library/xsa71f43(VS.80).aspx" target="_blank">/Gy</a> option.  For example:</p>
<p style="padding-left: 30px; "><em>for /r . %n in (*.obj) do DumpBin /headers &#8220;%n&#8221; &gt;&gt; all_headers.txt</em></p>
<p><em></em>A portion of the output from such a dump looks like this:</p>
<pre>SECTION HEADER #14E
   .text name
       0 physical address
       0 virtual address
       7 size of raw data
   22D6B file pointer to raw data (00022D6B to 00022D71)
   22D72 file pointer to relocation table
       0 file pointer to line numbers
       1 number of relocations
       0 number of line numbers
60501020 flags
         Code
         COMDAT; sym= "public: virtual __thiscall Despair::Reflection::Object::~Object(void)" (??1Object@Reflection@Despair@@UAE@XZ)
         16 byte align
         Execute Read

SECTION HEADER #14F
   .text name
       0 physical address
       0 virtual address
       5 size of raw data
   230B1 file pointer to raw data (000230B1 to 000230B5)
       0 file pointer to relocation table
       0 file pointer to line numbers
       0 number of relocations
       0 number of line numbers
60501020 flags
         Code
         COMDAT; sym= "public: int __thiscall Despair::BroadcasterReceipt::AddRef(void)" (?AddRef@BroadcasterReceipt@Despair@@QAEHXZ)
         16 byte align
         Execute Read</pre>
<p>As you can see, the result is an easily parsable list of packaged functions with symbol names and sizes clearly specified.  As great as the <em>/headers </em>dump is, it isn&#8217;t perfect.  Data symbols aren&#8217;t packaged as COMDATs and although DumpBin can extract data from executables, executables don&#8217;t contain COMDATs so there isn&#8217;t much to see.</p>
<p>To analyze the contents of an executable, a second approach is needed.  My solution is to reference the debugging information provided by the linker.  With Microsoft&#8217;s tools, debugging information is stored in a <a href="http://en.wikipedia.org/wiki/Program_database">PDB</a> file that accompanies the executable.  Similar to DumpBin&#8217;s <em>/headers</em> dump, the PDB contains information about symbols in the compiled binary.  Unlike the <em>/headers</em> dump, however, the PDB file contains data as well as code symbols, and the information it contains is post-linker optimization which is important if you&#8217;re more worried about memory consumption than compile and link times.</p>
<p>There aren&#8217;t any standard tools for dumping the contents of a PDB, but there is a DLL provided with the Microsoft toolchain that makes it pretty easy to write your own.  I&#8217;ve used the <a href="http://msdn.microsoft.com/en-us/library/x93ctkx8(VS.80).aspx">Debug Interface Access SDK</a> included with Microsoft Visual Studio to plumb the contents of an executable and generate symbol tables that can be analyzed alongside a COMDAT dump.</p>
<p>In an upcoming series of articles I&#8217;m going to explore the most common causes of C++ code bloat and take a closer look at how you can use DumpBin and the PDB to identify and remove the worst offenders from your code.  When I&#8217;m done I&#8217;ll share some sample code which you can use to assess the level of code bloat in your own engines.</p>
<p>First up, <a href="http://gameangst.com/?p=226">template overspecialization</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gameangst.com/?feed=rss2&amp;p=46</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>
