Minimizing Code Bloat: Excessive Inlining

This article is a continuation of Minimizing Code Bloat for Faster Builds and Smaller Executables.

When a function is inlined its code is, from the perspective of the compiler, replicated to every place where the function is called.  For very simple functions the inlined code can be smaller than the out-of-line function call, and the result is a net reduction in code size.  For most functions, however, the function code is larger than the function call and inlining will increase executable size.

Code bloat from excessive inlining usually isn’t a problem if you’re smart enough to trust your compiler.  Compilers apply a set of heuristics to decide whether the performance benefits of inlining a function justify the costs.  If the cost isn’t justified, the compiler won’t inline.  Additionally, most compilers err on the side of not inlining.  Unless you’ve overridden the default compiler behavior with a __forceinline directive, you can safely assume that any code bloat that comes from inlining is justified by improved performance.

Still, you can never be too careful, so it’s worth checking to see how much inlining is costing you.

One technique for measuring the cost of inlining is compiling your engine with inlining enabled and again with inlining disabled, and comparing the size of the resulting executables.  If the executable without inlining is a lot smaller, you’re probably inlining too much.  Even if your executable isn’t significantly smaller with inlining disabled, you may still be suffering from excessive inlining.  In some cases inlining a function can result in reduced code size because the inlined code opens up optimization opportunities that were otherwise invisible to the compiler.  Code size reductions from well inlined functions can offset code size increases from poorly inlined functions, making quick executable size comparisons unreliable.

A better way to identify excessive inlining is to use DumpBin on the .obj files of your application.  Even with inlining enabled, inline functions will be instantiated into every object file in which they are used.  All those duplicate instantiations are merged or removed at link time, but if you run DumpBin on the .lib or .obj files you can measure how many object files the inlined functions are being replicated into.  This is similar to the technique I proposed to measure the compile-time cost of template overspecialization.  Sort the database of symbols extracted from DumpBin by symbol_size * symbol_count, where symbol_count is the number of object files in which a particular symbol appears, and skim through the top 100.   This isn’t a perfect measure of code bloat from function inlining, but it can usually catch the worst offenders.

You can improve the report I described above by discarding symbols with sizes below some threshold.  These will be functions whose inlined code is smaller than an out-of-line function call and are consequently not a factor in code bloat.  Likewise, if you haven’t been abusing the __forceinline directive, you can discard symbols with sizes above some threshold.  Very large functions will never automatically be inlined by the compiler, so they don’t increase the size of the executable.  They do, however, contribute unfavorably to compile and link times, but I’ll talk more about that later.

One last thing–be sure to review the source for any function you’re considering deinlining.  Consider carefully how the compiler sees the function in isolation and in the context in which it is inlined.  Are the function parameter values typically known at compile time at the call site?  Is the ratio of the number of times the function is executed in a frame to the number of times the function is called in the source very large?  If the answer to either of these questions is yes, the function may be a good candidate for inlining regardless of what the symbol analysis says.

Next I’ll be looking at static allocations.

Minimizing Code Bloat: Template Overspecialization

This article is a continuation of Minimizing Code Bloat for Faster Builds and Smaller Executables.

Even in an engine that isn’t rife with template meta-programming, template overspecialization can be a major source of code bloat.  Very few programmers, in my experience, really think about what the compiler and linker are doing with the code they write.  They design their code in C++, they write their code in C++, and of course they debug their code in C++.  Their view of the programs they write begins and ends at that level, which is certainly understandable since there is more than enough to worry about in C++ alone.

Luckily for us, C++ is a language whose design is very representative of the underlying hardware.  Most of the time it is perfectly reasonable to assume that the code you write in C++ bears a close resemblance to the machine code generated by the compiler, at least in terms of quantity, structure, and performance.  The one place where that assumption breaks down, however, is in template code.  A very large amount of template code can result in little or no machine code and, conversely, a very small amount of template code can generate an enormous volume of machine code.

Consider this snippet from a hypothetical class implementing a Win32 file stream:

class Stream
{
    // …
    template <typename T>
    int Read(T* obj, uint numObjects)
    {
        if (!m_open) return 0;
        if (m_accessMode == ACCESS_WO)
            return 0;

        uint32 bytesRead = 0;
        uint32 bytesRemaining = numObjects * sizeof(T);

        char* readBuffer = reinterpret_cast(obj);

        while (bytesRemaining > 0)
        {
            const uint32 kMaxReadSize = 32*1024*1024;
            uint32 bytesThisRequest = std::min(bytesRemaining, kMaxReadSize);

            uint32 bytesReadThisRequest = 0;
            bool success = ::ReadFile(
                m_hFile,
                readBuffer,
                bytesThisRequest,
                &bytesReadThisRequest,
                NULL) != FALSE;

            bytesRemaining -= bytesReadThisRequest;
            readBuffer += bytesReadThisRequest;
            bytesRead += bytesReadThisRequest;

            if (bytesReadThisRequest == 0 ||
                !success)
                break;
        }

        m_currentPosition += bytesRead;
        return bytesRead;
    }
    // …
};

The author of this function is attempting to ease the burden on the calling code by removing the need to specify the size of the object being read into.  This simplifies the interface and removes a potential source of errors.  Unfortunately, the way the function is written is horribly inefficient from a compilation and code size perspective.  The entire contents of the function, some thirty lines of code, have been specialized for every type that is read from a stream.  That’s a lot of code for the compiler to process, particularly when you consider only one line of code in the entire function is actually dependent on the template parameter.

The story from the linker’s perspective is a little better.  For objects with the same size, say an unsigned short and a short, the compiler will generate identical code for both instantiations and any decent linker will happily merge those in the executable.  Nevertheless, code deduplication is a slow process so there is no reason to rely on it if you don’t have to.  Also, even with code deduplication most linkers won’t do partial function deduplication, so you’re still left with unique copies of this function in your executable for every size of serialized type in your application.

The compiler-friendly way to write a function like the one above is, of course:

class Stream
{
    // …
    int ReadRaw(void* buffer, uint bufferSize);

    template <typename T>
    int Read(T* obj, uint numObjects)
    {
        return ReadRaw(obj, numObjects * sizeof(T));
    }
    // …
};

Now only one line of code is specialized for each serialized type and the meat of Stream::Read will only be seen once by both the compiler and the linker.

The example above may seem contrived, and you’re probably thinking there is no way code like that could end up in your engine.  That’s certainly the way I felt before I started digging through our .obj files.  To find out if your optimism is justified, take a look at the output of DumpBin I mentioned earlier.

Run DumpBin on the libraries linked by your application and concatenate the results.  Next process the output to build a database of unique symbols recording both their individual sizes and their repeat counts.  If you want to see where your compiler is spending its time, look at the sum of symbol_size * symbol_count for all template symbols regardless of what types they are templatized on.  If you want to see where your linker is spending its time or what’s contributing to your executable size, ignore the symbol counts and simply take the sum of the symbol sizes for every symbol from the same template.

Once you have a list of top offenders, the next thing to do is to inspect their code and determine whether their degree of specialization is really justified.  In most cases it is pretty easy to turn template functions into non-template functions or at least to factor out whatever portions of the code aren’t directly dependent on the template parameters.

When you’re tired of looking at template code, it’s time to turn your attention to excessive inlining.

Minimizing Code Bloat for Faster Builds and Smaller Executables

No matter how much code you have, somebody always wants more.  That’s certainly been the case with every game and every engine I’ve worked on.  My good friend Kyle, who has for years been tracking all sorts of wonderful statistics about the development of Despair Engine, recently announced that we’d surpassed 1.5 million lines of principle engine code.  That doesn’t count build systems, tools, tests, middleware, or scripts; that’s 100% Despair-original C++ code the compiler has to wade through every time we build the game.

As if the code we write weren’t bad enough, our engine is particularly adept at generating code.  Those 1.5 MLOC result in a hefty 37 megabyte executable on the Xbox 360.  Kyle estimates that we’ll exceed 2 million lines of code before the end of this console generation.  We already have an executable that can’t fit into main memory on a PS2 or a Gamecube.  By 2014 it won’t fit onto an Xbox or Wii either.  I don’t think we’re in danger of exceeding the capacity of a PS3 or Xbox 360, but the prospect is enough to keep me up at nights.

Looking back at the graphs of our executable size and build and link times, I’m proud to see a few dips punctuating the otherwise consistent upward march.  At least a few of those dips correspond to weeks when I turned my sleepless nights into successful assaults on code bloat.  What’s particularly interesting about those dips is that they don’t correspond to any drops in our code line count.  In my biggest success I reduced the size of our executable by 25% or over 7 megabytes.  At the same time I cut our link times by 19%, and yet I did it all without removing a single line of code!  How?

The first place to start when trimming a codebase is figuring out where to look.  In an engine of 1.5 million lines of code there are a lot of places to hide.  My strategy when dealing with problems like this is to gather as much data as possible into a format that is amenable to analysis.  For code this means extracting symbols from the compiled binaries along with their size, type, source file, and any other details you can get your hands on.  Once you have a dump of all the symbols in your compiled code, it is a pretty simple matter to write a script to collate it and generate whatever statistics you find useful.

There are two methods I use for extracting data from compiled code.  The first is a quick and handy utility called DumpBin.  DumpBin is part of the Visual Studio toolchain and is available for all Microsoft targets.  There are similar tools available for every compiler I’ve encountered.  The gcc equivalents, for example, are objdump and nm.  These tools are useful because they can work with just about any compiled binary.  They target standard binary formats like PE/COFF and ELF, and consequently they work on object files, library files, executables and DLLs.  The ability to extract information from intermediate files in the build process is essential if you’re worried as much about build times as you are about final memory usage.

My preferred way to use DumpBin is to specify the /headers option to extract COMDATs from object files compiled with the /Gy option.  For example:

for /r . %n in (*.obj) do DumpBin /headers “%n” >> all_headers.txt

A portion of the output from such a dump looks like this:

SECTION HEADER #14E
   .text name
       0 physical address
       0 virtual address
       7 size of raw data
   22D6B file pointer to raw data (00022D6B to 00022D71)
   22D72 file pointer to relocation table
       0 file pointer to line numbers
       1 number of relocations
       0 number of line numbers
60501020 flags
         Code
         COMDAT; sym= "public: virtual __thiscall Despair::Reflection::Object::~Object(void)" (??1Object@Reflection@Despair@@UAE@XZ)
         16 byte align
         Execute Read

SECTION HEADER #14F
   .text name
       0 physical address
       0 virtual address
       5 size of raw data
   230B1 file pointer to raw data (000230B1 to 000230B5)
       0 file pointer to relocation table
       0 file pointer to line numbers
       0 number of relocations
       0 number of line numbers
60501020 flags
         Code
         COMDAT; sym= "public: int __thiscall Despair::BroadcasterReceipt::AddRef(void)" (?AddRef@BroadcasterReceipt@Despair@@QAEHXZ)
         16 byte align
         Execute Read

As you can see, the result is an easily parsable list of packaged functions with symbol names and sizes clearly specified.  As great as the /headers dump is, it isn’t perfect.  Data symbols aren’t packaged as COMDATs and although DumpBin can extract data from executables, executables don’t contain COMDATs so there isn’t much to see.

To analyze the contents of an executable, a second approach is needed.  My solution is to reference the debugging information provided by the linker.  With Microsoft’s tools, debugging information is stored in a PDB file that accompanies the executable.  Similar to DumpBin’s /headers dump, the PDB contains information about symbols in the compiled binary.  Unlike the /headers dump, however, the PDB file contains data as well as code symbols, and the information it contains is post-linker optimization which is important if you’re more worried about memory consumption than compile and link times.

There aren’t any standard tools for dumping the contents of a PDB, but there is a DLL provided with the Microsoft toolchain that makes it pretty easy to write your own.  I’ve used the Debug Interface Access SDK included with Microsoft Visual Studio to plumb the contents of an executable and generate symbol tables that can be analyzed alongside a COMDAT dump.

In an upcoming series of articles I’m going to explore the most common causes of C++ code bloat and take a closer look at how you can use DumpBin and the PDB to identify and remove the worst offenders from your code.  When I’m done I’ll share some sample code which you can use to assess the level of code bloat in your own engines.

First up, template overspecialization.

SIGGRAPH 2009 Highlights

Light Propagation Volumes in CryEngine 3
Anton Kaplanyan, Crytek

Anton Kaplanyan presented Crytek’s research into real-time single-bounce diffuse indirect illumination.  The current state of the art in real-time global illumination is derived from the work of Marc Stamminger and Carsten Dachsbacher.  They introduced the idea of rendering reflective shadow maps from light sources (shadow maps that store reflected light color and direction at each pixel) and then using that information to introduce virtual point lights (VPLs) that approximate bounced light.  This approach still requires a renderer capable of handling tens of thousands of point lights.  Deferred techniques can handle this kind of load only at barely interactive frame rates on the fastest hardware.  Various people, most notably Chris Wyman, have implemented solutions such as VPL clustering and multi-resolution rendering in order to boost the performance of this technique.  The best I’ve seen from this line of research is single-bounce, single-light global illumination consuming about 20 ms on a high-end GPU.  Crytek’s goal was to reduce that to 3.3 ms on current consoles!

Crytek’s research eventually led them to light propagation volumes A.K.A. irradiance volumes.  Their motivation is clear, to apply large numbers of point lights at a fixed cost.  The light propagation volumes are represented as cascaded volume textures covering the view frustum at multiple resolutions.  They store a single, 4-component spherical harmonic per texel in the volumes and they render point lights into the volumes as single-texel points.  They then apply what amounts to a custom blur over the volume texture to simulate light propagation and give their point lights non-zero radius.  Finally they apply the light volume as you would any single light in a deferred renderer.

They’ve provided tons of documentation, so I won’t bother going into any more of the details.  This is the most exciting and promising direction of research into real-time lighting I’ve seen in years.  If you read one paper from SIGGRAPH, let this be the one!

Graphics Engine Postmortem from LittleBigPlanet
Alex Evans, Media Molecule

As you may know, I’m a gushing fanboy when it comes to the team at Media Molecule.  These guys are alien geniuses and everything they touch turns to gold.  I was really excited to see the talk by Alex Evans and it absolutely did not disappoint!

Perhaps the most impressive thing about Media Molecule’s work on LittleBigPlanet is the large set of problems they chose not to solve.  From early in development they established a number of key limitations on their game design.  The world would be viewed through a constrained camera angle and it would have very low depth-complexity from that perspective.  Also, the game would be Playstation exclusive and rely heavily on the Cell processor.

In return for these constraints, the engine would push the boundaries of geometric, lighting, and material complexity.  Everything would be dynamic.  Physics simulation would happen at the vertex level (everything is cloth!), lights would be plentiful and dynamic, and users would be able to build and texture objects in game virtually without limitations.

Throughout his talk, Alex repeatedly cautioned against implementing the techniques he was presenting.  He seemed to do it mostly out of modestly, but let’s face it, if you’re like most of us, trying to build an extensible, multipurpose, cross-platform engine to accommodate the current demand for multiplayer story-driven detail-rich open-world action RPG driving games, these techniques ain’t for you!

Hopefully the course notes will come out soon with more details, but to tide you over here are some quick bullets on LittleBigPlanet rendering:

  • Everything is cloth.  The SPUs simulate and collide every vertex every frame, allowing an entire world of soft-body interactions.  The GPU just sees a big polygon soup separated by materials.
  • Alex implemented dynamic irradiance volumes not unlike Crytek’s stuff.  However, his approach differed in three key ways.  First, he was representing higher frequency direct lighting in his irradiance volumes rather than low frequency bounce lighting.  Second, rather than storing second order spherical harmonics, Alex stored first order spherical harmonics and inferred light direction by taking multiple samples from the volume.  Third, rather than using cascaded volumes, he mapped his volumes in projective space.  The combination of these differences led to distortion in the shape of point lights in the volume and caused unacceptable lighting artifacts. He ultimately abandoned irradiance volumes in favor of regular deferred lighting.
  • Since he fell back on deferred lighting, he needed a solution for lighting translucent objects.  Given the world constraints, he decided to allow only a single layer of translucency.  This was achieved by using MSAA sample masks to restrict translucent pixels to one sample of an MSAA buffer.  This is very similar to the stippled alpha approach used in Inferred Lighting (coming up).

Inferred Lighting: Fast dynamic lighting and shadows for opaque and translucent objects
Scott Kircher, Volition, Inc.
Alan Lawrance, Volition, Inc.

Scott Kircher and Alan Lawrance presented a technique they call Inferred Lighting.  I must admit, I was too quick to dismiss this paper when it first came out.  I’m glad I sat through the talk because there are quite a few clever ideas in there.  Inferred lighting is basically deferred lighting with two interesting and complimentary extensions.

The first extension is doing the deferred lighting at quarter resolution.  This reduces the cost of the G-Buffer generation and deferred lighting passes, but it introduces artifacts at edges where there is a sharp discontinuity between lighting at adjacent pixels.   To solve this problem, inferred lighting introduces two new attributes to the deferred lighting G-Buffer, an object ID and a normal group ID.  These two 8-bit IDs are packed together into a single 16-bit ID and stored with linear depth and a two-component surface normal representation as the G-Buffer.  Scott and Alan refer to the linear depth and 16-bit ID as the discontinuity sensitive filter (DSF) buffer and use it during the shading pass to discard spurious samples from the irradiance buffer.

The second extension is using alpha stippling to allow a limited number of deferred-lit translucent layers.  Translucent objects are written into the G-Buffer with a 2×2 stipple pattern, allowing up to the three layers of translucency.   Each layer of translucency is lit at one quarter the resolution of the already quarter-resolution irradiance buffer, and each overlapping layer of translucency reduces the lighting resolution of the opaque pixels in its quad by 25%.  Since the shading pass is already designed to accommodate a reduced-resolution irradiance buffer, it selects the best irradiance values for each fragment and overlapping opaque and translucent objects are lit appropriately.  It is important to note that unlike other stippled alpha techniques, inferred lighting doesn’t limit the range of levels of translucency in a scene; it merely limits the number of overlapping layers at each pixel.

I’m pretty intrigued by inferred lighting and its future possibilities, but the issues that bothered me when I first read the paper remain.

  • I’ve implemented deferred lighting at low resolutions before, and the quality difference between full and quarter resolution lighting was very noticeable.  Imagine cutting the dimensions of every normal map in your game in half and you’ll know what I mean.
  • Not every visible pixel has a compatible representation in a low-resolution irradiance buffer.  Pixel-thin objects like grass may end up discarding every candidate sample from the irradiance buffer creating unresolveable lighting artifacts.
  • Transparent objects lit with inferred rendering can’t smoothly fade into or out of a scene.  The transition from opaque to translucent or invisible to translucent will result in a pop as lighting resolution drops.  I can imagine lots of great uses for inferred-lit translucency, but fading effects and LODs in and out aren’t among them.

PatchMatch: A Randomized Correspondence Algorithm for Structural Image Editing
Connelly Barnes, Princeton University

This was a pretty interesting presentation of a new algorithm for finding approximate nearest-neighbor matches between image patches.  It sounds pretty dry, but nearest-neighbor matching is at the heart of a lot of highly-desired image editing operations like image retargeting, image shuffling, and image completion.  The cool thing about this paper was it showed how sometimes inferior techniques can achieve superior results if, unlike their competition, they can be applied at interactive frame rates.   The PatchMatch algorithm runs 20 to 100 times faster than previous nearest-neighbor matching algorithms and even though it is only an approximation, its speed has enabled interactive tools with more user guidance and truly stunning results.

id tech 5 Challenges
J.M.P. van Waveren, Id Software

This talk was divided into a review of the virtual texturing implementation in id tech 5 and an overview of the job system used to parallelize non-GPU processing.  Much of the virtual texturing content has been covered before, but this was the first live presentation I’ve been able to attend so it was still a treat.  I’m on the fence about whether virtual texturing is a neat trick suitable to a small set of applications or the foundation for all future development in real-time graphics.  Carmack’s involvement guarantees that virtual texturing will be an important and successful milestone in computer graphics, but it doesn’t guarantee that it will survive to the next generation. Consider, for example, how many people are still using stencil shadows?

The most interesting thing about the id’s job framework is that jobs must accept a full frame of latency.   This severely limits the kind of processing that can be pushed off into asynchronous jobs, but it greatly simplifies the problem of extracting maximum parallelism from the hardware.  It is hard for synchronization to be a bottleneck when it simply isn’t an option.  Anyway, despite this limitation id has managed to push non-player collision detection, animation blending, AI obstacle avoidance, virtual texturing, transparency processing, cloth simulation, water surface modeling, and detail model generation into jobs.   Most of those are pretty obviously latency tolerant, but I’m impressed they have a latency tolerant implementation of most of their physics simulation.

How does id tech 5 handle camera cuts with virtual texturing and latent job completion?   I hope it is better than the Halo and Unreal 3 engines.

A Real-time Micropolygon Rendering Pipeline
Kayvon Fatahalian, Stanford University

In this talk Kayvon Fatahalian made a case that REYES-style micropolygon rendering is a possible, and in fact practical, future for real-time graphics.  This talk was nicely grounded in reality.  The pros and cons of every technique were well covered and realistic extensions to current GPU architectures were presented to tackle the problems introduced by micropolygons.

The first part of the talk tackled the problem of tessellation, namely implementing split-dice adaptive tessellation.   Kayvon presented a new algorithm, DiagSplit tessellation, that can be implemented in hardware with the addition of one new fixed-function unit to the DirectX 11 GPU pipeline.

In the second part of the talk, Kayvon discussed rasterization of micropolygons.   Current methods for rasterizing polygons are inefficient when dealing with pixel-sized polygons because GPUs must shade at least 4 samples per polygon to generate derivatives.   Kayvon looked at possible modifications to the rasterization unit and weighed the pros and cons of shading before or after rasterization.

With almost everyone else looking at ways of dismantling the structured graphics pipeline and moving to a purely programmable model, it was refreshing to hear a talk on how to improve the existing pipeline to accommodate faster, higher quality graphics.   I feel a big part of the success of computer graphics over the past decade has been the fact that we’ve standardized on a common pipeline.  Even with hardware capable of unlimited flexibility, most developers will still need a common language for expressing their requirements and sharing their solutions.

Visual Perception of 3D Shape
Roland Fleming, MPI for Biological Cybernetics
Manish Singh, Rutgers – New Brunswick

This course covered research into how the human brain transforms visual sensory input into a 3-D representation of the world.   The instructors presented a number of research studies which have sought to test hypotheses about what internal representations the human brain uses for modeling the world.   This line of research is actually very relevant to real-time computer graphics.   We employ a lot of approximations to convey a sense of 3-D to the human brain, so the more we know about what cues the brain cares about, the more effectively we can trick it into seeing what we want it to see.

The Dark Side of Deferred Shading: Light Groups

Much is made of the advantages of deferred shading, but to read most articles on the subject the only disadvantages are translucency and multisample antialiasing.  This is really doing a disservice to forward rendering, however, as there are many popular techniques in real-time computer graphics that are impractical in a deferred shading engine.  One of the best examples of these techniques is light include / exclude groups.

Although it may not have much basis in reality, being able to specify exactly which lights affect which objects is a powerful tool in lighting design.  It can be used stylistically to make certain objects stand out from their environments, for example making player avatars artificially brighter than their surroundings.  It can also be used to compensate for limitations of the real-time lighting model, for example skin can be lit with additional lights to simulate light transmission and subsurface scattering.  Light groups are also extremely useful in optimizing shadow computation.  A common problem that must be dealt with when lighting indoor environments is light bleeding through walls.  Although this can be solved by any number of real-time shadowing techniques, it simply isn’t affordable on current hardware to cast real-time shadows from every light.  A cheap and flexible alternative is to use light groups to manually specify that lights in a room only affect objects in the same room.

Implementing light groups is cheap and easy with forward rendering.  Each light can store a list of meshes that it either explicitly includes or explicitly excludes.  Since meshes are already being matched with lights on the CPU, it is an easy thing to query the include / exclude list for each light and discard lights that don’t affect the current mesh.  The implementation of light groups in a deferred environment, however, is much less obvious.

One option for supporting light groups in a deferred renderer is to add an additional attribute to the G-Buffer specifying a unique mesh ID.  The light shaders, during the deferred pass, can read back the mesh ID per pixel and compare that to an include / exclude list passed to the shader in a texture or in constant registers.  Mesh IDs can be allocated on-demand and reused cleverly across the frame, so it is possible that as few as 8 bits are needed for them in the G-Buffer.  Likewise with some careful engineering the light shader can perform the light group classification in as little as one indirect texture fetch per pixel.  Unfortunately, no amount of optimization in this technique can address its largest drawback.  The light group classification is performed in the fragment shader which means that a light’s fragment shader must run even for pixels that aren’t in the light’s group.  That is a huge drawback from the forward implementation which incurs absolutely zero GPU cost for lighting meshes that aren’t in a light’s group.  If one of the principle advantages of light groups is that they allow for optimized, statically-defined shadows, this deferred implementation isn’t going to cut it.

A second, and far more common, option for supporting light groups with deferred shading is to define only a fixed number of light groups.  With this approach meshes and lights can be associated with multiple light groups, and lights only affect meshes with which they share membership in a group.  The number of light groups is usually selected to be less than or equal to 8, and the stencil buffer is written during the G-Buffer pass as a per-pixel mask of the groups each mesh belongs to.  During the deferred pass lights use the stencil test hardware to discard any pixels that aren’t in any of the light’s groups.  This technique for supporting light groups has several advantages over the previous technique.  It requires no additional storage in the G-Buffer, it adds no additional cost to the light shaders, and it has the potential at least to discard pixels that aren’t in a particular light’s groups prior to executing the light’s fragment shader.

There are some downsides, however.  The stencil buffer can be a precious commodity in a deferred renderer.  Engines using stencil shadows will need to reserve a piece of it and most deferred renders will also use a few bits of the stencil buffer to mask and quickly cull pixels that are unlit or outside a light’s depth bounds.  One consequence of this is that the number of bits available in the stencil buffer for a light group mask is usually less than 8 (4 to 6 in my experience).  Six unique light groups are better than none, but the limit does make it much more difficult to use light groups for statically defined shadows.  Assigning light groups to rooms in a level to prevent light bleeding through walls becomes a form of the computationally challenging graph coloring problem.

Another consequence of this technique’s reliance on the stencil buffer is that since its stencil test logic is complicated and changes with every light, it may disable or dramatically reduce the effectiveness of the GPU’s pre-fragment shader stencil cull.  On most hardware early stencil cull is accomplished by storing a low resolution, low bit depth cache of the stencil buffer.  Blocks of between 16 and 64 pixels are reduced to a single bit that represents a conservative average of the stencil function across all pixels in the block.  This cache is then consulted prior to running the fragment shader to quickly discard pixels which are guaranteed to fail the stencil test.  Given this implementation of early stencil in the hardware, there are two options for utilizing it with light groups.

The first option is to sort all lights by their light group mask and refresh the early stencil cache between lights when the mask changes.  This provides pre-fragment shader culling for light groups (and anything else you might be using the stencil buffer for), but it requires a potentially expensive refresh of the early stencil cache.  It also adds a new criterion on which to sort lights, and there are plenty of other deferred shading techniques that have their own competing sort order requirements.

The second option is to give up on pre-fragment shader culling for light groups and instead use the early stencil cache solely for stencil shadows and light depth bound culling.  To do this one would simply mask out and ignore the stencil bits representing light groups when populating the early stencil cache.  With this option light groups no longer have any impact on performance–no cost, no benefit.  This option may sound like a defeat, but even it isn’t available on all platforms.  The Xbox 360, for example, can’t apply a mask during the construction of the early stencil cache, so there is no way to exclude light groups from early stencil cull.  The best option on that platform without refreshing early stencil frequently is to make the early stencil test much more conservative.  That means that the presence of light groups will actually allow some pixels that would have been culled due to stencil shadows or light depth bounds to be processed by the fragment shader, and light groups become an unpredictable performance cost.

On the Xbox 360 and Playstation 3, at least, the early stencil hardware is well documented and fully exposed to developers.  On the PC, however, the concept of early stencil cull is entirely hidden from developers and implemented transparently within the driver.  This makes sense since early stencil cull is an optimization very specific to individual hardware implementations and has no impact on the correctness of the output, but it makes relying on early stencil cull for performance extremely dangerous.  Even when you know a lot about your rendering pipeline and your game’s data, it isn’t clear what the optimal use of early stencil cull is for light groups.  Drivers don’t have enough information to make the best choice on your behalf, and when the stencil configuration gets too complicated they are likely to just disable early stencil cull entirely!  So a scary third option for implementing light groups with the stencil buffer is that you lose all early stencil cull during your deferred pass and stencil shadows and light depth bound culling get a lot more expensive.

I’m a huge fan of deferred techniques, but it is important to remember that they are not a one-for-one replacement for forward rendering.  Forward rendering creates a tight coupling between the geometry and the frame buffer; its greatest weakness perhaps, but also its greatest strength.  When evaluating a move from forward to deferred rendering, don’t overlook the fact that light groups and a host of similar mesh-centric features aren’t going to be available anymore.