Deferred Shading Shines. Deferred Lighting? Not So Much.

As I indicate in the subtitle of this blog, there is no single way to develop games.  The techniques used in game development are as many and as varied as the games themselves–what’s best for one game is not necessarily best for another.  The phrase YMMV (your mileage may vary) is pretty much a staple of game technology discussions.  On the other hand, few teams have the money or stamina to try every technology on every game, so I hope people won’t hold it against me when I choose to take sides.

I’ve noticed an increase recently in game developers promoting a technique called deferred lighting.  Unfortunately this technique is old enough that not everyone remembers it by that name.  Wolfgang Engel reintroduced it in ShaderX7 under the name light pre-pass rendering, and for many that name seems to be sticking.  The most recent advocate of deferred lighting is Crytek.  Martin Mittring divulged in a presentation at the Triangle Games Conference that Crytek will be utilizing deferred lighting in version 3 of CryENGINE.

Now I get to tell you why that’s a bad idea.

Deferred lighting is similar to a better-known technique called deferred shading.  In deferred shading all the attributes necessary to completely shade a 3-D scene are rendered into off-screen textures called a G-Buffer.  The G-Buffer for a scene contains, per pixel, things like the surface normal, material abedos, and Phong specular exponent.  Shading can then be done in screen-space per light by reading back the necessary data from the G-Buffer.  This has the distinct advantage of decoupling the geometry processing in the scene from the lighting and shading calculations.  It is generally assumed that one can construct the G-Buffer in a single pass over the scene’s geometry and that one can constrain the light rendering in such a way that no more pixels are processed for a given light than are actually affected by the light.  From an algorithmic complexity standpoint this sounds great.  Meshes are rendered only once and no extraneous lighting or shadowing calculations are performed.  There is a drawback however.  The G-Buffer can be quite heavyweight, containing all those shading attributes, and consequently deferred shading consumes a lot of memory and bandwidth in constructing and reading back the G-Buffer.  Deferred lighting attempts to address that problem.

In deferred lighting only the lighting, not the shading, computations are deferred.  In the initial pass over the scene geometry only the attributes necessary to compute per-pixel lighting (irradiance) are written to the G-Buffer.  The screen-space, “deferred” pass then outputs only diffuse and specular lighting data, so a second pass must be made over the scene to read back the lighting data and output the final per-pixel shading (radiant exitance).  The apparent advantage of deferred lighting is a dramatic reduction in the size of the G-Buffer.  The obvious cost, of course, is the need to render the scene meshes twice instead of once.  An additional cost is that the deferred pass in deferred lighting must output diffuse and specular irradiance separately, whereas the deferred pass in deferred shading need only output a single combined radiance value.

Five years ago, when I was designing the renderer for the Despair Engine, I thought deferred lighting was the ideal choice.  Details on the Playstation 3 were sketchy at that time, but we already knew that render target memory on the Xbox 360 would be severely limited.  The G-Buffer for a deferred shading system wouldn’t fit in EDRAM and consequently it would have to be rendered in two tiles.  With deferred shading on the Xbox 360 requiring two passes over the scene meshes, the primary disadvantage of deferred lighting appeared nullified.

Despair Engine utilized deferred lighting for over two years, and we were generally very happy with the results.  It was implemented initially on the Xbox 360 and PC, but when the Playstation 3 was released it was extended to that platform as well.  Unfortunately our initial implementation on the Playstation 3 yielded significantly worse performance than we were seeing on the Xbox 360.  We had multiple projects well into development at that point, however, so scaling back our expectations on the content side wasn’t a viable option.  Instead the performance deficit on the Playstation 3 motivated our very talented PS3 programmer, Chris McCue, to look for alternate solutions.  From extensive profiling he identified two bottlenecks unique to the Playstation 3.  First, the PS3 struggled far more with vertex processing costs and consequently both the attributes and shading stages of deferred lighting were more frequently vertex bound on the PS3 than on the other platforms.  Second, the PS3 was sometimes ROP bound during the deferred lighting pass itself, a problem that is all but impossible on the Xbox 360 due to the massive bandwidth to EDRAM.

Based on this data, Chris proposed to switch to classical deferred shading on the Playstation 3.  Deferred shading would reduce the number of geometry passes from two to one and reduce the output bandwidth during the deferred pass.  I agreed, and sure enough the move to deferred shading was a success.  It helped narrow the gap between the Playstation 3 and the Xbox 360 to the point where we could ship the same content on both platforms and provide nearly identical play experiences on each.

The move to deferred shading on the PS3 prompted me to take a closer look at my decision to use deferred lighting on the other platforms.  If deferred shading was a win on the PS3, it seemed likely to have some advantages on the PC and maybe even the Xbox 360.  Although I’ve never been a proponent of settling for the least-common-denominator in cross-platform development, if we could move all platforms to the same deferred process without sacrificing performance, I knew it would save us some headaches in maintaining platform compatibility later on.

I implemented deferred shading on the Xbox 360 and PC a few months later and profiled the results.  On the Xbox 360, much to my surprise, deferred shading performed within a few percent of deferred lighting.  I could literally toggle back and forth between the two technique and barely notice the difference in GPU utilization.  Deferred lighting was a few percent faster in that initial implementation, but considering that we’d been optimizing the deferred lighting pipeline for years, I wasn’t about to be quibble over less than a millisecond of GPU time.  Doing head-to-head comparisons on the PC is a little more difficult because of the wide range of PC graphics hardware, but on the high-end DX9 cards and the low-end DX10 cards that I had access to at the time, the difference in rendering performance between the two techniques on the PC was similarly small.  More importantly, on the PC we suffered far more from CPU-side batch overhead and deferred shading handily cut that cost in half.

Having lived with deferred shading for a couple years now, I’ve come to appreciate the many ways in which it is superior to deferred lighting.  Although deferred lighting sounds great in theory, it can’t quite deliver in practice.  It does, in my experience, offer marginal GPU performance advantages on some hardware, but it does so at the expense of a lot of CPU performance and some noteworthy feature flexibility.  To understand this, consider the implementation of a traditional Phong lighting pipeline under deferred shading and deferred lighting.

Deferred shading consists of two stages, the “attributes stage” and the “deferred stage.”

  • The attributes stage:
    • Reads material color textures
    • Reads material normal maps
    • Writes depth to a D24S8 target
    • Writes surface normal and specular exponent to an A8R8G8B8 target
    • Writes diffuse albedo to an X8R8G8B8 target
    • Writes specular albedo to an X8R8G8B8 target
    • Writes emissive to an X8R8G8B8 target
  • The deferred Stage:
    • Reads depth, surface normal, specular exponent, diffuse albedo, and specular albedo
    • Blends exit radiance additively into an X16R16G16B16 target.

Deferred lighting, on the other hand, consists of three stages: the “attributes stage”, the “deferred stage,” and the “shading stage.”

  • The attributes stage:
    • Reads material normal maps
    • Writes depth to a D24S8 target
    • Writes surface normal and specular exponent to an A8R8G8B8 target
  • The deferred stage:
    • Reads depth, surface normal, and specular exponent
    • Blends specular irradiance additively into an X16R16G16B16 target.
    • Blends diffuse irradiance additively into an X16R16G16B16 target
  • The shading stage:
    • Reads material color textures
    • Reads diffuse and specular irradiance
    • Writes exit radiance into an X16R16G16B16 target

First let’s consider the memory requirements of the two techniques.  Deferred shading uses a G-Buffer that is 20 bytes per pixel and a radiance target that is 8 bytes per pixel for a total of 28 bytes per pixel.  Deferred lighting requires only 8 bytes per pixel for the G-Buffer and 8 bytes per pixel for the radiance target, but it also requires 16 bytes per pixel for two irradiance targets.  So in this configuration deferred lighting actually requires 8 bytes more memory per pixel.  I am assuming that both approaches are using appropriate bit-depth targets for high dynamic range rendering with tone reproduction handled as a post-processing step.  If you assume LDR rendering instead, I would argue that deferred lighting still requires deeper than 8-bit targets for irradiance, because the range of values for irradiance in a scene is typically far greater than the range of values for exit radiance.  In any case, there are a few variations on the layout described above and a number of options for overlapping or reusing targets on the more flexible console architectures that reduce the per-pixel costs of each technique to an equivalent 20-24 bytes per pixel.

Now let’s take a look at bandwidth usage.  The bandwidth required for “material color textures” and “material normal maps” is content dependent, but it is also exactly the same between the two techniques so I can conveniently factor it out of my calculations.  Looking at the layout described above, bandwidth consumed during the attributes and shading stages is measured per pixel and bandwidth consumed during the deferred stages is measured per lit pixel.  Adding everything up except the material color textures and normal maps, we see deferred shading writes 20 bytes per pixel plus an additional 8 bytes per lit pixel and reads 24 bytes per lit pixel.  Deferred lighting, however, writes 16 bytes per pixel plus an additional 16 bytes per lit pixel and reads 16 bytes per pixel plus an additional 24 bytes per lit pixel.  What this means is that if the average number of lights affecting a pixel is greater than 0.5, deferred lighting consumes more write bandwidth than deferred shading.  Furthermore, no matter how many lights affect each pixel, deferred shading consumes 16 fewer bytes of read bandwidth per pixel.

The last thing to consider when comparing the two techniques is feature flexibility.  So far I’ve looked at how traditional Phong lighting might be implemented using the rival deferred techniques.  Proponents of deferred lighting will sometimes argue that handling only the irradiance calculation in screen-space affords more flexibility in the choice of lighting models.  Once the diffuse and specular irradiance buffers have been constructed, each material is free to use them however it sees fit.  Unfortunately there isn’t as much freedom in that as one would like.  Most of the interesting variations in lighting occur in the irradiance calculation, not in the exit radiance calculation.  Anisotropic lighting, light transmission, and subsurface scattering all require additional attributes in the G-Buffer.  They can’t simply be achieved by custom processing in the shading stage.  When you consider the cost of adding additional attributes to each technique, the advantages of deferred shading really come to light.  The 8 byte G-Buffer layout for deferred lighting is completely full.  There is no room for an ambient occlusion or transmissive term without adding an additional render target at the cost of at least 4 bytes per pixel.  The deferred shading layout I’m using for this comparison, however, has unused channels in both the diffuse and specular albedo targets that can be read and written without adding anything to the space and bandwidth calculations above.

To be fair, there is one important detail I should mention.  Most proponents of deferred lighting recognize the excessive cost in generating separate diffuse and specular irradiance buffers and consequently adopt a compromise to the Phong lighting model.  They assume that specular irradiance is either monochromatic or a scalar factor of diffuse irradiance, and consequently it can be stored in the alpha channel of the diffuse irradiance target instead of requiring a full target of its own.  This configuration dramatically improves the results calculated above.  Again in the interests of fairness, when evaluating this form of deferred lighting, a similar compromise should be made for deferred shading.  The specular albedo can be considered monochromatic or a scalar factor of diffuse albedo (or both with sufficient packing).  With these modifications to both techniques deferred lighting does, indeed, have an advantage.  Deferred lighting will now require as little as 16 bytes of memory per pixel on some platforms whereas deferred shading will require 20.  Deferred lighting also ends up having equal write bandwidth requirements to deferred shading and lower read bandwidth requirements as long as the average number of lights per pixel is greater than 2.

Nevertheless, the differences are never huge, and ultimately there are a number of subtleties regarding how the bandwidth is distributed across the various stages and whether the stages are typically bandwidth bound that further muddy the waters.  The most damning evidence against deferred lighting remains that in a direct comparison across the content of two games and three platforms it only provided at best a few percent GPU performance advantage over deferred shading at the cost of nearly doubling the CPU-side batch count.  If further evidence is needed, consider that Killzone 2 experimented with deferred lighting early on in its development and also ultimately settled on a classical deferred shading architecture.

So as I said at the start, YMMV, but I for one don’t expect to be returning to deferred lighting anytime soon.

Class Reflection in the Despair Engine

Reflection is, without a doubt, my favorite computer programming language feature.  It is unfortunate therefore that I’ve spent most of my professional life programming in C++, a language that has stubbornly refused to adopt a mechanism for class reflection.  There is a reason why I’ve stuck with C++ however; it is the highest level language available today that still provides unobstructed access to the hardware platform on which it sits.  Whatever its faults, it never prevents me doing what I have to do.

When Kyle Wilson and I were laying out the design for Day 1’s Despair Engine, low-level support for reflection was a must-have on the list of engine features.  We were designing the largest, most ambitious engine either of us had ever worked on, and we didn’t want it to be clogged with thousands of lines of boilerplate code.  We wanted a simple, easy to use mechanism by which common libraries for disk serialization, UI generation, network synchronization, script integration, and debugging could operate intelligently on a diverse field of heterogeneous objects.  We’d both been down the route of virtual “Save” and “Load” calls implemented on every object as well as the route of UIVehicle and NetVehicle proxy objects, and we were intent on finding a better way.  At the same time, however, we were keenly aware of the dangers of over-generalizing.

We knew that the engine we were designing would have to grow to support dozens of engineers and multiple simultaneous projects, and that it would have to be flexible enough to adapt to the unpredictable requirements of future games and hardware.  Taking all this into consideration, here is the list of goals we drafted for our reflection system:

  • Zero reliance on custom compiler tools or non-standard compiler extensions.  100% portability was essential because we expected to support platforms whose compilers hadn’t even been written yet.
  • Zero reliance on RTTI or exceptions.  Both of these features come with a cost and are regularly disabled or broken on consoles.
  • No header pollution.  Reflection definitions had to be confined to cpp files so that compile time costs could be minimized and so that modifying a reflection definition didn’t require a full rebuild.
  • Minimal reliance on macros.  Provide type safety and lexical scoping whenever possible.
  • Zero reliance on external data.  Reflection metadata had to be available at static initialization time, tightly bound to class definitions.
  • Efficient support for native types like __m128 and custom low-level types like struct Vector2 { };
  • Support for sophisticated but non-polymorphic types like standard library containers.
  • Support for polymorphic classes and multiple inheritance of abstract interfaces.  We use a COM-like architecture in our engine which clearly separates objects from interfaces.  We deliberately chose not to tackle virtual inheritance with this system.
  • External extensibility.  Reflection definitions had to support metadata specific to systems that the reflection library itself had no knowledge of.  Adding new types or attributes to the system shouldn’t require a massive rebuild.
  • Support for overriding or circumventing the system for special cases.

In addition to these general requirements for the system, we had several specific uses for reflection already in mind which added their own requirements:

  • Disk serialization using the reflection system had to support backward and, to some degree, forward compatibility.
  • Network synchronization using the reflection system had to be able to operate on a subset of the class data and to efficiently extract deltas from class state.
  • UI generation from reflection definitions had to provide context-specific as well as type-specific widgets.  A “radius” property for a volume trigger, for example, should be able to provide a 3-D visualization as well as a range-limited edit box, context-sensitive help, and notifications to other UI elements when it changes.
  • Scripting languages like Lua and Python had to be able to dynamically bind to objects using their reflection definitions, but once bound they should be able to operate on the objects efficiently without requiring table lookups or string comparisons.

It fell on my shoulders to provide the initial design and implementation of Despair’s reflection library, dsReflection.  Luckily reflection is nothing new to games, so I had a lot of inspiration to draw upon.  One of my favorite early implementations of reflection in a game engine is in the Nebula Device.  The Nebula Device’s approach to reflection was fairly straightforward–hard coded companion functions for every object populated tables with metadata like command names and function pointers–but it was used to great effect to expose a great deal of the engine to Tcl and from there to perform serialization and network synchronization.

Although the Nebula Device was a great proof-of-concept for me, it didn’t meet many of the requirements Kyle and I had put forth.  It was very limited in the kind of data it could reflect and syntactically it was messy and required a lot of boilerplate code.  The inspiration for the syntax of Despair’s reflection definitions came instead from C# and luabind.  We hoped to be writing a lot of tool and support code for the engine in C# so it made sense to try to match its syntax, and luabind is an ingenious example of how C++ compilers can be manipulated into parsing very unC++like code.

What I ended up with is an embedded, domain-specific language implemented through a combination of clever operator overloading and template metaprogramming.  Many of the core concepts in dsReflection are enabled by features of Boost’s metaprogramming tools.  The reflection definition language is used to construct metadata for classes.  The metadata holds references to fields and methods of a class.  Fields can be raw member pointers as well as combinations of getters and setters.  Methods can be member functions with variable numbers of parameters of mixed types.  Methods can expose concrete, virtual, and even pure virtual functions.  All reflected elements can be tagged with user-defined properties similar to Attributes in C#.  These properties are how external systems annotate classes, fields, and methods with information specific to their needs.  Reflection definitions can be applied to non-polymorphic classes, but to take advantage of the full feature set reflected classes must derive from Reflection::Object which introduces a single virtual function.  In its initial implementation dsReflection supported global and file static variables and functions through reflection namespace definitions.  This capability still exists, but we’ve found little use for it and I’ve often considered removing it entirely.

At run time the reflection system provides a wide range of services.  It allows for RTTI-like queries for comparing object types and inheritance hierarchies and it allows for type-safe dynamic casts between reflection objects.  Class definitions for objects can be inspected at run time, and their component elements can be operated on in a variety of ways.  Fields can be read from and written to, and they can also be bound to strongly-typed, templatized wrapper objects called Reflection::Variable<T>.  Similarly, methods can be called with parameters provided through an abstract interface or they can be bound directly to boost::function objects.  Property objects can be extracted from all reflected elements and the properties themselves are reflection objects so they support the same rich level of introspection.

The Despair Engine is about four years old now and it has the battle scars to prove it.  The reflection library is one of the foundational elements of the engine, and it has held up remarkably well under the strain.  As we hoped, dsReflection is leveraged to support all the systems describe above, as well as several others that we never anticipated.  It is the kind of feature that once you have it, you can’t imagine life without it.

That said, there have been some struggles along the way.  The biggest struggle with the reflection library has been keeping compiler performance and executable code bloat under control.  These were both serious concerns going into the system which we attempted to address in the design.  The system does meet the requirement that class reflection definitions are specified in cpp files and not in headers, but nevertheless the reflection definitions for classes can be big (really big!) and there is a lot for the compiler to wade through.  A bigger problem for us than the compile time has been the amount of code generated by the compiler to instantiate the reflection definitions.  Reflection definitions are built automatically at static initialization time, so run time performance and memory consumption aren’t an issue, but I’ve twice had to revisit the system’s internals to optimize its code gen and bring our link times and final executable size back to within reasonable limits.  Both Sony and Microsoft have shared in this pain; few engines out there put pressure on a linker like ours.

Another struggle with the reflection system has been training and documentation.  It is a powerful and flexible language which looks only vaguely like C++, and learning all the ways in which one can and, perhaps more importantly, should use the system is a challenge.  Perhaps the best decision we made in this regard was to provide a comprehensive set of unit tests for the library.  Unit tests aren’t the best form of documentation, but they can be invaluable at enforcing undocumented and poorly specified behaviors in systems for which stability and backward compatibility are essential.  Our unit tests provide an answer of last resort to questions like, “can I reflect a std::map<enum, object> and have the serialization maintain backward compatibility as enum values are changed, added, and removed?”  The magic unit test says “yes!”

Detailing the exact workings of dsReflection is way beyond the scope of this article, but I can provide a sample reflection definition that gives a pretty clear picture of how the system plays out in practice.  This sample isn’t something that actually exists in any game, but it is a completely valid reflection definition that would compile with our code.  I haven’t tried to hide any of the messiness.  Most of the complexity comes from specifying a pretty sophisticated user interface with custom widgets and data-dependent views, but that’s hardly unusual.  When our components expose a UI, most of the time we’re pushing for more customized, more specialized behavior, even though it makes the reflection definitions tricky.

// DecalObject.h

class IDecalObject
    // abstract interfaces can carry reflection definitions

class DecalObject : public Reflection::Object, public IDecalObject
    const string& GetTextureFilename() const;
    void SetTextureFilenameBlocking(const string& filename);

    enum DecalType

    ResultCode StartLoggingDebugStats(const char* filename, bool echoToScreen);
    void StopLoggingDebugStats();

    string GetAspectRatio() const;  // returns aspect ratio of currently loaded texture
    bool IsRandomType() const;      // helper checks if m_decalType is random

    ResourceKey         m_textureResource;
    DecalType           m_decalType;
    float               m_maximumImpactAngle;
    std::vector<Color>  m_randomColorPalette;
    bool                m_applyToSkinnedObjects;



// DecalObject.cpp

#include "DecalObject.h"

// empty reflection definition for IDecalObject

.name("static",     DecalObject::kDecalType_Static)
.name("random",     DecalObject::kDecalType_Random)
    Prop::Description("A randomly tinted decal.") // custom help for UI / debug console
.name("special",    DecalObject::kDecalType_Special)
    Prop::EditEnumPrivate() // code only, not to be exposed to UI

    Prop::GroupProperty("Graphics") +
    Prop::FriendlyName("Decal Component") +
    Prop::Description("Creates a decal at the current location.")
.field("textureResource", &DecalObject::m_textureResource)
[   // direct serialization of a custom ResourceKey type
    Prop::Save() + Prop::Copy()
.field("textureFilename", &DecalObject::GetTextureFilename,
[   // redundant exposure of m_textureResource but through more UI convenient,
    // blocking member functions
    Prop::FriendlyName("Texture") +
    "Specifies the texture that will be applied by the decal.  The aspect ratio of the "
    "texture will define the aspect ratio of the decal.") +
.field("aspectRatio", &DecalObject::GetAspectRatio)
[   // read-only exposure of a fake, UI only field
    Prop::FriendlyName("Aspect Ratio") +
    "The aspect ratio of the decal.  This is determined by the texture selected for "
    "the decal.") +
.field("decalType", &DecalObject::m_decalType)
    Prop::FriendlyName("Type") +
    Prop::EditEnumCombo<DecalObject::DecalType>() + // provide a combo box tailored to
                                                    // this enum
    Prop::Save() + Prop::Copy()
.field("maxImpactAngle", &DecalObject::m_maximumImpactAngle)
    Prop::FriendlyName("Maximum Impact Angle") +
    Prop::EditAngle() + // field holds radians, UI presents as degrees
    Prop::Save() + Prop::Copy()
[   // Helpers expose a std::vector as resizeable within the UI. Fancier definitions can
    // provide custom dialogs for modifying container types
    Prop::FriendlyName("Random Palette Size") +
    Prop::Filter<DecalObject>(&DecalObject::IsRandomType) + // only random decals expose
                                                            // a color palette
    Prop::EditNumeric(0, 100) +
    Prop::Save() + Prop::Copy()
.field("palette", &DecalObject::m_randomColorPalette)
[   // Type traits allow the reflection system to operate on templatized standard library
    // and custom container types, including associative containers like std::map
    Prop::FriendlyName("Random Palette") +
    Prop::Filter<DecalObject>(&DecalObject::IsRandomType) +
    Prop::EditContainer( Prop::EditColorPicker(), "Color Entry" ) +  // The container
    // holds a simple type so addition properties are specified for it.  Containers can
    // also hold full reflection objects with their own reflection definitions.
    Prop::Save() + Prop::Copy()
#ifdef _DEBUG
// a few fields and methods for debug console
.method("StartLoggingDebugStats", &DecalObject::StartLoggingDebugStats)
    "StartLoggingDebugStats log_file_name echo_to_screen\n"
    "  Logs memory and performance stats to a file and, optionally, the screen.")
.method("StopLoggingDebugStats", &DecalObject::StopLoggingDebugStats)
.field("applyToSkinnedObjects", &DecalObject::m_applyToSkinnedObjects) // No props, debug
                                                                       // console only

GDC: Where Smart People Come to Gloat

Discovering New Development OpportunitiesSatoru Iwata

GDC kicked off with a keynote by Satoru Iwata, the president of Nintendo.  The talk included a variety of new software announcements which you can read about in the official coverage, but to me the most interesting content was Iwata’s comments on the business of game development.  He discussed how companies can fall into the “death spiral” of development where they are forced, due to financial pressure, to ship early and compromise the quality of their product.  This, in turn, leads to lower sales, less money, and increased financial pressure for the next project.  He then discussed how Shigeru Miyamoto’s approach to design creates an “upward spiral.”  

He described how Miyamoto always has multiple, simultaneous projects in a prototype stage and that only when Miyamoto is satisfied that all the necessary elements are perfect will he allow a project to progress to the next stage.  The prototype stage may take as long as two years and sometimes, when it isn’t showing progress, it is suspended and restarted later.  Although Iwata tried to make this sound like a repeatable recipe for success, his talk was intertwined with comments about how Miyamoto is a genius and sees fantastic new gameplay opportunities in every aspect of life.  What happens if I implement the process without a genius?  Do I still end up with Super Mario Sunshine?

Media Molecule: ‘Winging It’ – Ups, Downs, Mistakes, Successes in the Making of LittleBigPlanetAlex Evans, Mark Healey

I consider this talk by Alex Evans and Mark Healey to be a perfect reflection of the team at Media Molecule and the development process for LittleBigPlanet.  The talk was completely unstructured and Alex and Mark admitted they hadn’t really prepared for it.  Additionally, Alex doesn’t like PowerPoint so the talk was delivered on a collaborative presentation application that he wrote on the plane.  Despite the two presenters frequently talking over each other and moving randomly from topic to topic, they had a certain chemistry and the talk never got boring.  My takeaway from this talk is that all the principles at Media Molecule are multitalented geniuses.  Everyone has an equal voice in design and they all seem to have some background in coding and some talent in art.  As far as I can tell everything they touch turns to gold.

Stop Wasting My Time and Your Money: Why Your Game Doesn’t Need a Story to be a HitMargaret Robertson

Margaret Robertson returned this year after her much-praised talk in 2008 Treat Me Like a Lover.  Her focus this year was game stories, specifically the wasted time and money involved in pursuing them.  In theory her talk was about how difficult and expensive adding story to a game is and how games don’t need stories to be good, so why bother.  That isn’t really the message that came through in the content of her talk, however.  She did the obligatory highlighting of bad stories in games, but she spent most of her time praising the really subtle and clever ways in which really great stories have been told in games.  

Half-Life LambdaShe argued that very short, simple stories can be cheaper to produce than their long-winded brethren and that they can have more impact and wider appeal.  She used Hemingway’s famous six-word story as one example, “For sale: baby shoes, never worn,” but she also praised the lambda resistance decal in Half-life 2, the church turned field hospital in Eternal Darkness, and what could be The New Zealand Story‘s own entry in the six-word story contest, “a walrus has stolen my friends!”  What she failed to do was convince me that coming up with good tiny stories is any easier than coming up with long ones.  I’m totally convinced that the best and most effective game stories are small, immutable, and highly evocative, but I think we end up with long, convoluted, drawn-out epics because they’re far, far easier.  After all, despite many attempts, in 80 years no one has come up with a six-word story to rival Hemingway’s.  It’s only six words, how hard can it be?

Rasterization on Larrabee: A First Look at the Larrabee New Instructions in ActionMichael Abrash

Mike Abrash proves that one of the most experienced and respected optimization specialists in the industry can, given enough time and pressure, invent a software rasterizer for Larrabee that is competitive with a traditional hardware solution.  Woot?

Sarcasm aside, this was a very interesting talk.  Larrabee rectifies enough of the mistakes of the Cell processor that I think it actually has a future in mainstream computing, particularly game development.  It isn’t so different from the Xbox 360 processor, in fact.  It offers lots of parallelism and a beefed up vector unit at the cost of out-of-order execution.  The hardware can’t afford to extract parallelism, the compiler isn’t smart enough to, so guess what? assembly programming is back in fashion!  Luckily graphics has been dealing with the problem of high-level parallelism for about forty years and has established a framework that allows everyone–processors, compilers, and programmers–to achieve maximum parallelism with minimal effort.  We can wet our feet with Larrabee the graphics platform while they figure out how to make Larrabee the general-computing platform palatable.

Great Games Made Easy – Adrian Stone

I’ve come away from GDC with a new understanding of what it takes to make top-tier games.  First, blah blah blah; next, add a small team of passionate creative and technical geniuses; finally, blah blah blah.  Wait three years and out pops Super Mario Sunshine, LittleBigPlanet, or Half-Life 2 running at 60 Hz in perfectly vectorized assembly.  The only downside is you don’t get to pick the genre.

How not to C++

One of the things I love about C++ is the fact that it is a strongly-typed language.  An essential element of good programming is writing code that is self-documenting.  The more explicit you are in your code, the less chance that you, your fellow programmers, or even the compiler will misunderstand your purpose.  That said, strong typing can be taken too far.

One of the problems I noted in the first math library I worked with was excessive repetition of common operations.  Vector normalization particularly seemed to be done everywhere.  The library had numerous functions that took as an input parameter a 3 component vector and required that vector to be normalized.  The principle consideration in the design of the library must have been ease of use, because most library functions accepted unnormalized inputs and always performed a normalization step themselves.  The unfortunate downside of this approach was that vectors were frequently being renormalized unnecessarily and performance suffered.

struct Vector3 {};
struct Plane {};

// Construct a Plane object given a point on the plane and the plane’s normal
Plane CreatePlane(Vector3 point, Vector3 normal)
    normal = Normalize(normal);

Years later when I had an opportunity to design my own math library, I attempted to solve this problem with C++ types.  I introduced a new type in my math library called NormalizedVector3.  Functions like CreatePlane were modified to take NormalizedVector3 objects and consequently they no longer needed to perform possibly redundant normalizations.  My first implementation of NormalizedVector3 looked something like this:

struct NormalizedVector3
    // construction from Vector3 must normalize!
    NormalizedVector3(Vector3 v);

    // construction from scalars must normalize!
    NormalizedVector3(float x, float y, float z);

    // cast operator allows NormalizedVector3 to be compatible with Vector3
    operator Vector3() const;

    // can’t allow non-const access to members
    float X() const;
    float Y() const;
    float Z() const;

    float x, y, z;

This worked reasonably well but there were a few problems.  NormalizedVector3 allowed implicit construction from Vector3 and implicit casting to Vector3 so it could take advantage of all the built-in functionality of a Vector3.  Unfortunately this also meant it introduced a lot of opportunities for implicit normalization.

NormalizedVector3 nv0, nv1;
NormalizedVector3 nv2 = nv0 * nv1;     // implicit conversion to Vector3 and
                                       // back to NormalizedVector3

Vector3 point, normal;
Plane p = CreatePlane(point, normal); // implicit conversion to NormalizedVector3

Since operations like vector multiplication are not length-preserving, they are handled by implicit conversions to and from Vector3.  Not only is this confusing, it is exactly the sort of hidden (and likely unnecessary) cost I was trying to avoid.  My second implementation made the constructor of NormalizedVector3 explicit, which added a bit more overhead to the API but also helped highlight what kind of work was going on underneath the hood.

NormalizedVector3 nv0, nv1;
NormalizedVector3 nv2 = NormalizedVector3( nv0 * nv1 );

Vector3 point, normal;
Plane p = CreatePlane(point, NormalizedVector3(normal) );

Of couse as soon all the hidden normalizations were brought to light, I discovered a host of undesirable ones.  The math library included functions to multiply vectors and matrices, and most of the time the matrices represented orthogonal transformations.  Now all of a sudden transforming a NormalizedVector3 by a matrix required renormalization!

NormalizedVector vOld;
Matrix3x3 orthogonalTransform;
NormalizedVector3 vNew =
    NormalizedVector3( orthogonalTransform * vOld ); // unnecessary normalization

Luckily I already had a solution to this problem, more C++ types!  I created an OrthonormalMatrix3x3 class with a relationship to Matrix3x3 much like NormalizedVector3’s relationship to Vector3.  I was then able to provide a function for transforming NormalizedVector3s by OrthonormalMatrix3x3s that returned NormalizedVector3s and, as if by magic, all concerns about unnecessary normalization disappeared!  Okay, not really.  What actually happened was I discovered lots of operations on orthonormal matrices that preserved orthonomality but were now resulting in unnecessary re-orthonormalizations.  I attempted to provide overloads of those operations to remove the unnecessary conversions between Matrix3x3s and OrthonormalMatrix3x3s and hilarity ensued.

By this time it was pretty clear I was halfway to Wonderland and it was time to turn around and crawl back out of the rabbit hole.  I deleted all references to OrthonormalMatrix3x3 and NormalizedVector3, which by this point constituted an unhealthy percentage of my math library’s code, and made just two modifications to the original design:

// Construct a Plane object given a point on the plane and the plane’s normal
// The normal vector must be unit length.
Plane CreatePlane(Vector3 origin, Vector3 normal)
    assert( IsUnitLength(normal) );

The lesson here is that simplicity is every bit as important in API design as correctness.  You can try to make your API so clever that even an idiot can’t misuse it, or you can try to make your API so simple that even an idiot can’t misunderstand it.  The result is usually the same–high performance, error-free code–but your clients will be happier and your codebase will be much, much leaner.

Triangle Order Optimization

One of the standard weapons in any graphics programmer’s arsenal is a procedure for mesh triangle order optimization.  Although most meshes are intended to produce output that is independent of the ordering of their triangles, the order of triangles within a mesh can have a big impact on performance.  This gives us great leeway in picking our triangle order and also a great incentive for doing so wisely.

With today’s hardware and game content, a reasonable consideration when picking triangle order is overdraw.  Games tend to be limited by fragment processing and early-z rejection is the primary way of keeping that cost to a minimum.  The same meshes that are intended to be triangle order independent are usually also intended to be mesh order independent.  Consequently most games minimize overdraw by sorting meshes relative to the camera position and rendering them in nearest-to-farthest order.  This maximizes early-z rejection from intermesh occlusion, but it doesn’t address the possibility of intra-mesh occlusion.  Any concave mesh may, after all, have self-occluding elements that could benefit from a nearest-to-farthest ordering.  Since it isn’t feasible to sort the triangles within a mesh based on camera position every frame, the most common overdraw minimizing mesh optimization is ordering triangles offline based on a camera-independent probability of being occluded.  If you take a torus, for example, it is pretty clear that the probability of the inner wall of the torus being occluded from a random camera angle is much higher than the probability of the outer wall being occluded.  An overdraw-conscious triangle order for a torus would therefore render triangles on the outer wall of the torus before the triangles on the inner wall.  Since computing this camera-independent probability of occlusion can be quite expensive and since there are other performance advantages to rendering adjacent triangles together, a good approach to overdraw ordering is to first decompose the mesh into convex clusters of adjacent triangles and then compute the occlusion probability per cluster.  This approach is detailed in “Triangle Order Optimization for Graphics Hardware Computation Culling” which was presented at SIGGRAPH in 2006 and is implemented in AMD’s Tootle mesh optimization library.

Although I said that overdraw is a reasonable consideration in picking a triangle order, I get the impression that overdraw optimization is still pretty uncommon in games.  There is good cause for skepticism about the value of overdraw optimization, since it has yet to be demonstrated conclusively that it works.  Hardware implementations of early-z cull vary widely and aren’t well documented, so it is difficult to construct an a priori argument in favor of overdraw optimization.  One obvious concern is that hardware pipelines are getting deeper, and there is no guarantee that depth output from one triangle will even be available to the z-cull unit in time to affect other triangles in the same mesh.  Rather than deal with these issues, many game developers choose instead to render an explicit z prepass.  A z prepass ensures zero overdraw in fragment processing but adds significant additional CPU setup and GPU vertex processing cost.

There is still plenty of room for improvement in techniques for overdraw optimization of triangle order, but what this field really needs is a comprehensive evaluation of the advantages of overdraw optimization.  Until someone can show that real-world games can reap real-world benefits, there isn’t going to be much interest in overdraw minimizing triangle orders.

A second possible consideration in picking a triangle order, and by far the most common one, is making good use of the post-transform vertex cache.  Whether you’re sorting whole meshes at once or subsets from your overdraw optimization pass, there is usually plenty of room for minimizing vertex processing work.  The concept is pretty simple; indexed triangle meshes allow the same vertex to be used in multiple triangles.  Most graphics hardware provide a small cache of recently processed vertices so that when the same index is used more than once the “transformed” vertex can be pulled from the cache rather than being processed again.  The effectiveness of the post-transform cache is measured by counting how many vertices must be transformed on average per triangle.  This number is known as the “average cache miss ratio” or ACMR.  For a mesh with no vertex reuse the ACMR is obviously 3.0.  Most meshes however exhibit a lot of vertex reuse and for large meshes an ACMR under 1.0 is not uncommon.  The ACMR for a mesh is dependent on the number of entries in the post-transform cache and, for meshes with more vertices than cache entries, it is dependent on the order in which the vertices are used.  Luckily for hardware designers, it turns out that there are pretty steeply diminishing returns in ACMR for increasing cache sizes, assuming an optimal vertex order.  As long as software developers do their jobs and provide meshes with optimized triangle orders, the hardware can achieve near-optimal ACMR values with a modest-sized cache.  Consequently, the graphics chips I work with typically have between 10 and 24 entries in the post-transform cache.

Post-transform cache optimization has been well researched and many consider it to be a solved problem.  The top optimization algorithms generally yield ACMRs within a few percent of one another and within a few percent of optimal across a wide range of cache sizes.  Even algorithms that are cache size aware are generally tolerant of varying cache sizes and continue to provide competitive results.  This is probably because at their hearts most of the modern algorithms are the same.  To achieve good runtime performance they are greedy algorithms that employ a heuristic to select vertices that will maximize hits in a simulated cache.  These days it doesn’t make much sense to implement your own solution for post-transform cache optimization.  There are too many good libraries available for free and too small a gap between them and the theoretical ideal.

I recently had the opportunity to compare some of the front-runners in this field.  I took four competing algorithms for post-transform cache optimization and ran them on the mesh libraries for two games.  The complete test bed was over 19 million triangles in 25 thousand meshes, and it included plenty of the sort of low-poly, flat-shaded rocks and table legs that cache optimizers hate.  I timed the results and computed ACMR values for a range of cache sizes.  The algorithms I compared were Direct3D’s D3DXOptimizeFaces function which I believe is based on Hugues Hoppe’s work, my own implementation of Tom Forsyth’s “Linear-Speed Vertex Cache Optimization” algorithm, AMD’s implementation of the “Tipsy” algorithm in Tootle, and a third-party implementation of Gang Lin and Thomas P.-Y. Yu’s K-Cache algorithm.  Since Forsyth’s algorithm takes as a parameter the size of the simulated cache and Tipsy takes as a parameter the size of the targeted cache, I tested each algorithm multiple times with appropriate values.  Here are the results:

                               ACMR at various cache sizes
               Time (secs)       10        14        18        24        32
None                 0.00     1.613     1.554     1.507     1.458     1.416
Forsyth-32         131.27     1.235     1.216     1.208     1.203     1.200
Forsyth-64         139.18     1.240     1.215     1.206     1.200     1.196
D3DX                20.05     1.276     1.227     1.220     1.214     1.210
Tootle-14            7.87     1.281     1.232     1.224     1.216     1.209
Tootle-16            7.55     1.288     1.235     1.220     1.213     1.207
Tootle-32            7.50     1.302     1.260     1.236     1.214     1.198
K-Cache            413.36     1.256     1.232     1.219     1.208     1.204

As you can see, Forsyth’s algorithm produced the best results on all cache sizes and Tootle performed the fastest.  I’ll be the first to admit that I didn’t put a lot of effort into optimizing my implementation of Forsyth’s algorithm.  I tried not to be stupid, but hopefully someone will take a look at it (available here) and provide an implementation that out-performs Tootle.  That said, I don’t really believe the implementation is the sole reason for the difference in running time.  The Tipsy and Forsyth algorithms are quite similar, and they both claim complexity that is roughly linear in the number of faces.  Forsyth’s algorithm, however, only achieves that in the average case.  Its worst-case runtime is order n squared in the number of faces.  This isn’t true of Tipsy though, which is less pessimistic in the worst case and never sacrifices its linear runtime.  Then again, maybe my implementation of Forsyth’s algorithm and the K-Cache implementation I have are completely wrong and I’m doing them both a terrible disservice.  Let this be a lesson to algorithm authors—provide a reference implementation!  Anyway, regardless of small differences, all the algorithms I tested performed extremely well and left little room for improvement.  My personal choice based on these results is D3DX, mainly because I don’t want to keep revisiting the subject and I think D3D has the biggest incentive to stay competitive.  Also I think it offers a nice compromise between offline and online performance.

One last note because someone is bound to ask—I tried comparing the algorithms on various subsets of the mesh pool, like only high or low poly meshes or only character  or environment meshes, and while there were minor variations, the general trend always held strong.  The “Linear-Speed Vertex Cache Algorithm” consistently returned the lowest ACMRs and Tootle consistently ran the fastest.