Minimizing Code Bloat: Incorrect Inlining

This article is a continuation of Minimizing Code Bloat for Faster Builds and Smaller Executables.

Earlier I talked about excessive inlining as a common cause of code bloat.  Excessive inlining is when code that shouldn’t be inlined is.  Today I’m going to look at a related problem, code that is declared as inlined but never will be.

Any function whose address is taken for storage in a function pointer or a virtual function table can never fully be inlined.  Even for a function that can legally be inlined, the compiler has ultimate authority over whether or not it will be.  If a function is large the compiler will generally forgo inlining it because the performance benefits of removing the function call overhead don’t justify the increase in executable size.  That’s all well and good, but it still poses a problem.

C++ compilers work by dividing programs into something called translation units.  A translation unit is a single cpp file and all the headers it includes.  The compiler turns these translation units into object files and ultimately the linker combines the contents of all the object files into an executable.  The important thing to note is that every function visible to the compiler in a translation unit must be compiled as part of that unit.  In the case of a function that is declared inline, the compiler will generate a compiled instantiation of the function in every cpp file that uses it.  The various redundant instantiations are marked as weak symbols so the linker knows that multiple copies are expected and that it can pick any one of them for use in the final executable.

Knowing this, you can see how wasteful functions that are incorrectly marked as inline can be.  Large functions are less likely to be inlined, are more likely to take a long time to compile, and consequently can be murder on your build times.

By this point you should have a pretty good idea of how to detect incorrect inlining.  The same technique I proposed for measuring excessive inlining and the compile-time cost of template overspecialization works equally well for detecting incorrect inlining.  Dump the symbols from all your object files with DumpBin /headers and look for large functions with lots of redundant instantiations.

One of the most interesting (and insidious!) sources of incorrect inlining is code that you don’t write at all.  Consider this apparently harmless class definition:

class DataBox
{
    PhysicsData     m_physics;
    AIData          m_ai;
    GraphicsData    m_graphics;
    AnimationData   m_animation;
    UIData          m_ui;
    ScriptData      m_script;
    NetworkData     m_network;
};

This class is potentially a major contributor to code bloat from incorrect inlining.  If you’re scratching your head and wondering how a class with no functions can contribute to code bloat, take a look at the next listing to see this class the way the compiler sees it.

class DataBox
{
    PhysicsData     m_physics;
    AIData          m_ai;
    GraphicsData    m_graphics;
    AnimationData   m_animation;
    UIData          m_ui;
    ScriptData      m_script;
    NetworkData     m_network;

    // Not legal C++.  This is for illustration purposes only.
    DataBox()
    {
        m_physics.PhysicsData();
        m_ai.AIData();
        m_graphics.GraphicsData();
        m_animation.AnimationData();
        m_ui.UIData();
        m_script.ScriptData();
        m_network.NetworkData();
    }

    ~DataBox()
    {
        m_network.~NetworkData();
        m_script.~ScriptData();
        m_ui.~UIData();
        m_animation.~AnimationData();
        m_graphics.~GraphicsData();
        m_ai.~AIData();
        m_physics.~PhysicsData();
    }
};

In C++ every class has a constructor, a destructor, a copy constructor, and an assignment operator.  If you don’t provide these functions the compiler will, and it must do so inline.  Of course good compilers will only instantiate inline functions when they are actually used, so most classes never have their default copy constructor or assignment operators generated, but unless you’re doing something dangerously clever in your codebase, every class has a destructor instantiated at least once.

If you notice default (also sometimes called automatic) methods showing up in the top slots of your redundant function hotlist, the fix is counterintuitive but also really simple.  Just provide an empty, non-inline implementation in place of the default one.  Providing empty, non-inline implementations of default methods for nontrivial classes can sometimes help your build times in other ways.  If you use a lot of smart pointers in your code, the default destructor of a class holding smart pointers may be the only thing preventing you from replacing the #includes of the headers for the classes stored in smart pointers with forward declarations of those classes.  In other words, inline functions create header dependencies, and inline default methods are no exception.

Speaking of smart pointers, as great as they are, code bloat is one of their unfortunate costs.  As I mentioned above, good compilers only instantiate inline code when it is actually used.  Reference counting smart pointers allow for distributed ownership of objects, which can be a powerful and convenient design freedom.  However, if responsibility for destroying objects is widely distributed through your code, so is responsibility for instantiating object destructors.  An engine with rigorously defined ownership hierarchies will have few inline destructor instantiations whereas an engine with sloppily defined ownership hierarchies will have many.  So if you use smart pointers, use them responsibly!  Pass by reference and reference count only when necessary.  Unnecessary reference counting isn’t just inefficient at run time, it is inefficient at compile time too!

If all this sounds like a major pain in the neck, I suggest you give up now and implement unity builds.  They circumvent all these redundant instantiation issues so they’re fantastic at optimizing build times in codebases with poor structure.  On the other hand, if you’re still enjoying this trek through compile land, join me next time for a discussion of redundant template instantiation.

Minimizing Code Bloat: Static Allocations

This article is a continuation of Minimizing Code Bloat for Faster Builds and Smaller Executables.

Static allocations are certainly the simplest and most predictable sources of code bloat.  Static allocations refer to data structures in a program for which storage is allocated for the entire duration of the application’s execution.   In C++ file, class, and local static variables all require static allocations.  Additional examples of static allocations in C++ are global variables, string literals, and virtual function tables.

Static allocations can be very useful in some circumstances, but they can also be very wasteful.  Most of the code in an engine is executed very infrequently, which means most of the static allocations supporting that code are rarely needed.  Unfortunately since static allocations exist for the lifetime of the program, they’re taking up memory whether you’re using them or not.

In addition to possibly wasting memory, static allocations can increase program link and load times.  The memory for most static allocations is reserved directly within the application binary.  Consequently, every megabyte of static allocation is a megabyte that must be written to disk every time you link, copied across the network every time you deploy, and read from disk every time you launch.

The one exception to this is a special kind of static allocation known as bss data.  Static allocations that are known at compile time to be zero-initialized are placed in a special section of the executable called the bss section.  Memory used by objects in the bss section isn’t reserved in the executable, instead it is allocated by the program loader.  Bss data doesn’t significantly impact build times, but it has the same run time memory requirements as other static allocations.

In the list of code bloat causes, static allocations are usually a very distant third, but every engine I’ve investigated for static allocations has yielded a few unpleasant surprises.

Dumpbin isn’t of much help when it comes to tracking static allocations.  Data symbols aren’t packaged as COMDATs, so they don’t show up in the /headers report I’m so fond of.   To gather statistics on the static allocations within your application, you’ll have to crack open the PDB.  Every static and global variable is documented in the program database along with its size and, through some creative cross-referencing, its source file.

Large static objects are easy to find–just sort the data symbols by size and gawk at the top offenders. Don’t stop there though.  If you generate a lot of code with templates or macros, you may have large numbers of relatively small static objects.  To quantify the cost of these objects you’ll need to do some pattern matching on symbol names. You can strip template parameters from symbols so MyClass<int>::s_buffer and MyClass<float>::s_buffer get counted together in the stats, or you can strip away the scope entirely and tally the stats for all statics named s_buffer together.  I find it useful to analyze the results from collapsing symbol names in a variety of different ways.  Which option yields the best results depends on the particular codebase I’m dealing with and the type of code generation it uses.

That covers it for static allocations and for executable size optimizations in general.  I still have a couple topics to go, but from now on they’re relevant to build times alone.  Coming up, incorrect inlining.

Minimizing Code Bloat: Excessive Inlining

This article is a continuation of Minimizing Code Bloat for Faster Builds and Smaller Executables.

When a function is inlined its code is, from the perspective of the compiler, replicated to every place where the function is called.  For very simple functions the inlined code can be smaller than the out-of-line function call, and the result is a net reduction in code size.  For most functions, however, the function code is larger than the function call and inlining will increase executable size.

Code bloat from excessive inlining usually isn’t a problem if you’re smart enough to trust your compiler.  Compilers apply a set of heuristics to decide whether the performance benefits of inlining a function justify the costs.  If the cost isn’t justified, the compiler won’t inline.  Additionally, most compilers err on the side of not inlining.  Unless you’ve overridden the default compiler behavior with a __forceinline directive, you can safely assume that any code bloat that comes from inlining is justified by improved performance.

Still, you can never be too careful, so it’s worth checking to see how much inlining is costing you.

One technique for measuring the cost of inlining is compiling your engine with inlining enabled and again with inlining disabled, and comparing the size of the resulting executables.  If the executable without inlining is a lot smaller, you’re probably inlining too much.  Even if your executable isn’t significantly smaller with inlining disabled, you may still be suffering from excessive inlining.  In some cases inlining a function can result in reduced code size because the inlined code opens up optimization opportunities that were otherwise invisible to the compiler.  Code size reductions from well inlined functions can offset code size increases from poorly inlined functions, making quick executable size comparisons unreliable.

A better way to identify excessive inlining is to use DumpBin on the .obj files of your application.  Even with inlining enabled, inline functions will be instantiated into every object file in which they are used.  All those duplicate instantiations are merged or removed at link time, but if you run DumpBin on the .lib or .obj files you can measure how many object files the inlined functions are being replicated into.  This is similar to the technique I proposed to measure the compile-time cost of template overspecialization.  Sort the database of symbols extracted from DumpBin by symbol_size * symbol_count, where symbol_count is the number of object files in which a particular symbol appears, and skim through the top 100.   This isn’t a perfect measure of code bloat from function inlining, but it can usually catch the worst offenders.

You can improve the report I described above by discarding symbols with sizes below some threshold.  These will be functions whose inlined code is smaller than an out-of-line function call and are consequently not a factor in code bloat.  Likewise, if you haven’t been abusing the __forceinline directive, you can discard symbols with sizes above some threshold.  Very large functions will never automatically be inlined by the compiler, so they don’t increase the size of the executable.  They do, however, contribute unfavorably to compile and link times, but I’ll talk more about that later.

One last thing–be sure to review the source for any function you’re considering deinlining.  Consider carefully how the compiler sees the function in isolation and in the context in which it is inlined.  Are the function parameter values typically known at compile time at the call site?  Is the ratio of the number of times the function is executed in a frame to the number of times the function is called in the source very large?  If the answer to either of these questions is yes, the function may be a good candidate for inlining regardless of what the symbol analysis says.

Next I’ll be looking at static allocations.

Minimizing Code Bloat: Template Overspecialization

This article is a continuation of Minimizing Code Bloat for Faster Builds and Smaller Executables.

Even in an engine that isn’t rife with template meta-programming, template overspecialization can be a major source of code bloat.  Very few programmers, in my experience, really think about what the compiler and linker are doing with the code they write.  They design their code in C++, they write their code in C++, and of course they debug their code in C++.  Their view of the programs they write begins and ends at that level, which is certainly understandable since there is more than enough to worry about in C++ alone.

Luckily for us, C++ is a language whose design is very representative of the underlying hardware.  Most of the time it is perfectly reasonable to assume that the code you write in C++ bears a close resemblance to the machine code generated by the compiler, at least in terms of quantity, structure, and performance.  The one place where that assumption breaks down, however, is in template code.  A very large amount of template code can result in little or no machine code and, conversely, a very small amount of template code can generate an enormous volume of machine code.

Consider this snippet from a hypothetical class implementing a Win32 file stream:

class Stream
{
    // …
    template <typename T>
    int Read(T* obj, uint numObjects)
    {
        if (!m_open) return 0;
        if (m_accessMode == ACCESS_WO)
            return 0;

        uint32 bytesRead = 0;
        uint32 bytesRemaining = numObjects * sizeof(T);

        char* readBuffer = reinterpret_cast(obj);

        while (bytesRemaining > 0)
        {
            const uint32 kMaxReadSize = 32*1024*1024;
            uint32 bytesThisRequest = std::min(bytesRemaining, kMaxReadSize);

            uint32 bytesReadThisRequest = 0;
            bool success = ::ReadFile(
                m_hFile,
                readBuffer,
                bytesThisRequest,
                &bytesReadThisRequest,
                NULL) != FALSE;

            bytesRemaining -= bytesReadThisRequest;
            readBuffer += bytesReadThisRequest;
            bytesRead += bytesReadThisRequest;

            if (bytesReadThisRequest == 0 ||
                !success)
                break;
        }

        m_currentPosition += bytesRead;
        return bytesRead;
    }
    // …
};

The author of this function is attempting to ease the burden on the calling code by removing the need to specify the size of the object being read into.  This simplifies the interface and removes a potential source of errors.  Unfortunately, the way the function is written is horribly inefficient from a compilation and code size perspective.  The entire contents of the function, some thirty lines of code, have been specialized for every type that is read from a stream.  That’s a lot of code for the compiler to process, particularly when you consider only one line of code in the entire function is actually dependent on the template parameter.

The story from the linker’s perspective is a little better.  For objects with the same size, say an unsigned short and a short, the compiler will generate identical code for both instantiations and any decent linker will happily merge those in the executable.  Nevertheless, code deduplication is a slow process so there is no reason to rely on it if you don’t have to.  Also, even with code deduplication most linkers won’t do partial function deduplication, so you’re still left with unique copies of this function in your executable for every size of serialized type in your application.

The compiler-friendly way to write a function like the one above is, of course:

class Stream
{
    // …
    int ReadRaw(void* buffer, uint bufferSize);

    template <typename T>
    int Read(T* obj, uint numObjects)
    {
        return ReadRaw(obj, numObjects * sizeof(T));
    }
    // …
};

Now only one line of code is specialized for each serialized type and the meat of Stream::Read will only be seen once by both the compiler and the linker.

The example above may seem contrived, and you’re probably thinking there is no way code like that could end up in your engine.  That’s certainly the way I felt before I started digging through our .obj files.  To find out if your optimism is justified, take a look at the output of DumpBin I mentioned earlier.

Run DumpBin on the libraries linked by your application and concatenate the results.  Next process the output to build a database of unique symbols recording both their individual sizes and their repeat counts.  If you want to see where your compiler is spending its time, look at the sum of symbol_size * symbol_count for all template symbols regardless of what types they are templatized on.  If you want to see where your linker is spending its time or what’s contributing to your executable size, ignore the symbol counts and simply take the sum of the symbol sizes for every symbol from the same template.

Once you have a list of top offenders, the next thing to do is to inspect their code and determine whether their degree of specialization is really justified.  In most cases it is pretty easy to turn template functions into non-template functions or at least to factor out whatever portions of the code aren’t directly dependent on the template parameters.

When you’re tired of looking at template code, it’s time to turn your attention to excessive inlining.

Minimizing Code Bloat for Faster Builds and Smaller Executables

No matter how much code you have, somebody always wants more.  That’s certainly been the case with every game and every engine I’ve worked on.  My good friend Kyle, who has for years been tracking all sorts of wonderful statistics about the development of Despair Engine, recently announced that we’d surpassed 1.5 million lines of principle engine code.  That doesn’t count build systems, tools, tests, middleware, or scripts; that’s 100% Despair-original C++ code the compiler has to wade through every time we build the game.

As if the code we write weren’t bad enough, our engine is particularly adept at generating code.  Those 1.5 MLOC result in a hefty 37 megabyte executable on the Xbox 360.  Kyle estimates that we’ll exceed 2 million lines of code before the end of this console generation.  We already have an executable that can’t fit into main memory on a PS2 or a Gamecube.  By 2014 it won’t fit onto an Xbox or Wii either.  I don’t think we’re in danger of exceeding the capacity of a PS3 or Xbox 360, but the prospect is enough to keep me up at nights.

Looking back at the graphs of our executable size and build and link times, I’m proud to see a few dips punctuating the otherwise consistent upward march.  At least a few of those dips correspond to weeks when I turned my sleepless nights into successful assaults on code bloat.  What’s particularly interesting about those dips is that they don’t correspond to any drops in our code line count.  In my biggest success I reduced the size of our executable by 25% or over 7 megabytes.  At the same time I cut our link times by 19%, and yet I did it all without removing a single line of code!  How?

The first place to start when trimming a codebase is figuring out where to look.  In an engine of 1.5 million lines of code there are a lot of places to hide.  My strategy when dealing with problems like this is to gather as much data as possible into a format that is amenable to analysis.  For code this means extracting symbols from the compiled binaries along with their size, type, source file, and any other details you can get your hands on.  Once you have a dump of all the symbols in your compiled code, it is a pretty simple matter to write a script to collate it and generate whatever statistics you find useful.

There are two methods I use for extracting data from compiled code.  The first is a quick and handy utility called DumpBin.  DumpBin is part of the Visual Studio toolchain and is available for all Microsoft targets.  There are similar tools available for every compiler I’ve encountered.  The gcc equivalents, for example, are objdump and nm.  These tools are useful because they can work with just about any compiled binary.  They target standard binary formats like PE/COFF and ELF, and consequently they work on object files, library files, executables and DLLs.  The ability to extract information from intermediate files in the build process is essential if you’re worried as much about build times as you are about final memory usage.

My preferred way to use DumpBin is to specify the /headers option to extract COMDATs from object files compiled with the /Gy option.  For example:

for /r . %n in (*.obj) do DumpBin /headers “%n” >> all_headers.txt

A portion of the output from such a dump looks like this:

SECTION HEADER #14E
   .text name
       0 physical address
       0 virtual address
       7 size of raw data
   22D6B file pointer to raw data (00022D6B to 00022D71)
   22D72 file pointer to relocation table
       0 file pointer to line numbers
       1 number of relocations
       0 number of line numbers
60501020 flags
         Code
         COMDAT; sym= "public: virtual __thiscall Despair::Reflection::Object::~Object(void)" (??1Object@Reflection@Despair@@UAE@XZ)
         16 byte align
         Execute Read

SECTION HEADER #14F
   .text name
       0 physical address
       0 virtual address
       5 size of raw data
   230B1 file pointer to raw data (000230B1 to 000230B5)
       0 file pointer to relocation table
       0 file pointer to line numbers
       0 number of relocations
       0 number of line numbers
60501020 flags
         Code
         COMDAT; sym= "public: int __thiscall Despair::BroadcasterReceipt::AddRef(void)" (?AddRef@BroadcasterReceipt@Despair@@QAEHXZ)
         16 byte align
         Execute Read

As you can see, the result is an easily parsable list of packaged functions with symbol names and sizes clearly specified.  As great as the /headers dump is, it isn’t perfect.  Data symbols aren’t packaged as COMDATs and although DumpBin can extract data from executables, executables don’t contain COMDATs so there isn’t much to see.

To analyze the contents of an executable, a second approach is needed.  My solution is to reference the debugging information provided by the linker.  With Microsoft’s tools, debugging information is stored in a PDB file that accompanies the executable.  Similar to DumpBin’s /headers dump, the PDB contains information about symbols in the compiled binary.  Unlike the /headers dump, however, the PDB file contains data as well as code symbols, and the information it contains is post-linker optimization which is important if you’re more worried about memory consumption than compile and link times.

There aren’t any standard tools for dumping the contents of a PDB, but there is a DLL provided with the Microsoft toolchain that makes it pretty easy to write your own.  I’ve used the Debug Interface Access SDK included with Microsoft Visual Studio to plumb the contents of an executable and generate symbol tables that can be analyzed alongside a COMDAT dump.

In an upcoming series of articles I’m going to explore the most common causes of C++ code bloat and take a closer look at how you can use DumpBin and the PDB to identify and remove the worst offenders from your code.  When I’m done I’ll share some sample code which you can use to assess the level of code bloat in your own engines.

First up, template overspecialization.