Minimizing Code Bloat: Template Overspecialization

This article is a continuation of Minimizing Code Bloat for Faster Builds and Smaller Executables.

Even in an engine that isn’t rife with template meta-programming, template overspecialization can be a major source of code bloat.  Very few programmers, in my experience, really think about what the compiler and linker are doing with the code they write.  They design their code in C++, they write their code in C++, and of course they debug their code in C++.  Their view of the programs they write begins and ends at that level, which is certainly understandable since there is more than enough to worry about in C++ alone.

Luckily for us, C++ is a language whose design is very representative of the underlying hardware.  Most of the time it is perfectly reasonable to assume that the code you write in C++ bears a close resemblance to the machine code generated by the compiler, at least in terms of quantity, structure, and performance.  The one place where that assumption breaks down, however, is in template code.  A very large amount of template code can result in little or no machine code and, conversely, a very small amount of template code can generate an enormous volume of machine code.

Consider this snippet from a hypothetical class implementing a Win32 file stream:

class Stream
{
    // …
    template <typename T>
    int Read(T* obj, uint numObjects)
    {
        if (!m_open) return 0;
        if (m_accessMode == ACCESS_WO)
            return 0;

        uint32 bytesRead = 0;
        uint32 bytesRemaining = numObjects * sizeof(T);

        char* readBuffer = reinterpret_cast(obj);

        while (bytesRemaining > 0)
        {
            const uint32 kMaxReadSize = 32*1024*1024;
            uint32 bytesThisRequest = std::min(bytesRemaining, kMaxReadSize);

            uint32 bytesReadThisRequest = 0;
            bool success = ::ReadFile(
                m_hFile,
                readBuffer,
                bytesThisRequest,
                &bytesReadThisRequest,
                NULL) != FALSE;

            bytesRemaining -= bytesReadThisRequest;
            readBuffer += bytesReadThisRequest;
            bytesRead += bytesReadThisRequest;

            if (bytesReadThisRequest == 0 ||
                !success)
                break;
        }

        m_currentPosition += bytesRead;
        return bytesRead;
    }
    // …
};

The author of this function is attempting to ease the burden on the calling code by removing the need to specify the size of the object being read into.  This simplifies the interface and removes a potential source of errors.  Unfortunately, the way the function is written is horribly inefficient from a compilation and code size perspective.  The entire contents of the function, some thirty lines of code, have been specialized for every type that is read from a stream.  That’s a lot of code for the compiler to process, particularly when you consider only one line of code in the entire function is actually dependent on the template parameter.

The story from the linker’s perspective is a little better.  For objects with the same size, say an unsigned short and a short, the compiler will generate identical code for both instantiations and any decent linker will happily merge those in the executable.  Nevertheless, code deduplication is a slow process so there is no reason to rely on it if you don’t have to.  Also, even with code deduplication most linkers won’t do partial function deduplication, so you’re still left with unique copies of this function in your executable for every size of serialized type in your application.

The compiler-friendly way to write a function like the one above is, of course:

class Stream
{
    // …
    int ReadRaw(void* buffer, uint bufferSize);

    template <typename T>
    int Read(T* obj, uint numObjects)
    {
        return ReadRaw(obj, numObjects * sizeof(T));
    }
    // …
};

Now only one line of code is specialized for each serialized type and the meat of Stream::Read will only be seen once by both the compiler and the linker.

The example above may seem contrived, and you’re probably thinking there is no way code like that could end up in your engine.  That’s certainly the way I felt before I started digging through our .obj files.  To find out if your optimism is justified, take a look at the output of DumpBin I mentioned earlier.

Run DumpBin on the libraries linked by your application and concatenate the results.  Next process the output to build a database of unique symbols recording both their individual sizes and their repeat counts.  If you want to see where your compiler is spending its time, look at the sum of symbol_size * symbol_count for all template symbols regardless of what types they are templatized on.  If you want to see where your linker is spending its time or what’s contributing to your executable size, ignore the symbol counts and simply take the sum of the symbol sizes for every symbol from the same template.

Once you have a list of top offenders, the next thing to do is to inspect their code and determine whether their degree of specialization is really justified.  In most cases it is pretty easy to turn template functions into non-template functions or at least to factor out whatever portions of the code aren’t directly dependent on the template parameters.

When you’re tired of looking at template code, it’s time to turn your attention to excessive inlining.

7 Comments

  1. Man, these articles are terrific. I struggle with both build times and code bloat, and these are some superb discussions of how to analyze this stuff (which has definitely been my weak point). I was going to write something to analyze the map file to visualize our dependency graph (it would contain an awful lot of lines). I’ll let you know what I turn up — Lakos mentions some tools in the back of his book that probably have a more modern analogue.

    1. Adrian Stone says:

      I’ve promised code at the end of these articles, so hopefully that will be of some help. Analyzing program structure, à la Lakos, is a really cool idea. The PDB might be a good place to start with that. It contains all the information you’d need, I think, and the DIA SDK is pretty convenient to work with once you get a feel for it.

  2. I don’t see where the symbol count is coming from. Call me obtuse.

    1. Adrian Stone says:

      What I mean by symbol count is the number of times the symbol shows up in the dump from all the object files. The symbol for a particular template function will show up in the output from each object file in which it is used. So if the example function above, template int Read(…), is only called from one place, its symbol count will be 1. If it is called from 100 different .cpp files in the application, however, its symbol count will be 100 and, as far as the compiler is concerned, it is a 100 times bigger deal.

  3. peirz says:

    Good post, however I wanted to point out that the problem here is template overinstantiation, not overspecialization. The way it is set up first, the template gets instantiated for every different object type. This is different from somebody running off and specializing an existing template (-class, since doing this with functions is a bad idea) ’til kingdom come. Which is what overspecialization would be 🙂

    1. Adrian Stone says:

      I’ve never seen this problem discussed in literature before so I admit I made up the name. 🙂 I chose template overspecialization because a lot of code within the function is type dependent when it shouldn’t be. I was avoiding the term instantiation because it could easily be confused with another topic I intend to cover: redundant instantiation. In retrospect I think you’re right, however, that specialization implies different/overridden behavior per type which isn’t the issue here at all.

      Now I have to rename the article. Thanks a lot. 😛

  4. Ah, of course, that makes complete sense. I spent a little time spitting this stuff out today, very cool! I’ll take a look at the PDBs for sure.

Reply in Thread