Symbol Sort : A Utility for Measuring C++ Code Bloat

OVERVIEW

SymbolSort is a utility for analyzing code bloat in C++ applications.  It works by extracting the symbols from a dump generated by the Microsoft DumpBin utility or by reading a PDB file.  It processes the symbols it extracts and generates lists sorted by a number of different criteria.  You can read more about the motivation behind SymbolSort here.

The lists compiled by SymbolSort are:

Raw Symbols, sorted by size

This list is generated from the complete set of symbols.  No deduplication is performed so this list is intended to highlight individual large symbols.

File contributions, sorted by size

This list is generated by calculating the total size of symbols that contribute to a folder path.  If the input is a COMDAT dump, the source location for symbols is the .obj or .lib file that DumpBin was run on (see usage for details).  It is important to note that for COMDAT dumps individual symbols will appear multiple times coming from different .obj files.  If the input is a PDB file, the source location for symbols is the actual source file in which the symbol is defined.  The source file for data symbols is not always clearly defined within the PDB so in some cases it is a best guess.

File contribution, sorted by path

This is a complete, hierarchical list of the size of symbols in all contributing source files.

Symbol Sections / Types, sorted by total size and by total count

This shows a breakdown of symbols by section or type, depending on the kind of information that can be extracted from the input source.

Merged Duplicate Symbols, sorted by total size and by total count

This list is generated by merging symbols with identical names.  The symbols are not guaranteed to be the same symbol.  In the case of PDB input there will be very few duplicate symbols.  COMDAT input, however, should contain a large number of duplicate symbols.  This list is useful for measuring total compile and link time for a particular symbol.  A relatively small symbol that appears in a very large number of .obj files will have a large total size and appear near the top of this list.

Merged Template Symbols, sorted by total size and by total count

This list is generated by stripping template parameters from symbols and then merging duplicates.  Symbols std::auto_ptr<int> and std::auto_ptr<float> will be transformed into std::auto_ptr<T> in this list and be counted together.

Merged Overloaded Symbols, sorted by total size and by total count

This list is generated by stripping template parameters and function parameters from symbols and then merging duplicates.  Overloaded functions sqrt(float) and sqrt(double) will be transformed into sqrt(…) in this list and be counted together.

Symbol Tags, sorted by total size and by total count

This list represents a tag cloud generated from the symbol names.  The symbols are tokenized and the total size and count is tallied for each token.  I’m not sure what this list is good for, but I’m all about tag clouds so I couldn’t resist including it.

USAGE

SymbolSort [options]

Options:
  -in[:type] filename
      Specify an input file with optional type.  Exe and PDB files are
      identified automatically by extension.  Otherwise type may be:
          comdat - the format produced by DumpBin /headers
          sysv   - the format produced by nm --format=sysv
          bsd    - the format produced by nm --format=bsd --print-size

  -out filename
      Write output to specified file instead of stdout

  -count num_symbols
      Limit the number of symbols displayed to num_symbols

  -exclude substring
      Exclude symbols that contain the specified substring

  -diff:[type] filename
      Use this file as a basis for generating a differences report.
      See -in option for valid types.

  -searchpath path
      Specify the symbol search path when loading an exe

  -path_replace regex_match regex_replace
      Specify a regular expression search/replace for symbol paths.
      Multiple path_replace sequences can be specified for a single
      run.  The match term is escaped but the replace term is not.
      For example: -path_replace d:\\SDK_v1 c:\SDK -path_replace
      d:\\SDK_v2 c:\SDK

  -complete
      Include a complete listing of all symbols sorted by address.

Options specific to Exe and PDB inputs:
  -include_public_symbols
      Include 'public symbols' from PDB inputs.  Many symbols in the
      PDB are listed redundantly as 'public symbols.'  These symbols
      provide a slightly different view of the PDB as they are named
      more descriptively and usually include padding for alignment
      in their sizes.

  -keep_redundant_symbols
      Normally symbols are processed to remove redundancies.  Partially
      overlapped symbols are adjusted so that their sizes aren't over
      reported and completely overlapped symbols are discarded
      completely.  This option preserves all symbols and their reported
      sizes

  -include_sections_as_symbols
      Attempt to extract entire sections and treat them as individual
      symbols.  This can be useful when mapping sections of an
      executable that don't otherwise contain symbols (such as .pdata).

  -include_unmapped_addresses
      Insert fake symbols representing any unmapped addresses in the
      PDB.  This option can highlight sections of the executable that
      aren't directly attributable to symbols.  In the complete view
      this will also highlight space lost due to alignment padding.

SymbolSort supports three types of input files:

COMDAT dump

A COMDAT dump is generated using the DumpBin utility with the /headers option.  DumpBin is included with the Microsoft compiler toolchain. SymbolSort can accept the dump from a single .lib or .obj file, but the best way to use it is to create a complete dump of all the .obj files from an entire application.  The Windows command line utility FOR can be used for this:

for /R "c:\obj_file_location" %n in (*.obj) do "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\DumpBin.exe" /headers "%n" >> c:\comdat_dump.txt

This will generate a concatenated dump of all the headers in all the .obj files in c:\obj_file_location.  Beware, for large applications this could produce a multi-gigabyte file.

PDB or EXE

SymbolSort supports reading debug symbol information from .exe files and .pdb files.  The .exe file will only be used to find the location of its matching .pdb file, and then the symbols will be extracted from the PDB.  SymbolSort uses msdia100.dll to extract data from the PDB file.  Msdia100.dll is included with the Microsoft compiler toolchain.  In order to use it you will probably have to register the dll.

regsvr32 "C:\Program Files\Common Files\Microsoft Shared\VC\msdia100.dll"

It is important that you register the 64-bit version of msdia100.dll on 64-bit Windows and the 32-bit version on 32-bit Windows.  If you don’t find msdia100.dll in the path listed above, try looking for it in the Visual Studio install directory under “\Microsoft Visual Studio 10.0\DIA SDK\bin\”

NM dump

Similar to the COMDAT dump, SymbolSort can accept symbol dumps from the unix utility nm.  The symbols can be extracted from .obj files or entire .elfs.  SymbolSort supports bsd and sysv format dumps.  Sysv is preferred because it contains more information.  The recommended nm command lines are:

nm --format=sysv --demangle --line-numbers input_file.elf
nm --format=bsd --demangle --line-numbers --print-size input_file.elf

DOWNLOAD

SymbolSort-1.2.zip

BUILDING

The source for SymbolSort is distributed as a single file, SymbolSort.cs.  It can be built as a simple C# command line utility.  In order to get the msdia100 interop to work you must add msdia100.dll as a reference to the C# project.  That is done either by dragging and dropping the dll onto the references folder in the C# project or by right clicking the references folder, selecting “Add Reference” and then browsing for the msdia100 dll.

REVISION HISTORY

1.2    + Upgraded to Visual Studio 2010 / msdia100.dll
       + Added -path_replace option to convert paths stored in PDBs.
       + Added -complete option to dump a full list of all symbols sorted by 
         address.
       + Added several options for controlling what symbols are included in PDB
         dumps since PDBs often list the same address redundantly under
         different labels.
1.1    + Added support for computing differences between multiple input sources
       + Added support for nm output for PS3 / unix platforms.
       + Changed command line parameters.  See usage for details.
       + Added section / type information to output.
1.0    + First release!

FUTURE WORK (to be done by someone else!)

  • Add a GUI frontend to allow interactive filtering and sorting.
  • Read both the PDB and the COMDAT dump simultaneously and cross-reference the two.  This would enable new kinds of analysis and richer dumps.
  • Produce additional merged symbol reports by merging all symbols from the same class or namespace or that match based on some more clever fuzzy comparison.
  • Improve relative -> absolute path conversion for nm inputs
  • Figure out how to extract string literal information from PDB.

19 Comments

  1. Nash says:

    Hi,

    I get the following crash:

    SymbolSort -in base.pdb -out “SymbolsBase.txt”
    Loading symbols from base.pdb

    Unhandled Exception: System.Runtime.InteropServices.COMException (0x80040154): Retrieving the COM class factory for comp
    onent with CLSID {4C41678E-887B-4365-A09E-925D28DB33C2} failed due to the following error: 80040154.
    at SymbolSort.SymbolSort.ReadSymbolsFromPDB(List`1 symbols, String filename, String searchPath)
    at SymbolSort.SymbolSort.Main(String[] args)

    I use win64.
    The pdb is build for x64.
    I have only found a 32 bit version of msdia90.dll on my PC.

    Do you know why it crash?

    Best regards,
    Nash

    1. Adrian Stone says:

      Make sure the class is registered using regsvr32 as described above. With a Visual Studio 2008 installation, the 64-bit version of msdia90 should be available. Alternately, you can switch to use the 32-bit version by recompiling the application specifically for the x86 target.

  2. Nash says:

    I’ve found only the 32 bit version. I registered that, but I got the same crash.
    Also when I use it to open my x86 PDB files.

    Where can I found a 64 bit version, google doesn’t help.
    What can I do?
    I use VS2008

    1. Adrian Stone says:

      If you email me at ‘stone’ at this domain I’ll try to offer more targeted assistance.

  3. Nicolas says:

    Do you know if this compiles with mono for linux ? It would be very nice to run this in Mac OS X or Linux

    1. Adrian Stone says:

      The dependency on msdia90.dll will obviously have to be removed to run on linux, but the source code is pretty straightforward and it should port pretty easily.

  4. […] a thoroughly study with the help of tools like Sizer, Symbol Sort (I strongly recommend reading the articles associated to this tool: 1, 2, 3, 4, 5, 6 and passing […]

  5. phenix yu says:

    How to generate NM dump of PS3 for SymbolSort?

    I use ppu-lv2-nm to get one, but SymbolSort returns errors.

    Symbolsort -in:sysv dump.txt -out sym.txt
    Loading symbols from dump.txt
    Reading symbols… 26% complete
    System.FormatException:
    at System.Number.StringToNumber(String str, NumberStyles options, NumberBuffer& number, NumberFormatInfo info, Boolean parseDecimal)
    at System.Number.ParseInt32(String s, NumberStyles style, NumberFormatInfo info)
    at SymbolSort.SymbolSort.ParseSysvSymbol(String line, Symbol& symbol)
    at SymbolSort.SymbolSort.ReadSymbolsFromNM(List`1 symbols, String inFilename, InputType inType)
    at SymbolSort.SymbolSort.LoadSymbols(InputFile inputFile, List`1 symbols, String searchPath)
    at SymbolSort.SymbolSort.Main(String[] args)

    1. Adrian Stone says:

      I’ve used nm.exe on the PS3 with SymbolSort with success using the command line described above: “nm –format=sysv –demangle –line-numbers input_file.elf”
      If the parsing isn’t 100% robust, perhaps you can fix the code and submit a patch for me.

  6. philk says:

    Hello!

    Thanks for this series!
    I wanted to try the SymbolSort tool but it crashes:
    Loading symbols from comdat_dump.txt

    Unhandled Exception: System.Runtime.InteropServices.COMException (0x806D0012): Exception from HRESULT: 0x806D0012
    at Dia2Lib.DiaSourceClass.loadDataForExe(String executable, String searchPath, Object pCallback)
    at SymbolSort.SymbolSort.ReadSymbolsFromPDB(List`1 symbols, String filename, String searchPath)
    at SymbolSort.SymbolSort.LoadSymbols(InputFile inputFile, List`1 symbols, String searchPath)
    at SymbolSort.SymbolSort.Main(String[] args)

    1. Adrian Stone says:

      It sounds like you’re not specifying the correct type of input in the command line parameters. If comdat_dump.txt is a text comdat dump from dumpbin, use the command line -in:comdat comdat_dump.txt

  7. […] Here is a tool that looks quite interesting: SymbolSort […]

  8. Philip Bloom says:

    That’s a pretty neat tool. Thanks for making and sharing.

  9. Michał Cichoń says:

    Hi, I found a bug in your tool. There is a problem while parsing sysv format with C++ symbols. Sections are separated by ‘|’ symbol, C++ have operator | and it is splitted in the middle of signature. This line cause an exception:

    std::operator|(std::_Ios_Openmode, std::_Ios_Openmode)|00000000| W | FUNC|00000058| |.text._ZStorSt13_Ios_OpenmodeS_ D:/Programs/bbndk-2.1.0/target/qnx6/usr/include/c++/4.4.2/bits/ios_base.h:129

    I did a workaround:
    419c419,435
    if (!Int32.TryParse(tokens[1], NumberStyles.AllowHexSpecifier, CultureInfo.InvariantCulture, out rva))
    > {
    > if (name == “std::operator” && tokens[1].StartsWith(“(“))
    > {
    > tokens[0] += ‘|’ + tokens[1];
    > for (int i = 1; i tokens[i] = tokens[i + 1];
    >
    > string[] extra = tokens[tokens.Length – 2].Split(“|”.ToCharArray(), 2);
    > tokens[tokens.Length – 2] = extra[0];
    > tokens[tokens.Length – 1] = extra[1];
    >
    > name = tokens[0].Trim();
    >
    > rva = Int32.Parse(tokens[1], NumberStyles.AllowHexSpecifier);
    > }
    > }

    1. Adrian Stone says:

      Good catch, and thanks for posting your solution. I’ll roll this fix into the next version (>1.2).

    2. Michał Cichoń says:

      I’m afraid code was crippled by this whole citation mechanism.
      There you can find actually working patch:
      http://pastebin.com/chKNsCfp

  10. Heather says:

    I really want to bookmark this specific article, “Symbol Sort : A Utility for Measuring C++
    Code Bloat – Game Angst” on my web-site. Do you
    really mind in case I reallydo it? Thank you -Lakeisha

    1. Adrian Stone says:

      By all means, link to my articles as much as you want!

  11. stgatilov says:

    I once tried to implement some very basic code bloat analyzer by parsing MSVC assembly output with Python. The benefit with assembly files is that you can even draw symbols dependency graph (if you compile without optimization). The SymbolSort is definitely much better tool: it is faster, more generic, and provides much better output and statistics.

    Right now I’m trying to find some easy opportunities to reduce build time of an excessively bloated C++ project (well, I suppose it is as bloated as any other large C++ project which does not care about build times).
    I think the main source of problem is basic template classes like e.g. MyVector defined in MyVector.h. Is there any simple way to see how much object code is generated for this specific header file in total?

    I know that I can analyze symbols dumped from all .obj files, but then header-resident methods like MyVector::clear are attributed to the .obj files they are used in, instead of the header file. Also, I can analyze .pdb file, then symbols are property attributed to the header files, but it does not take into account how many times same symbol is compiled across different translation units.

    I think it should be possible to analyze .obj dumps, but then open .pdb files and look symbols up there (in order to set proper source file). I’ll probably try to fork SymbolSort and try to do this.
    Could you please comment if it sounds like a feasible idea?

Leave a Reply