Low-Fragmentation, High-Performance Memory Allocation in Despair Engine

I recently wrote about dlmalloc and how it is a poor choice for a memory allocator for console games.  As I explained in my previous article, dlmalloc has two major limitations.  It manages a pool of address space composed of discrete regions called segments.  It can easily add segments to grow its pool of address space, but it can’t easily remove segments to return address space to the OS.  Additionally, it doesn’t distinguish between physical and virtual memory, which means that it can’t take advantage of virtual memory’s ability to combat fragmentation.

During the development of FEAR 3, to address these limitations, I implemented a new general-purpose allocator based on dlmalloc called dsmalloc.  Dsmalloc makes two important improvements to dlmalloc.

The first improvement is in how dsmalloc treats segments.  In dsmalloc, segments are always managed as completely distinct regions of memory.  Dsmalloc will never coalesce adjacent segments.  Because of this, dsmalloc tracks segments much more closely than dlmalloc.  By overlaying a second intrusive data structure in the free bits of the boundary tag structure used to track individual allocations, dsmalloc can, in constant time and with no memory overhead, determine with every free operation when a segment is no longer occupied and can be returned to the system.

The second improvement is that dsmalloc recognizes the difference between reserved address space and allocated memory.  Unlike dlmalloc, which uses two callbacks to allocate and free system memory, dsmalloc exposes four callbacks, two to reserve and release segment-sized regions of virtual address space, and two to commit and decommit page-sized subregions of segments.

Just as dsmalloc never leaves any unoccupied segments reserved, it also never leaves any unoccupied pages committed.

typedef void* (*ReserveSegmentFunc)(size_t size);
typedef void (*ReleaseSegmentFunc)(void* ptr, size_t size);
typedef int (*CommitPageFunc)(void* ptr, size_t size);
typedef void (*DecommitPageFunc)(void* ptr, size_t size);

MemorySpace* CreateMemorySpace(
    size_t initialCapacity,
    ReserveSegmentFunc reserveSegmentFunc,
    ReleaseSegmentFunc releaseSegmentFunc,
    CommitPageFunc commitPageFunc,
    DecommitPageFunc decommitPageFunc,
    size_t pageSize,
    size_t segmentGranularity,
    size_t segmentThreshold );

These two extensions are implemented without any additional memory overhead compared to dlmalloc and with very little computational overhead.  The added cost is a few percent in targeted tests, but in a real-world scenario the performance difference between dlmalloc and dsmalloc isn’t even measurable.

The great thing about these two changes to dlmalloc is that they enable a wide range of allocation strategies that otherwise wouldn’t be feasible.

Individual systems within the Despair Engine are free to create separate dsmalloc instances (or regions) for their own use.  Because dsmalloc instances are so aggressive about returning memory to the system, they incur minimal internal fragmentation costs and can coexist gracefully with other system allocators.  Thread-safety is left to the users of dsmalloc, so individual systems can bypass thread synchronization costs entirely or at least use local locks to avoid contention with other allocators.  Using separate dsmalloc instances also provides systems with easier tracking of memory allocations, tighter enforcement of budgets, and, if their allocation patterns exhibit temporal or spatial locality within their own systems, reduced external fragmentation.

For systems that don’t want to bother with using a custom allocator (which, frankly, most don’t), Despair Engine provide a common allocator which services request from the tradition global allocation functions like operator new, malloc, and XMemAlloc.  This common allocator also utilizes dsmalloc instances under the hood.

The common allocator creates one dsmalloc instance for allocations larger than 256 bytes and 20 dsmalloc instances covering allocations at every 8 byte size interval less than 64 bytes and every 16 byte interval between 64 and 256 bytes.  The instance for large allocations uses a 32 megabyte segment size and 64 kilobyte pages whereas the small allocation instances use 64 kilobyte segments.  Using page-sized segments for the small allocation regions is a minor optimization that removes the need for dsmalloc to track address reservation and memory commission separately for these regions.

Bucketing small allocations into so many discrete regions significantly reduces external fragmentation in our games, despite creating a modest increase in internal fragmentation.  Since only allocations within a single region need to be synchronized, it also has a side benefit of greatly reducing contention between allocations from multiple threads.

It is worth noting that since the various small allocation regions are essentially used for fixed-size allocations, they could be more efficiently implemented as dedicated fixed-block allocators.  We have such an implementation in the engine, but dsmalloc (like dlmalloc) already implements an internal small, fixed-block allocation optimization, so in practice it is more convenient to use dsmalloc instances for everything and almost as efficient.

One key benefit of using dsmalloc instances for small allocations instead of a pure fixed-block allocator is that it offers us more flexibility in how the small allocation regions are configured.  At the time I was implementing this in FEAR 3, minimizing fragmentation was our top concern, but in the future we might choose to prioritize thread synchronization efficiency over memory consumption.  Instead of routing allocations to regions based on their size, we could create just a few small allocation regions and cycle between them based on thread contention.  The idea is to allow a thread to try to lock a region for allocation, but rather than waiting if the region is already in use by another thread, simply move on to another region and try again.

Dsmalloc is flexible enough to support both of these strategies efficiently or, in fact, any hybrid combination of the two.

The dsmalloc callbacks are designed to be easily mappable to OS functions such as VirtualAlloc in Windows, but sometimes these functions are too expensive to be used directly.  To improve performance, on some platforms the Despair general allocator utilizes a commit cache.  The commit cache is a simple direct-mapped cache that sits between the dsmalloc commit callback and the OS.  Dsmalloc already optimizes allocation order to maximize cache efficiency, and this benefits the commit cache as well.  A 32 megabyte commit cache is probably overly generous, but it guarantees that OS-level calls don’t show up in our profiles even during content streaming transitions.

Having the cache implemented external to dsmalloc is also useful.  When memory is extremely tight, anything that doesn’t utilize the general allocator could fail even though memory is still available in the commit cache.  In those extreme cases the commit cache can be manually prompted to return pages to the system ensuring that OS-level allocations never fail because the memory they need is held in a cache.

There is one additional complication that plagues the consoles.  On both the Xbox 360 and the Playstation 3, the GPU does not share the CPU’s memory mapping unit.  On the Xbox 360 the GPU requires physically contiguous memory for all resources and on the PS3 the GPU has a separate MMU with only 1 megabyte page granularity.  Since we use dsmalloc with 64 kilobyte pages to minimize fragmentation, this means we’re susceptible to physical memory fragmentation when it comes to resources allocated for the GPU.  On the Xbox 360, unbeknownst to many developers, the OS copes with this automatically.  When a physical allocation request can’t be satisfied due to physical memory fragmentation, the Xbox 360 operating system locks the entire address space and defragments physical memory pages (by memcpy’ing physical pages and remapping virtual addresses) to accommodate the request.

On the PS3 the OS doesn’t provide this service automatically, but the same basic need exists.  64 kilobyte pages must be locked and relocated to generate contiguous 1 megabyte pages appropriate for the GPU.  Thankfully with a little bit of acrobatics it is possible to do just that without violating certification requirements.

Although the operation is every bit as expensive as it sounds, it proved necessary to keep FEAR 3 from crashing.  FEAR 3 ran dangerously close to the limits of memory on the Playstation 3 and survived only by allowing flexible budgets for every kind of memory.  GPU allocations in main memory varied by over 100% between streaming regions so not only did CPU addressable memory need to be defragmented for use by the GPU, GPU addressable memory had to be defragmented continually and returned to the CPU.  The really expensive CPU defragmentation provided a safety net against crashes at all times, but thankfully it was only needed at major chapter transitions where a load screen was expected and a 100 ms hitch was infinitely preferable to an out and out crash.

When I undertook to rewrite the Despair memory allocator near the end of FEAR 3, the overarching goal was to never fail to satisfy an allocation request due to prior memory usage patterns.  If a level fit just barely into memory after a fresh boot, we wanted it to fit equally well in memory after a hundred hours of continuous play.  While I can’t pretend that fragmentation doesn’t exist at all in the system I devised, I can say that it was so dramatically reduced that we were able to establish an upper bound on fragmentation and enforce budgets for our content that were reliable regardless of the order or duration in which the game was played.

The new Despair memory allocator based on dsmalloc ran the same content as the old memory allocator on all platforms with a slightly lower lower bound on memory, a much lower upper bound on memory, and it even managed a slightly lower run-time performance cost.

9 Comments

  1. João says:

    Thanks for this great quality post. 🙂

    Any chance these improvements could be open-sourced?

    1. Adrian Stone says:

      I would really love to give back to the community, so hopefully someday. There are also proponents in my company of licensing the package, so that’s a possibility too.

  2. Michael says:

    I like the idea, thanks for sharing.
    Could you clarify CreateMemorySpace arguments?
    Also MemorySpace interface, is it just like STL allocator (allocate/deallocate) or something more?

    1. Adrian Stone says:

      Sorry, I realize that code snippet isn’t very clear. It helps if you are very familiar with dlmalloc since dsmalloc is designed to match dlmalloc very closely. Here is the entire header for dsmalloc, which will hopefully clarify things.


      NAMESPACE_BEGIN(Despair)
      NAMESPACE_BEGIN(DSMalloc)

      // An opaque type representing an independent region of space that
      // supports Alloc, etc. with custom policies
      struct MemorySpace;

      // The minimum alignment guaranteed for all allocations
      const size_t kDefaultAlignment = 8;

      // User defined functions to Reserve and Release address space
      // as well as functions to Commit and Decommit memory.
      typedef void* (*ReserveSegmentFunc)(size_t size);
      typedef void (*ReleaseSegmentFunc)(void* ptr, size_t size);
      typedef int (*CommitPageFunc)(void* ptr, size_t size);
      typedef void (*DecommitPageFunc)(void* ptr, size_t size);

      // Creates and returns a new MemorySpace or NULL on failure
      // The initial capacity of the MemorySpace will be initialCapacity
      // or segmentGranularity if capacity is 0.
      // The Reserve and Release functions must be specified to allow
      // the MemorySpace to allocate address space. If the Commit and
      // Decommit functions are NULL, the Reserve and Release functions
      // are assumed to perform commit / decommit automatically.
      // The pageSize is both the minimum granularity at which memory
      // can be Reserved, Released, Committed, and Decommitted, and the
      // guaranteed alignment of all those operations.
      // The segmentThreshold is the size of an allocation request above
      // which an independent call to ReserveSegmentFunc will be used
      // to satisfy the request, ensuring that allocation will not be
      // pooled with other allocations.
      MemorySpace* CreateMemorySpace(size_t initialCapacity,
      ReserveSegmentFunc reserveSegmentFunc,
      ReleaseSegmentFunc releaseSegmentFunc,
      CommitPageFunc commitPageFunc,
      DecommitPageFunc decommitPageFunc,
      size_t pageSize,
      size_t segmentGranularity,
      size_t segmentThreshold);

      // Same as CreateMemorySpace but specifies a base segment. If no
      // Reserve function is specified, the MemorySpace will not grow
      // beyond this initial capacity. If Commit and Decommit functions
      // are specified, the base segment will be committed and decommitted
      // like any other segment.
      MemorySpace* CreateMemorySpaceWithBase(void* baseSegment,
      size_t baseSegmentCapacity,
      ReserveSegmentFunc reserveSegmentFunc,
      ReleaseSegmentFunc releaseSegmentFunc,
      CommitPageFunc commitPageFunc,
      DecommitPageFunc decommitPageFunc,
      size_t pageSize,
      size_t segmentGranularity,
      size_t segmentThreshold);

      // Destroys the given space, and attempts to return all of its memory
      // back to the system, returning the total number of bytes freed.
      // After destruction, the results of access to all memory used by the
      // space become undefined.
      size_t DestroyMemorySpace(MemorySpace* msp);

      void* Alloc(MemorySpace* msp, size_t bytes);
      void Free(MemorySpace* msp, void* mem);
      void* Realloc(MemorySpace* msp, void* mem, size_t newSize);
      void* AllocAligned(MemorySpace* msp, size_t alignment, size_t bytes);

      // Returns the current amount of memory requested from SegmentReserveFunc
      size_t GetTotalReservedMemory(MemorySpace* msp);

      // Returns the highwater mark for memory requested from SegmentReserveFunc
      size_t GetMaxReservedMemory(MemorySpace* msp);

      // Returns the amount of usable space in a memory chunk. Works with both
      // allocated and free chunks returned by iteration functions.
      size_t GetUsableSize(void*);

      // Iterates over user allocated chunks.
      void* FindFirstAllocation(MemorySpace* msp);
      void* FindNextAllocation(MemorySpace* msp, void* mem);

      // Iterates over free chunks. These chunks cannot be used. The only valid
      // operation on a free chunk is GetUsableSize and inspecting the address
      // to measure fragmentation.
      void* FindFirstFreeBlock(MemorySpace* msp);
      void* FindNextFreeBlock(MemorySpace* msp, void* mem);

      void ValidateMemorySpace(MemorySpace* msp);

      NAMESPACE_END(DSMalloc)
      NAMESPACE_END(Despair)

    2. Michael says:

      “Dump of code is worth a thousand words.” )
      Got two questions after reading it.

      I assume Despair engine is mostly C++ based, except for portable libraries like dsmalloc. So it’s strange seeing only Realloc(), which is not that friendly to C++ non-POD types. I prefer function similar to dlmalloc’s dlrealloc_in_place(), that actually just kind of “resize” and allows explicit reallocation and construction/destruction of objects. Well, the question is, do you use Realloc() in C++ codepath or is it here just for compatibility?

      And the second, less related to memory management. How do you manage creation of dsmalloc instances for common allocator and their use in stateless new/malloc/stl_allocator, especially in multithreaded environment?
      Is it static initialization, singleton, explicit strict initialization order or something else?

      Hope, I’m not annoying you with all that questions.

    3. Adrian Stone says:

      I totally agree about dlrealloc_in_place. It would be a worthwhile addition to dsmalloc, but as it happens, we don’t really use realloc very much. One reason why we haven’t had a strong need for dlrealloc_in_place is that we have a type trait for many of our common C++ objects that identifies them as trivially movable. Trivially movable objects can be handled efficiently by STL-like containers and algorithms without the need for C++11.

      Regarding creating and initialization of memory spaces, it depends on how they are used. The common allocator creates it memory spaces at static initialization time, using compiler-specific techniques to ensure they are created as early as possible (and destroyed as late as possible). We want the common allocator used for as many allocations as possible and our leak detection at shutdown to be as comprehensive as possible.

      Other memory spaces are created and destroyed by the systems that actually use them. If the audio system puts all audio in a custom memory space, the “AudioManager” is responsible for creating and destroying that memory space. Our top-level managers are created explicitly in main() rather than being singletons.

    4. Michael says:

      Thanks for detailed explanation. It looks like my memory management system will have a major redesign soon 🙂

  3. Rahul says:

    I was curious as to when an allocation which was re-sized using dlrealloc_in place is being freed, how does your common allocator map it to the correct dsmalloc instance?

    1. Adrian Stone says:

      All the DSMalloc functions take the memory space as a parameter, so it is up to the caller to remember the correct memory space. Some higher-level allocators we have use multiple memory spaces and store the memory space associated with each allocation in the allocation header or footer. This allows them to determine the correct memory space from an arbitrary pointer.

Leave a Reply