{"id":585,"date":"2012-05-13T23:14:48","date_gmt":"2012-05-14T03:14:48","guid":{"rendered":"http:\/\/gameangst.com\/?p=585"},"modified":"2012-05-13T23:14:48","modified_gmt":"2012-05-14T03:14:48","slug":"low-fragmentation-high-performance-memory-allocation-in-despair-engine","status":"publish","type":"post","link":"http:\/\/gameangst.com\/?p=585","title":{"rendered":"Low-Fragmentation, High-Performance Memory Allocation in Despair Engine"},"content":{"rendered":"<p><a href=\"http:\/\/gameangst.com\/?p=496\">I recently wrote about dlmalloc<\/a> and how it is a poor choice for a memory allocator for console games. \u00a0As I explained in my previous article, dlmalloc has two major limitations. \u00a0It\u00a0manages a pool of address space composed of discrete regions called <em>segments<\/em>. \u00a0It can easily add segments to grow its pool of address space, but it can&#8217;t easily remove segments to return address space to the OS. \u00a0Additionally, it doesn&#8217;t distinguish between physical and virtual memory, which means that it can&#8217;t take advantage of virtual memory&#8217;s ability to combat fragmentation.<\/p>\n<p>During the development of FEAR 3, to address these limitations, I implemented a new general-purpose allocator based on <a href=\"http:\/\/g.oswego.edu\/dl\/html\/malloc.html\">dlmalloc<\/a> called <em>dsmalloc<\/em>.\u00a0\u00a0<em>Dsmalloc<\/em> makes two important improvements to dlmalloc.<\/p>\n<p>The first improvement is in how <em>dsmalloc<\/em> treats segments. \u00a0In <em>dsmalloc<\/em>, segments are always managed as completely distinct regions of memory.\u00a0 <em>Dsmalloc<\/em> will never coalesce adjacent segments.\u00a0 Because of this, <em>dsmalloc<\/em> tracks segments much more closely than dlmalloc.\u00a0 By overlaying a second intrusive data structure in the free bits of the boundary tag structure used to track individual allocations, <em>dsmalloc<\/em> can, in constant time and with no memory overhead, determine with every free operation when a segment is no longer occupied and can be returned to the system.<\/p>\n<p>The second improvement is that <em>dsmalloc<\/em> recognizes the difference between reserved address space and allocated memory.\u00a0 Unlike dlmalloc, which uses two callbacks to allocate and free system memory, <em>dsmalloc<\/em> exposes four callbacks, two to reserve and release segment-sized regions of virtual address space, and two to commit and decommit page-sized subregions of segments.<\/p>\n<p>Just as <em>dsmalloc<\/em> never leaves any unoccupied segments reserved, it also never leaves any unoccupied pages committed.<\/p>\n<blockquote>\n<div class=\"dean_ch\" style=\"white-space: nowrap;\"><span class=\"kw4\">typedef<\/span> <span class=\"kw4\">void<\/span>* <span class=\"br0\">&#40;<\/span>*ReserveSegmentFunc<span class=\"br0\">&#41;<\/span><span class=\"br0\">&#40;<\/span><span class=\"kw4\">size_t<\/span> size<span class=\"br0\">&#41;<\/span>;<br \/>\n<span class=\"kw4\">typedef<\/span> <span class=\"kw4\">void<\/span> <span class=\"br0\">&#40;<\/span>*ReleaseSegmentFunc<span class=\"br0\">&#41;<\/span><span class=\"br0\">&#40;<\/span><span class=\"kw4\">void<\/span>* ptr, <span class=\"kw4\">size_t<\/span> size<span class=\"br0\">&#41;<\/span>;<br \/>\n<span class=\"kw4\">typedef<\/span> <span class=\"kw4\">int<\/span> <span class=\"br0\">&#40;<\/span>*CommitPageFunc<span class=\"br0\">&#41;<\/span><span class=\"br0\">&#40;<\/span><span class=\"kw4\">void<\/span>* ptr, <span class=\"kw4\">size_t<\/span> size<span class=\"br0\">&#41;<\/span>;<br \/>\n<span class=\"kw4\">typedef<\/span> <span class=\"kw4\">void<\/span> <span class=\"br0\">&#40;<\/span>*DecommitPageFunc<span class=\"br0\">&#41;<\/span><span class=\"br0\">&#40;<\/span><span class=\"kw4\">void<\/span>* ptr, <span class=\"kw4\">size_t<\/span> size<span class=\"br0\">&#41;<\/span>;<\/p>\n<p>MemorySpace* CreateMemorySpace<span class=\"br0\">&#40;<\/span><br \/>\n&nbsp; &nbsp; <span class=\"kw4\">size_t<\/span> initialCapacity,<br \/>\n&nbsp; &nbsp; ReserveSegmentFunc reserveSegmentFunc,<br \/>\n&nbsp; &nbsp; ReleaseSegmentFunc releaseSegmentFunc,<br \/>\n&nbsp; &nbsp; CommitPageFunc commitPageFunc,<br \/>\n&nbsp; &nbsp; DecommitPageFunc decommitPageFunc,<br \/>\n&nbsp; &nbsp; <span class=\"kw4\">size_t<\/span> pageSize,<br \/>\n&nbsp; &nbsp; <span class=\"kw4\">size_t<\/span> segmentGranularity,<br \/>\n&nbsp; &nbsp; <span class=\"kw4\">size_t<\/span> segmentThreshold <span class=\"br0\">&#41;<\/span>;<\/div>\n<\/blockquote>\n<p>These two extensions are implemented without any additional memory overhead compared to dlmalloc and with very little computational overhead.\u00a0 The added cost is a few percent in targeted tests, but in a real-world scenario the performance difference between dlmalloc and <em>dsmalloc<\/em> isn&#8217;t even measurable.<\/p>\n<p>The great thing about these two changes to dlmalloc is that they enable a wide range of allocation strategies that otherwise wouldn&#8217;t be feasible.<\/p>\n<p>Individual systems within the Despair Engine are free to create separate <em>dsmalloc<\/em> instances (or <em>regions<\/em>)\u00a0for their own use.\u00a0 Because <em>dsmalloc<\/em> instances are so aggressive about returning memory to the system, they incur minimal internal fragmentation costs and can coexist gracefully with other system allocators. \u00a0Thread-safety is left to the users of <em>dsmalloc<\/em>, so individual systems can bypass thread synchronization costs entirely or at least use local locks to avoid contention with other allocators. \u00a0Using separate <em>dsmalloc<\/em>\u00a0instances also provides systems with\u00a0easier tracking of memory allocations, tighter enforcement of budgets, and, if their allocation patterns exhibit temporal or spatial locality within their own systems, reduced external fragmentation.<\/p>\n<p>For systems that don&#8217;t want to bother with using a custom allocator (which, frankly, most don&#8217;t), Despair Engine provide a common allocator which services request from the tradition global allocation functions like operator new, malloc, and XMemAlloc.\u00a0 This common allocator also utilizes <em>dsmalloc<\/em> instances under the hood.<\/p>\n<p>The common allocator creates one <em>dsmalloc<\/em> instance for allocations larger than 256 bytes and 20 <em>dsmalloc<\/em> instances covering allocations at every 8 byte size interval less than 64 bytes and every 16 byte interval between 64 and 256 bytes.\u00a0 The instance for large allocations uses a 32 megabyte segment size and 64 kilobyte pages whereas the small allocation instances use 64 kilobyte segments.\u00a0 Using page-sized segments for the small allocation regions is a minor optimization that removes the need for <em>dsmalloc<\/em> to track address reservation and memory commission separately for these regions.<\/p>\n<p>Bucketing small allocations into so many discrete regions significantly reduces external fragmentation in our games, despite creating a modest increase in internal fragmentation.\u00a0 Since only allocations within a single region need to be synchronized, it also has a side benefit of greatly reducing contention between allocations from multiple threads.<\/p>\n<p>It is worth noting that since the various small allocation regions are essentially used for fixed-size allocations, they could be more efficiently implemented as dedicated fixed-block allocators.\u00a0 We have such an implementation in the engine, but <em>dsmalloc<\/em> (like dlmalloc) already implements an internal small, fixed-block allocation optimization, so in practice it is more convenient to use\u00a0<em>dsmalloc<\/em> instances for everything and almost as efficient.<\/p>\n<p>One key benefit of using <em>dsmalloc<\/em>\u00a0instances for small allocations instead of a pure fixed-block allocator is that it offers us more flexibility in how the small allocation regions are configured.\u00a0 At the time I was implementing this in FEAR 3, minimizing fragmentation was our top concern, but in the future we might choose to prioritize thread synchronization efficiency over memory consumption.\u00a0 Instead of routing allocations to regions based on their size, we could create just a few small allocation regions and cycle between them based on thread contention.\u00a0 The idea is to allow a thread to try to lock a region for allocation, but rather than waiting if the region is already in use by another thread, simply move on to another region and try again.<\/p>\n<p><em>Dsmalloc<\/em> is flexible enough to support both of these strategies efficiently or, in fact, any hybrid combination of the two.<\/p>\n<p>The <em>dsmalloc<\/em> callbacks are designed to be easily mappable to OS functions such as <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/windows\/desktop\/aa366887(v=vs.85).aspx\">VirtualAlloc<\/a> in Windows, but sometimes these functions are too expensive to be used directly.\u00a0 To improve performance, on some platforms the Despair general allocator utilizes a commit cache.\u00a0 The commit cache is a simple <a href=\"http:\/\/dictionary.reference.com\/browse\/direct+mapped+cache\">direct-mapped cache<\/a> that sits between the <em>dsmalloc<\/em> commit callback and the OS. \u00a0<em>Dsmalloc<\/em> already optimizes allocation order to maximize cache efficiency, and this benefits the commit cache as well.\u00a0 A 32 megabyte commit cache is probably overly generous, but it guarantees that OS-level calls don&#8217;t show up in our profiles even during content streaming transitions.<\/p>\n<p>Having the cache implemented external to <em>dsmalloc<\/em> is also useful.\u00a0 When memory is extremely tight, anything that doesn&#8217;t utilize the general allocator could fail even though memory is still available in the commit cache.\u00a0 In those extreme cases the commit cache can be manually prompted to return pages to the system ensuring that OS-level allocations never fail because the memory they need is held in a cache.<\/p>\n<p>There is one additional complication that plagues the consoles.\u00a0 On both the Xbox 360 and the Playstation 3, the GPU does not share the CPU&#8217;s memory mapping unit.\u00a0 On the Xbox 360 the GPU requires physically contiguous memory for all resources and on the PS3 the GPU has a separate MMU with only 1 megabyte page granularity.\u00a0 Since we use <em>dsmalloc<\/em> with 64 kilobyte pages to minimize fragmentation, this means we&#8217;re susceptible to physical memory fragmentation when it comes to resources allocated for the GPU.\u00a0 On the Xbox 360, unbeknownst to many developers, the OS copes with this automatically.\u00a0 When a physical allocation request can&#8217;t be satisfied due to physical memory fragmentation, the Xbox 360 operating system locks the entire address space and defragments physical memory pages (by memcpy&#8217;ing physical pages and remapping virtual addresses) to accommodate the request.<\/p>\n<p>On the PS3 the OS doesn&#8217;t provide this service automatically, but the same basic need exists.\u00a0 64 kilobyte pages must be locked and relocated to generate contiguous 1 megabyte pages appropriate for the GPU.\u00a0 Thankfully with a little bit of acrobatics it is possible to do just that without violating certification requirements.<\/p>\n<p>Although the operation is every bit as expensive as it sounds, it proved necessary to keep FEAR 3 from crashing.\u00a0 FEAR 3 ran dangerously close to the limits of memory on the Playstation 3 and survived only by allowing flexible budgets for every kind of memory.\u00a0 GPU allocations in main memory varied by over 100% between streaming regions so not only did CPU addressable memory need to be defragmented for use by the GPU, GPU addressable memory had to be defragmented continually and returned to the CPU. \u00a0The really expensive CPU defragmentation provided a safety net against crashes at all times, but thankfully it was only needed at major chapter transitions where a load screen was expected and a 100 ms hitch was infinitely preferable to an out and out crash.<\/p>\n<p>When I undertook to rewrite the Despair memory allocator near the end of FEAR 3, the overarching goal was to never fail to satisfy an allocation request due to prior memory usage patterns.\u00a0 If a level fit just barely into memory after a fresh boot, we wanted it to fit equally well in memory after a hundred hours of continuous play.\u00a0 While I can&#8217;t pretend that fragmentation doesn&#8217;t exist at all in the system I devised, I can say that it was so dramatically reduced that we were able to establish an upper bound on fragmentation and enforce budgets for our content that were reliable regardless of the order or duration in which the game was played.<\/p>\n<p>The new\u00a0Despair memory allocator based on <em>dsmalloc<\/em> ran the same content as the old memory allocator on all platforms with a slightly lower lower bound on memory, a much lower upper bound on memory, and it even managed a slightly lower run-time performance cost.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently wrote about dlmalloc and how it is a poor choice for a memory allocator for console games. \u00a0As I explained in my previous article, dlmalloc has two major limitations. \u00a0It\u00a0manages a pool of address space composed of discrete regions called segments. \u00a0It can easily add segments to grow its pool of address space, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[11,7],"_links":{"self":[{"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/posts\/585"}],"collection":[{"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/gameangst.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=585"}],"version-history":[{"count":45,"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/posts\/585\/revisions"}],"predecessor-version":[{"id":667,"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/posts\/585\/revisions\/667"}],"wp:attachment":[{"href":"http:\/\/gameangst.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=585"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/gameangst.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=585"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/gameangst.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=585"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}