{"id":526,"date":"2012-04-05T17:07:57","date_gmt":"2012-04-05T21:07:57","guid":{"rendered":"http:\/\/gameangst.com\/?p=526"},"modified":"2012-04-05T17:07:57","modified_gmt":"2012-04-05T21:07:57","slug":"scalable-concurrency-in-despair-engine","status":"publish","type":"post","link":"http:\/\/gameangst.com\/?p=526","title":{"rendered":"Scalable Concurrency in Despair Engine"},"content":{"rendered":"<p>In my last article I promised a follow-up description of the memory allocation system that I wrote for <a href=\"http:\/\/www.day1studios.com\/company\/despair-engine\/\" target=\"_blank\">Despair Engine<\/a> during the development of F.E.A.R. 3. \u00a0Before I get to that, however, I&#8217;ve been inspired by <a href=\"http:\/\/cbloomrants.blogspot.com\/\">Charles Bloom<\/a>&#8216;s recent posts on the <a href=\"http:\/\/cbloomrants.blogspot.com\/2011\/12\/12-03-11-worker-thread-system-with.html\">job queue system<\/a> that is apparently part of <a href=\"http:\/\/www.radgametools.com\/\">RAD Game Tools<\/a>&#8216; Oodle, and I thought it might be interesting to describe the current job system and thread layout in Despair.<\/p>\n<p>The &#8220;concurrency manager&#8221; in Despair Engine consists of worker threads and job queues. \u00a0By default there is one pool of worker threads that share a single job queue, referred to as public workers, and some number of additional worker threads, each with its own job queue, referred to as private workers.<\/p>\n<p>The job queues and their worker threads are relatively simple constructs. \u00a0Jobs can be added to a queue from any thread and, when jobs are added, they create <a href=\"http:\/\/en.wikipedia.org\/wiki\/Futures_and_promises\" target=\"_blank\">futures<\/a> which can be used to wait on the job&#8217;s completion. \u00a0A job queue doesn&#8217;t directly support any concept of dependencies between jobs or affinities to limit which worker threads a job can run on.<\/p>\n<p>Jobs are fetched by worker threads in strictly FIFO order, but with multiple worker threads servicing a single queue, that doesn&#8217;t provide many guarantees on the order in which jobs are processed. \u00a0For example, jobs <em><strong>A<\/strong><\/em> and <em><strong>B<\/strong><\/em>, enqueued in that order, may be fetched consecutively by worker threads <strong>1<\/strong>\u00a0and <strong>2<\/strong>. \u00a0Because these tasks are asynchronous, worker thread <strong>2<\/strong> might actually manage to start and finish processing job <em><strong>B<\/strong><\/em> before worker thread <strong>1<\/strong> has done any meaningful work on job <em><strong>A<\/strong><\/em>.<\/p>\n<p>What this means in practice is that any job added to the public worker queue must be able to be processed on any public worker thread and concurrently with any other job in the queue. \u00a0If a job really needs to enforce a dependency with other jobs, it can do so by either waiting on the futures of the jobs it is dependent on or by creating the jobs that are dependent on it at the end of its execution. \u00a0Combinations of these techniques can be used to create almost arbitrarily complex job behavior, but such job interactions inhibit maximum parallelism so we try to avoid them in our engine.<\/p>\n<p>Public worker threads are mostly used by <a href=\"http:\/\/en.wikipedia.org\/wiki\/Data_parallelism\" target=\"_blank\">data-parallel<\/a> systems that divide their work into some multiple of the number of public workers, enqueue jobs for each worker, and then, at some later point in the frame, wait on all the jobs&#8217; futures. \u00a0Examples of systems like this are animation, cloth simulation, physics, effects, and rendering. \u00a0Although all of the users of the public job queue try to provide as much time as possible between adding work to the queue and blocking on its completion, work in the public job queue is considered high-priority and latency intolerant.<\/p>\n<p>Systems that have work that can be processed asynchronously, but doesn&#8217;t meet the requirements of the public job queue, create private job queues. \u00a0The primary reason for using a private job queue is that the work is long but latency tolerant, and we don&#8217;t want it delaying the latency intolerant work in the public job queue.<\/p>\n<p>In Fracture, terrain deformation was processed in a private worker thread, consuming as much as 15 ms a frame but at low priority.<\/p>\n<div style=\"width: 430px\" class=\"wp-caption aligncenter\"><iframe loading=\"lazy\" width=\"500\" height=\"375\" src=\"http:\/\/www.youtube.com\/embed\/m_o2BFvhqt4?feature=oembed\" frameborder=\"0\" allowfullscreen><\/iframe><p class=\"wp-caption-text\">Terrain deformation in Fracture<\/p><\/div>\n<p>Starting with F.E.A.R. 3, decals have used a private worker thread.  The decal thread generates and clips geometry from a queue.  Although the queue might hold several hundred milliseconds of work, it will happily take only what time slices it can get from a single thread until it finishes its backlog.<\/p>\n<p>We also use private worker threads for systems that have restrictions on their degree of parallelism or that perform primarily blocking work. \u00a0Save data serialization, background resource loading, audio processing, network transport, and, on the PC, communication with the primary Direct3D context fall into this category.<\/p>\n<p>I&#8217;ve had occasion over the years to use the Despair concurrency manager for a wide variety of tasks with a wide range of requirements, and I have very few complaints. \u00a0I find the system simple and intuitive, and it is usually immediately obvious, given a new task, whether a private or public job queue is appropriate. \u00a0I&#8217;ve occasionally wished for a richer set of scheduling options within the job queues themselves, but I ultimately believe that complex scheduling requirements are a symptom of bad multithreaded design and that if complex scheduling truly is justified, it is better handled within the jobs themselves.<\/p>\n<p>The one area where scheduling has given me a lot of trouble, however, and where I wish we could offer some improvement, is in the interaction of public and private worker threads. When the platform, code, and content are reasonably stable, it isn&#8217;t too difficult to arrange the public and private workers such that they share the available processor resources efficiently.<\/p>\n<p>On the Xbox 360, for example, where there are 6 hardware threads available, we have the main thread with affinity to hardware thread 0, four public worker threads with separate affinity to hardware threads 1, 2, 3, and 5, and most of our private worker threads sharing affinity with hardware thread 4. \u00a0This arrangement ensures that the main thread and the public workers are never interrupted by the private worker threads, and it means that the private workers get a roughly equal share of an entire hardware thread. \u00a0We know exactly how fast the hardware is and we can predict with a high degree of accuracy how much work each private worker will receive, so we don&#8217;t need to worry about oversubscription or underutilization of hardware thread 4.<\/p>\n<p>In cases where the private worker threads aren&#8217;t sharing the load in a way we consider optimal, we can tweak either the hardware thread affinity or the software thread priorities to get the behavior we want. \u00a0For example, in F.E.A.R. 3 we offloaded some marshaling of data for network transport to a private worker thread. \u00a0Jobs for that thread were generated near the end of each frame and they had to be completed near the beginning of the following frame. \u00a0If the private workers were left to the OS scheduler, the decal thread might preempt the network thread during that crucial window and cause a stall in the next frame. \u00a0Since we knew the network thread never generated more than 5-6 ms of single-threaded work, we could safely boost its thread priority and ensure that it was never preempted by decals.<\/p>\n<p>In another case where we weren&#8217;t 100% satisfied with the default scheduling of a private worker, we moved the private worker to share a hardware thread with one of the public workers but also lowered its thread priority. \u00a0The luxury of a fixed, single-process platform is that we can hand tune the thread layout and be confident that our results will match those of our customers.<\/p>\n<div id=\"attachment_568\" style=\"width: 810px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-568\" loading=\"lazy\" class=\"size-large wp-image-568 \" title=\"Xbox 360 PIX Capture\" src=\"http:\/\/gameangst.com\/wp-content\/uploads\/2012\/03\/XboxPIXCapture3-800x160.png\" alt=\"\" width=\"800\" height=\"160\" srcset=\"http:\/\/gameangst.com\/wp-content\/uploads\/2012\/03\/XboxPIXCapture3-800x160.png 800w, http:\/\/gameangst.com\/wp-content\/uploads\/2012\/03\/XboxPIXCapture3-150x30.png 150w, http:\/\/gameangst.com\/wp-content\/uploads\/2012\/03\/XboxPIXCapture3-400x80.png 400w, http:\/\/gameangst.com\/wp-content\/uploads\/2012\/03\/XboxPIXCapture3.png 1409w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><p id=\"caption-attachment-568\" class=\"wp-caption-text\">Example hardware thread utilization in F.E.A.R. 3, as captured by PIX on the Xbox 360<\/p><\/div>\n<p>In the capture above you can see examples of both the situations I described. \u00a0The thin colored lines represent software threads and the thick blocks above them represent individual jobs. \u00a0A high priority audio thread, shown in yellow, interrupts a job on hardware thread 4, but that&#8217;s okay because the job being interrupted is latency tolerant. \u00a0Later in the frame another thread, shown in light blue, schedules nicely with the latency intolerant jobs in the pink public worker on hardware thread 5.<\/p>\n<p>The PC is where things get messy. \u00a0On the PC we worry most about two different configurations. \u00a0One is the low end, which for us is a dual core CPU, and the other is the high end, which is a hyper-threaded quad core CPU.<\/p>\n<p>Currently, on the PC we allocate MAX(1, num_logical_processors-2) public worker threads. \u00a0On a hyper-threaded quad core that means 6 public worker threads and on a dual core that means just 1 public worker thread. \u00a0Unlike on the Xbox 360, however, we don&#8217;t specify explicit processor affinities for our threads, nor do we adjust thread priorities (except for obvious cases like an audio mixer thread). \u00a0We don&#8217;t know what other processes might be running concurrently with our game and, with variations in drivers and platform configurations, we don&#8217;t even know what other third-party threads might be running in our process.  Constraining the Windows scheduler with affinities and thread priorities will likely lead to poor processor utilization or even thread starvation.<\/p>\n<p>That&#8217;s the convention wisdom anyway, but it sure doesn&#8217;t look pretty in profiles. \u00a0From a bird&#8217;s eye view the job system appears to work as expected on the PC. \u00a0As the number of cores increase, the game gets faster. \u00a0Success! \u00a0If it weren&#8217;t for our internal thread profiler and the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/magazine\/ee336027.aspx\">Concurrency Visualizer<\/a> in Visual Studio 2010 we&#8217;d probably have been happy with that and moved on.<\/p>\n<p>On high-end PCs things aren&#8217;t too bad. \u00a0Both our job queue visualization and Visual Studio&#8217;s thread visualization sometimes show disappointing utilization of our public workers, but that&#8217;s not necessarily a problem. \u00a0We know we&#8217;re oversubscribed because we have more software threads created by the engine than there are hardware threads and there are at least 3 other third-party threads in our process doing meaningful work not to mention the other processes in the system. \u00a0One of the benefits of a thread pool is that the number of participating threads can scale with the environment. \u00a0Thankfully the behavior we usually see in these cases is fewer public workers tackling more jobs rather than all public workers starting jobs and then some being preempted, which would block the job queue until they can be rescheduled.<\/p>\n<div id=\"attachment_577\" style=\"width: 587px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-577\" loading=\"lazy\" class=\"size-full wp-image-577 \" title=\"Dual Core Preemption\" src=\"http:\/\/gameangst.com\/wp-content\/uploads\/2012\/03\/DualCorePreemption.png\" alt=\"\" width=\"577\" height=\"150\" srcset=\"http:\/\/gameangst.com\/wp-content\/uploads\/2012\/03\/DualCorePreemption.png 577w, http:\/\/gameangst.com\/wp-content\/uploads\/2012\/03\/DualCorePreemption-150x38.png 150w, http:\/\/gameangst.com\/wp-content\/uploads\/2012\/03\/DualCorePreemption-400x103.png 400w\" sizes=\"(max-width: 577px) 100vw, 577px\" \/><p id=\"caption-attachment-577\" class=\"wp-caption-text\">Job visualization from the same portion of two adjacent frames on a dual core PC<\/p><\/div>\n<p>The image above is an example from our internal thread profiler. \u00a0I had to zoom and crop it a bit to fit on the page, but what you&#8217;re seeing is the same portion of two frames on a dual core machine. \u00a0The colored blocks represent jobs. \u00a0You can see in the first frame a long stretch in which the main thread is processing jobs while the single worker thread is sitting idle. \u00a0The next frame shows the intended behavior, with both the main thread and the worker thread processing jobs equally. \u00a0We can&#8217;t visualize third-party threads or other processes with our internal profiler, so we have to turn to Visual Studio&#8217;s profiler to see what&#8217;s preempting our worker in that first frame. \u00a0In our case it is usually the video driver or audio processor, but really any thread could be at fault. \u00a0The more active threads in the system, including our own private workers, the more likely this sort of interference becomes.<\/p>\n<p>The other behavior that is a little disappointing on many-core PCs is the high percentage of cross-core context switches. \u00a0The Windows scheduler prioritizes quite a few factors above keeping a thread on its current core, so it isn&#8217;t too big a surprise for threads to jump cores in an oversubscribed system. \u00a0The cost is some nebulous decrease in CPU cache coherency that is all but impossible to measure. \u00a0Short of setting explicit processor affinities for our threads, which hurts overall performance, I haven&#8217;t had any luck improving this behavior. \u00a0I had hoped to combat this effect with\u00a0<a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/windows\/desktop\/ms686253(v=vs.85).aspx\">SetThreadIdealProcessor<\/a>, but I haven&#8217;t actually been able to detect any change in scheduling when calling this function so we don&#8217;t use it.<\/p>\n<p>On a high-end PC, as <a href=\"http:\/\/youtu.be\/8r1CZTLk-Gk\" target=\"_blank\">Louis C.K.<\/a> might say, these are <a href=\"http:\/\/first-world-problems.com\/\" target=\"_blank\">first world problems<\/a>. \u00a0With 8 logical processors we can afford to be less than perfect. \u00a0From my profiles, even the best PC games are barely utilizing 30% of an 8 processor PC, and we&#8217;re comfortably within that range so I&#8217;m not complaining.<\/p>\n<p>On dual core machines these issues can&#8217;t be ignored. \u00a0With only two hardware threads, we&#8217;re now massively oversubscribed. \u00a0The particularly difficult situation that we have to cope with is when all of our private workers are fully occupied at the same time. As I explained earlier, the decal thread is latency tolerant, but it can buffer far more than a single frame&#8217;s worth of work. \u00a0This means that, left unchallenged, it alone can consume a full core for a full frame. \u00a0Video drivers usually have their own threads which might, under heavy load, consume 25% of a core, audio might want another 20%, and Steam another 20%. \u00a0All told we can have two thirds of a core&#8217;s worth of work in miscellaneous secondary threads and another full cores&#8217;s worth of work in the decal thread. \u00a0That&#8217;s 1.7 cores worth of work competing on a level playing field with the main job queue on a machine with only 2 cores!<\/p>\n<p>For most of these threads we have a general optimization problem more than a concurrency problem. \u00a0We don&#8217;t have much flexibility in when or how they run, we just need to lower their cost. \u00a0The decal thread, on the other hand, is different. \u00a0Its purpose is to periodically consume far more work than would normally be budgeted for a single frame and to amortize the cost of that work over multiple frames. \u00a0If it is impacting the execution of other threads then it isn&#8217;t doing its job.<\/p>\n<p>My first reaction to this problem was, as usual, to wish for a more sophisticated scheduler in the public job queue. \u00a0It seemed as though an easy solution would be to stick decal jobs in the public job queue and to instruct the scheduler to budget some fraction of every second to decal processing while trying to schedule decal jobs only at times when no other jobs are pending. \u00a0After some consideration, however, I realized that this was asking too much of the scheduler and, perversely, would still require a lot of work in the decal system itself. \u00a0Since the job scheduler isn&#8217;t preemptive, even a powerful system of budgets and priorities would rely on the jobs themselves being of sufficiently small granularity. \u00a0The decal system would have to break up large jobs to assist the scheduler or, similarly, implement a cooperative yielding strategy that returned control to the scheduler mid-execution.<\/p>\n<p>In addition to assisting the scheduler, the decal system would also have to become aware of how its rate of production was being throttled. \u00a0Since decal resources are recycled using a heuristic that is heavily LRU, the decal system must manage the rate of input requests to match the rate of production in order to ensure that decals aren&#8217;t being recycled as soon as they are created.<\/p>\n<p>It seems that any additional complexity added to the job scheduler is going to require equal complexity to be added to the decal system in order to take advantage of it. \u00a0That&#8217;s always a red flag for me in systems design.<\/p>\n<p>I&#8217;m still weighing some options for dealing with the decal thread, but my current favorite is to swallow my fears and seek help from the OS thread scheduler. \u00a0If we reduce the thread priority of the decal thread under Windows, it will only be given time when the cores would otherwise be idle. \u00a0However, since Windows implements <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/windows\/desktop\/ms684828(v=vs.85).aspx\">thread boosts<\/a>, even in a completely saturated environment the decal thread won&#8217;t starve completely. \u00a0Nevertheless, this is a risky strategy because it creates the ability for the decal thread to block other threads through <a href=\"http:\/\/en.wikipedia.org\/wiki\/Priority_inversion\">priority inversion<\/a>.  This probably isn&#8217;t a scalable long-term solution, but given our current thread layout and hardware targets, it achieves the desired result.<\/p>\n<p>The difficulty of achieving maximum throughput on multicore processors is something that is often talked about in games, but what is less often talked about is how much harder this is on the PC than on the consoles. \u00a0Maximizing throughout on high-end PCs is great, but, as I&#8217;ve shown, it must be done without sacrificing response time on low-end PCs. \u00a0With our current approach I&#8217;ve been pretty pleased with our progress in this area, but I&#8217;m nevertheless having a hard time envisioning a day when we can fully utilize the resources of an 8 processor machine and still continue to provide a compatible play experience on a lowly dual core.<\/p>\n<p>Hopefully by that time we&#8217;ll have broad support for GPGPU tasks as well as the scheduling flexibility to rely on them for more of our latency tolerant work.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my last article I promised a follow-up description of the memory allocation system that I wrote for Despair Engine during the development of F.E.A.R. 3. \u00a0Before I get to that, however, I&#8217;ve been inspired by Charles Bloom&#8216;s recent posts on the job queue system that is apparently part of RAD Game Tools&#8216; Oodle, and [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[19,11,18,15,7],"_links":{"self":[{"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/posts\/526"}],"collection":[{"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/gameangst.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=526"}],"version-history":[{"count":82,"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/posts\/526\/revisions"}],"predecessor-version":[{"id":616,"href":"http:\/\/gameangst.com\/index.php?rest_route=\/wp\/v2\/posts\/526\/revisions\/616"}],"wp:attachment":[{"href":"http:\/\/gameangst.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=526"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/gameangst.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=526"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/gameangst.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=526"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}