In my last article I promised a follow-up description of the memory allocation system that I wrote for Despair Engine during the development of F.E.A.R. 3. Before I get to that, however, I’ve been inspired by Charles Bloom‘s recent posts on the job queue system that is apparently part of RAD Game Tools‘ Oodle, and I thought it might be interesting to describe the current job system and thread layout in Despair.
The “concurrency manager” in Despair Engine consists of worker threads and job queues. By default there is one pool of worker threads that share a single job queue, referred to as public workers, and some number of additional worker threads, each with its own job queue, referred to as private workers.
The job queues and their worker threads are relatively simple constructs. Jobs can be added to a queue from any thread and, when jobs are added, they create futures which can be used to wait on the job’s completion. A job queue doesn’t directly support any concept of dependencies between jobs or affinities to limit which worker threads a job can run on.
Jobs are fetched by worker threads in strictly FIFO order, but with multiple worker threads servicing a single queue, that doesn’t provide many guarantees on the order in which jobs are processed. For example, jobs A and B, enqueued in that order, may be fetched consecutively by worker threads 1 and 2. Because these tasks are asynchronous, worker thread 2 might actually manage to start and finish processing job B before worker thread 1 has done any meaningful work on job A.
What this means in practice is that any job added to the public worker queue must be able to be processed on any public worker thread and concurrently with any other job in the queue. If a job really needs to enforce a dependency with other jobs, it can do so by either waiting on the futures of the jobs it is dependent on or by creating the jobs that are dependent on it at the end of its execution. Combinations of these techniques can be used to create almost arbitrarily complex job behavior, but such job interactions inhibit maximum parallelism so we try to avoid them in our engine.
Public worker threads are mostly used by data-parallel systems that divide their work into some multiple of the number of public workers, enqueue jobs for each worker, and then, at some later point in the frame, wait on all the jobs’ futures. Examples of systems like this are animation, cloth simulation, physics, effects, and rendering. Although all of the users of the public job queue try to provide as much time as possible between adding work to the queue and blocking on its completion, work in the public job queue is considered high-priority and latency intolerant.
Systems that have work that can be processed asynchronously, but doesn’t meet the requirements of the public job queue, create private job queues. The primary reason for using a private job queue is that the work is long but latency tolerant, and we don’t want it delaying the latency intolerant work in the public job queue.
In Fracture, terrain deformation was processed in a private worker thread, consuming as much as 15 ms a frame but at low priority.
Starting with F.E.A.R. 3, decals have used a private worker thread. The decal thread generates and clips geometry from a queue. Although the queue might hold several hundred milliseconds of work, it will happily take only what time slices it can get from a single thread until it finishes its backlog.
We also use private worker threads for systems that have restrictions on their degree of parallelism or that perform primarily blocking work. Save data serialization, background resource loading, audio processing, network transport, and, on the PC, communication with the primary Direct3D context fall into this category.
I’ve had occasion over the years to use the Despair concurrency manager for a wide variety of tasks with a wide range of requirements, and I have very few complaints. I find the system simple and intuitive, and it is usually immediately obvious, given a new task, whether a private or public job queue is appropriate. I’ve occasionally wished for a richer set of scheduling options within the job queues themselves, but I ultimately believe that complex scheduling requirements are a symptom of bad multithreaded design and that if complex scheduling truly is justified, it is better handled within the jobs themselves.
The one area where scheduling has given me a lot of trouble, however, and where I wish we could offer some improvement, is in the interaction of public and private worker threads. When the platform, code, and content are reasonably stable, it isn’t too difficult to arrange the public and private workers such that they share the available processor resources efficiently.
On the Xbox 360, for example, where there are 6 hardware threads available, we have the main thread with affinity to hardware thread 0, four public worker threads with separate affinity to hardware threads 1, 2, 3, and 5, and most of our private worker threads sharing affinity with hardware thread 4. This arrangement ensures that the main thread and the public workers are never interrupted by the private worker threads, and it means that the private workers get a roughly equal share of an entire hardware thread. We know exactly how fast the hardware is and we can predict with a high degree of accuracy how much work each private worker will receive, so we don’t need to worry about oversubscription or underutilization of hardware thread 4.
In cases where the private worker threads aren’t sharing the load in a way we consider optimal, we can tweak either the hardware thread affinity or the software thread priorities to get the behavior we want. For example, in F.E.A.R. 3 we offloaded some marshaling of data for network transport to a private worker thread. Jobs for that thread were generated near the end of each frame and they had to be completed near the beginning of the following frame. If the private workers were left to the OS scheduler, the decal thread might preempt the network thread during that crucial window and cause a stall in the next frame. Since we knew the network thread never generated more than 5-6 ms of single-threaded work, we could safely boost its thread priority and ensure that it was never preempted by decals.
In another case where we weren’t 100% satisfied with the default scheduling of a private worker, we moved the private worker to share a hardware thread with one of the public workers but also lowered its thread priority. The luxury of a fixed, single-process platform is that we can hand tune the thread layout and be confident that our results will match those of our customers.
In the capture above you can see examples of both the situations I described. The thin colored lines represent software threads and the thick blocks above them represent individual jobs. A high priority audio thread, shown in yellow, interrupts a job on hardware thread 4, but that’s okay because the job being interrupted is latency tolerant. Later in the frame another thread, shown in light blue, schedules nicely with the latency intolerant jobs in the pink public worker on hardware thread 5.
The PC is where things get messy. On the PC we worry most about two different configurations. One is the low end, which for us is a dual core CPU, and the other is the high end, which is a hyper-threaded quad core CPU.
Currently, on the PC we allocate MAX(1, num_logical_processors-2) public worker threads. On a hyper-threaded quad core that means 6 public worker threads and on a dual core that means just 1 public worker thread. Unlike on the Xbox 360, however, we don’t specify explicit processor affinities for our threads, nor do we adjust thread priorities (except for obvious cases like an audio mixer thread). We don’t know what other processes might be running concurrently with our game and, with variations in drivers and platform configurations, we don’t even know what other third-party threads might be running in our process. Constraining the Windows scheduler with affinities and thread priorities will likely lead to poor processor utilization or even thread starvation.
That’s the convention wisdom anyway, but it sure doesn’t look pretty in profiles. From a bird’s eye view the job system appears to work as expected on the PC. As the number of cores increase, the game gets faster. Success! If it weren’t for our internal thread profiler and the Concurrency Visualizer in Visual Studio 2010 we’d probably have been happy with that and moved on.
On high-end PCs things aren’t too bad. Both our job queue visualization and Visual Studio’s thread visualization sometimes show disappointing utilization of our public workers, but that’s not necessarily a problem. We know we’re oversubscribed because we have more software threads created by the engine than there are hardware threads and there are at least 3 other third-party threads in our process doing meaningful work not to mention the other processes in the system. One of the benefits of a thread pool is that the number of participating threads can scale with the environment. Thankfully the behavior we usually see in these cases is fewer public workers tackling more jobs rather than all public workers starting jobs and then some being preempted, which would block the job queue until they can be rescheduled.
The image above is an example from our internal thread profiler. I had to zoom and crop it a bit to fit on the page, but what you’re seeing is the same portion of two frames on a dual core machine. The colored blocks represent jobs. You can see in the first frame a long stretch in which the main thread is processing jobs while the single worker thread is sitting idle. The next frame shows the intended behavior, with both the main thread and the worker thread processing jobs equally. We can’t visualize third-party threads or other processes with our internal profiler, so we have to turn to Visual Studio’s profiler to see what’s preempting our worker in that first frame. In our case it is usually the video driver or audio processor, but really any thread could be at fault. The more active threads in the system, including our own private workers, the more likely this sort of interference becomes.
The other behavior that is a little disappointing on many-core PCs is the high percentage of cross-core context switches. The Windows scheduler prioritizes quite a few factors above keeping a thread on its current core, so it isn’t too big a surprise for threads to jump cores in an oversubscribed system. The cost is some nebulous decrease in CPU cache coherency that is all but impossible to measure. Short of setting explicit processor affinities for our threads, which hurts overall performance, I haven’t had any luck improving this behavior. I had hoped to combat this effect with SetThreadIdealProcessor, but I haven’t actually been able to detect any change in scheduling when calling this function so we don’t use it.
On a high-end PC, as Louis C.K. might say, these are first world problems. With 8 logical processors we can afford to be less than perfect. From my profiles, even the best PC games are barely utilizing 30% of an 8 processor PC, and we’re comfortably within that range so I’m not complaining.
On dual core machines these issues can’t be ignored. With only two hardware threads, we’re now massively oversubscribed. The particularly difficult situation that we have to cope with is when all of our private workers are fully occupied at the same time. As I explained earlier, the decal thread is latency tolerant, but it can buffer far more than a single frame’s worth of work. This means that, left unchallenged, it alone can consume a full core for a full frame. Video drivers usually have their own threads which might, under heavy load, consume 25% of a core, audio might want another 20%, and Steam another 20%. All told we can have two thirds of a core’s worth of work in miscellaneous secondary threads and another full cores’s worth of work in the decal thread. That’s 1.7 cores worth of work competing on a level playing field with the main job queue on a machine with only 2 cores!
For most of these threads we have a general optimization problem more than a concurrency problem. We don’t have much flexibility in when or how they run, we just need to lower their cost. The decal thread, on the other hand, is different. Its purpose is to periodically consume far more work than would normally be budgeted for a single frame and to amortize the cost of that work over multiple frames. If it is impacting the execution of other threads then it isn’t doing its job.
My first reaction to this problem was, as usual, to wish for a more sophisticated scheduler in the public job queue. It seemed as though an easy solution would be to stick decal jobs in the public job queue and to instruct the scheduler to budget some fraction of every second to decal processing while trying to schedule decal jobs only at times when no other jobs are pending. After some consideration, however, I realized that this was asking too much of the scheduler and, perversely, would still require a lot of work in the decal system itself. Since the job scheduler isn’t preemptive, even a powerful system of budgets and priorities would rely on the jobs themselves being of sufficiently small granularity. The decal system would have to break up large jobs to assist the scheduler or, similarly, implement a cooperative yielding strategy that returned control to the scheduler mid-execution.
In addition to assisting the scheduler, the decal system would also have to become aware of how its rate of production was being throttled. Since decal resources are recycled using a heuristic that is heavily LRU, the decal system must manage the rate of input requests to match the rate of production in order to ensure that decals aren’t being recycled as soon as they are created.
It seems that any additional complexity added to the job scheduler is going to require equal complexity to be added to the decal system in order to take advantage of it. That’s always a red flag for me in systems design.
I’m still weighing some options for dealing with the decal thread, but my current favorite is to swallow my fears and seek help from the OS thread scheduler. If we reduce the thread priority of the decal thread under Windows, it will only be given time when the cores would otherwise be idle. However, since Windows implements thread boosts, even in a completely saturated environment the decal thread won’t starve completely. Nevertheless, this is a risky strategy because it creates the ability for the decal thread to block other threads through priority inversion. This probably isn’t a scalable long-term solution, but given our current thread layout and hardware targets, it achieves the desired result.
The difficulty of achieving maximum throughput on multicore processors is something that is often talked about in games, but what is less often talked about is how much harder this is on the PC than on the consoles. Maximizing throughout on high-end PCs is great, but, as I’ve shown, it must be done without sacrificing response time on low-end PCs. With our current approach I’ve been pretty pleased with our progress in this area, but I’m nevertheless having a hard time envisioning a day when we can fully utilize the resources of an 8 processor machine and still continue to provide a compatible play experience on a lowly dual core.
Hopefully by that time we’ll have broad support for GPGPU tasks as well as the scheduling flexibility to rely on them for more of our latency tolerant work.