Finally had a chance to sit down and watch this.
Stuff I learned:
Fiber switching is really fast.
I had assumed they'd have a worker thread dedicated to each game system; an AI thread, physics thread, etc. Instead, they just have a generic worker thread running on each CPU core, and those threads all execute fibers as they become available. When a worker thread completes a fiber's current job, it just grabs another fiber to process. It doesn't matter if it contains an AI job or a physics job or what. It's a job that needs to get done, and this worker thread isn't doing anything, so it does it. All six of the worker threads just constantly process whatever ready fiber they can find. This is better than dedicating a thread/core to each subsystem, because it allows you to process six AI jobs simultaneously, or zero, instead of only one at a time.
This also means that ND's engine should benefit fairly linearly from the unlocking of Core6. Because their worker threads aren't particular about what type of job they're executing, their engine's job throughput should increase nearly 17% assuming it is indeed a 100% unlock because they don't have very many synchronization points in their engine. He said the engine is more like an OS, and that's how I've kind of pictured "modern" game engine; primarily a collection of daemons that perform very specific tasks very efficiently, and are capable of performing their job without worrying too much about what its fellow daemons are up to.
They use something called atomic counters for synchronization. It's basically a global unsigned int. If a job is processing some data that another job will depend on, it sets this counter to some non-zero value. The dependent job can then be dispatched immediately, and it tests if the counter has reached 0 yet. If not, the job's fiber is just slept, and the worker thread simply switches context to a fiber that
isn't blocked. The slept fiber is added to a list of fibers waiting on that particular counter, and when it reaches zero, all of the waiting jobs are flagged as ready to go for the next available worker thread.
This is especially useful because their job system is cooperative rather than preemptive. There's no way to interrupt a running job, so if it gets stuck waiting on a value from another job, it's completely stuck. To test whether an operation is safe to perform, the job calls WaitForCounter(counter, value), and if the counter doesn't contain the requested value, the fiber puts itself to sleep, relinquishing control of the thread so another fiber can execute.
While they don't lock particular subsystems to specific cores, they
do lock specific worker threads to specific cores. Because they mostly have the CPU cores to themselves, their worker threads can run for many, many cycles without interruption. But, it's a Unix system, and the kernel is king, so occasionally it
will come by and boot one of the threads off of its preferred core for a moment. So in that time when the worker thread has nowhere to execute, it complains, and all of the other worker threads offer up their cores, seeing as how they've been "hogging it" for the last however many cycles. The net result of this that every time they had a kernel-forced context switch, is actually triggered
six (or more!) switches, as their worker threads all played musical cores. So they use affinity to bind worker0 to Core0, and if it gets booted, it simply waits for the core to become available again rather than interrupting the other worker threads. This exposes another advantage of generic workers; if one gets blocked by the kernel, you simply process jobs
1/6th 1/7th slower until the block clears, rather than having your entire physics engine hang, for example.
To improve CPU utilization, they changed the way they thought of frames. They boiled the definition of a frame down to, "A piece of data that is processes and ultimately displayed on the screen." This data goes through three phases of life.
Game Logic: This is where all of your dice-rolling and bookkeeping happens. You run your AI and physics routines, process hit detection, update hit points
All of that good stuff. Basically, you're determining the current state of the world.
Render: After you've done your world simulation to determine the state of the environment, this is where you start describing the scene for the GPU to draw. Draw calls and such go here. There's a bunch, especially if the scene is complicated.
GPU: This last phase doesn't have anything to do with the CPU at all. Once all of the draw calls have been dispatched to the GPU, the CPU's involvement with that frame pretty much ends, and it can get started on the next frame.
So at 60 fps, the CPU basically has 16.67 ms to complete the game logic
and rendering phases, or the frame won't be ready in time for the GPU to do its thing. Problem was, they had about 100 ms worth of code to execute, and even spread across six cores, it still took about 25 ms to get from the start of game logic to the end of rendering. They weren't getting very good utilization on the CPU; a lot of the silicon was sitting idle a lot of the time.
So the solution was to run the game logic and rendering phases simultaneously, and eliminate dependencies by sending the frame through each phase sequentially.
This eliminates
all contention between the Game and Render phases, because they're always working on completely different frames. If a core is available, it can always be executing a rendering job, because all of the data the render requires was completed in the
previous beat of our 16.67 ms "phase clock." By making the two phases completely asynchronous, they were able to vastly improve their utilization and get their overall execution time to a svelte 15.5 ms.
Stuff I suspect:
Thanks to heterogeneous queueing, things should run fairly similarly from the GPU's perspective. Any job destined for the CPU gets added to one of the CPU job queues, where it's picked up by the next available fiber and executed by the next available worker. Any jobs tagged for the GPU whether compute or traditional rendering should be handled similarly, but by a system completely independent from the system that feeds the CPU.
So if the GPU is being fed similarly, would it also be a good idea to do your compute and render out-of-phase on the GPU, so it's doing all of its rendering for frame1 while simultaneously doing the compute needed for frame 2? Or do we need to move things even further out of phase, so Game on the CPU never stalls waiting for Compute to be ready on the GPU?
Stuff that confused me:
Okay, tagged data. I understand how it works and why it's good. It's okay if new memory for a fiber is allocated by a different thread, because it all goes through to the same backing store, since all of the threads have a Game block in the allocator, right? But what happens if the worker thread picks up a Render fiber, which it's equally likely to do ay any given time, yes? Is a Render block automagically loaded in to the allocator when a Render fiber is picked up? Is that any more expensive than a normal fiber switch?
Fibers provide context, but how, exactly? When he described the queueing system, it sounded like the top job from the highest-priority queue was simply popped by the next available fiber. Is that correct? If so, where does the "context" come from, or is it not that kind of context? It seems like a random sequence of jobs would be moving through any given fiber; not really grouped by dependency or anything like that. So is "context" simply making sure the job is working with the correct memory location, or is there some greater context like, "Do these things in this order" happening? Is a fiber a reusable, one-shot wrapper to "target" a job, or does it provide some sort of job-to-job context as well?
FakeEdit: Okay, I can imagine a sequence of commands that can be executed atomically, but are still only useful when performed in a specific order. So does a fiber tie that specific sequence of jobs together? How, if they're just pulling jobs from the queue, one at a time?
surfer's tl;dr:
ND's engine is pretty fucking sweet. It seems as scalable as it is flexible, and assuming a 100% unlock of the seventh core I wouldn't be surprised if they saw a 15% boost in CPU performance with just a few lines of code to set up a seventh worker.
real tl;dr:
"Very helpful."