A few days late, but still warm and gooey…
Aquarium Rewrite Status
After wrestling with CAS for a bit too long, and swearing at the lack of DCAS (or my own laziness for not restructuring to allow fake DCAS with paired CAS), we’re now switched over to using very tiny locks. The two areas with locks are: actor message queue and actor work queue. Speed-wise the actor message queue is nearly identical to using the CAS method, since the locked regions are only a couple instructions each. Later, as we go through optimization phases, I suspect there’s a good chance that the actor work queue will move back to CAS for performance reasons (attempting to steal from work queues shouldn’t incur any locking penalties, esp for manycore systems). Heck, I might try that out today on a working branch.
As you can tell, we’re at the polish stage, but if you’re like me, you’re wanting to know what the point of the rewrite was with some numbers.
First, the gist:
- No kernel. What it did — cross-scheduler messaging, load balancing, actor isolation — is now done using other techniques.
- Actors are messaged directly using safe message queues. Sending to the queue will tell you if it’s safe to enqueue/dequeue that actor for work. If you notice the actor is idle, and you are the first person messaging it, you become its owner. With this actors have no origin scheduler, and effectively are drawn from a shared idle pool
- Once active and on a scheduler, any actor can be stolen by idle schedulers. This gives us our load balancing.
- Because actors can be stolen around a scheduler, it becomes less important to have explicitly isolated actors. If the scheduler is busy with a long-running actor, other schedulers can steal off the starving actors behind the active task.
Now, the numbers:
Threadring (8-way Ubuntu Linux)
10m:
Alpha 3: .38s
Alpha 4: 3.0s
50m:
Alpha 3: 1.8s
Alpha 4: 15.0s
Explanation: We knew going in that threadring would suffer the most in the rewrite, as it gained a lot of speed from assuming that idle actors were paired to the local scheduler. We take a hit for not being able to make this assumption anymore.
Big bang:
1500:
Alpha 3: 1.0s
Alpha 4: 0.6s
4000:
Alpha 3: 108.0s
Alpha 4: 4.5s
Explanation: On the flip side, big bang actually is doing “real work”, and gets a performance boost for doing so. The active load balancer in alpha 4 if far superior to the alpha 3 one when there is work that can be shared.
Like I said earlier, we might be able to shave some of the fat during optimization later on, but I’m pleased with where we are. These are some pretty solid results and come from a much simpler architecture. The Aquarium library could also reasonably be used as an add-on library for C/C++ applications at this point. The code wouldn’t be idiomatic C, since sending messages to actors won’t look like function calls, but once you’re over the bump you’ve got an active load-balancing system.