DIRDEPS_BUILD scale out

The DIRDEPS_BUILD lends itself quite neatly to scale out on a per directory basis - thus reducing the overhead.

We need some tools and a filesystem shared by all the nodes, this is typically going to be NFS which adds latency to the problems we need to deal with.

Since the DIRDEPS_BUILD is orchestrated in terms of directories, that is the unit of work we farm out to the worker bees.

We deal with the issue of latency by use of a token. When we ask a worker bee to build a directory, we give it a token to drop in its ${.OBJDIR} when done. Actually that's all handled by the makefiles. The distribution mechanism is totally ignorant of such details.

Other jobs can simply wait for the token to be visible in all dependent directories before starting. Again this is handled by the makefiles.

Because of that latency and our strategy for coping with it, things will generally work best if the build load is split up by TARGET_SPEC. That is if one worker deals with all jobs for say i386 while another deals with all jobs for amd64 and yet another for arm64 etc. Each worker will not see any latency when looking for tokens created on the same host.

We generally leave the orchestration host to deal with host targets. It should be obvious that we depend on all the nodes being compatible.

Components Introduction

Below I introduce the moving parts. The goal is to distribute jobs with minimal overhead.

queen.py

The queen distributes jobs to workers via SSH.

At the start of the build it creates an AF_UNIX socket to collect requests and makes SSH connections to each worker.

We can likely support numerous methods of sharding the build work, but if there are enough TARGET_SPEC involved that is the most efficient method.

When a job request arrives, a worker is selected from the pool. Workers already handling jobs for the same TARGET_SPEC as the new request will be prefered, but if these are at max load, the next available with lowest load will be selected.

The result stream from each worker is handled by a sax parser running in another thread, as result elements complete a callback sends them to the original caller.

XML

We use XML to encode the job requests and results. While the job requests are typically small and could be communicated on a single line, the results can have an arbitrary amount of output so explicit markup is called for.

A job request looks like:

<job>
<id>1234</id>
<target_spec>$TARGET_SPEC</target_spec>
<env>TARGET_SPEC=$TARGET_SPEC MACHINE=$MACHINE...</env>
<cmd>${.MAKE} -C ${RELDIR}</cmd>
</job>

Where id is the pid of the requesting job.py and everything else is as provided by the _DIRDEPS_USE target.

The results look like:

<result>
<id>1234</id>
<status>0</status>
<output>whatever</output>
</result>

job.py

This is the interface from the build. The orchestrating bmake instance runs job.py rather than bmake directly for each subdir to be built on a remote host.

job.py connects to queen's socket and communicates the request then simply waits for the result.

Once received the output is printed and we exits with the given status.

The end result is that the orchestrating bmake cannot tell the difference.

worker.py

Started by the queen via workon which conditions the environment for the SB.

It initially responds with some information about itself, including number of cpus and load averages. If the 1 minute load average is 80% of ncpu or more, it will be skipped.

For each job it recieves it forks a child to run it. The exit status of bmake and any output are returned to caller.

A sax parser is run in a thread, to simplify handling of the job stream from the queen. As each job element is completed, a callback to the worker is made to actually run the job.

When a job is completed, the main thread sense results back to the queen, this ensures there is no corruption of the result stream.

dirdeps-queen.mk

This is the magic to integrate all the above into the build.

This makefile is only used if ${.MAKE.LEVEL} == 0 and we are thus the orchestration bmake.

With the latest dirdeps.mk we need simply tweak some of the variables used by the _DIRDEPS_USE target to send non-local jobs via job.py.

dirdeps-bee.mk

This is the makefile used when ${.MAKE.LEVEL} > 0 Ie. we are actually building something.

We simply add a bee_ready target that waits for all the directories we depend on to contain the ${BEE_BUILD_TOKEN} to ensure results are visible.

For jobs done on the same host this has little impact, but for jobs done on another host we might have to wait a minute or two depending on the latency of the shared filesystem.

We also add a bee_done target to create the token on success.

This is why sharding the build based on TARGET_SPEC is more efficient. Most of the directories we depend on will be for the same TARGET_SPEC. If all such directories are built on the same host we expect zero latency caused by NFS.

Shared filesystem

As noted this will typically be NFS, and that imposes some limitations.

Since writing to NFS can be pretty slow, we need to a achieve a considerable degree of parallelism in the build before this scale out design will actually improve build time.

That is if the build can only scale to a modest number of concurrent jobs, better results will be seen from a single host with decent number of cpus.

If the build can scale well beyond what a singe host can provide, then scale out can be of benefit.

Also since writing to NFS can be slow, it can help if the host the build is run from is also the NFS server - provided the NFS server implementation is decent.

Work scheduling

There are many ways to divide the work up among a cluster of nodes. If the shared filesystem is anything like NFS though, some models will work better than others.

For example, it is reasonable to assume that jobs for the same TARGET_SPEC, might have dependencies. They can also have dependencies on other TARGET_SPEC especially those for a pseudo MACHINE like host or common, but other cross TARGET_SPEC dependencies should be rare. With that in mind, we can minimize the effect of NFS latency, by sending all jobs for the same TARGET_SPEC to the same node.

But that may result in a very unbalanced load accross available nodes.

Since we keep track of how many jobs each worker is currently handling, we can avoid overload.

We can feed a node jobs for the same TARGET_SPEC until we hit a specified limit, then just select next node which may be doing overflow for the same TARGET_SPEC or will simply be the least loaded noded.

Split-fs

When building src from NFS we often use a local filesystem for obj dirs. With the scale-out model above, this would be counter productive.


Author:sjg@crufty.net
Revision:$Id: README.txt,v 4ddf9df389f3 2019-11-13 22:56:11Z sjg $
Copyright:Crufty.NET

Author:sjg@crufty.net /* imagine something very witty here */