A year ago I introduced Zuul, a program I developed to drive the OpenStack project's gating system. In short, each change to an OpenStack project must pass unit and integration tests before it is merged. For more details on the system, see Zuul: a Pipelining Trunk Gating System.
Over the past year, the OpenStack project has grown tremendously, with 62 git repositories related to OpenStack, 30 for the project infrastructure, and an additional 75 unofficial projects that share the same testing infrastructure. In all, the development infrastructure currently serves 167 repositories. We run up to 720 test jobs per hour, and our dynamic provisioning system has pushed our test node count up to 328 nodes online and running tests simultaneously.
Over the past year, we've made a large number of changes to prepare for this load (we saw the graphs of the OpenStack project growth just like everyone else). Here are some of the key innovations that help us test at scale.
Gearman-Plugin and Multi-Master Jenkins
It was becoming apparent that as we kept adding more nodes to Jenkins that the Jenkins master was becoming a bottleneck for scaling, as well as a single point of failure. We decided to solve this by creating a system where we can have completely independent Jenkins masters.
We decided to use Gearman as a way to distribute jobs from Zuul to any number of systems that can run tests for it (Jenkins or otherwise). Instead of talking to Jenkins directly, Zuul now submits requests to run jobs to a Gearman server, which then distributes them to any worker that registers with it indicating that it can run a particular job.
So that we can continue to run our Jenkins tests, Khai Do and I wrote the gearman-plugin for Jenkins. It connects to a Gearman server and registers every job defined in Jenkins as something it can run. We currently have three Jenkins masters which register their jobs (1129 of them) with Gearman, which distributes build requests from Zuul to them as they have nodes available.
This system gives us quite a bit of flexibility -- we can have any program (not just Zuul) trigger jobs, as well as have any system (not just Jenkins) run those jobs. We also now have a highly-available Jenkins system, with redundant Jenkins masters across which we can do rolling upgrades of Jenkins with no downtime.
While implementing the Gearman interface for Zuul, we found that the existing Python Gearman libraries didn't facilitate the kind of asynchronous concurrency we wanted to use in Zuul. So I wrote gear which is a very simple and lightweight interface that tries to expose all of the flexibility of the Gearman protocol. Using Gear, it's very simple to write a Gearman worker or Client that can handle having thousands of jobs in-flight at a time. In the OpenStack project infrastructure it is used by Zuul, as well as the log processors in our Logstash system.
With multiple Jenkins masters in place, we now have quite a bit of capacity to add more test nodes. To fill that capacity, I wrote Nodepool.
Some of our most important as well as complex tests involve taking over an entire (and perhaps in the future, more than one) virtual machine and installing and configuring software as root. This is clearly not an ideal environment for a long-running Jenkins node. Instead, we run these tests on single-use nodes.
Once a day, nodepool spins up a new node (using novaclient, of course), and caches data that tests can later make use of locally on the host. It then creates a snapshot of the host. It spins up a number of machines from that image and adds them to Jenkins. Jenkins registers their availability with Gearman, and they wait to be assigned a job.
Clark Boylan wrote the ZeroMQ event publisher plugin for Jenkins. With that installed, Jenkins publishes start and end events for every build.
Nodepool subscribes to ZeroMQ events published by each of our Jenkins masters and notes when a build starts on a node that it manages, marks it as being in-use in its internal database, and immediately starts spinning up a replacement node. When the job completes, Nodepool removes the node from Jenkins and deletes it.
Nodepool is very fast and responsive. In our current configuration, developers and tests never have to wait for a nodepool-managed node to become available, unless we hit our test node quota (of several hundred machines, of which nodepool is capable of exhausting in a few minutes). We like it so much that we're looking into having it manage all of our Jenkins nodes.