3 Steps to Eradicate Flaky CI Builds

4 min readDec 10, 2020

Unreliable builds hurt every developer.

First, there’s the obvious: time spent re-running and debugging flaky builds is time lost for other activities.

Second, time spent waiting has to be filled. And team members who switch contexts to progress another strand of work while waiting suffer reduced focus.

Third, delayed feedback on the effect of changes to an application inhibits learning and causes waste.

In this post, we follow the real-world experiences of some development teams as they overcome flaky builds and transform their productivity.

Note that flaky tests are a related issue which warrant a separate discussion.

The issue is more common than you think

It’s not uncommon for a build system to get into a state where, upon seeing a failed pipeline, the team’s first action is to check if it failed for reasons unrelated to the codebase. Perhaps it’s a problem with a build agent running one of the jobs in the pipeline. For example, the agent might have run out of disk space or gone offline. Or perhaps the failure is due to a mismatch in tool and library versions on build agents, developers’ machines, and the production environment.

What if the failure is due to non-determinism and interactions between tests within the build, or worse, previous builds that have altered the build agent’s state?

Flaky builds hurt productivity and devalue your tests

Split Payments’ CI system was slowing them down: “Performance could sometimes be quite poor and unstable,” explains CTO Trevor Wistaff. “Wasting time debugging CI performance issues and simply having a slow feedback loop was hindering team performance.”

AppSignal were also suffering from spurious build and test failures. As a result, they saw the value of the team’s CI setup deteriorate. First, a lot of time was lost retrying builds to see if a failure was real. Second, Tom de Bruijn, an AppSignal developer, sank a lot of effort into debugging the brittle builds. Third, and most importantly, the random failures started to mask real issues. As Tom describes, “Failing tests snuck in. If the build always or randomly fails you’re not going to look at the build result anymore.”

Step 1: Commit to solving the problem

The first step in solving any problem is to acknowledge that you have a problem.

And since you’re most likely working in a team, that includes getting the whole team to acknowledge the problem — and commit to fixing it.

Every moderately complex code base eventually experiences flaky builds and/or tests. It’s how you deal with it that matters. If you act, you will solve it. If you don’t, problems will compound and working on the project will be a miserable experience.

Step 2: Use managed build environments to improve stability, isolation, and repeatability

To achieve the level of consistency needed to address the problem of flaky builds, current CI/CD best-practice makes great advantage of virtualization.

Spinning up a clean virtual machine (VM) for every job means that each job’s runtime environment is well defined and repeatable, jobs are strongly isolated from each other and configuration is consistent across all builds. This removes a major source of flaky, hard-to-reproduce errors.

Virtualization also brings another benefit. When you outsource running CI/CD to a cloud-based solution, the vendor takes care of VM images for you. Not only does this reduce the workload on your team, the images are kept stable and always have up-to-date tools installed.

Even with rock-solid build machines, issues that need debugging will crop up. For example, configuration issues can sometimes appear during acceptance tests that usually run in CI, and not on developer’s machines prior to pushing a commit. Build systems typically capture logs, provide debug access to running and completed jobs, and store artifacts.

But there is another way to give developers even greater access to reproduce and debug issues: using Docker.

Step 3: Use containers to make your development, CI, and production environments bit-exact

When it comes to debugging build failures, Docker transforms the developer experience. With Docker containers developers can reproduce troublesome build jobs and debug them within an identical environment on their own machine. There’s never any need to reduce the build system capacity by taking nodes offline to debug a job, and no need to log in to build machines and debug where you don’t have your preferred tools available.

At AppSignal, Tom’s number one piece of advice to anyone considering switching CI providers is to move the test environment to containers. Semaphore’s seamless support for Docker containers allowed the AppSignal workflow to be ported quickly. The team’s server test container, from AppSignal’s own, private Docker Hub collection, gave them total control of the runtime environment and insulated them from external changes.

In addition, running build jobs in Docker containers increases the granularity of control over the runtime environment. Another advantage is that spinning up Docker containers is quicker than starting a whole VM.

Still, further advantage can be made from the huge array of public, optimized Docker images that are available. The library of images available at Docker Hub allows anyone to get up-and-running quickly on an array of common use-cases. And if ever a specialized environment is needed and is not available, custom images can be derived from existing base images.

Semaphore’s free book on CI/CD with Docker covers this topic in great detail.

Final thoughts

Flaky CI builds can be a tough problem, and there are always project-specific details that we can’t cover in this post. However, the advice above should put you on the right track to achieving a 100% healthy continuous integration process.

Got questions or suggestions? Let us know in the comments.

Originally published at https://semaphoreci.com on December 10, 2020.