How to Fix Flaky Tests

10 min readFeb 3, 2022

A test that intermittently fails for no apparent reason — or works in your local machine and fails with — is called a . Flaky tests , slow down progress, hide design problems, and cost a lot of money in the long run.

Flaky tests a pretty common in the software industry. But they only become a serious problem when we’re not proactive about fixing them

The problem with flaky tests

An essential property of an automated test is its determinism: if the code hasn’t changed, the results shouldn’t either. Flaky tests neutralize the benefits of CI/CD and shake the team’s trust in their test suite.

You may be thinking that if a test fails randomly, you can game the system by retrying it until it passes. If this sounds like a good plan, you’re not alone. Some test runners even do this for you automatically.

While re-running flaky tests is a popular “fix” — more so, when a deadline approaches — is it a real solution? How confident are you in your test after having taken this road? How often must the test fail until you declare it a “real” failure? There aren’t any satisfying answers. The game is rigged. You can’t win with this strategy.

Step 1 — Commit to fixing the problem right away!

The first appearance of a flaky test is the best moment to fix it. Maybe the test is new, or a recent commit changed its stability. This means that the related history is still fresh in the developers’ memory and that they can act quickly.

Nothing is more frustrating than trying to push a hotfix down the pipeline and seeing a flaky test stand in your way. Flaky tests slow down development by affecting both the test suite and CI pipeline. Retrying tests may temporarily solve the issue, but by slowing down CI/CD, you’re wasting time and reducing your capacity to deliver software.

If you don’t have enough time to fix a test right away, you should document it, create a ticket, and start working on it as soon as possible. Keep in mind that reporting a problem does not equal fixing it. And by not fixing it, the technical debt in your project grows.

Step 2 — Find the flaky tests in your suite

Flaky tests are statistical by nature, so you’ll need to follow a test over a few days to understand its behavior. The more it runs, the more likely a pattern will emerge.

One efficient way to do this is to use a CI service such as Semaphore. By setting up a schedule on the main or failed branch every hour, or more often than that, you will have enough data to document the problem within a couple of days.

Another benefit of scheduling is that builds execute at different times of the day. If you notice a pattern — for example, a test fails every time between 3 and 5 am — you are one step closer to fixing the test.

Saving debugging information

Save every scrap of information that can help you find the root cause of the flakiness. Event logs, memory maps, profiler outputs, the key to the problem can be anywhere. Set your logging level to debug, like I do here with log4j, and print the messages into a file:

# log4j.properties
log4j.rootLogger = DEBUG, FILE
log4j.appender.FILE=org.apache.log4j.FileAppender
log4j.appender.FILE.File=${log}/log.out

Feel free to add debugging messages or add any other instrumentation to help you make sense of the logs.

import com.foo.Bar;

import org.apache.logging.log4j.Logger;
import org.apache.logging.log4j.LogManager;

public class MyApp {

 private static final Logger logger = LogManager.getLogger(MyApp.class);

 public static void main(final String... args) {
 logger.debug("Entering main loop.");
 Bar bar = new Bar();
 if (!bar.doIt()) {
 logger.error("Didn't do it.");
 }
 logger.debug("Exiting main loop.");
 }
}

SSH debugging

For quick diagnosis, you can run the job interactively with SSH debugging. Semaphore gives you the option to access all running jobs via SSH, restart your jobs in debug mode, or start on-demand virtual machines to explore the CI/CD.

You can reproduce the conditions that created the test to fail and try ways of fixing it. The changes will be lost when the session ends, though. So you’ll need to re-apply any modifications as a normal commit in your repository.

sem debug job 0265bd94-e2d1-4d5c-b89a-88f918dbf3a2
* Creating debug session for job '0265bd94-e2d1-4d5c-b89a-88f918dbf3a2'
* Setting duration to 60 minutes
* Waiting for debug session to boot up ..............
* Waiting for ssh daemon to become ready .......

Semaphore CI Debug Session.

 - Checkout your code with `checkout`
 - Run your CI commands with `source ~/commands.sh`
 - Leave the session with `exit`

Documentation: https://docs.semaphoreci.com/essentials/debugging-with-ssh-access/.

semaphore@semaphore-vm:~$

A picture says more than a thousand words

The most challenging class of flaky errors to debug involves the UI. End-to-end and acceptance tests depend on graphical elements not represented in logs.

Configure your test framework to dump HTML or screenshots when a test fails. You’ll be happy to have something to look at when the error strikes. The following example shows how to save a rendered page as an image with Cucumber.

// Cucumber supports hooks that allow you to execute code after a scenario has failed.
@After
public void afterAcceptanceTests(Scenario scenario) {
  try {
    if (scenario.isFailed()) {
      final byte[] screen = ((TakesScreenshot) driver).getScreenshotAs(OutputType.BYTES);
      scenario.embed(screen, "image/png");
    }
  } finally {
    driver.quit();
 }
}

And here’s the same thing but with Ruby and another BDD framework called Capybara, which also allows taking screenshots in tests.

after(:each) do |example|
  if example.exception
    # print DOM to file
    print page.html
    # save screenshot of browser
    page.save_screenshot('screenshot.png')
 end
end

Configure reports in the pipeline

Every job in Semaphore runs in an isolated environment, which means that all generated files (including any logs) are lost when the job is done. Without logging information it’s very difficult to find the root cause of the flakiness. In order to preserve files, there are two mechanisms: artifacts and test reports.

Artifacts are the easiest way of storing data from your pipelines:

Save your files into a predetermined location.
Add the following command into the CI job: artifact push job <my_file_or_dir>
Download the files in the Artifacts tab in your job.

With more setup required but much more powerful, we have test reports. This feature generates a categorized and sortable dashboard with all your tests in your pipeline. It works with any test framework as long as it outputs the results in the JUnit format. For more information on setting up test reports, check out the test docs.

Step 3 — Document flaky tests

After driving the flaky tests out in the open:

Document every flaky test in your ticketing system.
As you acquire more information about the cause of a test’s flakiness, add them to the ticket.
Feel free to fix the tests where the reason for flakiness is apparent right away.

Step 4 — Diagnose the cause and fix the test

In some cases, the cause of a test failure is obvious. This means that you can fix the test and close the case quickly. The problem arises when it’s not immediately clear why a test fails. In this case, you will need to analyze all the garnered data.

Let’s look at common causes for flakiness and their solutions.

Environmental differences

Differences between your local development machine and the CI fall into this category. Variances in operating systems, libraries, environment variables, number of CPUs, or network speed can produce flaky tests.

While having 100% identical systems is impossible, being strict about library versions and consistent in the build process help to avoid flakiness. Even minor version changes in a library can introduce unexpected behavior or even new bugs. Keeping environments as equal as possible during the whole CI process reduces the chance of creating flaky test.

Containers are great for controlling what goes into the application environment and isolating the code from OS-level influence.

Non-deterministic code

Code that relies on unpredictable inputs such as dates, random values, or remote services produces non-deterministic tests.

Preventing non-determinism involves exerting a tight degree of control of your test environment. You could inject known data in place of otherwise uncertain inputs with fakes, stubs, and mocks. These devices let you control to a great degree otherwise random inputs in your tests.

In the following example we override now() with a fixed value, effectively removing the non-deterministic aspects from the test:

@Test
public void methodThatUsesNow() {
   String fixedTime = "2022-01-01T12:00:00Z";
   Clock clock = Clock.fixed(Instant.parse(fixedTime), ZoneId.of("UTC"));

   // now holds a known datetime value
   Instant now = Instant.now(clock);

   // the rest of the test...
}

Asynchronous wait

Flaky tests happen when the test suite and the application run in separate processes. When a test performs an action, the application needs some time to complete the request. After that, it can check if the action has yielded the expected result.

A simple solution for the asynchrony is for the test to wait for a specified period before it checks if the action has been successful:

click_button "Send"
sleep 5
expect_email_to_be_sent

The problem here is that, from time to time, the application will need more than 5 seconds to complete the task. In that case, the test will fail. Also, if the application typically needs around 2 seconds to complete the task, the test will be wasting 3 seconds every time it executes.

There are two better solutions to this problem: polling and callbacks.

Polling is based on a series of repeated checks of whether an expectation has been satisfied.

click_button "Send"
wait_for_email_to_be_sent

The wait_for_email_to_be_sent method would check if the expectation is valid. If it's not, it would sleep for a short time (say, 0.1 seconds) and check again. The test fails after a predefined number of unproductive attempts.

The callback solution allows the code to signal back to the test when it can start executing again. The advantage is that the test doesn’t wait longer than necessary.

Imagine we have an async function that returns a value by reference:

function someAsyncFunction(myObject) {
  // function body

  // return value by reference
  myObject.return_value = "some string";
}

How do we test such a function? We can’t simply call it and compare the resulting value because, by the time the assertion is executed, the function may not have been completed yet.

// this introduces flakiness
let testObject = {};
someAsyncFunction(testObject);
assertEqual(testObject.return_value == "some string");

We could put a sleep or some kind of timer in place, but this pattern also introduces flakiness. Much better is to refactor the async function to accept a callback, which is executed when the body of the function is complete:

// run a callback when the function is done
function someAsyncFunction(myObject, callback) {

  // function body ...

  // execute callback when done
  callback(myObject);
}

// move the test inside the callback function
function callback(testobject) {
  assertEqual(testobject.return_value == "expected string");
};

Now we can chain the test to the async function, ensuring the assertion runs after the function is done:

let testObject = {};
someAsyncFunction(testObject, callback);

Concurrency

Concurrency can be responsible for flakiness due to deadlocks, race conditions, leaky implementations, or implementations with side effects. The problem stems from using shared resources.

Check out this test for a money transfer function:

function testAccountTransfer(fromAccount, toAccount) {
  lockFrom=fromAccount.lock()
  lockTo=toAccount.lock()

  beforeBalanceFrom = getBalance(fromAccount)
  beforeBalanceTo = getBalance(toAccount)

  transfer(fromAccount,toAccount,100)

  assert(beforeBalanceFrom - getBalance(fromAccount) == 100)
  assert(getBalance(toAccount) - beforeBalanceTo == 100)

  lockTo.release()
  lockFrom.release()
}

If we were to run multiple instances of this test in parallel, we risk creating a deadlock in which each function locks the resources the other needs, developing a condition in which either test doesn’t ever end.

// both tests running in parallel can cause a deadlock
testAccountTransfer('John', 'Paula')
testAccountTransfer('Paula', 'John')

The failure can be prevented by replacing the shared resource (the account) with a mocked component.

Order dependency

Dependency problems are caused when tests are executed in a different order than planned. One way to solve this issue is to consistently conduct tests in the same order. However, this is a poor solution, as it means that we have accepted that tests are brittle and that their execution depends solely on a carefully built environment.

The root of the problem is that tests depend on shared mutable data. When it’s not mutated in a predefined order, tests fail. This issue is resolved by breaking dependency on shared data. Every test should prepare the environment for its execution and clean it after it’s done.

Look at this Cypress test and think about what would happen if we reverse the test order of subscribing and unsubscribing.

describe('Newsletter test', () => {

  it('Subscribes to newsletter', () => {
     cy.visit('https://example.com/newsletter');

     cy.get('.action-email').type('fake@email.com');
     cy.get('.subscribe-button').click();

     cy.get('.message').should('have.value', 'Subscribed successfully');
  });

  it('Unsubscribes from newsletter', () => {
     cy.visit('https://example.com/newsletter');

     cy.get('.action-email').type('fake@email.com');
     cy.get('.unsubscribe-button').click();

     cy.get('.message').should('have.value', 'Unsubscribed successfully');
  });

});

In addition to issues with the data in a database, problems can also occur with other shared data, e.g., files on a disk or global variables. In these cases, a custom solution needs to be developed to clean up the environment and prepare it before every test.

Improper assumptions

It’s unavoidable to make assumptions while writing tests. Maybe we expect some dataset to be already loaded in the database. Sometimes reality surprises us with a day with less than 24 hours.

The best we can do is make tests completely self-contained, i.e., prepare the conditions and set up the scenario within the test. The next best thing is to check our assumptions before executing them. For instance, JUnit has an assumption utility class that aborts the test (but does not fail it) if initial conditions are not suitable.

@Test
void testOnDev() {
   System.setProperty("ENV", "DEV");
   Assumptions.assumeTrue("DEV".equals(System.getProperty("ENV")));
}

Conclusion

Remember, it’s easier to write a test than to maintain it. Strategies for fixing a flaky test are highly dependent on the application, but the approach and ideas outlined in this article should be enough to get you started.

Is flakiness still plaguing in your codebase? Read these next:

Originally published at https://semaphoreci.com on February 3, 2022.