Software Testing: Reproducible Tests

27 Mar

In science, we always seek to find reproducible tests and experiments. Any scientific paper ought to give enough instructions and guidance on how to reproduce the outcome. While software development is typically not as much of a hard science as, say, chemistry or physics, we still like reproducible tests. In this blog post we explain why reproducible tests are recommendable. We also give a formal definition for reproducible tests. Furthermore, we describe some common pitfalls that create non-reproducible tests and provide solutions or workarounds for those.

Why is test reproducibility important? As in science, reproducible tests provide a certain type of guarantee of correctness. When a reproducible test passes, we know that the software works at least to the extent of what the test tests. Likewise, when the test fails, we know that the software is faulty. Furthermore, in software testing, reproducible tests can speed up bug fixing substantially.

Typically, bug fixing becomes tedious and time consuming exactly when it is unclear what causes the bug. If the bug was obvious, you would’ve already fixed it. But when investigation is required, test reproducibility is invaluable. You can run the test many times over, make changes to what logs you are collecting and so forth.

Furthermore, it is easier to debug when you can fully trust your test results. Consider the opposite scenario: your test randomly gives an incorrect verdict. Now all your analysis needs to account for this. Your analysis, which was hard to begin with, becomes even harder.

When investigation is required, test reproducibility is invaluable

Definition of reproducible tests

Let’s assume we have written a piece of software that is now being tested. Furthermore, let’s assume our test verdict is binary: either the software works or it doesn’t. Now, if we run the test once, the verdict will indeed be binary. However, if we run the test several times over, not necessarily every test run returns the same verdict.

In the above scenario, if the test always returns the same verdict for the same output, we call the test reproducible. On the other hand, if the test randomly changes verdict although no input and nothing in the software was changed, we call the test non-reproducible or flaky.

A reproducible test always returns the same result for the same input and the same software. A flaky test can return a different result on various runs although nothing was changed.

Note that according to the definition, there is no requirement for the test to return a binary result for the test to be reproducible. However, for non-binary test verdicts, such as with load testing or characteristics testing, it is harder to distinguish between a reproducible and a non-reproducible test.

Alright, now we know the motivation why we want reproducible tests. Furthermore, we have defined what reproducible tests are. Let’s continue with examples of what makes test non-reproducible.

Typical problems

First of all, it is worth to note that it does not really matter if the source of problems is in the product itself or in the test code or test setup. Both can have the same root causes, and both cause similar problems. Let’s go through some typical problems.

Random Number Generators

Random Number Generators are an obvious source of randomness in a product or test. Instead of using real randomness during testing, it is best to use a reproducible seed in all random number generators. If we use reproducible seeds, we can reproduce the test scenario.

Note that it does not matter if the random number generator is used in the product itself or in the test. Even if the product uses real randomness in live operation, the random number generator needs to be mocked or to use reproducible seeding during testing to enable reproducibility.

For example, let’s assume your product greets a user with a welcoming “message of the day” upon start-up or login. Furthermore, let’s assume this MOTD is randomly selected from several predefined messages. Instead of having your test check if the displayed message matches one of the predefined messages, it is best to give a reproducible seed in the test scenario. Now, the test knows exactly which message will appear.

Clocks, Timers and Timestamps

Clocks, Timers and Timestamps are another source of problems. This is due to the fact they can “overflow”. A clock or timestamp overflow regularly when seconds are full and increase the minute count, minutes are full and increase the hourly count and so forth.

As an example, let’s assume we have code that uses the system clock and adds x minutes to it. Now, even if we test by always using the same value for x, our result varies depending on the system clock. For instance, if we use 5 minutes for x, the resulting minute count will be x + 5 when the system clock is in the range [0, 54], and x + 5 - 60 when the system clock minute count is in the range [55, 59].

It is worth noting that the same applies to timers as to clocks and timestamps. However, nowadays many hardware timers use such a larger counter that it is less often a problem. For example, for a 64 bit timer counting nanoseconds it takes over 500 years to overflow. Then again, a 32 bit timer overflows in less than 5 seconds, or about 72 minutes if counting microseconds.

The easiest solution is to mock whatever baseline time is used in the code. Now, all tests that are not supposed to overflow, do not overflow. Furthermore, the tests designed to test overflowing behavior test overflows.

Asynchronous Functionality

Asynchronous Functionality is another typical pitfall when trying to achieve reproducible tests. When a server operates in asynchronous mode, the server returns “OK” to the user immediately before the operation is completed. This is a problem in regard to testing as we don’t know exactly when the request has been completed.

Consider a scenario where we make a request to a server which will change the state in the server. Furthermore, let’s assume it takes a non-trivial amount of time for the change to take effect. Since we don’t know how long the propagation takes, how do we know when to query the server in order to verify the update was propagated successfully?

In the scenario above, we can implement a “sleep” to wait some time after the original request has completed. However, since we rarely know exactly how long the propagation takes, we have to choose between a too short sleep which causes intermittent failures and a too long sleep where we wait an excessive amount of time. We could re-run the test several times in case the change had not been properly propagated. However, also this is suboptimal as our test might hide bugs that are encountered only rarely such as race conditions.

Instead, one solution to work around problems with asynchronous functionality is to make the code synchronous. Rather than having the server return immediately and operate asynchronously, the server could complete the operation before returning. Now, during testing, we would have full control of the state of the server.

However, making an asynchronous server operate synchronously is often not feasible. That leaves us with the last option, which is to use polling. After making our initial request to the server and before making the next operation, which assumes the previous request has been fully propagated, we add polling in between.

The high-level description of the relevant parts of the test is as follows:

Make a request to the server that changes state
Poll to verify that the change is fully propagated
Repeat polling as long as is needed
Do the operation, which relies on the change being propagated

As we do not want to loop over the polling indefinitely, we need to choose sensible values for the frequency of polling and maximum attempts. In practice, a big value for the maximum attempts is usually suitable, as the limit will be hit only when we encounter a bug.

Externalities

If our product or test requires externalities such as a remote server, this is also a potential source of problems with regards to test reproducibility. For example, the network connection might be down, the server might be overloaded or the server might have the incorrect configuration.

There is no general fix for this, other than not having externalities in the first place. As externalities are often strictly required, it is best to mock them as much as possible. For example, instead of using a live remote server when testing, use a mocked server which always works exactly the same way.

Timing

Timing is a typical source of problems in regards to test reproducibility. While an issue with timing could be just a race condition and a real bug in the product, it could also be just something unique to the test setup.

Let’s assume that as part of our test our product connects to a remote server (in this example it does not matter if we are adhering to the guidance in externalities or not). Now, if the server is slow to answer, our product might timeout waiting for the answer. Note that while this causes the test to be non-reproducible, our product does not have a bug here. Instead, the product works as intended (it timeouts correctly), but the test verdict is still subject to the timing.

This can be a very hard problem to solve as, theoretically, the server could be indefinitely slow. This is the case even if we would not be using a remote server, but a mocked server that is part of our test setup.

The foolproof solution is to have the test be aware of what happened: if the connection timed out, get a different result than if everything worked correctly. However, even this is suboptimal, as different runs test different things even if the test verdict is reproducible.

Ultimately, there is no optimal solution to this. The heuristic approach is to minimize the amount of timing sensitive functionality in the product and test setup. This might be a good approach anyway, as issues related to timing are often difficult to tackle.

Incomplete Initialization

Often we have a substantial amount of tests we want to run. Now, if our product start-up or reconfiguration takes a long time, we might want to save time by leaving the product in a known state after a test has run. However, if sequential tests expect a certain state, we may have problems related to test reproducibility.

In general, there are three different types of problems we may hit. They are as follows:

A failing test causes cascading failures
Tests cannot be run as stand-alone
Tests cannot be run in parallel (without modification)

The reason why we may end up with cascading failures is that we expect previous tests to leave the product in a known state. This expectation may not be fulfilled when the previous test fails. Therefore, the discrepancy in the product state will cause subsequent tests to fail.

Similarly, if tests assume a known product state, it can make it difficult to run a single test. For example, if a test modifies the state, but does not revert the change back before completion, the re-runs of the same test may fail. This can be nasty in situations when debugging a single failing test.

In addition, nowadays, due to cloud computing, in many cases it is possible to run tests in parallel because specialized hardware is no longer needed. However, the capability to run tests in parallel may be degraded if tests expect to be run in a known sequence, where a previous test leaves the product in a known state.

Let’s consider a scenario where we want to populate an SQL database with data in order to test functionality in our product. For example, we have one test case without user data. In the next test case we have the user data, but some fields are missing. In the third test case all necessary data is populated. If the third test case expects the data populated by the second test case to exist, we may hit problems described.

How to solve this? Note that we often have lots of test cases and product initialization can take considerable amount of time. Therefore we want to avoid full restart and reconfiguration when feasible. Instead, a preferable solution is to have test cases figure out what the product state is and make adjustment only when needed.

Summary

In this blog post we discussed why reproducible tests are desirable. We gave a formal definition for reproducible tests and we described 6 typical problems that can make tests non-reproducible. Furthermore, we discussed solutions for each of the problems.

This article is part of Omoroi’s blog series on Software Testing. You can reach us at blog@omoroi.fi or discuss in this LinkedIn thread.

Michele Lindroos

Software Testing: Reproducible Tests

Definition of reproducible tests

Typical problems

Random Number Generators

Clocks, Timers and Timestamps

Asynchronous Functionality

Externalities

Timing

Incomplete Initialization

Summary

Enquiries

Electronic Invoicing

Address

Software Testing: Reproducible Tests

Definition of reproducible tests

Typical problems

Random Number Generators

Clocks, Timers and Timestamps

Asynchronous Functionality

Externalities

Timing

Incomplete Initialization

Summary

Software Testing: Conclusion

Software Testing: Design Code with Testing in Mind

Enquiries

Electronic Invoicing

Address