The first is automatic retries of fast tests. If a test runs quickly and fails, it costs us little to try again just to make sure. Most of our unit tests are configured to run up to 3 times.
Another thing is keeping track of a database of individual test case passes/failures across all time. This will let us automatically mark tests as flaky if they fail often, and ignore their results programatically rather than requiring a human to manually mark the test as ignorable.
A third thing is, obviously, automatically filing bugs against the owners/authors of tests which have been marked flaky. This is controversial -- often a test is just fine until one of its underlying libraries has a race condition introduced, and the real person to fix it should be the author of that change, not the author of the test. But it is still a step in the right direction much of the time.
Many people subscribe to the philosophy that "a flaky test is worse than no test", because you think it is giving you information when in fact it is giving you none. I subscribe to a slightly different philosophy: "A test with a known flaky rate is hugely valuable". If you know how often a test flakes (statistically), then you can measure variances from that rate to detect changes. Of course, a flaky test with an unknown rate of flaky is still useless. Hence the second initiative above: measuring the rate of flake of everything.
Mozilla has a big problem with flaky tests (aka "intermittents") too. The SpiderMonkey VM can run tests in a deterministic mode to make tests less flaky by making things like GC more predictable.
Firefox has an experimental "chaos mode" that takes the opposite approach. It purposely randomizes behavior by adjusting thread priorities, changing hash table iteration order, and randomize timer durations. Unfortunately, many flaky tests fail in chaos mode, so it is not enabled by default.
But we (Mozilla) are also doing many of the same things as the Chromium team here. In particular there is work in progress to automatically "ignore" the results of known-flaky tests until we detect that there has been a change in the rate of flakiness, at which point we will — assuming all goes to plan — trigger new test runs until we can determine the point at which the regression was introduced.
I think one of the lessons we've learnt is that with a browser-type project it's very hard to make test runs fully deterministic, for both technical and human reasons.
The technical reasons are touched on in the original article: these are complex codebases with lots of moving parts and lots of environmental dependencies. Of course there are various tactics to try and combat this; for example there is a wiki page dedicated to innocuous-looking code that leads to intermittent tests [1].
The human reasons centre around the difficulty of getting people to care about spending time fixing a test that fails one time in 1,000 (which is still very noticeable when you are running it hundreds of times a day). Unless the issue is something that fits a known pattern it's hard work, difficult to tell if your fix even worked, and not likely to be considered a top priority due to the diffuse, hard to quantify, nature of the benefits.
I think the fact that both Google and Mozilla still have significant problems with intermittents despite talented engineering staff and it having been a known problem for years implies that some of the standard thinking about making tests fully deterministic simply doesn't apply; for this kind of work you have to embrace — or at least accept — the randomness, and look for ways to get the data you need despite the noise.
That's a good point that making tests fully deterministic is not actually possible. But users aren't deterministic either, so we must accept noisy test environments because that's what users see. Tracking changes in the rate of flakiness is an interesting idea.
Could you run "all" the identify flaky tests by running all the tests 100 times on the same stable build (like the latest ESR)? Is it even possible to write a test that could pass 100 times in a row? :)
Running a test N times will certainly detect some fraction of all the flaky tests. It's something we occasionally do manually to work out if e.g. a certain intermittent is (likely) fixed and it's something that we'd like to do more to quarantine new tests.
Unfortunately there are various confounding factors that mean many intermittent tests would look clean in such a run might nevertheless be problematic. For example if you only run tests that you think are intermittent problems that are triggered by state left from a previous test won't be found. This is one reason that we've been trying to run particularly problematic test types (e.g. firefox browser-chrome tests) in smaller groups restarting the browser with a clean profile between groups to clear the state. A group size of 1 would obviously be ideal here, but when you have thousands of tests and limited resources it's not practical.
The other problem is tests that have unexpected sensitivity to the environment. For example the other day DNS was being slow on the test infrastructure. This isn't a problem for most tests since they use something like /etc/hosts. But some tests were intentionally trying to use a non-resolving domain and those tests sudden started to randomly time out.
I have to dealt with debugging flaky tests before on embedded Linux app with multithreads, GUI, audio, video streaming before.
I debugged those bugs with custom gdb scripts. The test system will run the gdb scripts to start the app. The gdb script setup breakpoints, log stack trace and local variables automatically for a particular bugs.
The debugging environment is scripted and automated. Test failure mark/track the log file with all the gdb info.
I schedule the tests to run continuously overnight and in the morning I analyze the log file with gdb info.
After that I can add more debug command in gdb script or try fixed the bug if I have enough info. The cycle continue until the bugs were all fixed.
I personally think it was very good for try to analyze / fixing the "flaky" tests issue.
Even for the the tests that fails 1 in 1000 times, the bugs can still be debug and fixable.
It was actually more relaxing for me to debug this kind of problem. 99% of time, the computer do all the works.
You guys might want to try it if not done in Google already.
The first is automatic retries of fast tests. If a test runs quickly and fails, it costs us little to try again just to make sure. Most of our unit tests are configured to run up to 3 times.
Another thing is keeping track of a database of individual test case passes/failures across all time. This will let us automatically mark tests as flaky if they fail often, and ignore their results programatically rather than requiring a human to manually mark the test as ignorable.
A third thing is, obviously, automatically filing bugs against the owners/authors of tests which have been marked flaky. This is controversial -- often a test is just fine until one of its underlying libraries has a race condition introduced, and the real person to fix it should be the author of that change, not the author of the test. But it is still a step in the right direction much of the time.
Many people subscribe to the philosophy that "a flaky test is worse than no test", because you think it is giving you information when in fact it is giving you none. I subscribe to a slightly different philosophy: "A test with a known flaky rate is hugely valuable". If you know how often a test flakes (statistically), then you can measure variances from that rate to detect changes. Of course, a flaky test with an unknown rate of flaky is still useless. Hence the second initiative above: measuring the rate of flake of everything.