We killed our end-to-end test suite

kodablah · on Sept 24, 2021

Sounds like this specific e2e suite was poorly optimized and was killed instead of rewritten/optimized due to a perceived notion that inefficiences are inherent in all e2e suites. If you maintain speed and strict curation of such a suite, most of the bullet points against are not an issue.

Also it sounds like the solution is just a bit higher than limited integration testing which does have value of course. Sounds trite, but if you don't test end-to-end you aren't going to catch bugs that only appear end-to-end (which also happen to be the ones customers see making the e2e suite a decent place for high level regressions assuming you maintain test performance of course). This is especially true in environment-specific scenarios.

aidenn0 · on Sept 24, 2021

Right, they talk about fighting for a queue. Firstly, a good test-suite can be run (a configurable subset) on the developer's workstation. Secondly it needs to run on commits in a reasonable amount of time. This is just as true of E2E as of unit tests.

They also mention flaky tests. If there is a spectrum between unit tests that can run on a single function and e2e tests that need a complete system, the closer to e2e you get the more likely you are to have flaky tests.

Flaky tests are an indication of non-determinism either in your test or your system. If you have non-determinism in your system, then you can't confidently test it regardless of the flavor of tests you use. Non-determinism in your tests should be minimized; if you can take a random-seed as an explicit parameter, do so, so that you can reproduce the flaky failures. Test failures (flaky or not) are always indicative of a bug either in the test or in the system, and should be investigated as such. Flaky tests should be removed from the production testing system just like code that fails tests should be removed from production deployments.

ezekiel68 · on Sept 24, 2021

Flaky tests are an indication of non-determinism either in your test or your system.

Yeah, my first though upon reading the article was: If their E2E tests produced non-deterministic results due to asynchrony, how can they have any confidence that their production data ever becomes 'eventually consistent'?

amw-zero · on Sept 24, 2021

All end to end tests are non-deterministic due to asynchrony. At some point you have to trust the discrete states of your software.

aidenn0 · on Sept 24, 2021

I mean the exact output given a certain set of inputs may be slightly different due to asynchrony, but given a set of inputs, there should be a finite set of correct outputs and check for those.

To use a stupid example: if listAnimals returns [cat, dog, mouse] some of the time and [cat, mouse, dog] other times, if your passes on the former and not the latter, then your test is broken and you should fix it. If it sometimes returns [cat, dog, mouse, tree] then your system is broken and you should fix it.

amw-zero · on Sept 25, 2021

A more accurate way to look at this based on your example is that, sometimes listAnimals returns [cat, dog, mouse], and sometimes it returns null.

It’s not that the result is nondeterministic, it’s that _whether or not the result is returned within the timeout of the polling mechanism_ is deterministic.

aidenn0 · on Sept 25, 2021

Presumably that happens in production as well, and the test can determine that the system does the proper thing when that happens?

mattnewton · on Sept 24, 2021

I should be able to test that this usually works though, right?

grumple · on Sept 24, 2021

You can test these things, sure. But if you're using other people's software (linux, vms, chromedriver, capybara) on other people's hardware (again, vms), you have to tolerate the fact that you can't control everything if you want to actually get work done. A little electrical, magnetic, or gravitational anomaly here, a little memory access blip there, some competition for cpu time elsewhere... I suspect there are probably only a handful of completely controlled environments on the planet and even those are suspect.

Test suites are sort of an eventual consistency problem themselves...

aidenn0 · on Sept 24, 2021

If you use other people's software and hardware, and those things don't perform the way your software assumes they perform, knowing that would be useful, right? There's always a limit to how much you want to handle, but if you are having a test fail even a large fraction of 1% of the time, then there's probably some underlying behavior that you should account for in production as well.

amw-zero · on Sept 25, 2021

No, that test doesn’t give you any useful information, because all it told you was that your expected answer wasn’t found in the configured time interval. You have no way of knowing whether or not your expected behavior would be satisfied if you ran for t + 1 seconds.

jerome-jh · on Sept 25, 2021

After some time you have to consider the test is failed and investigate, even if it would have succeeded had the timeout been 1 second larger. I cannot believe they do not have quality of service requirements. Testing those requirements is of course not easy. It may take to much time to run on every release or may be considered out of the scope of E2E tests and compliance is checked with telemetry results.

However pick any response time mandated by the QOS requirements, multiply by an appropriate x and use this as the pass/fail timeout for your test. Take a value large enough that can easily be considered a bug (because e.g. the customer would think the operation failed and would hit refresh or back). You then have an issue that is definitely worth investigating. You may actually have reproduced a rare issue that is part of the long tail of your telemetry.

aidenn0 · on Sept 26, 2021

Right have a timeout measured in minutes. The timeouts have zero effect on a clean run, so large timeouts have no effect on time to deploy if you require a clean run of tests for deploying.

mattnewton · on Sept 24, 2021

Right, but the key word here being “usually” - if I can’t just run the test three times and assume 2/3rds of the time it’s good, how can I know it usually works in production?

Is the right solution really to throw up your hands and not test end to end ever? I guess the argument is more convincing if it’s not that it’s impractical, it’s just too expensive relative to the returns.

jart · on Sept 24, 2021

You can if it's BSD-licensed.

amw-zero · on Sept 25, 2021

This is what’s known as “counterintuition.” You would think that you could, but you are wrong.

I’m not saying you can’t write a passing end to end test. Of course you can get it to pass some times. But they are inherently non-deterministic.

derefr · on Sept 24, 2021

> Flaky tests should be removed from the production testing system just like code that fails tests should be removed from production deployments.

...then how do you know when third-party upstream services are obeying their contracts to your service, if not by testing how your service interacts with those third-parties?

(I know my answer, but I'm curious to hear yours.)

marcosdumay · on Sept 24, 2021

That's monitoring, not testing.

Of course, both have the same form, you run the system and verifies if the results match the expected. But monitoring is done constantly during the lifetime of your infrastructure, and verifies the entire infrastructure; while tests are done episodically, and verifies your program or a component. Tests also often block some procedures, while monitoring doesn't (but it certainly starts some).

joshribakoff · on Sept 24, 2021

You can try to monitor that an endpoint responds quickly but how do you monitor that it responds correctly? At the end of the day both tests and monitoring are forms of verification

Some people run (subsets) of their tests in production as a form of monitoring. Sometimes monitoring does not pass or fail and is instead qualitative like a dashboard or raw logging, without alerts

I’d say there is a grey area between monitoring and testing, it is more precise to ask if you’re verifying pre production, post production, or both

catlifeonmars · on Sept 24, 2021

Generally, I think tests are used to validate changes to your service code (often as a gate to release it to production). Whereas monitoring is used to detect issues external to your code (often operated in production).

Edit: That is to say, what distinguishes testing from monitoring isn’t content, but purpose.

joshribakoff · on Sept 26, 2021

Monitoring can catch issues in the code. For example if an event is dead lettered or the application crashes unexpectedly, it triggers an alert, which may make you aware of some edge case you forgot to test. Both tests and monitoring can encompass validating code is running correctly, some even run their tests against production at regular intervals as a form of monitoring, for example see “datadog synthetic tests”, which could be characterized as both a test and a monitor. Many companies opting not to do traditional e2e tests actually still have them, they’re just running them against production instead of blocking CI (with the rationale they will prioritize fast detection and mitigation rather than trying to prevent bugs from entering prod)

x0x0 · on Sept 24, 2021

Our solution was two test suites.

End-to-end (which I will fight for being the highest value test site, and it's not close) had no external dependencies.

And a separate test suite that touched external services, split into two components: one that tested our integrations, typically against a remote testbed (if the 3rd party was competent enough to have such a thing), and a second chunk that attempted to see if remote api behavior had changed. Which it does with annoying regularity.

amw-zero · on Sept 24, 2021

There's basically no value in having tests against third-party code anyway, because all the test is going to do is tell you that they broke their interface. And by then, production is already broken.

mattnewton · on Sept 24, 2021

I agree this is often the case but disagree it is always; “testing” against the api can be a canary for your new usage of their third party api not working the way you think it does.

kdazzle · on Sept 25, 2021

I don’t know, I think having your tests assume that the 3rd party data looks a certain way is helpful. If that ever breaks in prod, then something needs to change, and your tests can change if the interface changes.

geofft · on Sept 24, 2021

There are two things that can be tested here, not one: whether the upstream service conforms to the contract / API promise, and whether your code behaves correctly with respect to what the API promises.

So that gives you a number of options for testing the second one of those. Recording sample traffic and replaying it in the test suite is one approach. Actually running an instance of the service (if it's open-source - there's still value in paying someone to competently run an OSS service) in your test suite is another, as is running some clone of the service (e.g., if you're talking to S3, there are probably a hundred S3 API-compatible clones that are good enough to run in your test suite, even if, again, you are happy to pay Amazon to competently run production).

You also want to pay attention to the first one of those, but that's not a job for your test suite. That's the job for some balance between their test suite, your monitoring or production logging, and your business relationship with them.

sidlls · on Sept 24, 2021

Third-party services should be mocked for integration and end-to-end testing. Error conditions with respect to these services should be something that is monitored and alerted on when appropriate.

mattwad · on Sept 24, 2021

I always mocked out 3rd party tests in my tests. I've never actually had a problem with some third party changing their API. That's the whole point of a versioned API anyway. I think when people talk about e2e tests, it's more about testing only integration between contracts that you own.

joshribakoff · on Sept 24, 2021

It’s valid to say you’re e2e testing your system, just not e2e testing the “full system”.

This is why the classification of the test into e2e, integration, and unit can cause confusion. I like to try to encourage people to avoid bucketing and instead say “this test should be more integration style than it currently is”, “this test should be more isolated than it currently is”. At the end of the day all testing mocks out the user and things like old web browsers or other factors that are a part of the real world system you care about may not be simulated in your test, so the way to get ”real” e2e verification is probably monitoring real users, if you consider that the user is a part of your “system”

jamil7 · on Sept 24, 2021

I’ve run into this a few times with some upstream package breaking and showing up in tests. I try to avoid mocking as much as possible in tests these days.

arp242 · on Sept 24, 2021

One thing I've done is adding the ability to run tests both with a "mocked" and a "real" version. The mocked version is fast and can be run quickly, the real version is much slower, but tests the actual real service. It's not that much extra effort to make in most cases, and I've caught some bugs when my mocked version made assumptions that were false, didn't cover some edge case, or whatnot.

That said, I too avoid mocks unless there's a specific good reason to add one.

cratermoon · on Sept 24, 2021

I like the idea of testpoints in code that can be switched on or off, an idea originally from the hardware side. Modifying the testpoints to allow switching between different test implementations is a useful generalization of the idea.

aidenn0 · on Sept 24, 2021

This system didn't rely on third party services, so "not applicable" I guess?

SquishyPanda23 · on Sept 24, 2021

> Flaky tests are an indication of non-determinism either in your test or your system

Or in the system that runs your tests. That can itself be non-trivial.

amw-zero · on Sept 24, 2021

> Sounds like this specific e2e suite was poorly optimized and was killed instead of rewritten/optimized due to a perceived notion that inefficiences are inherent in all e2e suites. If you maintain speed and strict curation of such a suite, most of the bullet points against are not an issue.

At least for web applications, all end to end test suites are slow and flaky. This is not an exaggeration - all of them. There are no magical optimizations. This is something that every project runs into, over and over again.

I will never willingly write an end to end test ever again. Unit / module tests + targeted integration tests are the only hope that we have.

deterministic · on Sept 25, 2021

I never had any problems writing reliable end-to-end tests. They are super useful for catching serious subtle bugs before the system goes into production. Not having solid end-to-end tests is a massive red flag for me.

firedark60 · on Sept 26, 2021

+1 on this comment. We had issues with poorly written selenium tests and after rewriting new tests with cypress and better test practices, the e2e test are reliable enough to be used as canary testing in new environments without false positive flaking. It ultimately comes down to how much you're willing to invest in writing good tests.

If you're suffering from seriously flakey e2e tests results, more often than not it's from either outdated tech, poor testing practices, and not enabling a selective retry on failures.

amw-zero · on Sept 26, 2021

You hit the nail on the head, using your own words.

> selective retry on failures

This is why your test suite passes, not because of an avoidance of outdated tech or poor practices. You rerun your flaky tests until they pass. That’s bad engineering, and the definition of non-determinism.

amw-zero · on Sept 25, 2021

I flat out don’t believe you.

deterministic · on Sept 26, 2021

Well it is true. The fact that you don’t believe it tells me you have a lot to learn. Writing good end-to-end tests is a skill you need to learn. Don’t assume that software developers can do it without proper training/learning. It is hard to do well.

amw-zero · on Sept 26, 2021

My disbelief of you is from my own experiences writing them, and talking to dozens of colleagues across different companies.

Every single company has to heavily parallelize their e2e tests, and pays a huge CI bill on top of effort maintaining an overly complex CI config.

Even after this, every company has to have a retry mechanism for their e2e tests because at least one fails at least every test run. It’s also the first thing I ask on interviews. I have many dat points on this one, across many teams and companies.

It’s an abomination.

Maybe you’re not talking about interactive web applications. Or maybe you’re not talking about the scale where you have thousands of e2e tests written over multi-year projects. If you’re talking about a toy project, sure you might not run into nondeterminism. These problems present themselves in aggregate.

deterministic · on Sept 26, 2021

I am talking about very large scale systems with web + mobile + desktop clients in an Enterprise environments. Think airlines and airports (operations and pairing/rostering). However it sounds as if your experience has been pretty bad. So I understand where you are coming from. My experience is that yes it is hard to do right but also that it is worth doing. YMMV of course.

strokirk · on Sept 28, 2021

I've never had any issues with the E2E tests in my company either, and I'm unsure where the flakyness would even come from. Using Cypress with great success.

sumedh · on Sept 25, 2021

> At least for web applications, all end to end test suites are slow and flaky.

Then you should be asking why are they flaky in the test environment? Its probably because your services are running on very slow servers.

amw-zero · on Sept 25, 2021

No, it’s because end to end tests are non-deterministic with respect to execution time. They are deterministic only in their discrete states.

This is why all end to end testing involves polling to wait for asynchronous operations to complete, which is by definition non-deterministic.

sumedh · on Sept 26, 2021

No not really, what you are saying is that you dont how long a test will take to complete, it can take 1 min or it can take 1 hour. If it sometimes takes 1 hour then you have to put on your detective hat and go look in the logs to see which service is slowing the e2e flow.

jrochkind1 · on Sept 24, 2021

> I will never willingly write an end to end test ever again. Unit / module tests + targeted integration tests are the only hope that we have.

What are your "integration" tests that are not "end to end" tests like, how do they differ from end to end tests?

Jtsummers · on Sept 24, 2021

Integration tests may integrate smaller-than-the-whole groups of subsystems. It definitely gets fuzzy. A lot of people treat end-to-end and integration tests as equivalent, but piecing together everything-but-the-frontend and testing it is also an integration test, but not an end-to-end test.

If we consider tests as existing at and covering different scales, unit tests are at the smallest scale and integration tests run the gamut from 2 units to the entire system.

amw-zero · on Sept 25, 2021

Picture a test that doesn’t involve any clicking of UI elements. That’s a start. So right there you’re avoiding the complexities of UI rendering, you just call the commands that are invoked by clicking directly.

Also, you can test frontend components together, and backend components together, but not cross the client-server boundary. Faster, more reliable.

That leaves a very small amount of e2e tests that you even want to write, and by that point I’m totally fine with manual smoke testing or automating them. But they’re the vast minority of tests.

BurningFrog · on Sept 25, 2021

Not OP, but testing groups of the "units" you unit tests can have a lot of value.

dcow · on Sept 24, 2021

Sounds like you’ve been subject to some pretty poor test setups. I’ve experienced good ones. My cynical take is that well maintained e2e tests aren’t a product priority in environments where they’re flaky and slow so they come as an afterthought. Not that they can’t be good. Usually product wants to ship code yesterday and doesn't care if there are bugs… so good test hygiene is nowhere to be seen.

amw-zero · on Sept 25, 2021

This argument does not account for the fact that all e2e tests are non-deterministic, so the quality of your “setup” is not relevant.

dcow · on Sept 30, 2021

You must have worked on some incredibly bad software to have non-determinism dominate your life. If a request fails then retry it just like a user would. If it keeps failing there's a problem. If it works then move on. I doubt your bank just throws in the towel and says "whelp software systems are inherently non-deterministic so we'll just forget some transactions here, allow the wrong amount over there, forget tests they're hard we can handle a little chance in our payment flows". The closest thing I've heard to that is amazon very occasionally shipping multiples of the same item because it was allegedly more expensive to implement immediate consistency than to ship a few duplicate items.

hinkley · on Sept 24, 2021

The main problem I see over and over with E2E tests is that they keep people from getting good at unit tests. The E2E are a magical security blanket that covers over all of the mistakes you’ve made leading up to them.

It’s much easier to build a testing pyramid from the bottom up. The skills maturity comes from the bottom of the pyramid, not the top, and thinking about the end game stunts your growth.

Often E2E tests have such sunk costs involved that they materially affect the project roadmap.

BeetleB · on Sept 24, 2021

While I agree that writing unit tests are a lot harder, and you develop good skills in attempting to write them, I must say that in the projects I worked on most bugs were caught by integration tests (technically not E2E tests), and not unit tests.

I've also had projects with only unit tests, and almost no bugs were found by it, and there were plenty of bugs.

Ideally, I would like both. But if I had to have only one, I'd go with tests at a coarser granularity than unit tests.

josephg · on Sept 24, 2021

My experience is that most bugs are found by randomisers / fuzzers. I’m consistently surprised they aren’t used more often, because they’re insanely good value for the time spent writing them.

Eg, a b-tree has a bunch of invariants: Leaves have equal height, data is sorted, nodes have between N/2 and N values, they contain everything that was inserted and not deleted, etc. So write a test which makes random changes to a b-tree in a loop, and makes those same changes to a simple sorted list. Every iteration, verify the invariants hold and values match. Every 1000 iterations, throw out the object and start again with a new seed. If the test ever fails, print out the seed for easy reproducibility.

In your unit testing suite, run this fuzzer for about 100ms or something. This catches lots of bugs. And occasionally leave the randomiser running overnight looking for rare bugs.

This sort of thing is so humbling, for the sheer volume of “obvious” bugs you find in otherwise working code. It’s hands down the best value testing code I’ve ever written.

sidlls · on Sept 25, 2021

> The main problem I see over and over with E2E tests is that they keep people from getting good at unit tests.

I'd view that as a win then. Unit tests are next to worthless for anything but tightly bound domains, like libraries, in most cases.

They're actively harmful in things like application-level service code, where they're used to turn perfectly well-written code into a chockablock mess to satisfy "testibility".

shakezula · on Sept 24, 2021

> Often E2E tests have such sunk costs involved that they materially affect the project roadmap.

I've seen this first hand. When the e2e tests take hours to run, are flaky on a good day, and are only really understood by one or two people on the whole team, they can be a major roadblock to new features or even just moderate refactors.

hinkley · on Sept 24, 2021

That definitely happens. The E2E tests tend to make assumptions about how the app works (encode not just the requirements but also the architecture) and some features change the design. In order to add this feature we have to fix dozens of other tests. I’ve seen people on multiple projects team up to fix these, take over a day working together, and still not be done. They always try to tweak the tests but the test assumptions fight them.

Meanwhile if we add a feature that invalidates a unit test, you just delete the unit test and start over. Unit tests are cattle, E2E tests are pets.

hinkley · on Sept 24, 2021

I should add as well: after that day, day and a half working together on old tests, those engineers look beaten down. They are not having a good time. It’s miserable work.

It must be some sort of Stockholm syndrome that people in this state still defend the tests. Even after they’ve invested more time and energy into fixing them than we ever would just manually testing that part of the code in perpetuity.

_6pvr · on Sept 25, 2021

> The main problem I see over and over with E2E tests is that they keep people from getting good at unit tests. The E2E are a magical security blanket that covers over all of the mistakes you’ve made leading up to them.

Either alone is insufficient. Both together aren't necessarily sufficient.

deterministic · on Sept 25, 2021

Unit tests are worthless. End-to-end tests are 100% needed. You need tests that cover all use cases end-to-end including testing error cases. That’s the absolute minimum I would expect from a well-engineered system.

hinkley · on Sept 25, 2021

You can’t test all error cases end to end. If you can you have shitty error handling.

Clock skew between servers? Drifting clock skew? Disk space exhaustion? Disk space exhaustion at each possible failure point? There are so many of these and you’re going to inject most of them in unit tests.

My original point was that if you can’t write good unit tests your e2e tests are also going to be lousy, and you will never get good at either, let alone both, if you fixate on more coverage with E2E tests.

They’re also just too damned expensive even if they were qualitatively as good. Which they are not. They are less numerous, sure, but that’s false economy because they are usually 3 orders of magnitude slower.

deterministic · on Sept 26, 2021

Of course you can. Part of the end-to-end test is to setup the test scenarios you want to test (including limited HD space etc.)

hinkley · on Sept 27, 2021

You're going to spin up a vm with the wrong system time, and then advance it between two operations that take 200 ms on a live system?

Bullshit.

deterministic · on Sept 30, 2021

What a BS straw man answer. There are much easier ways to do that kind of testing.

deterministic · on Sept 26, 2021

They are only slow if you do it wrong.

R0b0t1 · on Sept 25, 2021

You're both kind of right. You need to E2E your product, and what your product is is what matters.

Daishiman · on Sept 24, 2021

I'm not sure that the idea of e2e being relatively inefficient is just "perceived".

E2E tests in all orgs I worked at have always been the slowest and flakiest part, especially when simulating UI work and when working with systems that go beyond a handful of services.

kodablah · on Sept 24, 2021

I have seen efficient e2e suites, often built by and having a BDFL who had the same experiences as you. They have enforced best practices like "no sleeps", "no time-based tests", "every test must be concurrent and isolated", "refactor liberally", "bootstrap/share expensively allocated resources", etc.

I don't know how to say it humbly, but the biggest problem I've witnessed in slow e2e suites is that they are considered second-class pieces of software and only get the attention of QA engineers or developers who are not applying the same level of effort as their runtime code.

aidenn0 · on Sept 24, 2021

> I don't know how to say it humbly, but the biggest problem I've witnessed in slow e2e suites is that they are considered second-class pieces of software and only get the attention of QA engineers or developers who are not applying the same level of effort as their runtime code.

I replied in two other places on this thread before seeing this comment. It's very true. Since tests don't get shipped to customers, tests don't get the same level of effort. But when your tests are known to be of poor quality, people stop trusting them, and when people don't trust the tests, they stop adding any value.

bcrosby95 · on Sept 24, 2021

I have a friend that works on a team whose whole job is writing e2e tests. Before them the tests were slow, buggy, and couldn't be ran in parallel. Now they can be ran in parallel and there are few-to-no false positives.

There's still challenges with this model (such as tracking changes on other teams, helping ensure that UIs are testable), but it seems to have worked out much better for their company than expecting every developer to write and maintain them.

jseban · on Sept 24, 2021

Yeah but what's the point then, in that case you can just take back the old QA team and delete the gazillion lines of e2e test code and save yourself the liability of all that complexity. If it's cheap and simple to make a manual test, why replace that with something that complex, expensive and hard?

williamdclt · on Sept 24, 2021

> If it's cheap and simple to make a manual test, why replace that with something that complex, expensive and hard?

Because you don't want to make _a_ manual test, you want to make _hundreds_ of tests.

jseban · on Sept 25, 2021

If you are a human you have judgement and can determine which tests are the most critical and most relevant, so you don't have to always execute all of them. "It's just one line of css change to fix the styling, ok deploy". Second, a team of QA can very well make hundreds of tests in a day. And more importantly, they can really easily make decisions and draw conclusions such as "it's a bit slow sometimes, but overall acceptable", or "the animation is displayed correctly" or "there was a glitch in the rendering, but it's fine now", or "it works but the stying has moved slightly off center" etc etc, which take test expert programmers forever to try, and fail, to create deterministic automated tests for.

catern · on Sept 24, 2021

Your comment matches my experience very well. I had the same experience as GP and OP with low-quality e2e tests at my job. I got fed up four years ago, started something new from scratch, and now I'm the BDFL you mentioned, for a bunch of teams working in a common testing framework.

The main thing is indeed enforcing high quality standards even when individual engineers aren't very invested. You've identified some good practices right in your post, but it can take some time for people to learn these principles. And they can be reluctant if they see it as a waste of time. "These are just tests, I need to do my real work!"

For me, the crucial thing here is to avoid building things that are just for testing. If you tell someone that sleeping here is not good enough, and they need to build something more elaborate - then it's much more compelling if you can figure out how to build that so it's not just useful for a test, but also useful in production. This can be things like more flexible configurations, recovery tools for emergencies, new monitoring scripts and systems... all kinds of stuff.

If you stay focused on building things that are flexible enough to be used for both testing and production, then your life gets harder in some ways, but you can be much more strict about requiring high-quality work.

(btw, I'm hiring for the team building this infrastructure: http://catern.com/tsint_job.html )

spc476 · on Sept 25, 2021

I'm someone who had to build some mocked services to do end-to-end testing (well, as much as we can). The stuff I work on involves making two DNS requests (to different providers) and a possible HTTP request (for notification) and these three end-points are not under our control (as far as the department I work in are concerned). The two DNS requests are made concurrently [1] and management wanted to test the following scenarios:

* A returns, then B;

* B returns, then A;

* A returns, B returns late [2];

* B returns, A returns late;

* A returns, B never returns;

* B returns, A never returns.

I had to implement a side channel from the testing program to the mocked DNS servers (because a program like bind is just overkill for this---seriously) to implement artificial delays in the responses. Kind of hard to justify that for a production server (and yes, there is an active bug where B returns but A doesn't and the wrong information is returned, but it happens so rarely in production [3] that it was deemed acceptable for now).

The other component, the notification via HTTP, required ensuring that a notification that wasn't supposed to happen, didn't happen. [4] Again, I had to implement a mock with a side channel to the testing program to inform if it was to expect a request or not, and then report after all the tests were run how many requests were actually made. If the value between the testing program and the mock didn't match, it's an error. Oh, it's also useful to inform the mock what HTTP status code to return for the test. Such fun.

Management doesn't seem to think these mocks are a waste of time, but it seems like you might.

[1] At least for now. In the past, there were cases were we were to only contact A; some cases where we contact B, then maybe A; and some cases where we contacted both. This was done to save money at the time because all queries cost us money.

[2] we have some real time constraints on handling queries from our customers, the Oligarchic Cell Phone Companies.

[3] Excessive KPI logging for the win here.

[4] Proving a negative---lovely. Thanks, management!

Daishiman · on Sept 24, 2021

It's a fair point, which shows the underlying problem with e2e: if you don't have a BDFL who's willing to fight on this hill, the system will eventually break down. This implies a huge amount of constant friction that I don't believe is sustainable over the long term.

Most engineering organizations don't have "excellent" leadership, and so most orgs are well served by having team dynamics such that they don't depend on that. A bunch of additional integration tests and a bit of formalization of the difficult parts that e2e tests (the sort of async message-passing stuff that has unpredictable bounds) seems like a far better alternative for most orgs.

munificent · on Sept 24, 2021

> is that they are considered second-class pieces of software and only get the attention of QA engineers or developers who are not applying the same level of effort as their runtime code.

Another way to say this is that efficient e2e tests require significant continuous investment in top-tier engineer time. The question then is how much engineering time is worth being spent in that way.

It may be that, yes, you can have fast e2e suites, but doing so is too expensive to justify the cost.

dlisboa · on Sept 24, 2021

You have to compare that against what was done instead. Their solution was to employ a few engineers to create a new contract-based test framework, which will also have to be maintained. I believe that counts as "significant" investment too, but the calculus has to be whether that is less costly than improving their E2E tests.

marcosdumay · on Sept 24, 2021

> The question then is how much engineering time is worth being spent in that way.

Well, since it brings more value than testing in a lower level, I would say, more than any other kind of test (except, maybe, for monitoring).

Another good question is, is there any kind of tests that gives you good results without investing good-engineers time? If you find any, I'd ask you to share (but I would understand if you consider the information a market differentiator and won't).

Daishiman · on Sept 24, 2021

I have generally found that integration tests with well-mocked external dependencies achieve 80% of the things E2E tests do with a quarter of the effort.

evandwight · on Sept 24, 2021

How do you run tests in parallel if part of the logic you are testing is a sql statement?

Do you just test them separately? For example, mock out the db when testing the app and then sequentially test the db to make sure the sql statement works as expected. However, this explicitly doesn't test the integration.

jdsleppy · on Sept 24, 2021

As one example, Django handles test parallelism by creating N test databases (on the single test database server) and dividing tests into N runners. https://docs.djangoproject.com/en/3.2/ref/django-admin/#envv...

You could also have multiple Docker containers running DBs.

evandwight · on Sept 24, 2021

Thank you! That's so crazy!

williamdclt · on Sept 24, 2021

Not so crazy, it's very feasible to roll it yourself! Postgres has a "copy database" feature that's very useful (`CREATE DATABASE xxx WITH TEMPLATE yyy`).

I saw a project on HN a while ago focused on "managing isolated PostgreSQL databases for your integration tests", never used it but looks like a good idea: https://github.com/allaboutapps/integresql

bluesnowmonkey · on Sept 24, 2021

One option I’ve used only works if there’s some natural partition of the data like a customer ID. Every test starts by creating a new customer account. Since by design customers can’t see each other’s data, therefore tests can’t interfere with each other and can run in parallel on a single database. After all, in production all your customers are going to be using the database at the same time right? So it needs to work anyway.

evandwight · on Sept 24, 2021

Cool. I think I can use this strategy when I add communities.

mtone · on Sept 25, 2021

Here our tests are written in BDD (behavior-driven development) style, mimicking user actions and data expectations. During development, these are run against mocks (either in-memory DB or a mock repository). Individual small scenarios are also combined into realistic long-running processes, for example cases from opened to closed taking various paths.

The suite runs in parallel fast and frequently alongside unit tests. Then occasionally like before PR merges, the same scenarios are run against a clone of the production environment to catch any mismatch with the run-time environment, also in parallel (connections) to simulate multi-user usage. Any technical issue prompts an improvement of the mocks and rarely resurfaces.

Running these as scripts also doubles as a fake data generator to play with for manual testing, reporting, etc. Last we proceed with some manual testing to validate new changes and pick-up UI-related issues - we don't do UI automation.

williamdclt · on Sept 24, 2021

In addition to the database pool approach, you can also write tests so that they are inherently independent. Each test creates and (optionally) deletes its own data, without making assumption about what else is in the database. That's not ideal, as it's hard to know you're not making a hidden assumption.

aidenn0 · on Sept 24, 2021

Given that E2E tests should run in an environment that is more controlled than production, if you can't get an e2e test to perform reliably then it's a strong indication that your system won't perform reliably in production.

If an e2e test is not performing reliably not because it can't, but because the test is half-assed, then that needs to be treated as a bug in the test, and the test should not be used to assess the quality of your software. Developers (including me!) have a natural tendency to treat bugs in tests as lesser than bugs in the product, but given that bugs in tests will mask bugs in the product, this is a problem.

True story: a new e2e test was failing randomly. For 6 months nobody looked at it because "it was just a flaky test." A manager found out and insisted that someone fix the test, and it turned up the test was fine, it just found a (non-deterministic) bug that had been in the product for over a decade.

Daishiman · on Sept 24, 2021

> Given that E2E tests should run in an environment that is more controlled than production, if you can't get an e2e test to perform reliably then it's a strong indication that your system won't perform reliably in production.

"Reliably" isn't a binary indicator, but a spectrum of how frequently certain classes of bugs may appear in a system.

In the example that you were mentioning, it would appear that the amount of effort needed to maintain the e2e test suite was simply not worth it. How many man-hours were spent by your manager and staff ignoring the test suite? How critical was the bug (it would appear not much)? How much effort would have to be dedicated to get the e2e suite working will that won't be spent doing other classes of tests or feature development?

I'm not saying a well-maintained e2e suite doesn't work well or help to catch a lot of interesting production bugs. But I am saying that I think that for the vast majority of systems it's just not a good use of your time. Save your efforts and put more thought into the system design to avoid certain theoretical classes of errors and devote the rest of your time to better integration tests and that will likely serve more orgs better.

aidenn0 · on Sept 24, 2021

> How many man-hours were spent by your manager and staff ignoring the test suite?

I'm not sure what you mean by that. Ignoring the test-suite isn't something that you spend time doing. It was 6ish weeks with a team of 8ish people, so you could say "48 man-weeks" were spent ignoring it, but they were doing other things in that time, not just sitting at their desk proclaiming "I'm ignoring this test."

Once the manager forced someone to fix the test it took less than one man-day to find the bug, and about 5 minutes to fix the bug once it was found.

hinkley · on Sept 24, 2021

I have a Rule of 8 for the testing pyramid that has been roughly stable across three programming languages.

Each level you crawl up the testing pyramid increases run time for good tests by a factor of 6-10. If your functional tests are taking more than 10 times as long as your unit tests there is something wrong that is worth investigating. Usually I set a default “slow” time equal to multiples of 8 over a good unit test and round off to a whole number to invite fewer questions.

But it also means that if your unit tests are running in 10ms apiece, your integration tests should run in about 640ms and your E2E tests in under 5 seconds. Getting most people to make 3 second end to end tests is at least as hard as getting them to push them down the stack.

You need more tests as you go down, but that generally takes about a 5:1 ratio, meaning you still get a 30% improvement in run time for every test you can push down, and sometimes we are using end to end tests to do unit test work, which is going to be 4 to 500 times faster depending in how many cases were really missing in the unit tests.

jseban · on Sept 24, 2021

Yeah I agree with you, and it's because writing a test suite to simulate the users and verify (all) the use cases of your system, is really complex. I think the test advocates make it way too easy for themselves when they always just say "the first rule of testing is that your tests should always be fast" See, you broke the first rule, that's your problem! Well.. how do you execute a large number of complex operations and verifications, quickly? There's a how lot of actual practical solutions missing here, and just a lot of obstinate principle belittling "rules" and deflecting from providing an actual solution, which in practice, is hard.

lrem · on Sept 24, 2021

Of course they're the slowest and flakiest - they include the most sources of slowness and flakiness that you have. But, if you want to make your production fast and reliable, they're pretty good gauges if you're headed for success.

realusername · on Sept 24, 2021

Same experience, I'm sure you can design a large e2e test which isn't slow and flaky but for that you need some very very strict set of rules & care. I've personally never experienced one like this though.

QuercusMax · on Sept 24, 2021

Yes, this is true in my experience as well.

They are, however, extremely useful when they aren't flaky. A well-built E2E test can be a huge timesaver when debugging interactions between components.

gravypod · on Sept 24, 2021

One of the things that I love about Bazel is it thinks of a binary that obeys a contract as a test. This means you can have things like `sh_test` which just runs a shell script in a sandbox and gives you all of the benefits Bazel has normally for test execution. You get automated caching, parallelization, and remote execution of tests for free.

A great talk about this: https://www.youtube.com/watch?v=muvU1DYrY0w

You can often get situations where integration tests (that cover large features) take less than 30 seconds, only ever execute your tests when it is possible for the outcome to change (a dep has changed), and you can run your tests on a fleet of machines rather than one laptop or CI runner.

ahuth · on Sept 24, 2021

> If you maintain speed and strict curation of such a suite

This seems to be the hard part. Any tips for maintaining speed and strict curation, especially at scale (in terms of developers)?

At the very least, it seems that E2E tests are a tool that's easy to misuse. Not sure of the best way to mistake-proof it.

Too · on Sept 25, 2021

Automatic linter completely banning sleep(). Present reasonable alternative functions instead, like untilServiceAvailable.

Sounds silly and obvious but look into your test suite and i'm sure you will find sleep everywhere.

Majority of all flaky tests are due to miscalculated sleeps in my experience. Sleep is also the biggest contributor to slow runs, with longer and longer sleeps being used to combat the flakiness. Both factors being the biggest pain points of E2E tests.

strken · on Sept 25, 2021

I've been thinking about the benefits of only writing E2E smoke tests which cover a small number of critical paths quickly. Seems like most of their problems came about because they wrote more tests than they needed, with higher coverage than was necessary.

Too · on Sept 25, 2021

If you are only allowed to have a single test in your project, it would have to be the E2E smoke. I've seen systems where master don't even start up for weeks but project is still proud to present unit tests are green with 100% coverage. One single E2E smoke outweighs all those tests, catching everything from faulty configuration, infrastructure, interface assumptions, integrations, libraries and code bugs. After this it becomes more of a runtime vs coverage balance.

What you need to be aware of is that even one single E2E test can require a significant investment in bootstrapping your test-environment. If you are clever you will use this to improve the quality of your production code as well. For example, if you only have a single DB for production and now need to spin up an new instance for test, don't do it manually, instead refactor it into infrastructure-as-code and you've now turned this component into cattle instead of sheep, allowing you to scale both prod and test and giving your ops team a much easier life.

roguas · on Sept 29, 2021

this.. I have been doing that in products from hardware to ai models. It is effective and gives reasonable feedback.

I even gave a baiting tech talk at one of the companies about how useless unit test really are. Useless is understating it, they are often counter productive. Especially "junior" people, who try and grind the testing mindset, will come and write a little test for everything. Good luck making small inconsequential changes in the future.

If you have well written unit test (high level apis) then they become worthwhile - but once you arrive there, you have to just do one little step further to move into integration test and get good cover from failures/changes/bugs in cloud/infra provider.

gchamonlive · on Sept 24, 2021

> if you don't test end-to-end you aren't going to catch bugs that only appear end-to-end

Points 2, 4 and 7 from the assessment expose why sometimes this is not achieved even in E2E tests.

kodablah · on Sept 24, 2021

Well of course e2e suites are not a panacea. That doesn't support killing them. With regards to flaky or hard to debug tests, those are implementation-specific issues that should not be used dispel the entire concept of e2e testing (and can usually be solved by high code-quality tolerances and tracing respectively).

gchamonlive · on Sept 24, 2021

> That doesn't support killing them.

If you have to wait hours, sometimes days for a queue to run tests that catch 1 bug in 1000 runs and you still end up with bugs in production, I believe this supports killing the current e2e process, if you find ways to guarantee system integrity between services.

I think you are assuming accidental complexity, but hard to debug tests could also be a symptom of inherent system complexity.

Nubank is a gigantic proponent of clojure and is regarded to have high standards of code quality, so I think that we can give them the benefit of the doubt in this aspect.

cratermoon · on Sept 24, 2021

The entire article is about the reasons they ditched the test suite and replaced it with a different practice. Does it need to be more specific about the tradeoffs between fixing/rewriting the e2e suite vs. doing something different?

toong · on Sept 24, 2021

Should a BMW test-driver take a car out on the test track, when a engineer/designer is tweaking the glove compartiment handle ?

juicypt · on Sept 24, 2021

Yes. The latch might not be strong enough to handle the centrifugal force when driving hard, or vibrations, etc.

You don't need to go out to the track once per tweak of course. You could very well do a few laps to test out the whole system once in a while.

adriancr · on Sept 24, 2021

BMW still does crash testing on finished products (end-to-end tests)... which would cover glove compartment too and how it affects overall safety... (perhaps it breaks up into pointy objects on crash, maybe it opens randomly during driving causing safety issues...)

You wouldn't build a car without doing test drives at the end... or crash testing... or certifications.

peterkos · on Sept 24, 2021

I’ve personally seen way more e2e regressions than isolated regressions (mobile dev). It seems to make sense from a high level: it’s easy to test finite/internal behavior (unit test, UI test, or manually), but there are exponentially more cases when integrating any bit of code with any other bit of code.

deterministic · on Sept 25, 2021

Agree. Other organisations can effective run very large end-to-end tests. So it is most definitely possible. If they don’t have the in-house skills to do it then they should hire somebody who can.

deterministic · on Sept 26, 2021

What’s incorrect about my comment? Clearly other organisations can write successful end-to-end tests. The fact that you can’t do it doesn’t mean that nobody can.

kleinsch · on Sept 24, 2021

Buried ten feet deep in the article - they retired E2E tests and introduced "acceptance" tests, which are more efficient E2E tests that they still run on critical code. But I guess "We Renamed Our Eng-to-End Test Suite" isn't a very good blog post.

alex-freire · on Sept 24, 2021

Thanks for the feedback, we didn't want to burry the "acceptance test" complement to our testing strategy in the article. If you're curious how it works we've recorded a webinar about it: https://www.youtube.com/watch?v=wKgDaD5Nie4&list=PLfqo9_UMdH...

alex-freire · on Sept 24, 2021

And in part 2 about 12 minutes in we show the code to exemplify how our acceptance test suite works (and is very different from traditional E2E): https://www.youtube.com/watch?v=caxpxszueI0&list=PLfqo9_UMdH...

gls2ro · on Sept 24, 2021

I cannot watch this entire video, so it is possible that you explain conceptually how your acceptance tests are different from end to end test?

I would like to undestand what were the goals of your initial end to end and what are the goals of acceptance testing, and how you define these acceptance testing from test objective perspective.

I assume end to end tests could be described by this definition: "test the functionality and performance of an application under product-like circumstances and data to replicate live settings. The goal is to simulate what a real user scenario looks like from start to finish" [0]

[0] https://smartbear.com/solutions/end-to-end-testing/

mst · on Sept 24, 2021

I think what they mean, basically, is "e2e testing absolutely everything was becoming a nightmare, so we've now switched to 'contract based testing' - effectively 'unit testing where the microservice is the unit granularity' - plus some e2e style testing for critical paths where it's still valuable enough to justify all the extra effort".

gls2ro · on Sept 24, 2021

Ok. In this case it is still e2e. In my head e2e is never a strategy to test everything.

I also thing they did a good optimizations to offload some cases from e2e go some other form of testing which for should be integration testing.

If they dont do integration testing then a lot of possible bug cannot be found. Just testing the input and output of each microservice is not enough.

But I will watch the video as maybe it is better explained there and there is something to learn from this experience.

mst · on Sept 24, 2021

Yes, I said it was still e2e.

What they meant - I think - was they moved away from "a primarily end-to-end integration test suite" to "service-as-black-box unit-ish testing" plus "focused end-to-end integration testing for critical paths".

lamontcg · on Sept 24, 2021

Yeah and that's deeply confusing.

At work we have too many components that are tested in isolation, but which have grown to become tightly coupled, so we're trying to build an end to end testing framework.

So from my perspective I'm living in a world where our end-to-end test suite doesn't exist and therefore could be equivalent to "killing it" and it is bad. Each component tests its own contracts, but if there's no global testing that the contracts match in both codebases then you're still shipping broken software.

I thought this article would be some clever way to match client side and server side contracts to ensure that the contracts are identical on both sides and tested so that you could test in isolation then still come away with assurances that the whole would work together.

Instead it sounds like it is advice to only build as much end-to-end tests so that you're reasonably confident that more isolating unit/functional will work, but don't build too much because they're horribly slow, and never adopt a policy that literally everything should be end-to-end tested because that will result in infinitely long running test suites. If you have no end-to-end tests you have no confidence that the software you ship works, if you have only end-to-end tests you have no confidence in your ability to ship software in the future.

So, uh, "clickbait title" I guess is my point?

alex-freire · on Sept 25, 2021

sorry for the bait. thanks for the click :-D

dgb23 · on Sept 24, 2021

I don't quite agree. A more precise title would have been:

"We Disentangled our E2E Tests"

If I understand correctly - what seemed to have happened is that they separated the lower level data-coupling testing and their higher level testing with contracts and acceptance testing respectively. Each of those layers don't need to know of each other, so this was a separation of concerns AKA simplification.

msoad · on Sept 24, 2021

This is an endless debate. Each situation requires a different test setup but ultimately you can't say end-to-end tests are not worth it. You can have perfectly functioning units of software that are all perfectly unit tested but the units are not working together (insert a related meme GIF about working drawers colliding when opened). This can happen with strongest inter-unit communication protocols such as strong types and validation mechanisms.

E2E tests are very hard to maintain but in many situations they are required.

commandlinefan · on Sept 24, 2021

> you can't say end-to-end tests are not worth it

Or, from another perspective - you are doing end-to-end testing, the question is whether you're doing it before production or if your customers are doing it for you...

coryrc · on Sept 24, 2021

Tests are never 100%, customers will always find issues. They made the choice to allow slightly more bugs to production than before, not to go from never having bugs in production to having some.

alex-freire · on Sept 24, 2021

we didn't see more bugs in production when we sunseted E2E. in fact, complementing contract tests with acceptance tests we saw less bugs in production and more productive test creation and maintanence. :-D

rgoulter · on Sept 24, 2021

> you are doing end-to-end testing, the question is whether you're doing it before production or if your customers are doing it for you.

This reminds me of the point I've seen made elsewhere: E2E tests can be complemented by use of monitoring metrics, healthchecks, etc. for providing confidence that the system is working as intended (or for spotting cases where it's not working as intended).

deckard1 · on Sept 24, 2021

> you can't say end-to-end tests are not worth it

You can, actually.

But here's the thing: I've never seen an honest debate on E2E within an org. When your manager comes to you and says your team is going to start doing E2E, ask him/her if they are prepared for their schedule to slip by 30% or more.

They will either slither back into their office, or (most likely) they will insist that developers write E2E in addition to their current workload of writing unit tests, writing the actual code, and all of the other overhead (pull requests, approvals, JIRA ticket maintenance, interviews, etc.)

Developers are expected to pay the costs of E2E with no impact to the business.

What managers do not understand and has been my experience for many years now is that E2E is at least 30% of the cost of development. And that's probably low. I recall certain features where E2E took probably 200% or more time to get working. Because, unlike most unit tests, writing E2E tests is nontrivial. You may have to invent entirely new techniques and apparatus just for a single test.

If the costs of writing and maintaining E2E tests outweigh the benefits, then obviously it's not worth it. Not every bug is critical. In fact, go back 15 years and no one had any tests whatsoever. The world didn't end.

3pt14159 · on Sept 24, 2021

In my book, E2E tests should be on a couple of basic, mission critical things and integration tests should pick up the rest. It's far, far better to have 10 E2E tests and 1000 integration tests than 0 E2E tests and 1500 integration tests because it picks up failures in your infrastructure or weird stuff like middleware that are probably system wide(ish).

arp242 · on Sept 24, 2021

Pretty much this. I found that having loads of E2E tests often doesn't add all that much; usually they're all doing the exact same thing test after test after test, and since these parts tend to be fairly isolated there isn't all that much that can go wrong in just that specific test. Either it works for everything, or it fails for everything.

The way I've always viewed E2E tests is as "testing everything at the top layers" such as middleware and whatnot, which you can usually do with just a few (or sometimes even one) test. Other less high-level integration tests can test all the rest, and they tend to run much faster as they avoid a lot of overhead, and are a lot easier to write and reason about, especially if tests fail.

I once rewrote a E2E test suite to use integration tests which gave a massive speed-up, and because the tests were a lot easier work with people actually started writing them. I added a few E2E tests (IIRC logging in, viewing the dashboard, logging out) and that was enough really.

3pt14159 · on Sept 24, 2021

Yes exactly. This mirrors my experience as well. Of course it depends on the individual setup, but making testing easier and faster has huge benefits, but a tiny amount of E2E coverage goes a long way.

tetha · on Sept 24, 2021

I feel like this is also where dogfooding - or drinking your own champagne - comes in, if possible.

We can use our software internally and sure, there are hardware costs and manpower overheads to run an additional instance of our software, but those aren't too high. Hardware necessary to run E2E tests of all the systems at proper scale including maintenance manpower probably eclipses those efforts. And then you'd have to add dev-hours on top of the E2E costs to build and maintain mountains of E2E tests.

And this has exposed really nasty bugs in common paths already, just by employees using the system.

arp242 · on Sept 25, 2021

This is absolutely a good attitude, but it only works for a fairly limited class of software. I worked for things like realtor agent software, child care agency software, etc. and you can't really "dogfood" these sort of systems unless you want to become a landlord or start a childcare service.

brabel · on Sept 24, 2021

> In fact, go back 15 years and no one had any tests whatsoever. The world didn't end.

You really need to stop repeating this, it's absurd and there was a great post here the other day explaining how in most companies, they had a large QA team that would need to approve any code, it's just that developers were not expected to write the tests themselves.

> If the costs of writing and maintaining E2E tests outweigh the benefits, then obviously it's not worth it.

Obviously, but the question is, what's the alternative? In the blog post, the alternative was to have contract-based acceptance tests... but that may not always be appropriate for every business. We have a huge E2E test suite where I work and I was one of the biggest contributors to creating it... as everywhere else, it's heavy, slow and hard to maintain, but replacing what we have with contract testing wouold be unfeasible because we're not a micro-service architecture, we are one big application as we're a product company.... I would love to find a better way of testing our product, but contract-based testing is definitely not the answer for us.

codefreakxff · on Sept 24, 2021

“In fact, go back 15 years and no one had any tests whatsoever. The world didn't end.”

That’s just silly. Of course there were tests 15 years ago. Unit testing has been around since the 1950s

Very little would cause the world to end. But lives have been lost and billions of dollars along the way wasted due to improper testing

Too · on Sept 25, 2021

Not in the same way as a modern CI/CD-pipeline. 15 years ago you had release-approval meetings where a dedicated QA team would present all their manually observed findings in a excel-sheet to management.

E2E is a replacement for, or evolution of, the QA department, not a replacement for unit tests.

asdfasgasdgasdg · on Sept 24, 2021

I can hardly think of a situation where I'd want no end to end test.

I think one misconception is that there has to be a single end to end test. Really what you want is a variety of end to end tests examining the functionality of different parts of the system. But the system under test is still the whole system, not the units. These partial end to end tests can still be quick to run, as long as you keep system startup time down.

For example, I work on a system that builds text indexes on an underlying database management system. We take an input mutation with logical changes and then use that to determine what additional index updates are required. This all happens automatically when our users write.

There are two ways to test this. The old way was that we instantiated the top level class that did the changes and manually constructed mutations that look like user mutations. Then we examined the mutations produced by our top level class.

I recently converted this test to use the public write and read apis of the database to instead write data to a test instance and then check that the index contents was as expected. The public api is more stable than our private one and is resilient to internal refactorings. It's also more amenable to ad hoc queries of the type you generally do in tests. And it ends up not being much slower, since our test for various sad reasons still had to start the database engine even though it was mostly unused.

All in all, I was able to make the test faster (3 minutes -> 1.5 minutes) and less brittle, while using less code and getting more coverage of what we actually care about. I think wins like this are commonly available when moving from unit to end to end testing, as long as you keep system startup time down.

afiori · on Sept 24, 2021

there was a nice talk putting this as an example where React's engineering teams designed their test to be future proof: always test the public API; your end to end tests should survive a major refactoring/rewrite

williamdclt · on Sept 24, 2021

95% of our tests are going through the API, including all the auth. Spin up a server in the test, create some data (we have helpers), call the endpoint, assert on the response, another endpoint's response, or the database state.

That allowed us to carry out _enormous_ refactors (pretty much only the controllers stayed) without touching tests. It's not really harder to write, making an API request isn't harder to write than making a function call

asdfasgasdgasdg · on Sept 24, 2021

Great stuff. Refactoring in my codebase is quite painful because earlier owners went in almost the opposite direction. The unit tests pretty much are never the ones that catch our bugs -- it's all caught by random or end to end tests. I'm slowly trying to crawl toward the light with respect to reorganizing tests I come across to use the external API.

asdfasgasdgasdg · on Sept 24, 2021

Hard to overstate the benefit of not having to rewrite your tests when you refactor. That's a concrete and difficult to ignore benefit of using higher level APIs. I find that the cost of writing a test is often not less than the cost of writing the code. If that is correct, it means that the system lifetime cost of authorship is much higher with unit tests. If all tests are unit tests the cost is almost 100% higher.

Veuxdo · on Sept 24, 2021

E2E tests aren't worth it if they produce false positives and don't prevent defects from reaching production. By definition. Too many devs treat automated testing as a goal in and of itself.

deterministic · on Sept 25, 2021

So you are saying that X is not worth it if X is crap? Who wouldn’t agree with that? What if X was implemented well and did what it was supposed to do? We use end-to-end production and it works extremely well. I have had zero production problems for years now. All issues were found by the end-to-end tests before deploying.

Veuxdo · on Sept 25, 2021

"So you are saying that X is not worth it if X is crap?"

I am because it needs to be heard. Like I said, too many devs treat writing tests as a goal unto itself.

alex-freire · on Sept 24, 2021

well said!

rmbyrro · on Sept 24, 2021

They didn't say E2E aren't worth it (they actually said it worked for them early on)

Their point is that it's not scalable as they think contract testing is

lowbloodsugar · on Sept 24, 2021

Contract testing isn't E2E testing and doesn't provide the same guarantees. So they are replacing a high cost, high value system with a lower cost, lower value system. So they are literally saying that they don't think the extra value of E2E testing is worth the extra cost.

Personally, I think they are wrong, and the problems they have with their E2E tests are problems with their implementation. Fixing them would benefit the customer experience and the developer experience as well as reducing the costs of the E2E. They are absolutely going to have critical customer impact (they are fintech ffs) that their E2E tests would have caught. Of course, whether that actually tanks their business depends on other factors. So accepting more critical issues may the right thing for the success of the company. Robinhood customers were greatly pissed off by its behavior, and yet it doesn't seem to have hurt them too much. But I wouldn't fucking boast about it!

tmcneal · on Sept 24, 2021

Our product is an end-to-end testing tool, so it's always interesting to see what issues companies hit with E2E tests and how they solve them. What's interesting about Nubank's experience is that after deleting their E2E suite, they realized that replacing them with integration tests wasn't providing enough value. There's a lot value in E2E tests, but so many orgs end up taking the wrong approach and ending up with a slow, flaky test suite.

We wrote a guide [1] for building automated test suites based on our experience working with and talking to software orgs. Teams who get value out of E2E tests generally do the following things right:

1. They keep tests as small as possible. This makes maintenance easier and forces a separation-of-concerns in the tests.

2. They factor the tests so they can run in parallel. This, plus shorter tests, is the best way to mitigate the slowness issue brought up in the article.

3. They have a good strategy for test data management. It looks like Nubank had test data represented as fixtures, but then somehow manual testing in that same environment was clobbering test data and causing false failures. A better strategy for managing test data could have solved for this. Or maybe even just running the automated tests in an isolated environment.

[1: https://reflect.run/regression-testing-guide/]

daxfohl · on Sept 24, 2021

FWIW I've been in both situations. One company had sketchy E2E coverage that resulted in a modicum of production bugs. I moved to a competitor of roughly same size had a huge E2E suite and AFAICT results in roughly the same modicum of production bugs. But feature development at the latter moves much more slowly because of all the wait queues, flaky tests, timeouts, test maintenance overhead, etc.

IMO these seem to be more of a CYA thing for managers. When a bug does get to production, you need to have something to point to. (And in my previous company of course they're scrambling now to make a big E2E testing platform). But I'm not convinced they're actually worth the effort.

Edit: actually maybe I think they're worth the effort, if for no other reason than when bugs do go to production, executives tend to start micromanaging if you don't have something to point at. That can be worse than dealing with flaky tests. But what I'm not convinced of is whether they actually reduce the number of bugs that go to production.

lowbloodsugar · on Sept 24, 2021

You have been in two of three situations.

  1. No e2e tests.
  2. Heavy, flaky, slow e2e tests.
  3. e2e as a driver of first class system integration

Compare with

  1. no unit tests
  2. tons of unit tests, that regularly fail, nobody cares and check in more bad tests, "unit" tests that thread sleep and take minutes
  3. CI/CD with 0 tolerance for failures or >1s tests

The difference between "We have tons of tests" and "We drive development with tests" is night and day. Sure, if you slap on e2e with the mandate "They must exist", then you're going to have a shitty experience, e2e or unit.

catern · on Sept 24, 2021

>3. e2e as a driver of first class system integration

Could you say more about this approach? That's how I've tried to approach end-to-end tests, but I haven't found much written about this which is specifically about e2e tests.

lowbloodsugar · on Sept 24, 2021

I haven't written a book but just take TDD philosophy but apply it to e2e. For example, if you have an e2e test that often fails because something times out, or you've had to set a high timeout, then dig down why and fix it. Turns out you have a JVM that often has a 2s GC but no alarms, and yes, it impacts customers too. Fix that problem, don't get rid of the test (or all the tests). You got a slow third party thing, put an abstraction in front of it so that username: MyE2EUser's traffic goes to a shim. Then either negotiate with third party, or make interaction with third party system asychronous. Or does resubmitting the page order the product twice?! lol. I wish those were the old days. All that being said, I haven't done this for a couple of years now, and I have the joys of vastly simpler systems at vastly huger scale, so I may be seeing the past through rose tinted glasses.

spc476 · on Sept 25, 2021

If you have an E2E testing platform, and a bug is found in production, can you reproduce the issue with the E2E testing platform? How easy is it?

physicles · on Sept 25, 2021

There’s the classic diagram showing tests as a pyramid[1]. At the bottom you have unit tests, white box stuff that mocks all dependencies and runs super fast. At the top you have E2E tests (or acceptance tests or whatever, names are fungible), but very few of them, to catch bugs at the boundaries between systems. In the middle you have stuff that maybe runs against a live database instead of a mock.

As you go further up the pyramid, tests get more expensive in every dimension: they are slower (and are run less frequently as a result), more expensive to debug and maintain, and maybe a bit flaky, though you should still de-flakify these tests as much as is practical.

The key is to push tests as far down the pyramid as possible. Never test something in an E2E test if it can be meaningfully tested in a unit test. For example, a regression test for a database date/time serialization bug should run against a real database. Meanwhile, a test verifying that an HTTP service returns the correct response code in a specific situation can run against mocks.

1 https://martinfowler.com/articles/practical-test-pyramid.htm... (holy crap that is a mountain of text, but the diagram is near the top)

commandlinefan · on Sept 24, 2021

> Contract tests allow us to describe the interactions between our services through expectations to be met by their inputs and outputs.

If you say so, and I wish you luck but... I've seen that tried many, many times and never seen it actually work out in practice. It seems like it ought to be workable - there are only a finite number of ways that each service can be invoked after all - but if the goal of automated testing is to find problems before they become production problems, I've never seen "defined contracts" fulfill that goal.

anamexis · on Sept 24, 2021

> but if the goal of automated testing is to find problems before they become production problems, I've never seen "defined contracts" fulfill that goal

But, have you seen contract testing fail to catch problems that E2E tests did catch?

I think both of them end up tending to be regression tests a lot of the time.

makeitdouble · on Sept 24, 2021

A complicating factor is that teams usually start with E2E when it’s simple, and only move to other approaches when the systems have become way too complicated.

At that point I don’t know if there is any specific strategy that effectively catches a lot of bugs, short of sending to production and monitor the effect. At a company we just called those tests “sanity checks”, and the goal was just to make sure the most basic use cases would still work, and nothing more.

coryrc · on Sept 24, 2021

It says the majority of the problems caught in E2E were due to changed contacts, so it makes sense to have a dedicated test type for those errors. If the remaining caught bugs are few enough and their customers are willing to suffer occasional unavailability, just let them be caught in production and save that expense.

deterministic · on Sept 25, 2021

Agree. They are not testing the end-to-end contracts of how concurrent services interact. I predict it will end badly.

drewcoo · on Sept 24, 2021

E2E tests are required because no matter how well-defined your other tests are or how completely they've tested everything . . . you can't prove that they'd absolutely catch all the bugs.

https://en.wikipedia.org/wiki/Argument_from_ignorance

Kudos to Nubank for whatever combination of logic and bravery led them to their decision.

coryrc · on Sept 24, 2021

E2E can't catch all bugs either. This team decided the number of bugs their test suite caught was not enough to be worth keeping it. With a robust canary deployment, they will quickly find and rollback breakages whether or not the e2e suite would have caught it.

dgb23 · on Sept 24, 2021

Reading the article beyond the title reveals that this decision was engineering driven, measured and they ended up with a simpler, disentangled solution, by separating contracts that verify compatible schemas on one side and acceptance tests on the other side.

bryan_w · on Sept 24, 2021

Catching all the bugs is generous. You won't be able to prove...basic functionally when all the components are deployed in production

jniedrauer · on Sept 24, 2021

E2E tests are still extremely limited and let bugs through, unless they also fuzz somehow. But that will make them even more flaky and difficult to debug, costing more time. It's a tradeoff.

Example: You run some number of operations and then batch them. If you run the exact same operations each time you test, you may not catch conflicts between them. Instead, you'd have to run a random number of, and type of, operations. But then test failures would become extraordinarily difficult to reproduce. You'd have to hope that your tests log exactly what the inputs were, and have a semi-efficient way to recreate those inputs locally.

deterministic · on Sept 25, 2021

I don’t think it was bravery. I think it was lack of end-to-end test engineering skills. If your tests are flaky then do what you would do with any other software: fix it! If your end-to-end tests are slow then do what you would do with any other software: optimise it! Not having proper end-to-end tests is a massive red flag.

NewEntryHN · on Sept 24, 2021

Which can't be proven by E2E either, and which you don't need to prove.

mirekrusin · on Sept 24, 2021

So they ditched e2e in favour of something that average monorepo checks statically ie throught typescript? Then mocked functional tests and called it a day?

The problem probably started when they put themselves in this microservice plague setup where they can't spawn simulated environment in ci anymore. As it turns out running system aka deployment on environment is a monolithic expression of microservice spaghetti.

As a side note flaky tests is such an idiotic concept. There are tests that pass and ones that don't. How good is button which works 35% of time? It's not a good button, period. Taking aside fact that it inflates test runtime more than decade of mcdonalds diet - if you find yourself in setup with flaky tests, you should ask why they are flaky and amend setup so test is expressed as non-maybe-flaky, normal test. Forbit flakiness, there is no such thing as passed flaky test - those are just shitty tests.

_hilro · on Sept 24, 2021

> As a side note flaky tests is such an idiotic concept. There are tests that pass and ones that don't. How good is button which works 35% of time? It's not a good button, period

Or more like the test runner succumbs to non deterministic flaky behaviour.

If something failed 65% of the time, it would be one of the easiest thing in the world to fix.

If it fails .001% of the time, that's what the industry refers to as flaky.

> Forbit flakiness, there is no such thing as passed flaky test - those are just shitty tests.

Have you ever written and monitored e2e tests over a year? It's industry wide.

Selenium/selenium grid always works great until it doesn't. Ditto with the new kids on the block. e2e outside of a browser is 100% fine unless there's an actual bug somewhere.

mirekrusin · on Sept 25, 2021

Let me clarify, I think I haven't expressed myself well. What I meant to bash is the culture where you wrap e2e tests with retries and consider flaky tests green. This arrangement is a pathology.

Industry did not put 1 out of 100000 (.001%) or less as threshold for calling something flaky. From my experience I've seen 20% success rate tests passing due to retries and teams are living with it as normal.

> Have you ever written and monitored e2e tests over a year? It's industry wide.

Yes, on high profile projects. I find myself repeating how important determinism is. Flaky tests, no mater how frequent, are indistinguishable from bugs, which implies they can't ever be considered green. Architecting testing environment in such a way that it is deterministic is fundamental. Sometimes it doesn't require big reshuffles, it just means the test has to be rephrased in deterministic terms that matter without asserting intermediate, timing based, racing middle states. As an example testing random failures by killing services doesn't have to assert intermediate client states, it has to assert that final state is eventually correct, which implies reconnects did happen and eventually state is correct, regardless of intermediate client states (ie. retry logic on the client auto healing itself on idempotent actions or erroring notifying client the service is offline, in which case user itself has to retry action - both are ok depending on how long service was offline, both can be progressed from test PoV, assertion that specific one happened is irrelevant).

exdsq · on Sept 24, 2021

Ahhh I've made this mistake before. You can't test every input/output while also losing the ability to accurately depict stateful user flows. Even Fintech projects I've worked on at the >£10Bn daily volume mark combining formal methods, mathematical proofs, property based tests, fuzzing, model based testing, etc.. still caught issues pre-production via good old end-to-end tests.

Good luck nubank :)

alex-freire · on Sept 25, 2021

thank you

exdsq · on Sept 25, 2021

I'll be interested to see how it goes, will you follow up with a blog post in a couple months?

mjul · on Sept 24, 2021

In summary they noticed that their e2e suite mostly caught integration errors where clients and servers had incompatible schemas for the data exchange.

The novelty is that they found a much faster way to identify this kind of errors by collecting and comparing the client and server side schemas statically without even running the code.

This is a great optimisation, but it did not remove all defects so they still need to define tests that validate actual application behaviour against the business rules.

alex-freire · on Sept 25, 2021

well put! thanks for the great summary.

liuliu · on Sept 24, 2021

I am not the typical guy who is going to preach e2e testing when I am in big tech companies. But let's not confuse what e2e tests (or I'd like to call it, integration tests) can do.

I love property-based testing, especially with these new frameworks now does coverage-guided fuzzing too. However, it only guarantees the "contract" (or "interface") at that level. For property-based testing (or contract testing this article calls) today, it is still very much to only validate the property holds true, not to exhaust all the edge cases. To give an example, a property-based testing validates function: "add(x, y) == add(y, x), given x in Int32 range", it doesn't validate edges cases what if you call "add" twice, 3 times, from different threads etc.

At the end of the day, it would be hard for the property-based testing to validate your component satisfy Liskov substitution principle.

Integration testing on the other hand, makes sure your system worked at integrated level. It doesn't enforce Liskov substitution principle too. However, if you have downstream components depend on your implementations (for example, there is an earlier callsite called a function before, second call must be cached), update the upstream and run integration testing makes that implicit assumption apparent.

So, that's where I am arriving at. Without a powerful language that can encode all contract at programming level, relying on property-based testing only at components level cannot maintain substitution invariant. Integration testing is required.

corpMaverick · on Sept 24, 2021

Don't do comprehensive E2E tests. Have a small set of E2E tests to verify the most critical functionality is working. Don't verify every single business rule. There are other techniques that help you reduce risk:

- Good unit and integration tests.

- Effective Monitoring and alerting.

- Canary releases or Blue/Green releases.

- Continuous integration.

- Continuous delivery.

- Ability to safely rollback the more recent release.

jniedrauer · on Sept 24, 2021

I have experienced all the same problems that they outline. E2E tests require a huge number of human-hours to maintain, they're difficult to debug when they fail, almost always false positives, and bugs still get through anyway. But for many situations, there doesn't seem to be a better solution.

For most early stage startups, it seems that time would be better spent optimizing your deployments, rollbacks, and real time metrics so you can maintain a high velocity and roll back quickly when you make a mistake.

For more safety critical systems, the cost of maintaining E2E tests needs to be built into the total engineering cost for the project. It's a hidden cost that is often way bigger than you'd expect.

deterministic · on Sept 25, 2021

Not my experience at all. Having solid end-to-end tests means that I spend 99% of my time adding new features and 1% of my time fixing bugs before deploying. I haven’t had a bug in production for years because of solid end-to-end tests.

toddh · on Sept 24, 2021

Contracts are basically unit tests for whatever size of unit you're testing. How do you capture all the dynamic behaviors of a system without some sort of end-to-end test? Delayed timers, queues fill, missed interrupts, locks are held for too long, dead lock, live lock, priority inversion, dropped messages, out of order issues, etc. These things are not captured by contracts and are often exactly why the end-to-end tests were flakey in the first place.

JackFr · on Sept 24, 2021

System A listens to queue B and handles every kind of message b throws at it. But somewhere, at some point in time, some coder has made the innocuous assumption that B_id's are unique.....

xrd · on Sept 24, 2021

I'm a big advocate for testing, to state up front.

E2E is problematic from the start because of the expectations set by the name. Any sufficiently interesting system is nigh impossible to test "end to end." And, you aren't testing ends, you are testing the "start" of the process, to one of many "ends."

What about only doing "end testing?" Meaning, don't test the beginning. Put unit tests there. Put integration tests between the important components.

It is important to make sure you have coverage with automated tests that prove that your system can work at the end of at least some of the processes. Otherwise your QA costs are massive, and that never scales, and no one will ever fix that other than adding QA. Your innovation will slow to a crawl, much worse than waiting on your test suite.

I'm not sure after reading this article that the authors added a new testing methodology by calling it "contract testing." I'm still confused about what that means. Having said that, I am still confused about a lot of the boundaries between e2e and integration. It always sounds simple, but rarely in practice.

The bottom line: the organization as a whole has to see the value of testing. That's harder work than writing the tests for sure.

vemv · on Sept 24, 2021

The middle ground that not enough teams are exploring is following the so-called Functional Architecture. If all side-effects are effectively segregated and reified, then one should be able to swap them out for determistic mocks that run instantly.

So you could E2E a distributed system realistically and instantly.

You can still simulate things services being slow, unavailable, etc if the code handling those is expressed as pure logic instead of coupling itself to IO.

Interestingly, in addition to fixed examples, you can perform generative testing over this setup e.g. what happens for various combinations of services being slow/down.

alex-freire · on Sept 24, 2021

exactly! well said. it's what we're trying to achieve with our acceptance testing strategy. you can see more here how we leveraged clojure to be able to simulate E2E in memory in the JVM by bypassing IO and just having one services logic+data layer talk to the other: https://www.youtube.com/playlist?list=PLfqo9_UMdHhah_gNPnawX...

vemv · on Sept 24, 2021

Nice! I missed that nuance in the article.

Looking forward to eventually check out Sachem if there's a plan to share it?

sethammons · on Sept 24, 2021

Why not both?

Beyond unit test, we have docker compose spin up our service(s) and its dependencies. If those dependencies have too big a web, we may point at a staging instance or a fake server, but we routinely will spin up dependencies that will run a local kafka and zookeeper for them to run, are backed by mysql and redis, etc.

We then test our service at its incoming edges (feed its incoming queue or call its endpoints) and verify its output (via logs, metrics, and sinks).

We also have end to end tests that exercise our our services from the customer's point of view, but take place in our staging environment. These do suffer from many of the points the article points out, but we run these tests concurrently, and, when not flaky, can pass in 10 minutes.

We are addressing flaky tests by addressing their root cause: flaky services in staging. We are expecting teams to have mature monitoring of services in staging and tying improvements directly to flaky failed tests. We are also improving traceability so a failed test is easier to debug to understand if it was a failed service request somewhere in the stack.

caust1c · on Sept 24, 2021

End-to-end test suites do not excel in continuous integration workflows.

They excel as part of your metrics, monitoring and alerting system running continuously for as long as the service they exercise lives.

Perhaps it wouldn't be as painful if they took this approach instead.

alex-freire · on Sept 24, 2021

that's a very good alternative way of thinking about E2E

it · on Sept 24, 2021

If your system can't be tested end-to-end because of how slow and flaky it is to test that way, doesn't that say something about the quality of your system rather than the tests?

Of course once you have such a system, it's probably the result of years of work by many people and most likely it would be hard to make it faster and more reliable. That is probably why people shy away from doing that, and choose to blame the tests instead.

whoisjuan · on Sept 24, 2021

If your tolerance to deal with regressions and bugs in production is high and you have millions of users, then you can think of the user as the end to end tester. Maybe you ship some change and put it behind a feature flag and make it available to only 2% to 5% of the users.

If you get 1000 users to go through a particular flow and you have a way to collect failure signals from production accurately and in real time, then you can just dial down that flag to 0% if you see a lot of production errors.

I'm still not convinced that you can't drop e2e testing completely but maybe if your business allows it you can confidently rely of unit testing or testing contracts without having to run the app through all the user flows for every change.

zemo · on Sept 24, 2021

> Manual changes in our staging environment corrupted test data fixtures

there's a lot here.

Manual changes in your staging environment shouldn't affect your tests, because your tests should be isolated from other environments.

Also fixtures are generally bad. Given some fixture representing an initial state S, a test utilizing this fixture along with some acceptance criteria is essentially testing that given the state S, running the tests executes some transformation T such that the state of the system is now S2; acceptance criteria evaluate S2 against some known-good value to confirm that T is the desired transformation. This is meaningless if the initial state S is not actually reachable by the system. The fixture itself does not prove that S is reachable: that S is reachable is taken as an act of faith.

So how do you determine that the initial state S is reachable by the system? Well, you have some other test, that starts with an initial state of nothing, that performs some transition (generating and inserting random data instead of using a fixture, for example) and gets nothing into the state S. By doing this, you've both created the state S _and_ verified that S is a valid, reachable state. Now you run your second test after the first test in sequence. To run N tests off of initial state S, you replay the initial test that produced state S N times, once for each dependent test. Sure that's a lot of work, but each sequence of testing events can be run in isolation from the others, so they can be run in parallel.

aidenn0 · on Sept 24, 2021

Sometimes you have "state S that we got from a coredump from a customer that happened once every 30 computer-years in their deployment so we know it is reachable, but haven't ever seen it happen in-house"

siliconc0w · on Sept 24, 2021

'E2E' is often impossible as in most ecosystems as things are constantly changing and you will dependencies out of your control you cannot simulate. The key is faking the right dependencies with accurate-enough versions to keep test fidelity and speed, keeping the test svelte and fast enough so it can run before you merge code to eliminate the size of change being tested and thus more easily understand the outputs to determine if they represent a false positive or negative and where any problems may be. This also all requires building the infrastructure to spin up a simulated world quickly enough to simulate how the proposed change effects the simulation and then analyze the results which is also pretty hard and can get expensive.

Luckily spending time optimizing often helps test speed and can control cost so there can be a good case to make for it but orgs have to be willing to pour engineering hours into that and engineers need to want to do it vs building new things which is typically more enticing.

deterministic · on Sept 25, 2021

Test against API’s. Not implementation! Your API’s are supposed to be stable. If they are not then you are doing it wrong.

lkrubner · on Sept 24, 2021

Test suites will tend to fail the more your system has to work with "outside" data. I recently had a client where their own data was entirely dependent on data drawn from 23 different 3rd party APIs, which meant the bulk of their code was devoted to parsing APIs they had no control over. Those external APIs sometimes changed, and sometimes contained bugs (that is, violations of published contracts).

To talk about this, I use a broad definition of "outside data". If you're a small startup, "outside data" typically refers to data that belongs to another company. But if you're working in a Fortune 500 company, "outside data" can also refer to data coming from an API run by some other division, which is nominally part of "your" company but is affectively independent.

One rule I now offer to my clients: the more your system relies on outside data, the more it is helpful to have run time checks, rather than a test suite. Assuming you run your code on multiple machines or nodes or dynos or instances, you can chose to run the tests on just a percentage of your system, enough to detect problems, but without paying the performance price on 100% of your system.

When a problem in your system is because of a change in an external API, your test suite won't catch it, since your test suite works with dummy data. But run time checks will catch the problem and make debugging easy -- you'll see almost instantly which API call created the problem.

Code written on the JVM has the beautiful property that you can add pre and post assertions on every function, and you can pass a flag to the compiler asking that the assertions either be left in the code or stripped out. This makes it easier to build 2 copies of the code, one with the asserts and one without, and that makes it easier to, again, deploy the code in such a way that only a limited percentage of your code needs to run those run time checks.

kevindong · on Sept 24, 2021

At $PRIOR_JOB, it always felt like the full E2E tests approached useless since for every bug successfully caught, it felt like there were ~20 false positives. At which point, everyone (myself included) blamed the tests and just repeatedly reran the tests until they usually passed. Every single failure would halt the pipeline anywhere from 5 minutes (in the case that rerunning the failed test shows that it was just a flaky test) up to multiple hours since everyone would rather try to diagnose/hotfix the issue rather than revert their code to unblock the pipeline.

With that being said, a full run of the E2E suite at $PRIOR_JOB took very, very low double digit minutes so it wasn't that expensive. Rerunning a handful of failed tests took single digit minutes so it wasn't too terrible.

aidenn0 · on Sept 24, 2021

Was in a similar situation, and the VP of engineering banned the practice of rerunning failed tests, so flaky tests caused everybody pain. In less than 8 weeks the false positive rate dropped by about 3 orders of magnitude. There's a strong tendency to treat tests as a hurdle to get over rather than to treat them as first-class part of the development process.

deckard1 · on Sept 24, 2021

I imagine this would just turn into everyone inserting 10 second pauses on the tests that fail. Which works, but now your suite doubles the run time. Actually turning nondeterministic tests into deterministic ones is... hard. Really hard in some cases. Many devs don't even understand how to get there, even after years of E2E experience.

One place I worked, the E2E suite took a full hour to run. Everyone reran the tests. Merges took a full day in many cases. Management tried to force people to fix broken tests. But they also required new tests on new features. So it was a constant treadmill. There was basically a full mutiny by the end and the company killed off their entire E2E suite.

aidenn0 · on Sept 24, 2021

If people just started throwing random sleeps into tests, I think management would shit a brick. Do people throw random sleeps into production code to fix bugs where you work as well?