Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
We killed our end-to-end test suite (nubank.com.br)
274 points by jdminhbg on Sept 24, 2021 | hide | past | favorite | 260 comments


Sounds like this specific e2e suite was poorly optimized and was killed instead of rewritten/optimized due to a perceived notion that inefficiences are inherent in all e2e suites. If you maintain speed and strict curation of such a suite, most of the bullet points against are not an issue.

Also it sounds like the solution is just a bit higher than limited integration testing which does have value of course. Sounds trite, but if you don't test end-to-end you aren't going to catch bugs that only appear end-to-end (which also happen to be the ones customers see making the e2e suite a decent place for high level regressions assuming you maintain test performance of course). This is especially true in environment-specific scenarios.


Right, they talk about fighting for a queue. Firstly, a good test-suite can be run (a configurable subset) on the developer's workstation. Secondly it needs to run on commits in a reasonable amount of time. This is just as true of E2E as of unit tests.

They also mention flaky tests. If there is a spectrum between unit tests that can run on a single function and e2e tests that need a complete system, the closer to e2e you get the more likely you are to have flaky tests.

Flaky tests are an indication of non-determinism either in your test or your system. If you have non-determinism in your system, then you can't confidently test it regardless of the flavor of tests you use. Non-determinism in your tests should be minimized; if you can take a random-seed as an explicit parameter, do so, so that you can reproduce the flaky failures. Test failures (flaky or not) are always indicative of a bug either in the test or in the system, and should be investigated as such. Flaky tests should be removed from the production testing system just like code that fails tests should be removed from production deployments.


Flaky tests are an indication of non-determinism either in your test or your system.

Yeah, my first though upon reading the article was: If their E2E tests produced non-deterministic results due to asynchrony, how can they have any confidence that their production data ever becomes 'eventually consistent'?


All end to end tests are non-deterministic due to asynchrony. At some point you have to trust the discrete states of your software.


I mean the exact output given a certain set of inputs may be slightly different due to asynchrony, but given a set of inputs, there should be a finite set of correct outputs and check for those.

To use a stupid example: if listAnimals returns [cat, dog, mouse] some of the time and [cat, mouse, dog] other times, if your passes on the former and not the latter, then your test is broken and you should fix it. If it sometimes returns [cat, dog, mouse, tree] then your system is broken and you should fix it.


A more accurate way to look at this based on your example is that, sometimes listAnimals returns [cat, dog, mouse], and sometimes it returns null.

It’s not that the result is nondeterministic, it’s that _whether or not the result is returned within the timeout of the polling mechanism_ is deterministic.


Presumably that happens in production as well, and the test can determine that the system does the proper thing when that happens?


I should be able to test that this usually works though, right?


You can test these things, sure. But if you're using other people's software (linux, vms, chromedriver, capybara) on other people's hardware (again, vms), you have to tolerate the fact that you can't control everything if you want to actually get work done. A little electrical, magnetic, or gravitational anomaly here, a little memory access blip there, some competition for cpu time elsewhere... I suspect there are probably only a handful of completely controlled environments on the planet and even those are suspect.

Test suites are sort of an eventual consistency problem themselves...


If you use other people's software and hardware, and those things don't perform the way your software assumes they perform, knowing that would be useful, right? There's always a limit to how much you want to handle, but if you are having a test fail even a large fraction of 1% of the time, then there's probably some underlying behavior that you should account for in production as well.


No, that test doesn’t give you any useful information, because all it told you was that your expected answer wasn’t found in the configured time interval. You have no way of knowing whether or not your expected behavior would be satisfied if you ran for t + 1 seconds.


After some time you have to consider the test is failed and investigate, even if it would have succeeded had the timeout been 1 second larger. I cannot believe they do not have quality of service requirements. Testing those requirements is of course not easy. It may take to much time to run on every release or may be considered out of the scope of E2E tests and compliance is checked with telemetry results.

However pick any response time mandated by the QOS requirements, multiply by an appropriate x and use this as the pass/fail timeout for your test. Take a value large enough that can easily be considered a bug (because e.g. the customer would think the operation failed and would hit refresh or back). You then have an issue that is definitely worth investigating. You may actually have reproduced a rare issue that is part of the long tail of your telemetry.


Right have a timeout measured in minutes. The timeouts have zero effect on a clean run, so large timeouts have no effect on time to deploy if you require a clean run of tests for deploying.


Right, but the key word here being “usually” - if I can’t just run the test three times and assume 2/3rds of the time it’s good, how can I know it usually works in production?

Is the right solution really to throw up your hands and not test end to end ever? I guess the argument is more convincing if it’s not that it’s impractical, it’s just too expensive relative to the returns.


You can if it's BSD-licensed.


This is what’s known as “counterintuition.” You would think that you could, but you are wrong.

I’m not saying you can’t write a passing end to end test. Of course you can get it to pass some times. But they are inherently non-deterministic.


> Flaky tests should be removed from the production testing system just like code that fails tests should be removed from production deployments.

...then how do you know when third-party upstream services are obeying their contracts to your service, if not by testing how your service interacts with those third-parties?

(I know my answer, but I'm curious to hear yours.)


That's monitoring, not testing.

Of course, both have the same form, you run the system and verifies if the results match the expected. But monitoring is done constantly during the lifetime of your infrastructure, and verifies the entire infrastructure; while tests are done episodically, and verifies your program or a component. Tests also often block some procedures, while monitoring doesn't (but it certainly starts some).


You can try to monitor that an endpoint responds quickly but how do you monitor that it responds correctly? At the end of the day both tests and monitoring are forms of verification

Some people run (subsets) of their tests in production as a form of monitoring. Sometimes monitoring does not pass or fail and is instead qualitative like a dashboard or raw logging, without alerts

I’d say there is a grey area between monitoring and testing, it is more precise to ask if you’re verifying pre production, post production, or both


Generally, I think tests are used to validate changes to your service code (often as a gate to release it to production). Whereas monitoring is used to detect issues external to your code (often operated in production).

Edit: That is to say, what distinguishes testing from monitoring isn’t content, but purpose.


Monitoring can catch issues in the code. For example if an event is dead lettered or the application crashes unexpectedly, it triggers an alert, which may make you aware of some edge case you forgot to test. Both tests and monitoring can encompass validating code is running correctly, some even run their tests against production at regular intervals as a form of monitoring, for example see “datadog synthetic tests”, which could be characterized as both a test and a monitor. Many companies opting not to do traditional e2e tests actually still have them, they’re just running them against production instead of blocking CI (with the rationale they will prioritize fast detection and mitigation rather than trying to prevent bugs from entering prod)


Our solution was two test suites.

End-to-end (which I will fight for being the highest value test site, and it's not close) had no external dependencies.

And a separate test suite that touched external services, split into two components: one that tested our integrations, typically against a remote testbed (if the 3rd party was competent enough to have such a thing), and a second chunk that attempted to see if remote api behavior had changed. Which it does with annoying regularity.


There's basically no value in having tests against third-party code anyway, because all the test is going to do is tell you that they broke their interface. And by then, production is already broken.


I agree this is often the case but disagree it is always; “testing” against the api can be a canary for your new usage of their third party api not working the way you think it does.


I don’t know, I think having your tests assume that the 3rd party data looks a certain way is helpful. If that ever breaks in prod, then something needs to change, and your tests can change if the interface changes.


There are two things that can be tested here, not one: whether the upstream service conforms to the contract / API promise, and whether your code behaves correctly with respect to what the API promises.

So that gives you a number of options for testing the second one of those. Recording sample traffic and replaying it in the test suite is one approach. Actually running an instance of the service (if it's open-source - there's still value in paying someone to competently run an OSS service) in your test suite is another, as is running some clone of the service (e.g., if you're talking to S3, there are probably a hundred S3 API-compatible clones that are good enough to run in your test suite, even if, again, you are happy to pay Amazon to competently run production).

You also want to pay attention to the first one of those, but that's not a job for your test suite. That's the job for some balance between their test suite, your monitoring or production logging, and your business relationship with them.


Third-party services should be mocked for integration and end-to-end testing. Error conditions with respect to these services should be something that is monitored and alerted on when appropriate.


I always mocked out 3rd party tests in my tests. I've never actually had a problem with some third party changing their API. That's the whole point of a versioned API anyway. I think when people talk about e2e tests, it's more about testing only integration between contracts that you own.


It’s valid to say you’re e2e testing your system, just not e2e testing the “full system”.

This is why the classification of the test into e2e, integration, and unit can cause confusion. I like to try to encourage people to avoid bucketing and instead say “this test should be more integration style than it currently is”, “this test should be more isolated than it currently is”. At the end of the day all testing mocks out the user and things like old web browsers or other factors that are a part of the real world system you care about may not be simulated in your test, so the way to get ”real” e2e verification is probably monitoring real users, if you consider that the user is a part of your “system”


I’ve run into this a few times with some upstream package breaking and showing up in tests. I try to avoid mocking as much as possible in tests these days.


One thing I've done is adding the ability to run tests both with a "mocked" and a "real" version. The mocked version is fast and can be run quickly, the real version is much slower, but tests the actual real service. It's not that much extra effort to make in most cases, and I've caught some bugs when my mocked version made assumptions that were false, didn't cover some edge case, or whatnot.

That said, I too avoid mocks unless there's a specific good reason to add one.


I like the idea of testpoints in code that can be switched on or off, an idea originally from the hardware side. Modifying the testpoints to allow switching between different test implementations is a useful generalization of the idea.


This system didn't rely on third party services, so "not applicable" I guess?


> Flaky tests are an indication of non-determinism either in your test or your system

Or in the system that runs your tests. That can itself be non-trivial.


> Sounds like this specific e2e suite was poorly optimized and was killed instead of rewritten/optimized due to a perceived notion that inefficiences are inherent in all e2e suites. If you maintain speed and strict curation of such a suite, most of the bullet points against are not an issue.

At least for web applications, all end to end test suites are slow and flaky. This is not an exaggeration - all of them. There are no magical optimizations. This is something that every project runs into, over and over again.

I will never willingly write an end to end test ever again. Unit / module tests + targeted integration tests are the only hope that we have.


I never had any problems writing reliable end-to-end tests. They are super useful for catching serious subtle bugs before the system goes into production. Not having solid end-to-end tests is a massive red flag for me.


+1 on this comment. We had issues with poorly written selenium tests and after rewriting new tests with cypress and better test practices, the e2e test are reliable enough to be used as canary testing in new environments without false positive flaking. It ultimately comes down to how much you're willing to invest in writing good tests.

If you're suffering from seriously flakey e2e tests results, more often than not it's from either outdated tech, poor testing practices, and not enabling a selective retry on failures.


You hit the nail on the head, using your own words.

> selective retry on failures

This is why your test suite passes, not because of an avoidance of outdated tech or poor practices. You rerun your flaky tests until they pass. That’s bad engineering, and the definition of non-determinism.


I flat out don’t believe you.


Well it is true. The fact that you don’t believe it tells me you have a lot to learn. Writing good end-to-end tests is a skill you need to learn. Don’t assume that software developers can do it without proper training/learning. It is hard to do well.


My disbelief of you is from my own experiences writing them, and talking to dozens of colleagues across different companies.

Every single company has to heavily parallelize their e2e tests, and pays a huge CI bill on top of effort maintaining an overly complex CI config.

Even after this, every company has to have a retry mechanism for their e2e tests because at least one fails at least every test run. It’s also the first thing I ask on interviews. I have many dat points on this one, across many teams and companies.

It’s an abomination.

Maybe you’re not talking about interactive web applications. Or maybe you’re not talking about the scale where you have thousands of e2e tests written over multi-year projects. If you’re talking about a toy project, sure you might not run into nondeterminism. These problems present themselves in aggregate.


I am talking about very large scale systems with web + mobile + desktop clients in an Enterprise environments. Think airlines and airports (operations and pairing/rostering). However it sounds as if your experience has been pretty bad. So I understand where you are coming from. My experience is that yes it is hard to do right but also that it is worth doing. YMMV of course.


I've never had any issues with the E2E tests in my company either, and I'm unsure where the flakyness would even come from. Using Cypress with great success.


> At least for web applications, all end to end test suites are slow and flaky.

Then you should be asking why are they flaky in the test environment? Its probably because your services are running on very slow servers.


No, it’s because end to end tests are non-deterministic with respect to execution time. They are deterministic only in their discrete states.

This is why all end to end testing involves polling to wait for asynchronous operations to complete, which is by definition non-deterministic.


No not really, what you are saying is that you dont how long a test will take to complete, it can take 1 min or it can take 1 hour. If it sometimes takes 1 hour then you have to put on your detective hat and go look in the logs to see which service is slowing the e2e flow.


> I will never willingly write an end to end test ever again. Unit / module tests + targeted integration tests are the only hope that we have.

What are your "integration" tests that are not "end to end" tests like, how do they differ from end to end tests?


Integration tests may integrate smaller-than-the-whole groups of subsystems. It definitely gets fuzzy. A lot of people treat end-to-end and integration tests as equivalent, but piecing together everything-but-the-frontend and testing it is also an integration test, but not an end-to-end test.

If we consider tests as existing at and covering different scales, unit tests are at the smallest scale and integration tests run the gamut from 2 units to the entire system.


Picture a test that doesn’t involve any clicking of UI elements. That’s a start. So right there you’re avoiding the complexities of UI rendering, you just call the commands that are invoked by clicking directly.

Also, you can test frontend components together, and backend components together, but not cross the client-server boundary. Faster, more reliable.

That leaves a very small amount of e2e tests that you even want to write, and by that point I’m totally fine with manual smoke testing or automating them. But they’re the vast minority of tests.


Not OP, but testing groups of the "units" you unit tests can have a lot of value.


Sounds like you’ve been subject to some pretty poor test setups. I’ve experienced good ones. My cynical take is that well maintained e2e tests aren’t a product priority in environments where they’re flaky and slow so they come as an afterthought. Not that they can’t be good. Usually product wants to ship code yesterday and doesn't care if there are bugs… so good test hygiene is nowhere to be seen.


This argument does not account for the fact that all e2e tests are non-deterministic, so the quality of your “setup” is not relevant.


You must have worked on some incredibly bad software to have non-determinism dominate your life. If a request fails then retry it just like a user would. If it keeps failing there's a problem. If it works then move on. I doubt your bank just throws in the towel and says "whelp software systems are inherently non-deterministic so we'll just forget some transactions here, allow the wrong amount over there, forget tests they're hard we can handle a little chance in our payment flows". The closest thing I've heard to that is amazon very occasionally shipping multiples of the same item because it was allegedly more expensive to implement immediate consistency than to ship a few duplicate items.


The main problem I see over and over with E2E tests is that they keep people from getting good at unit tests. The E2E are a magical security blanket that covers over all of the mistakes you’ve made leading up to them.

It’s much easier to build a testing pyramid from the bottom up. The skills maturity comes from the bottom of the pyramid, not the top, and thinking about the end game stunts your growth.

Often E2E tests have such sunk costs involved that they materially affect the project roadmap.


While I agree that writing unit tests are a lot harder, and you develop good skills in attempting to write them, I must say that in the projects I worked on most bugs were caught by integration tests (technically not E2E tests), and not unit tests.

I've also had projects with only unit tests, and almost no bugs were found by it, and there were plenty of bugs.

Ideally, I would like both. But if I had to have only one, I'd go with tests at a coarser granularity than unit tests.


My experience is that most bugs are found by randomisers / fuzzers. I’m consistently surprised they aren’t used more often, because they’re insanely good value for the time spent writing them.

Eg, a b-tree has a bunch of invariants: Leaves have equal height, data is sorted, nodes have between N/2 and N values, they contain everything that was inserted and not deleted, etc. So write a test which makes random changes to a b-tree in a loop, and makes those same changes to a simple sorted list. Every iteration, verify the invariants hold and values match. Every 1000 iterations, throw out the object and start again with a new seed. If the test ever fails, print out the seed for easy reproducibility.

In your unit testing suite, run this fuzzer for about 100ms or something. This catches lots of bugs. And occasionally leave the randomiser running overnight looking for rare bugs.

This sort of thing is so humbling, for the sheer volume of “obvious” bugs you find in otherwise working code. It’s hands down the best value testing code I’ve ever written.


> The main problem I see over and over with E2E tests is that they keep people from getting good at unit tests.

I'd view that as a win then. Unit tests are next to worthless for anything but tightly bound domains, like libraries, in most cases.

They're actively harmful in things like application-level service code, where they're used to turn perfectly well-written code into a chockablock mess to satisfy "testibility".


> Often E2E tests have such sunk costs involved that they materially affect the project roadmap.

I've seen this first hand. When the e2e tests take hours to run, are flaky on a good day, and are only really understood by one or two people on the whole team, they can be a major roadblock to new features or even just moderate refactors.


That definitely happens. The E2E tests tend to make assumptions about how the app works (encode not just the requirements but also the architecture) and some features change the design. In order to add this feature we have to fix dozens of other tests. I’ve seen people on multiple projects team up to fix these, take over a day working together, and still not be done. They always try to tweak the tests but the test assumptions fight them.

Meanwhile if we add a feature that invalidates a unit test, you just delete the unit test and start over. Unit tests are cattle, E2E tests are pets.


I should add as well: after that day, day and a half working together on old tests, those engineers look beaten down. They are not having a good time. It’s miserable work.

It must be some sort of Stockholm syndrome that people in this state still defend the tests. Even after they’ve invested more time and energy into fixing them than we ever would just manually testing that part of the code in perpetuity.


> The main problem I see over and over with E2E tests is that they keep people from getting good at unit tests. The E2E are a magical security blanket that covers over all of the mistakes you’ve made leading up to them.

Either alone is insufficient. Both together aren't necessarily sufficient.


Unit tests are worthless. End-to-end tests are 100% needed. You need tests that cover all use cases end-to-end including testing error cases. That’s the absolute minimum I would expect from a well-engineered system.


You can’t test all error cases end to end. If you can you have shitty error handling.

Clock skew between servers? Drifting clock skew? Disk space exhaustion? Disk space exhaustion at each possible failure point? There are so many of these and you’re going to inject most of them in unit tests.

My original point was that if you can’t write good unit tests your e2e tests are also going to be lousy, and you will never get good at either, let alone both, if you fixate on more coverage with E2E tests.

They’re also just too damned expensive even if they were qualitatively as good. Which they are not. They are less numerous, sure, but that’s false economy because they are usually 3 orders of magnitude slower.


Of course you can. Part of the end-to-end test is to setup the test scenarios you want to test (including limited HD space etc.)


You're going to spin up a vm with the wrong system time, and then advance it between two operations that take 200 ms on a live system?

Bullshit.


What a BS straw man answer. There are much easier ways to do that kind of testing.


They are only slow if you do it wrong.


You're both kind of right. You need to E2E your product, and what your product is is what matters.


I'm not sure that the idea of e2e being relatively inefficient is just "perceived".

E2E tests in all orgs I worked at have always been the slowest and flakiest part, especially when simulating UI work and when working with systems that go beyond a handful of services.


I have seen efficient e2e suites, often built by and having a BDFL who had the same experiences as you. They have enforced best practices like "no sleeps", "no time-based tests", "every test must be concurrent and isolated", "refactor liberally", "bootstrap/share expensively allocated resources", etc.

I don't know how to say it humbly, but the biggest problem I've witnessed in slow e2e suites is that they are considered second-class pieces of software and only get the attention of QA engineers or developers who are not applying the same level of effort as their runtime code.


> I don't know how to say it humbly, but the biggest problem I've witnessed in slow e2e suites is that they are considered second-class pieces of software and only get the attention of QA engineers or developers who are not applying the same level of effort as their runtime code.

I replied in two other places on this thread before seeing this comment. It's very true. Since tests don't get shipped to customers, tests don't get the same level of effort. But when your tests are known to be of poor quality, people stop trusting them, and when people don't trust the tests, they stop adding any value.


I have a friend that works on a team whose whole job is writing e2e tests. Before them the tests were slow, buggy, and couldn't be ran in parallel. Now they can be ran in parallel and there are few-to-no false positives.

There's still challenges with this model (such as tracking changes on other teams, helping ensure that UIs are testable), but it seems to have worked out much better for their company than expecting every developer to write and maintain them.


Yeah but what's the point then, in that case you can just take back the old QA team and delete the gazillion lines of e2e test code and save yourself the liability of all that complexity. If it's cheap and simple to make a manual test, why replace that with something that complex, expensive and hard?


> If it's cheap and simple to make a manual test, why replace that with something that complex, expensive and hard?

Because you don't want to make _a_ manual test, you want to make _hundreds_ of tests.


If you are a human you have judgement and can determine which tests are the most critical and most relevant, so you don't have to always execute all of them. "It's just one line of css change to fix the styling, ok deploy". Second, a team of QA can very well make hundreds of tests in a day. And more importantly, they can really easily make decisions and draw conclusions such as "it's a bit slow sometimes, but overall acceptable", or "the animation is displayed correctly" or "there was a glitch in the rendering, but it's fine now", or "it works but the stying has moved slightly off center" etc etc, which take test expert programmers forever to try, and fail, to create deterministic automated tests for.


Your comment matches my experience very well. I had the same experience as GP and OP with low-quality e2e tests at my job. I got fed up four years ago, started something new from scratch, and now I'm the BDFL you mentioned, for a bunch of teams working in a common testing framework.

The main thing is indeed enforcing high quality standards even when individual engineers aren't very invested. You've identified some good practices right in your post, but it can take some time for people to learn these principles. And they can be reluctant if they see it as a waste of time. "These are just tests, I need to do my real work!"

For me, the crucial thing here is to avoid building things that are just for testing. If you tell someone that sleeping here is not good enough, and they need to build something more elaborate - then it's much more compelling if you can figure out how to build that so it's not just useful for a test, but also useful in production. This can be things like more flexible configurations, recovery tools for emergencies, new monitoring scripts and systems... all kinds of stuff.

If you stay focused on building things that are flexible enough to be used for both testing and production, then your life gets harder in some ways, but you can be much more strict about requiring high-quality work.

(btw, I'm hiring for the team building this infrastructure: http://catern.com/tsint_job.html )


I'm someone who had to build some mocked services to do end-to-end testing (well, as much as we can). The stuff I work on involves making two DNS requests (to different providers) and a possible HTTP request (for notification) and these three end-points are not under our control (as far as the department I work in are concerned). The two DNS requests are made concurrently [1] and management wanted to test the following scenarios:

* A returns, then B;

* B returns, then A;

* A returns, B returns late [2];

* B returns, A returns late;

* A returns, B never returns;

* B returns, A never returns.

I had to implement a side channel from the testing program to the mocked DNS servers (because a program like bind is just overkill for this---seriously) to implement artificial delays in the responses. Kind of hard to justify that for a production server (and yes, there is an active bug where B returns but A doesn't and the wrong information is returned, but it happens so rarely in production [3] that it was deemed acceptable for now).

The other component, the notification via HTTP, required ensuring that a notification that wasn't supposed to happen, didn't happen. [4] Again, I had to implement a mock with a side channel to the testing program to inform if it was to expect a request or not, and then report after all the tests were run how many requests were actually made. If the value between the testing program and the mock didn't match, it's an error. Oh, it's also useful to inform the mock what HTTP status code to return for the test. Such fun.

Management doesn't seem to think these mocks are a waste of time, but it seems like you might.

[1] At least for now. In the past, there were cases were we were to only contact A; some cases where we contact B, then maybe A; and some cases where we contacted both. This was done to save money at the time because all queries cost us money.

[2] we have some real time constraints on handling queries from our customers, the Oligarchic Cell Phone Companies.

[3] Excessive KPI logging for the win here.

[4] Proving a negative---lovely. Thanks, management!


It's a fair point, which shows the underlying problem with e2e: if you don't have a BDFL who's willing to fight on this hill, the system will eventually break down. This implies a huge amount of constant friction that I don't believe is sustainable over the long term.

Most engineering organizations don't have "excellent" leadership, and so most orgs are well served by having team dynamics such that they don't depend on that. A bunch of additional integration tests and a bit of formalization of the difficult parts that e2e tests (the sort of async message-passing stuff that has unpredictable bounds) seems like a far better alternative for most orgs.


> is that they are considered second-class pieces of software and only get the attention of QA engineers or developers who are not applying the same level of effort as their runtime code.

Another way to say this is that efficient e2e tests require significant continuous investment in top-tier engineer time. The question then is how much engineering time is worth being spent in that way.

It may be that, yes, you can have fast e2e suites, but doing so is too expensive to justify the cost.


You have to compare that against what was done instead. Their solution was to employ a few engineers to create a new contract-based test framework, which will also have to be maintained. I believe that counts as "significant" investment too, but the calculus has to be whether that is less costly than improving their E2E tests.


> The question then is how much engineering time is worth being spent in that way.

Well, since it brings more value than testing in a lower level, I would say, more than any other kind of test (except, maybe, for monitoring).

Another good question is, is there any kind of tests that gives you good results without investing good-engineers time? If you find any, I'd ask you to share (but I would understand if you consider the information a market differentiator and won't).


I have generally found that integration tests with well-mocked external dependencies achieve 80% of the things E2E tests do with a quarter of the effort.


How do you run tests in parallel if part of the logic you are testing is a sql statement?

Do you just test them separately? For example, mock out the db when testing the app and then sequentially test the db to make sure the sql statement works as expected. However, this explicitly doesn't test the integration.


As one example, Django handles test parallelism by creating N test databases (on the single test database server) and dividing tests into N runners. https://docs.djangoproject.com/en/3.2/ref/django-admin/#envv...

You could also have multiple Docker containers running DBs.


Thank you! That's so crazy!


Not so crazy, it's very feasible to roll it yourself! Postgres has a "copy database" feature that's very useful (`CREATE DATABASE xxx WITH TEMPLATE yyy`).

I saw a project on HN a while ago focused on "managing isolated PostgreSQL databases for your integration tests", never used it but looks like a good idea: https://github.com/allaboutapps/integresql


One option I’ve used only works if there’s some natural partition of the data like a customer ID. Every test starts by creating a new customer account. Since by design customers can’t see each other’s data, therefore tests can’t interfere with each other and can run in parallel on a single database. After all, in production all your customers are going to be using the database at the same time right? So it needs to work anyway.


Cool. I think I can use this strategy when I add communities.


Here our tests are written in BDD (behavior-driven development) style, mimicking user actions and data expectations. During development, these are run against mocks (either in-memory DB or a mock repository). Individual small scenarios are also combined into realistic long-running processes, for example cases from opened to closed taking various paths.

The suite runs in parallel fast and frequently alongside unit tests. Then occasionally like before PR merges, the same scenarios are run against a clone of the production environment to catch any mismatch with the run-time environment, also in parallel (connections) to simulate multi-user usage. Any technical issue prompts an improvement of the mocks and rarely resurfaces.

Running these as scripts also doubles as a fake data generator to play with for manual testing, reporting, etc. Last we proceed with some manual testing to validate new changes and pick-up UI-related issues - we don't do UI automation.


In addition to the database pool approach, you can also write tests so that they are inherently independent. Each test creates and (optionally) deletes its own data, without making assumption about what else is in the database. That's not ideal, as it's hard to know you're not making a hidden assumption.


Given that E2E tests should run in an environment that is more controlled than production, if you can't get an e2e test to perform reliably then it's a strong indication that your system won't perform reliably in production.

If an e2e test is not performing reliably not because it can't, but because the test is half-assed, then that needs to be treated as a bug in the test, and the test should not be used to assess the quality of your software. Developers (including me!) have a natural tendency to treat bugs in tests as lesser than bugs in the product, but given that bugs in tests will mask bugs in the product, this is a problem.

True story: a new e2e test was failing randomly. For 6 months nobody looked at it because "it was just a flaky test." A manager found out and insisted that someone fix the test, and it turned up the test was fine, it just found a (non-deterministic) bug that had been in the product for over a decade.


> Given that E2E tests should run in an environment that is more controlled than production, if you can't get an e2e test to perform reliably then it's a strong indication that your system won't perform reliably in production.

"Reliably" isn't a binary indicator, but a spectrum of how frequently certain classes of bugs may appear in a system.

In the example that you were mentioning, it would appear that the amount of effort needed to maintain the e2e test suite was simply not worth it. How many man-hours were spent by your manager and staff ignoring the test suite? How critical was the bug (it would appear not much)? How much effort would have to be dedicated to get the e2e suite working will that won't be spent doing other classes of tests or feature development?

I'm not saying a well-maintained e2e suite doesn't work well or help to catch a lot of interesting production bugs. But I am saying that I think that for the vast majority of systems it's just not a good use of your time. Save your efforts and put more thought into the system design to avoid certain theoretical classes of errors and devote the rest of your time to better integration tests and that will likely serve more orgs better.


> How many man-hours were spent by your manager and staff ignoring the test suite?

I'm not sure what you mean by that. Ignoring the test-suite isn't something that you spend time doing. It was 6ish weeks with a team of 8ish people, so you could say "48 man-weeks" were spent ignoring it, but they were doing other things in that time, not just sitting at their desk proclaiming "I'm ignoring this test."

Once the manager forced someone to fix the test it took less than one man-day to find the bug, and about 5 minutes to fix the bug once it was found.


I have a Rule of 8 for the testing pyramid that has been roughly stable across three programming languages.

Each level you crawl up the testing pyramid increases run time for good tests by a factor of 6-10. If your functional tests are taking more than 10 times as long as your unit tests there is something wrong that is worth investigating. Usually I set a default “slow” time equal to multiples of 8 over a good unit test and round off to a whole number to invite fewer questions.

But it also means that if your unit tests are running in 10ms apiece, your integration tests should run in about 640ms and your E2E tests in under 5 seconds. Getting most people to make 3 second end to end tests is at least as hard as getting them to push them down the stack.

You need more tests as you go down, but that generally takes about a 5:1 ratio, meaning you still get a 30% improvement in run time for every test you can push down, and sometimes we are using end to end tests to do unit test work, which is going to be 4 to 500 times faster depending in how many cases were really missing in the unit tests.


Yeah I agree with you, and it's because writing a test suite to simulate the users and verify (all) the use cases of your system, is really complex. I think the test advocates make it way too easy for themselves when they always just say "the first rule of testing is that your tests should always be fast" See, you broke the first rule, that's your problem! Well.. how do you execute a large number of complex operations and verifications, quickly? There's a how lot of actual practical solutions missing here, and just a lot of obstinate principle belittling "rules" and deflecting from providing an actual solution, which in practice, is hard.


Of course they're the slowest and flakiest - they include the most sources of slowness and flakiness that you have. But, if you want to make your production fast and reliable, they're pretty good gauges if you're headed for success.


Same experience, I'm sure you can design a large e2e test which isn't slow and flaky but for that you need some very very strict set of rules & care. I've personally never experienced one like this though.


Yes, this is true in my experience as well.

They are, however, extremely useful when they aren't flaky. A well-built E2E test can be a huge timesaver when debugging interactions between components.


One of the things that I love about Bazel is it thinks of a binary that obeys a contract as a test. This means you can have things like `sh_test` which just runs a shell script in a sandbox and gives you all of the benefits Bazel has normally for test execution. You get automated caching, parallelization, and remote execution of tests for free.

A great talk about this: https://www.youtube.com/watch?v=muvU1DYrY0w

You can often get situations where integration tests (that cover large features) take less than 30 seconds, only ever execute your tests when it is possible for the outcome to change (a dep has changed), and you can run your tests on a fleet of machines rather than one laptop or CI runner.


> If you maintain speed and strict curation of such a suite

This seems to be the hard part. Any tips for maintaining speed and strict curation, especially at scale (in terms of developers)?

At the very least, it seems that E2E tests are a tool that's easy to misuse. Not sure of the best way to mistake-proof it.


Automatic linter completely banning sleep(). Present reasonable alternative functions instead, like untilServiceAvailable.

Sounds silly and obvious but look into your test suite and i'm sure you will find sleep everywhere.

Majority of all flaky tests are due to miscalculated sleeps in my experience. Sleep is also the biggest contributor to slow runs, with longer and longer sleeps being used to combat the flakiness. Both factors being the biggest pain points of E2E tests.


I've been thinking about the benefits of only writing E2E smoke tests which cover a small number of critical paths quickly. Seems like most of their problems came about because they wrote more tests than they needed, with higher coverage than was necessary.


If you are only allowed to have a single test in your project, it would have to be the E2E smoke. I've seen systems where master don't even start up for weeks but project is still proud to present unit tests are green with 100% coverage. One single E2E smoke outweighs all those tests, catching everything from faulty configuration, infrastructure, interface assumptions, integrations, libraries and code bugs. After this it becomes more of a runtime vs coverage balance.

What you need to be aware of is that even one single E2E test can require a significant investment in bootstrapping your test-environment. If you are clever you will use this to improve the quality of your production code as well. For example, if you only have a single DB for production and now need to spin up an new instance for test, don't do it manually, instead refactor it into infrastructure-as-code and you've now turned this component into cattle instead of sheep, allowing you to scale both prod and test and giving your ops team a much easier life.


this.. I have been doing that in products from hardware to ai models. It is effective and gives reasonable feedback.

I even gave a baiting tech talk at one of the companies about how useless unit test really are. Useless is understating it, they are often counter productive. Especially "junior" people, who try and grind the testing mindset, will come and write a little test for everything. Good luck making small inconsequential changes in the future.

If you have well written unit test (high level apis) then they become worthwhile - but once you arrive there, you have to just do one little step further to move into integration test and get good cover from failures/changes/bugs in cloud/infra provider.


> if you don't test end-to-end you aren't going to catch bugs that only appear end-to-end

Points 2, 4 and 7 from the assessment expose why sometimes this is not achieved even in E2E tests.


Well of course e2e suites are not a panacea. That doesn't support killing them. With regards to flaky or hard to debug tests, those are implementation-specific issues that should not be used dispel the entire concept of e2e testing (and can usually be solved by high code-quality tolerances and tracing respectively).


> That doesn't support killing them.

If you have to wait hours, sometimes days for a queue to run tests that catch 1 bug in 1000 runs and you still end up with bugs in production, I believe this supports killing the current e2e process, if you find ways to guarantee system integrity between services.

I think you are assuming accidental complexity, but hard to debug tests could also be a symptom of inherent system complexity.

Nubank is a gigantic proponent of clojure and is regarded to have high standards of code quality, so I think that we can give them the benefit of the doubt in this aspect.


The entire article is about the reasons they ditched the test suite and replaced it with a different practice. Does it need to be more specific about the tradeoffs between fixing/rewriting the e2e suite vs. doing something different?


Should a BMW test-driver take a car out on the test track, when a engineer/designer is tweaking the glove compartiment handle ?


Yes. The latch might not be strong enough to handle the centrifugal force when driving hard, or vibrations, etc.

You don't need to go out to the track once per tweak of course. You could very well do a few laps to test out the whole system once in a while.


BMW still does crash testing on finished products (end-to-end tests)... which would cover glove compartment too and how it affects overall safety... (perhaps it breaks up into pointy objects on crash, maybe it opens randomly during driving causing safety issues...)

You wouldn't build a car without doing test drives at the end... or crash testing... or certifications.


I’ve personally seen way more e2e regressions than isolated regressions (mobile dev). It seems to make sense from a high level: it’s easy to test finite/internal behavior (unit test, UI test, or manually), but there are exponentially more cases when integrating any bit of code with any other bit of code.


Agree. Other organisations can effective run very large end-to-end tests. So it is most definitely possible. If they don’t have the in-house skills to do it then they should hire somebody who can.


What’s incorrect about my comment? Clearly other organisations can write successful end-to-end tests. The fact that you can’t do it doesn’t mean that nobody can.


Buried ten feet deep in the article - they retired E2E tests and introduced "acceptance" tests, which are more efficient E2E tests that they still run on critical code. But I guess "We Renamed Our Eng-to-End Test Suite" isn't a very good blog post.


Thanks for the feedback, we didn't want to burry the "acceptance test" complement to our testing strategy in the article. If you're curious how it works we've recorded a webinar about it: https://www.youtube.com/watch?v=wKgDaD5Nie4&list=PLfqo9_UMdH...


And in part 2 about 12 minutes in we show the code to exemplify how our acceptance test suite works (and is very different from traditional E2E): https://www.youtube.com/watch?v=caxpxszueI0&list=PLfqo9_UMdH...


I cannot watch this entire video, so it is possible that you explain conceptually how your acceptance tests are different from end to end test?

I would like to undestand what were the goals of your initial end to end and what are the goals of acceptance testing, and how you define these acceptance testing from test objective perspective.

I assume end to end tests could be described by this definition: "test the functionality and performance of an application under product-like circumstances and data to replicate live settings. The goal is to simulate what a real user scenario looks like from start to finish" [0]

[0] https://smartbear.com/solutions/end-to-end-testing/


I think what they mean, basically, is "e2e testing absolutely everything was becoming a nightmare, so we've now switched to 'contract based testing' - effectively 'unit testing where the microservice is the unit granularity' - plus some e2e style testing for critical paths where it's still valuable enough to justify all the extra effort".


Ok. In this case it is still e2e. In my head e2e is never a strategy to test everything.

I also thing they did a good optimizations to offload some cases from e2e go some other form of testing which for should be integration testing.

If they dont do integration testing then a lot of possible bug cannot be found. Just testing the input and output of each microservice is not enough.

But I will watch the video as maybe it is better explained there and there is something to learn from this experience.


Yes, I said it was still e2e.

What they meant - I think - was they moved away from "a primarily end-to-end integration test suite" to "service-as-black-box unit-ish testing" plus "focused end-to-end integration testing for critical paths".


Yeah and that's deeply confusing.

At work we have too many components that are tested in isolation, but which have grown to become tightly coupled, so we're trying to build an end to end testing framework.

So from my perspective I'm living in a world where our end-to-end test suite doesn't exist and therefore could be equivalent to "killing it" and it is bad. Each component tests its own contracts, but if there's no global testing that the contracts match in both codebases then you're still shipping broken software.

I thought this article would be some clever way to match client side and server side contracts to ensure that the contracts are identical on both sides and tested so that you could test in isolation then still come away with assurances that the whole would work together.

Instead it sounds like it is advice to only build as much end-to-end tests so that you're reasonably confident that more isolating unit/functional will work, but don't build too much because they're horribly slow, and never adopt a policy that literally everything should be end-to-end tested because that will result in infinitely long running test suites. If you have no end-to-end tests you have no confidence that the software you ship works, if you have only end-to-end tests you have no confidence in your ability to ship software in the future.

So, uh, "clickbait title" I guess is my point?


sorry for the bait. thanks for the click :-D


I don't quite agree. A more precise title would have been:

"We Disentangled our E2E Tests"

If I understand correctly - what seemed to have happened is that they separated the lower level data-coupling testing and their higher level testing with contracts and acceptance testing respectively. Each of those layers don't need to know of each other, so this was a separation of concerns AKA simplification.


This is an endless debate. Each situation requires a different test setup but ultimately you can't say end-to-end tests are not worth it. You can have perfectly functioning units of software that are all perfectly unit tested but the units are not working together (insert a related meme GIF about working drawers colliding when opened). This can happen with strongest inter-unit communication protocols such as strong types and validation mechanisms.

E2E tests are very hard to maintain but in many situations they are required.


> you can't say end-to-end tests are not worth it

Or, from another perspective - you are doing end-to-end testing, the question is whether you're doing it before production or if your customers are doing it for you...


Tests are never 100%, customers will always find issues. They made the choice to allow slightly more bugs to production than before, not to go from never having bugs in production to having some.


we didn't see more bugs in production when we sunseted E2E. in fact, complementing contract tests with acceptance tests we saw less bugs in production and more productive test creation and maintanence. :-D


> you are doing end-to-end testing, the question is whether you're doing it before production or if your customers are doing it for you.

This reminds me of the point I've seen made elsewhere: E2E tests can be complemented by use of monitoring metrics, healthchecks, etc. for providing confidence that the system is working as intended (or for spotting cases where it's not working as intended).


> you can't say end-to-end tests are not worth it

You can, actually.

But here's the thing: I've never seen an honest debate on E2E within an org. When your manager comes to you and says your team is going to start doing E2E, ask him/her if they are prepared for their schedule to slip by 30% or more.

They will either slither back into their office, or (most likely) they will insist that developers write E2E in addition to their current workload of writing unit tests, writing the actual code, and all of the other overhead (pull requests, approvals, JIRA ticket maintenance, interviews, etc.)

Developers are expected to pay the costs of E2E with no impact to the business.

What managers do not understand and has been my experience for many years now is that E2E is at least 30% of the cost of development. And that's probably low. I recall certain features where E2E took probably 200% or more time to get working. Because, unlike most unit tests, writing E2E tests is nontrivial. You may have to invent entirely new techniques and apparatus just for a single test.

If the costs of writing and maintaining E2E tests outweigh the benefits, then obviously it's not worth it. Not every bug is critical. In fact, go back 15 years and no one had any tests whatsoever. The world didn't end.


In my book, E2E tests should be on a couple of basic, mission critical things and integration tests should pick up the rest. It's far, far better to have 10 E2E tests and 1000 integration tests than 0 E2E tests and 1500 integration tests because it picks up failures in your infrastructure or weird stuff like middleware that are probably system wide(ish).


Pretty much this. I found that having loads of E2E tests often doesn't add all that much; usually they're all doing the exact same thing test after test after test, and since these parts tend to be fairly isolated there isn't all that much that can go wrong in just that specific test. Either it works for everything, or it fails for everything.

The way I've always viewed E2E tests is as "testing everything at the top layers" such as middleware and whatnot, which you can usually do with just a few (or sometimes even one) test. Other less high-level integration tests can test all the rest, and they tend to run much faster as they avoid a lot of overhead, and are a lot easier to write and reason about, especially if tests fail.

I once rewrote a E2E test suite to use integration tests which gave a massive speed-up, and because the tests were a lot easier work with people actually started writing them. I added a few E2E tests (IIRC logging in, viewing the dashboard, logging out) and that was enough really.


Yes exactly. This mirrors my experience as well. Of course it depends on the individual setup, but making testing easier and faster has huge benefits, but a tiny amount of E2E coverage goes a long way.


I feel like this is also where dogfooding - or drinking your own champagne - comes in, if possible.

We can use our software internally and sure, there are hardware costs and manpower overheads to run an additional instance of our software, but those aren't too high. Hardware necessary to run E2E tests of all the systems at proper scale including maintenance manpower probably eclipses those efforts. And then you'd have to add dev-hours on top of the E2E costs to build and maintain mountains of E2E tests.

And this has exposed really nasty bugs in common paths already, just by employees using the system.


This is absolutely a good attitude, but it only works for a fairly limited class of software. I worked for things like realtor agent software, child care agency software, etc. and you can't really "dogfood" these sort of systems unless you want to become a landlord or start a childcare service.


> In fact, go back 15 years and no one had any tests whatsoever. The world didn't end.

You really need to stop repeating this, it's absurd and there was a great post here the other day explaining how in most companies, they had a large QA team that would need to approve any code, it's just that developers were not expected to write the tests themselves.

> If the costs of writing and maintaining E2E tests outweigh the benefits, then obviously it's not worth it.

Obviously, but the question is, what's the alternative? In the blog post, the alternative was to have contract-based acceptance tests... but that may not always be appropriate for every business. We have a huge E2E test suite where I work and I was one of the biggest contributors to creating it... as everywhere else, it's heavy, slow and hard to maintain, but replacing what we have with contract testing wouold be unfeasible because we're not a micro-service architecture, we are one big application as we're a product company.... I would love to find a better way of testing our product, but contract-based testing is definitely not the answer for us.


“In fact, go back 15 years and no one had any tests whatsoever. The world didn't end.”

That’s just silly. Of course there were tests 15 years ago. Unit testing has been around since the 1950s

Very little would cause the world to end. But lives have been lost and billions of dollars along the way wasted due to improper testing


Not in the same way as a modern CI/CD-pipeline. 15 years ago you had release-approval meetings where a dedicated QA team would present all their manually observed findings in a excel-sheet to management.

E2E is a replacement for, or evolution of, the QA department, not a replacement for unit tests.


I can hardly think of a situation where I'd want no end to end test.

I think one misconception is that there has to be a single end to end test. Really what you want is a variety of end to end tests examining the functionality of different parts of the system. But the system under test is still the whole system, not the units. These partial end to end tests can still be quick to run, as long as you keep system startup time down.

For example, I work on a system that builds text indexes on an underlying database management system. We take an input mutation with logical changes and then use that to determine what additional index updates are required. This all happens automatically when our users write.

There are two ways to test this. The old way was that we instantiated the top level class that did the changes and manually constructed mutations that look like user mutations. Then we examined the mutations produced by our top level class.

I recently converted this test to use the public write and read apis of the database to instead write data to a test instance and then check that the index contents was as expected. The public api is more stable than our private one and is resilient to internal refactorings. It's also more amenable to ad hoc queries of the type you generally do in tests. And it ends up not being much slower, since our test for various sad reasons still had to start the database engine even though it was mostly unused.

All in all, I was able to make the test faster (3 minutes -> 1.5 minutes) and less brittle, while using less code and getting more coverage of what we actually care about. I think wins like this are commonly available when moving from unit to end to end testing, as long as you keep system startup time down.


there was a nice talk putting this as an example where React's engineering teams designed their test to be future proof: always test the public API; your end to end tests should survive a major refactoring/rewrite


95% of our tests are going through the API, including all the auth. Spin up a server in the test, create some data (we have helpers), call the endpoint, assert on the response, another endpoint's response, or the database state.

That allowed us to carry out _enormous_ refactors (pretty much only the controllers stayed) without touching tests. It's not really harder to write, making an API request isn't harder to write than making a function call


Great stuff. Refactoring in my codebase is quite painful because earlier owners went in almost the opposite direction. The unit tests pretty much are never the ones that catch our bugs -- it's all caught by random or end to end tests. I'm slowly trying to crawl toward the light with respect to reorganizing tests I come across to use the external API.


Hard to overstate the benefit of not having to rewrite your tests when you refactor. That's a concrete and difficult to ignore benefit of using higher level APIs. I find that the cost of writing a test is often not less than the cost of writing the code. If that is correct, it means that the system lifetime cost of authorship is much higher with unit tests. If all tests are unit tests the cost is almost 100% higher.


E2E tests aren't worth it if they produce false positives and don't prevent defects from reaching production. By definition. Too many devs treat automated testing as a goal in and of itself.


So you are saying that X is not worth it if X is crap? Who wouldn’t agree with that? What if X was implemented well and did what it was supposed to do? We use end-to-end production and it works extremely well. I have had zero production problems for years now. All issues were found by the end-to-end tests before deploying.


"So you are saying that X is not worth it if X is crap?"

I am because it needs to be heard. Like I said, too many devs treat writing tests as a goal unto itself.


well said!


They didn't say E2E aren't worth it (they actually said it worked for them early on)

Their point is that it's not scalable as they think contract testing is


Contract testing isn't E2E testing and doesn't provide the same guarantees. So they are replacing a high cost, high value system with a lower cost, lower value system. So they are literally saying that they don't think the extra value of E2E testing is worth the extra cost.

Personally, I think they are wrong, and the problems they have with their E2E tests are problems with their implementation. Fixing them would benefit the customer experience and the developer experience as well as reducing the costs of the E2E. They are absolutely going to have critical customer impact (they are fintech ffs) that their E2E tests would have caught. Of course, whether that actually tanks their business depends on other factors. So accepting more critical issues may the right thing for the success of the company. Robinhood customers were greatly pissed off by its behavior, and yet it doesn't seem to have hurt them too much. But I wouldn't fucking boast about it!


Our product is an end-to-end testing tool, so it's always interesting to see what issues companies hit with E2E tests and how they solve them. What's interesting about Nubank's experience is that after deleting their E2E suite, they realized that replacing them with integration tests wasn't providing enough value. There's a lot value in E2E tests, but so many orgs end up taking the wrong approach and ending up with a slow, flaky test suite.

We wrote a guide [1] for building automated test suites based on our experience working with and talking to software orgs. Teams who get value out of E2E tests generally do the following things right:

1. They keep tests as small as possible. This makes maintenance easier and forces a separation-of-concerns in the tests.

2. They factor the tests so they can run in parallel. This, plus shorter tests, is the best way to mitigate the slowness issue brought up in the article.

3. They have a good strategy for test data management. It looks like Nubank had test data represented as fixtures, but then somehow manual testing in that same environment was clobbering test data and causing false failures. A better strategy for managing test data could have solved for this. Or maybe even just running the automated tests in an isolated environment.

[1: https://reflect.run/regression-testing-guide/]


FWIW I've been in both situations. One company had sketchy E2E coverage that resulted in a modicum of production bugs. I moved to a competitor of roughly same size had a huge E2E suite and AFAICT results in roughly the same modicum of production bugs. But feature development at the latter moves much more slowly because of all the wait queues, flaky tests, timeouts, test maintenance overhead, etc.

IMO these seem to be more of a CYA thing for managers. When a bug does get to production, you need to have something to point to. (And in my previous company of course they're scrambling now to make a big E2E testing platform). But I'm not convinced they're actually worth the effort.

Edit: actually maybe I think they're worth the effort, if for no other reason than when bugs do go to production, executives tend to start micromanaging if you don't have something to point at. That can be worse than dealing with flaky tests. But what I'm not convinced of is whether they actually reduce the number of bugs that go to production.


You have been in two of three situations.

  1. No e2e tests.
  2. Heavy, flaky, slow e2e tests.
  3. e2e as a driver of first class system integration
Compare with

  1. no unit tests
  2. tons of unit tests, that regularly fail, nobody cares and check in more bad tests, "unit" tests that thread sleep and take minutes
  3. CI/CD with 0 tolerance for failures or >1s tests
The difference between "We have tons of tests" and "We drive development with tests" is night and day. Sure, if you slap on e2e with the mandate "They must exist", then you're going to have a shitty experience, e2e or unit.


>3. e2e as a driver of first class system integration

Could you say more about this approach? That's how I've tried to approach end-to-end tests, but I haven't found much written about this which is specifically about e2e tests.


I haven't written a book but just take TDD philosophy but apply it to e2e. For example, if you have an e2e test that often fails because something times out, or you've had to set a high timeout, then dig down why and fix it. Turns out you have a JVM that often has a 2s GC but no alarms, and yes, it impacts customers too. Fix that problem, don't get rid of the test (or all the tests). You got a slow third party thing, put an abstraction in front of it so that username: MyE2EUser's traffic goes to a shim. Then either negotiate with third party, or make interaction with third party system asychronous. Or does resubmitting the page order the product twice?! lol. I wish those were the old days. All that being said, I haven't done this for a couple of years now, and I have the joys of vastly simpler systems at vastly huger scale, so I may be seeing the past through rose tinted glasses.


If you have an E2E testing platform, and a bug is found in production, can you reproduce the issue with the E2E testing platform? How easy is it?


There’s the classic diagram showing tests as a pyramid[1]. At the bottom you have unit tests, white box stuff that mocks all dependencies and runs super fast. At the top you have E2E tests (or acceptance tests or whatever, names are fungible), but very few of them, to catch bugs at the boundaries between systems. In the middle you have stuff that maybe runs against a live database instead of a mock.

As you go further up the pyramid, tests get more expensive in every dimension: they are slower (and are run less frequently as a result), more expensive to debug and maintain, and maybe a bit flaky, though you should still de-flakify these tests as much as is practical.

The key is to push tests as far down the pyramid as possible. Never test something in an E2E test if it can be meaningfully tested in a unit test. For example, a regression test for a database date/time serialization bug should run against a real database. Meanwhile, a test verifying that an HTTP service returns the correct response code in a specific situation can run against mocks.

1 https://martinfowler.com/articles/practical-test-pyramid.htm... (holy crap that is a mountain of text, but the diagram is near the top)


> Contract tests allow us to describe the interactions between our services through expectations to be met by their inputs and outputs.

If you say so, and I wish you luck but... I've seen that tried many, many times and never seen it actually work out in practice. It seems like it ought to be workable - there are only a finite number of ways that each service can be invoked after all - but if the goal of automated testing is to find problems before they become production problems, I've never seen "defined contracts" fulfill that goal.


> but if the goal of automated testing is to find problems before they become production problems, I've never seen "defined contracts" fulfill that goal

But, have you seen contract testing fail to catch problems that E2E tests did catch?

I think both of them end up tending to be regression tests a lot of the time.


A complicating factor is that teams usually start with E2E when it’s simple, and only move to other approaches when the systems have become way too complicated.

At that point I don’t know if there is any specific strategy that effectively catches a lot of bugs, short of sending to production and monitor the effect. At a company we just called those tests “sanity checks”, and the goal was just to make sure the most basic use cases would still work, and nothing more.


It says the majority of the problems caught in E2E were due to changed contacts, so it makes sense to have a dedicated test type for those errors. If the remaining caught bugs are few enough and their customers are willing to suffer occasional unavailability, just let them be caught in production and save that expense.


Agree. They are not testing the end-to-end contracts of how concurrent services interact. I predict it will end badly.


E2E tests are required because no matter how well-defined your other tests are or how completely they've tested everything . . . you can't prove that they'd absolutely catch all the bugs.

https://en.wikipedia.org/wiki/Argument_from_ignorance

Kudos to Nubank for whatever combination of logic and bravery led them to their decision.


E2E can't catch all bugs either. This team decided the number of bugs their test suite caught was not enough to be worth keeping it. With a robust canary deployment, they will quickly find and rollback breakages whether or not the e2e suite would have caught it.


Reading the article beyond the title reveals that this decision was engineering driven, measured and they ended up with a simpler, disentangled solution, by separating contracts that verify compatible schemas on one side and acceptance tests on the other side.


Catching all the bugs is generous. You won't be able to prove...basic functionally when all the components are deployed in production


E2E tests are still extremely limited and let bugs through, unless they also fuzz somehow. But that will make them even more flaky and difficult to debug, costing more time. It's a tradeoff.

Example: You run some number of operations and then batch them. If you run the exact same operations each time you test, you may not catch conflicts between them. Instead, you'd have to run a random number of, and type of, operations. But then test failures would become extraordinarily difficult to reproduce. You'd have to hope that your tests log exactly what the inputs were, and have a semi-efficient way to recreate those inputs locally.


I don’t think it was bravery. I think it was lack of end-to-end test engineering skills. If your tests are flaky then do what you would do with any other software: fix it! If your end-to-end tests are slow then do what you would do with any other software: optimise it! Not having proper end-to-end tests is a massive red flag.


Which can't be proven by E2E either, and which you don't need to prove.


So they ditched e2e in favour of something that average monorepo checks statically ie throught typescript? Then mocked functional tests and called it a day?

The problem probably started when they put themselves in this microservice plague setup where they can't spawn simulated environment in ci anymore. As it turns out running system aka deployment on environment is a monolithic expression of microservice spaghetti.

As a side note flaky tests is such an idiotic concept. There are tests that pass and ones that don't. How good is button which works 35% of time? It's not a good button, period. Taking aside fact that it inflates test runtime more than decade of mcdonalds diet - if you find yourself in setup with flaky tests, you should ask why they are flaky and amend setup so test is expressed as non-maybe-flaky, normal test. Forbit flakiness, there is no such thing as passed flaky test - those are just shitty tests.


> As a side note flaky tests is such an idiotic concept. There are tests that pass and ones that don't. How good is button which works 35% of time? It's not a good button, period

Or more like the test runner succumbs to non deterministic flaky behaviour.

If something failed 65% of the time, it would be one of the easiest thing in the world to fix.

If it fails .001% of the time, that's what the industry refers to as flaky.

> Forbit flakiness, there is no such thing as passed flaky test - those are just shitty tests.

Have you ever written and monitored e2e tests over a year? It's industry wide.

Selenium/selenium grid always works great until it doesn't. Ditto with the new kids on the block. e2e outside of a browser is 100% fine unless there's an actual bug somewhere.


Let me clarify, I think I haven't expressed myself well. What I meant to bash is the culture where you wrap e2e tests with retries and consider flaky tests green. This arrangement is a pathology.

Industry did not put 1 out of 100000 (.001%) or less as threshold for calling something flaky. From my experience I've seen 20% success rate tests passing due to retries and teams are living with it as normal.

> Have you ever written and monitored e2e tests over a year? It's industry wide.

Yes, on high profile projects. I find myself repeating how important determinism is. Flaky tests, no mater how frequent, are indistinguishable from bugs, which implies they can't ever be considered green. Architecting testing environment in such a way that it is deterministic is fundamental. Sometimes it doesn't require big reshuffles, it just means the test has to be rephrased in deterministic terms that matter without asserting intermediate, timing based, racing middle states. As an example testing random failures by killing services doesn't have to assert intermediate client states, it has to assert that final state is eventually correct, which implies reconnects did happen and eventually state is correct, regardless of intermediate client states (ie. retry logic on the client auto healing itself on idempotent actions or erroring notifying client the service is offline, in which case user itself has to retry action - both are ok depending on how long service was offline, both can be progressed from test PoV, assertion that specific one happened is irrelevant).


Ahhh I've made this mistake before. You can't test every input/output while also losing the ability to accurately depict stateful user flows. Even Fintech projects I've worked on at the >£10Bn daily volume mark combining formal methods, mathematical proofs, property based tests, fuzzing, model based testing, etc.. still caught issues pre-production via good old end-to-end tests.

Good luck nubank :)


thank you


I'll be interested to see how it goes, will you follow up with a blog post in a couple months?


In summary they noticed that their e2e suite mostly caught integration errors where clients and servers had incompatible schemas for the data exchange.

The novelty is that they found a much faster way to identify this kind of errors by collecting and comparing the client and server side schemas statically without even running the code.

This is a great optimisation, but it did not remove all defects so they still need to define tests that validate actual application behaviour against the business rules.


well put! thanks for the great summary.


I am not the typical guy who is going to preach e2e testing when I am in big tech companies. But let's not confuse what e2e tests (or I'd like to call it, integration tests) can do.

I love property-based testing, especially with these new frameworks now does coverage-guided fuzzing too. However, it only guarantees the "contract" (or "interface") at that level. For property-based testing (or contract testing this article calls) today, it is still very much to only validate the property holds true, not to exhaust all the edge cases. To give an example, a property-based testing validates function: "add(x, y) == add(y, x), given x in Int32 range", it doesn't validate edges cases what if you call "add" twice, 3 times, from different threads etc.

At the end of the day, it would be hard for the property-based testing to validate your component satisfy Liskov substitution principle.

Integration testing on the other hand, makes sure your system worked at integrated level. It doesn't enforce Liskov substitution principle too. However, if you have downstream components depend on your implementations (for example, there is an earlier callsite called a function before, second call must be cached), update the upstream and run integration testing makes that implicit assumption apparent.

So, that's where I am arriving at. Without a powerful language that can encode all contract at programming level, relying on property-based testing only at components level cannot maintain substitution invariant. Integration testing is required.


Don't do comprehensive E2E tests. Have a small set of E2E tests to verify the most critical functionality is working. Don't verify every single business rule. There are other techniques that help you reduce risk:

- Good unit and integration tests.

- Effective Monitoring and alerting.

- Canary releases or Blue/Green releases.

- Continuous integration.

- Continuous delivery.

- Ability to safely rollback the more recent release.


I have experienced all the same problems that they outline. E2E tests require a huge number of human-hours to maintain, they're difficult to debug when they fail, almost always false positives, and bugs still get through anyway. But for many situations, there doesn't seem to be a better solution.

For most early stage startups, it seems that time would be better spent optimizing your deployments, rollbacks, and real time metrics so you can maintain a high velocity and roll back quickly when you make a mistake.

For more safety critical systems, the cost of maintaining E2E tests needs to be built into the total engineering cost for the project. It's a hidden cost that is often way bigger than you'd expect.


Not my experience at all. Having solid end-to-end tests means that I spend 99% of my time adding new features and 1% of my time fixing bugs before deploying. I haven’t had a bug in production for years because of solid end-to-end tests.


Contracts are basically unit tests for whatever size of unit you're testing. How do you capture all the dynamic behaviors of a system without some sort of end-to-end test? Delayed timers, queues fill, missed interrupts, locks are held for too long, dead lock, live lock, priority inversion, dropped messages, out of order issues, etc. These things are not captured by contracts and are often exactly why the end-to-end tests were flakey in the first place.


System A listens to queue B and handles every kind of message b throws at it. But somewhere, at some point in time, some coder has made the innocuous assumption that B_id's are unique.....


I'm a big advocate for testing, to state up front.

E2E is problematic from the start because of the expectations set by the name. Any sufficiently interesting system is nigh impossible to test "end to end." And, you aren't testing ends, you are testing the "start" of the process, to one of many "ends."

What about only doing "end testing?" Meaning, don't test the beginning. Put unit tests there. Put integration tests between the important components.

It is important to make sure you have coverage with automated tests that prove that your system can work at the end of at least some of the processes. Otherwise your QA costs are massive, and that never scales, and no one will ever fix that other than adding QA. Your innovation will slow to a crawl, much worse than waiting on your test suite.

I'm not sure after reading this article that the authors added a new testing methodology by calling it "contract testing." I'm still confused about what that means. Having said that, I am still confused about a lot of the boundaries between e2e and integration. It always sounds simple, but rarely in practice.

The bottom line: the organization as a whole has to see the value of testing. That's harder work than writing the tests for sure.


The middle ground that not enough teams are exploring is following the so-called Functional Architecture. If all side-effects are effectively segregated and reified, then one should be able to swap them out for determistic mocks that run instantly.

So you could E2E a distributed system realistically and instantly.

You can still simulate things services being slow, unavailable, etc if the code handling those is expressed as pure logic instead of coupling itself to IO.

Interestingly, in addition to fixed examples, you can perform generative testing over this setup e.g. what happens for various combinations of services being slow/down.


exactly! well said. it's what we're trying to achieve with our acceptance testing strategy. you can see more here how we leveraged clojure to be able to simulate E2E in memory in the JVM by bypassing IO and just having one services logic+data layer talk to the other: https://www.youtube.com/playlist?list=PLfqo9_UMdHhah_gNPnawX...


Nice! I missed that nuance in the article.

Looking forward to eventually check out Sachem if there's a plan to share it?


Why not both?

Beyond unit test, we have docker compose spin up our service(s) and its dependencies. If those dependencies have too big a web, we may point at a staging instance or a fake server, but we routinely will spin up dependencies that will run a local kafka and zookeeper for them to run, are backed by mysql and redis, etc.

We then test our service at its incoming edges (feed its incoming queue or call its endpoints) and verify its output (via logs, metrics, and sinks).

We also have end to end tests that exercise our our services from the customer's point of view, but take place in our staging environment. These do suffer from many of the points the article points out, but we run these tests concurrently, and, when not flaky, can pass in 10 minutes.

We are addressing flaky tests by addressing their root cause: flaky services in staging. We are expecting teams to have mature monitoring of services in staging and tying improvements directly to flaky failed tests. We are also improving traceability so a failed test is easier to debug to understand if it was a failed service request somewhere in the stack.


End-to-end test suites do not excel in continuous integration workflows.

They excel as part of your metrics, monitoring and alerting system running continuously for as long as the service they exercise lives.

Perhaps it wouldn't be as painful if they took this approach instead.


that's a very good alternative way of thinking about E2E


If your system can't be tested end-to-end because of how slow and flaky it is to test that way, doesn't that say something about the quality of your system rather than the tests?

Of course once you have such a system, it's probably the result of years of work by many people and most likely it would be hard to make it faster and more reliable. That is probably why people shy away from doing that, and choose to blame the tests instead.


If your tolerance to deal with regressions and bugs in production is high and you have millions of users, then you can think of the user as the end to end tester. Maybe you ship some change and put it behind a feature flag and make it available to only 2% to 5% of the users.

If you get 1000 users to go through a particular flow and you have a way to collect failure signals from production accurately and in real time, then you can just dial down that flag to 0% if you see a lot of production errors.

I'm still not convinced that you can't drop e2e testing completely but maybe if your business allows it you can confidently rely of unit testing or testing contracts without having to run the app through all the user flows for every change.


> Manual changes in our staging environment corrupted test data fixtures

there's a lot here.

Manual changes in your staging environment shouldn't affect your tests, because your tests should be isolated from other environments.

Also fixtures are generally bad. Given some fixture representing an initial state S, a test utilizing this fixture along with some acceptance criteria is essentially testing that given the state S, running the tests executes some transformation T such that the state of the system is now S2; acceptance criteria evaluate S2 against some known-good value to confirm that T is the desired transformation. This is meaningless if the initial state S is not actually reachable by the system. The fixture itself does not prove that S is reachable: that S is reachable is taken as an act of faith.

So how do you determine that the initial state S is reachable by the system? Well, you have some other test, that starts with an initial state of nothing, that performs some transition (generating and inserting random data instead of using a fixture, for example) and gets nothing into the state S. By doing this, you've both created the state S _and_ verified that S is a valid, reachable state. Now you run your second test after the first test in sequence. To run N tests off of initial state S, you replay the initial test that produced state S N times, once for each dependent test. Sure that's a lot of work, but each sequence of testing events can be run in isolation from the others, so they can be run in parallel.


Sometimes you have "state S that we got from a coredump from a customer that happened once every 30 computer-years in their deployment so we know it is reachable, but haven't ever seen it happen in-house"


'E2E' is often impossible as in most ecosystems as things are constantly changing and you will dependencies out of your control you cannot simulate. The key is faking the right dependencies with accurate-enough versions to keep test fidelity and speed, keeping the test svelte and fast enough so it can run before you merge code to eliminate the size of change being tested and thus more easily understand the outputs to determine if they represent a false positive or negative and where any problems may be. This also all requires building the infrastructure to spin up a simulated world quickly enough to simulate how the proposed change effects the simulation and then analyze the results which is also pretty hard and can get expensive.

Luckily spending time optimizing often helps test speed and can control cost so there can be a good case to make for it but orgs have to be willing to pour engineering hours into that and engineers need to want to do it vs building new things which is typically more enticing.


Test against API’s. Not implementation! Your API’s are supposed to be stable. If they are not then you are doing it wrong.


Test suites will tend to fail the more your system has to work with "outside" data. I recently had a client where their own data was entirely dependent on data drawn from 23 different 3rd party APIs, which meant the bulk of their code was devoted to parsing APIs they had no control over. Those external APIs sometimes changed, and sometimes contained bugs (that is, violations of published contracts).

To talk about this, I use a broad definition of "outside data". If you're a small startup, "outside data" typically refers to data that belongs to another company. But if you're working in a Fortune 500 company, "outside data" can also refer to data coming from an API run by some other division, which is nominally part of "your" company but is affectively independent.

One rule I now offer to my clients: the more your system relies on outside data, the more it is helpful to have run time checks, rather than a test suite. Assuming you run your code on multiple machines or nodes or dynos or instances, you can chose to run the tests on just a percentage of your system, enough to detect problems, but without paying the performance price on 100% of your system.

When a problem in your system is because of a change in an external API, your test suite won't catch it, since your test suite works with dummy data. But run time checks will catch the problem and make debugging easy -- you'll see almost instantly which API call created the problem.

Code written on the JVM has the beautiful property that you can add pre and post assertions on every function, and you can pass a flag to the compiler asking that the assertions either be left in the code or stripped out. This makes it easier to build 2 copies of the code, one with the asserts and one without, and that makes it easier to, again, deploy the code in such a way that only a limited percentage of your code needs to run those run time checks.


At $PRIOR_JOB, it always felt like the full E2E tests approached useless since for every bug successfully caught, it felt like there were ~20 false positives. At which point, everyone (myself included) blamed the tests and just repeatedly reran the tests until they usually passed. Every single failure would halt the pipeline anywhere from 5 minutes (in the case that rerunning the failed test shows that it was just a flaky test) up to multiple hours since everyone would rather try to diagnose/hotfix the issue rather than revert their code to unblock the pipeline.

With that being said, a full run of the E2E suite at $PRIOR_JOB took very, very low double digit minutes so it wasn't that expensive. Rerunning a handful of failed tests took single digit minutes so it wasn't too terrible.


Was in a similar situation, and the VP of engineering banned the practice of rerunning failed tests, so flaky tests caused everybody pain. In less than 8 weeks the false positive rate dropped by about 3 orders of magnitude. There's a strong tendency to treat tests as a hurdle to get over rather than to treat them as first-class part of the development process.


I imagine this would just turn into everyone inserting 10 second pauses on the tests that fail. Which works, but now your suite doubles the run time. Actually turning nondeterministic tests into deterministic ones is... hard. Really hard in some cases. Many devs don't even understand how to get there, even after years of E2E experience.

One place I worked, the E2E suite took a full hour to run. Everyone reran the tests. Merges took a full day in many cases. Management tried to force people to fix broken tests. But they also required new tests on new features. So it was a constant treadmill. There was basically a full mutiny by the end and the company killed off their entire E2E suite.


If people just started throwing random sleeps into tests, I think management would shit a brick. Do people throw random sleeps into production code to fix bugs where you work as well?


Not GP, and fortunately not often, but I have seen that done to overcome race conditions. I pushed for it to be corrected by using a proper design. That was a stupidly hard fight, though.


My pet peeve is people sprinkling C's "volatile" keyword in places. Since doing so inhibits many optimizations, it changes the timing and can make race conditions appear to go away.


Yep. Lots of effective ways to paper over issues without actually resolving them, and often disguising them so that resolution becomes nearly impossible later.

Worst, things like the introduced sleeps in some of the systems look legit. There are reasonable times to introduce a timed delay into your program (3rd party APIs have a rate limit, 1 request per second or 10 per 30 seconds or whatever). Depending on how you introduce these extra sleeps, then, it's possible that they'll look like they satisfy a valid requirement, when the reality is that they exist to cover up the absence of things like proper use of locks/mutexes or other elements.


At one place I consulted, the fte lead ignored flaky tests and attributed failures to the tests being wrong.

A few months later...

The code that was failing intermittently was found to be using floating point types for money. Yeah, I'm gonna wanna fix that.


Right if you have flaky tests there are 3 acceptable responses:

1. Fix the test

2. Fix the code that is being tested

3. Say "well we don't need this software to be reliable anyways so let just stop running tests"

But many places seem to adopt hidden option #4 "Run the tests and ignore failures"

A related issue is dialing the tunables for warnings up to 11 and then not reading any of the warnings. Once I saw a case where the build generated 1000s of warnings. Found a bug and said "this would be flagged as a warning even with relatively low warning settings" sure enough it was.

Obviously fixing warnings is good, but if they had just lowered the warning setting to be something reasonable, they would have had maybe 10 warnings total, one of which was a bug, which is a lot more useful than 1000s of warnings, at least one of which was a bug.


Option #4 is just option #3 but keeping the costs of running tests you ignore.

You're right about excessive warnings, but then sometimes note. Running `gcc -Wall` used to be considered madness, and if you did it now on a codebase that has been around a while and not kept clean, you'd drown in messages. The key is to turn it on from the very start and fix things when there are 10 warnings instead of 1000.

This decay happens with test suites, too. One or two tests start to fail, and instead of fixing them, people ignore the failures. A bit later, it's five tests, then 10, and pretty soon the programmers see the tests as broken instead of looking at the failures that let things get to the point where there are so many failing tests.


The fix for both situations is similar though; dial down the {warning strictness|number of tests run} until you get a clean {warnings|test-run} then enable them one by one in order of how easy they are to fix.


Obviously the E2E tests were really badly implemented. Implementing solid E2E tests is a skill that needs to be learned like any other software development skill. Most developers don’t know how to do it well.


The sad fact of E2E is that the tests genuinely find broken stuff. The “false negative” test results usually just mean false as in “something was broken, just not what the CI claimed was broken.”

It’s could be anything, so you need automatic specificity as to what’s broken (hard) or buy-in from the entire organisation to be on standby for finding broken stuff (also hard.)

“Anything” as in if your external DNS provider has 1 of 10 resolvers with an out of date zonefile, or a dodgy switch port to that particular resolver.

It’s hard but if it’s broken then it’s likely it is a real issue one of your end users is also experiencing. A commitment to E2E is committing to a level of quality across your entire infrastructure that few people are prepared to own.


This is our goal with E2E tests and monitoring — write tests that fail if and only if customers are experiencing issues — and I don’t know how to achieve that level of I-can-sleep-well-at-night assurance of quality without it. We run our E2E tests continuously against prod from a different cloud region.

As you wrote, the challenge is that there are literally 100 things that could cause a test to fail (some of which are outside your control), and as your team scales you’ll have to get smart about how to efficiently dispatch people to fix problems.


This is a very bad list of complaints and it actually makes me angry to read it.

> Engineers had to wait more and more to get feedback from this long-running suite

So speed up your tests. Run them in parallel. Find better frameworks for running tests.

> Flaky tests meant that we had to re-run the suite frequently to see if something was really wrong or just a false negative;

Fix your flaky tests! Why anyone just accepts that "Oh, sometimes that test fails and we have to restart everything" is beyond me. Root cause the problem and FIX IT.

> Manual changes in our staging environment corrupted test data fixtures and maintaining the environment “clean” was a challenge;

Tests should not rely on pre-existing state. Have a setup phase for each test that creates new data in the state you want it to be in. As it makes this data, also note down a reference to it with a Time To Live so that a follow-up process can clean up this unneeded data.

> Test failures were very hard to debug

That's not the fault of the tests, that's the fault of a complex system that is hard to debug. Improve your tracing between services.

> Queueing of commits in the End-to-End suite resulted in less frequent deployments

There are well known solutions to this problem. Lots of companies have overcome this already.

> Few bugs caught in this stage. One experiment suggested that, for every 1000 runs, we had 42 failures, only 1 bug

If your tests have false-positives, you need to adjust your tests. Accepting that the test failed but there isn't a problem, and then not fixing the reason the test failed means that you don't have reliable tests.

> Bugs were still being found in production

Bugs will always make it to production. But after you fix a bug, you write a test so that this bug cannot happen again. Over time, the number of possible bugs that can make it to production shrinks.

And lastly:

> The main difference to the old E2E is that they encompass only a subset of services and don’t require spinning a production-like environment (the services run in memory on a single JVM and HTTP/Kafka communication is replaced by in-process communication). They are used in specific flows that we find too critical to only rely on Contract Tests.

Running tests against different infra than your customers have to deal with is asking for trouble. What bugs will exist in the real production infra that won't in your fake infra?


> Running tests against different infra than your customers have to deal with is asking for trouble. What bugs will exist in the real production infra that won't in your fake infra?

We actually do pretty well testing against fake infra.

We have a large test suite that enforces the contract on our REST server API. That is implement both in one heavy server written in erlang which is the production code and one lightweight server written in ruby which would never scale but is the same API. When the test suite is updated both the implementations need to be fixed. When the client code runs integration tests we can test against the lightweight ruby code and when it passes we actually have pretty high confidence that it runs against the production code. We have hundreds of those tests and they can be run as fast as spinning up a ruby process with an in-memory datastore which is trashed on every test. Compare that to end to end tests that might fire up a set of images, terraform them into production servers and clients and run a scripted interaction or set of interactions and then throws that away and does it again.

At some point there's a tradeoff between the realism of your tests and the cost of them and how many of them you can do. The right strategy is that you want to have enough of the most realistic tests to give you a high level of confidence that your faster, slightly less realistic tests are useful, and by having those faster tests you increase your amount of coverage, and on down the stack iteratively until you may get to unit tests of individual objects.


> At some point there's a tradeoff between the realism of your tests and the cost of them and how many of them you can do.

Yeah, you're not wrong. I'm just griping because of the previous list of complaints.

There are some benefits to running your tests against a mirror infra to reality. But they are limited. The bugs they catch that you won't catch running fake infra are very small in number, but terrifying in difficulty to solve.


The article did make its points fairly poorly.

I think in the end they more or less did what you suggested as well they just didn't call it end to end testing.


Agree. They clearly are lacking in test engineering skills. So the right solution would have been to hire somebody who knows how to do this well. Bad tests are simply bad code. So fix it!


> One of our Sr Staff Engineers, Rafael Ferreira, ran some numbers and applied queueing theory.

Is it just me or is queueing theory seemingly misused all over the place in software engineering. I don't know what was applied here, as it's omitted, but I've certainly seen people suggest that doing things like mandating that cycle time is decreased will increase throughput and justifying what they are saying with "queue theory".

Mathematical constructs are great, but they aren't worth much if you can't ensure their constraints are met.


All models are wrong. Some are useful. The problem is knowing if a specific model is useful.


> One of our Sr Staff Engineers ran some numbers and applied queueing theory.

Queueing theory is an excellent way to look at E2E testing for the big-picture view and also for drilling down into each of the relevant services.

For a quick intro, this is a queueing theory primer that I've written and shared with HN previously: https://github.com/joelparkerhenderson/queueing-theory


I’m generally not fond of e2e or even integration testing. At least, I prefer to keep them to a minimum, and use other tools to ensure units interact as expected.

That said, where e2e tests may be valuable but costly as described in the article, it occurs to me that narrower integration tests which invert responsibility may be better. Which is to say:

- Given Service A

- Given Service B which depends on Service A

Integration tests of Service A may provide more value if implemented in Service B. It’s SB, after all, which understands the behavior it expects from SA. (If they’re mutual dependencies, of course the inverse applies as well.)

Of course, this highlights (at least for me) why integration tests should be limited in scope. If both services are well tested at the unit level, you will probably end up with a lot of redundancy between their reciprocal test suites. But at least at the idea level, this feels like a better compromise than expecting Team SA to anticipate all of the subtleties Team SB might have in mind.


Yeah, I think the easiest way to test that Service A is meeting its API obligations is to send some requests to Service B. Hyrum's Law means that that's the only way to really test the important aspects of Service A's API.

And I think this can be generalized into a general philosophy of using your users as a test suite for your API: http://catern.com/usertests.html


What are the tools you do use to ensure units interact as expected?


Mostly making types more specific. Within a service, ensuring semantics are part of the type so they can be checked statically, and designing function boundaries so those semantics are part of the interface.

Between services, using standardized machine-usable docs (like OpenAPI and JSONSchema) to share those types in a well-defined way. This is harder because network boundaries are less flexible, but it does help a lot.


We've seen this before. At first, the E2E tests automate a bunch of tedious manual tests. Then they become the team's "automated test suite." Then nobody (except true believers) tests their code anymore, because "our automated test suite" will surely catch any problems. Then everything starts breaking down the way the author cites.

The E2E test suite needs to be thought of differently from a "test suite." It's the last safety net to disprove that the build is worth [manual testing | dogfooding | beta release | prod release]. Any bug found there should be worth a postmortem -- even one as short as "oops, new guy forgot to update the unit tests! fixed, won't happen again!" Of course, bugs will get through to the safety net, even in a system that's working well. But they should always trigger the question whether and how the bug could have been caught earlier.


Almost all the bugs I’ve ever seen have been an integration or configuration issue. Race conditions are a common example of this. Even the small bugs usually involve 2-3 “units.” End to end tests and integration tests, manual or automated, are really the only way to catch these


This sounds like a garden variety example of an organization moving to event-based architecture and implementing contract testing.

What makes this interesting is that they had decided early on to be a Clojure/Lisp shop so there weren't great tools out there at enterprise scale.

I don't understand why they decided to built everything on Lisp, I assume it was best for them at the beginning. But given the known lack of tools for Lisp to scale into large scale enterprises - this is the kind of thing I would always consider once you get past MVP phase. Otherwise you have to end up building all your own stuff.

Architecture matters.


What enterprise tools are you thinking of? Because Clojure runs on the JVM, I'd imagine what works for Java also to work for Clojure.

Personally I see Clojure as a perfect choice for an ever-increasing amount of complexity, since you'd keep side-effects on the edges of your app and everything else as a pure function. Every occurrence of enterprise OO I have ever seen has been unmanageable mess that burns everyone out in a quick year, because you have no idea where something begins or ends, what goes on in-between, and instead of doing what actually matters (business logic) you spend most of your time creating abstractions.


Erm most people need all kinds of tests, it's not a story of one vs the other... Unit, integration, end to end... The end to end is always slowest and flaky so not everything should go there but with any mature ish product it's also impossible to catch issues related to system complexity without it... It's also weird how self important this account reads. The field is not new but they don't really describe how their homegrown stuff does better than other frameworks and present themselves as visionaries... Smelly


I’m not sure I follow the logic that their e2e test suite would take an “infinite” amount of time to run by 2021. It seems like an obviously faulty calculation, unless someone puts an infinite loop.


I think what they meant is that at the pace they were committing coffee to production, the e2e suite would never stop running.

This is the most charitable interpretation.


The way we solve that at Brex it scales logarithmically, we use bors merge bot which batches everyone’s PRs together and will binary search for the offending PR if any build fails. I’ve hardly ever waited longer than 2x the length our suite takes. They could also decouple CI and only run code tests for upstream changes, so it wasn’t obvious what they meant, thanks for clarifying one theory. I’m not sure I’d agree that it’s a problem with e2e tests per se but rather the problem is slowness itself, and monolithic CI. Not doing e2e tests can be a valid tradeoff to avoid slowness but I’d also point out these issues can be solved without deleting the e2e suite…


you’re right sir!


The real problem is that they don’t know how to write fast end-to-end tests. Not having solid end-to-end tests is a massive red flag. Also, each service can be simulation tested in isolation. So even if they don’t have the skills to do end-to-end testing, they could at least do per-service simulation testing, with each service team responsible for running those tests independently.


> The support for messaging tests was immature in the JVM implementation: most of the critical interactions between our microservices occurs through Kafka messages (we favor mutations in asynchronous flows while HTTP calls are mostly reserved for read-only operations).

Trying to wrap my head around what is meant by that. I mean I get the second half, but the first half not so much.


“Instead of investing in making an existing tool better, we built our own thing!”

Oye.


> In our analysis, we figured out that the most frequent category of bugs caught by End-to-End tests was schema violations.

Schema violations are pretty much just type errors.

Fortunately these can be prevented automatically and with 100% confidence without writing even a single test.


Types are compile checks, they have nothing to do with contracts, having contracts for messages that will transit over a queue are still useful even if you are using a typed language.


Not in all environments, e.g. clojure.spec, or with MyPy reflection etc.


clojure.spec is not a type system. Still, my point is, just using a typed language won't remove the need for contracts, you would still need to roll up something like Nubank did even if it means using MyPy reflection features.

For example, imagine you have two services that communicates through a message queue. Service A produces X as a string, but Service B consumes X as an integer. You can type that, both services would compile, but it would break as soon as you tried to consume that message. And yes, you can build something using MyPy reflection or whatever, but you have to build it anyway.


Hm? Give the message a type. Service A or B would fail to compile, depending on whether the static type of X is a string or an integer.


AFAIK, kafka messages doesn't have "types", and even if they did, you would be relying on an external system, not your type system. If you are not convinced, test it by yourself, create two services and a kafka topic and produce a message from one service to another with different types on each service.


Hogwash. You did E2E testing badly, and now you’ve replaced it with something else. Which may or may not be as bad, but is certainly bad faster.


I work a lot on compilers and VMs and have written tens of thousands of tests at different scales over the years. Different kinds of tests serve different purposes.

Unit tests help you pinpoint errors in the code. They can exhaustively test (only) small components to make sure they are fully compliant. They are a refactoring and development aid to the extent that they are focused (don't involve too many components), quick (run in seconds or less), not too tightly coupled to the code under test (i.e. can change code under test without changing tests), and explanatory (failure output is easy to understand and points exactly at the faulting component). Making good unit tests is an art form. Some people love their mocking frameworks. Personally, I hate them. Mocks are confusing and they try to check behavior rather than input/output. They make refactoring hard because they test behavior rather than results, and they are usually confusing.

Integration tests are about putting one or more systems together to test their interactions. More than just a single unit, we can put services together and test their interface. They can be more exhaustive about testing a component's interface because the combinatorics haven't exploded yet. Because there is a lot more code under test, failures are less explanatory and thus there is more work to investigate these failures. Investments here that help are to make failure modes as helpful as possible. That, too, is an art form.

End-to-end tests are inherently going to be slow. We put the whole system together and run some canned interactions on it. It might be flaky (because large scale, because networks, because OOM, timeouts, etc). End to end tests are generally a bitch to debug, because essentially anything could be at fault...well, anything except the things that are clearly passing their unit tests and integration tests. Which is why you need to have good unit and integration tests, so that you don't need many end-to-end tests.

It sounds from the article like they reduced or eliminated their end-to-end tests and went for more integration tests. That does seem to have paid off. Sometimes tests are slow and bad, and other kinds of tests are better.

I would say though, working now on a system with many, many, distributed moving parts, you do want to at least have some end-to-end tests that make sure everything comes up properly. Nothing like committing a change that passes all the small scale tests and then a component fails to come up because some stupid command-line flag is set wrong. You gotta have tests for anything you could absent-mindedly break.

And all of that testing needs to be a one button push away. You can't have tests that developers don't run, or don't know exist. Personally I like having shell scripts that are checked in and all at least one that does the whole enchilada, even if it is just a wrapper around the build system's or CI's test targets.


Agree. I wouldn’t be able to confidently deploy my code into production without solid tests. I haven’t had a production bug for years because of those solid tests catching it early.


It's not one vs the other. Both kinds of testing are 100% essential to a stable service.

If you have a flaky, laggy E2E test suite...fix it.


If you integrate with a third party service and their environment is slow, because it's not a production environment, your tests will fail due to timeouts. How would you fix this?

One way I would think is to not go against their service; create a similar service and run it according to your SLA. But then, you have to make sure contracts are in sync, so you need to verify contracts from time to time.


If that picture is from the actual company. Doesn't look like a good place to work. That environment is horrible.


"We can't develop proper end to end pipeline, so we killed it and patted ourselves on the backs".


For rewrites and refactors of integrations, there is no substitute for E2E testing.

I’m hopeful to be proven wrong, though.


As a rule of thumb: If you're not in charge of testing, your users will have to test it for you.

Looks like Nubanks's e2e was flaky and poorly optimized, so now their will try to `silo` tests in form of contracts.


Isn't this what protos are for?


So, they're not testing at all. Just inputs and outputs. mmm ok.


Wait, what? Do people actually work in conditions like that article's picture shows? Good Lord.


I don't know anything about what Nubank is up to or how things work there overall, but integration tests are absolutely worth doing. The argument against this to me reads like "coordination and testing of big systems is hard, so let's not do it."

> Waiting. Engineers had to wait more and more to get feedback from this long-running suite;

"Our tests are inefficient, not sufficiently parallelized, the setup / tear down of the test environment isn't optimized, and it isn't possible to run only the relevant subset of tests during feature development or bug triage for short feedback loops"

> Lack of confidence. Flaky tests meant that we had to re-run the suite frequently to see if something was really wrong or just a false negative;

"Our tests aren't well written (we have sleep-polling)", "we don't build-in testability into our system (we can't introspect or wait on the thing we care about in the test, so we have massive work arounds)", or possibly worst "our system is flaky and our tests reflect that".

> Expensive to maintain. Manual changes in our staging environment corrupted test data fixtures and maintaining the environment “clean” was a challenge;

"We haven't spent enough time developing our own tools for testing, so we have tests that are extremely fragile (think copy and paste of massive JSON blobs with comparisons just to check a handful of values)"

> Failures don’t point to obvious issues. Test failures were very hard to debug, specially due to our reliance on asynchronous communication that make it hard to connect the cause of failure (a message not published to a queue) with its effect (changes not made in another system);

"Our system is over-engineered and our service boundaries match our internal structure rather than clean separation in the functions of our APIs. We don't have good visibility because doing any one thing involves massive levels of coordination. We lack proper tracing and aggregation."

> Slower value delivery. Queueing of commits in the End-to-End suite resulted in less frequent deployments;

"Quality is hard and takes time. Let's not do it so we can move fast and break things."

> Not efficient. Few bugs caught in this stage. One experiment suggested that, for every 1000 runs, we had 42 failures, only 1 bug;

See above about flakiness and fragility. Also, integration tests catching bugs tend to point to really bad obvious bugs. I'd be happy about the one that was caught.

> Not effective. Bugs were still being found in production.

"We still found bugs. This means testing must be ineffective altogether?"


You mean unit tests?


I loled that you were downvoted, I was going to post the same thing - they discovered multiple layers of testing? Congrats?


right?! Gimme a f* break


The scale of Nubank is insane


The "flaky" argument against end-to-end tests does not fill me with confidence in your system.


With a photo from a pre-COVID world...


Good, kill testing. Testing is a worthless and tragic waste of human energy and creativity unless actual lives or fortunes are at stake




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: