Our dev and prod environments are literally the same code other than some operat...

h0l0cube · on Nov 23, 2021

A major dot point of the article that it's impossible to replicate a prod environment. A staging environment won't capture all the issues

dcow · on Nov 23, 2021

That’s essentially why we don't have a staging environment and why “dev” is a direct copy of prod but with different backing data. I’d argue more often than not it’s overwhelmingly difficult to maintain the discipline required to not have a crummy staging that doesn't in any way resemble prod, so I’m sympathetic. We deal with that by taking away the notion that you get to stage your changes anywhere persistent before they land in prod. Of course dev is not 1000% identical to the very last bit, I’m not going to argue that. But it is a hell of a lot better than the type of staging environments I imagine drove the author to take such a stance. Like I said, we’ve yet to experience a prod-only bug that didn't reproduce in our dev env. So in that sense, anecdotally, the point does not hold.

Just to be a little more clear: I agree with the author that issues happen in prod that are unique to prod that you simply won’t catch pre-prod. And I agree with the hot take mantra that “testing in prod” is okay and not to be as frowned upon as people seem to think is trendy. But I’m also suggesting that instead of viewing the ability to test in prod as a badge of honor, it’s also possible to apply this mantra towards traditional notions of a staging environment. You can cut out many of the issues and frustrations surrounding testing in staging by actually practicing continuous integration. Build mechanisms and policy that severely limit the frequency and distances that staging systems diverge from prod and I wager you’d get much much further than the comfortable status quo of merging to staging, manual and automated integrating, and then a cadence-based release to prod. So yeah: test in prod! Just don't use your real prod unless you have to.

h0l0cube · on Nov 23, 2021

> That’s essentially why we don't have a staging environment and why “dev” is a direct copy of prod but with different backing data.

That's fair, and a decent way to make a staging environment, though as echoed elsewhere, the data itself can exercise your code in ways that uncover bugs. I think also this is more feasible on, say, a monolith setup, vs a sharded multi-cluster service that's integrated with manifold 3rd party systems - but yes, if you can, you probably should have this kind of prod-replica staging in adjunct to incremental canary rollouts, prod-safe testing suites etc. and also the article was explicitly suggesting in-prod testing should be adjunct to non-prod testing

dcow · on Nov 24, 2021

You and others also have a point. I’m now thinking of ways we could seed our dev data to be maximally similar to prod. It’s all encrypted blobs though so it would mostly be about scale in our case. But your point is still taken.

selcuka · on Nov 23, 2021

> “dev” is a direct copy of prod but with different backing data

Then it's not a direct copy of prod. Many times it's the data that makes bugs appear.

dcow · on Nov 23, 2021

Let me quote myself:

> Of course dev is not 1000% identical to the very last bit, I’m not going to argue that. … Like I said, we’ve yet to experience a prod-only bug that didn't reproduce in our dev env. …

> You can cut out many of the issues and frustrations surrounding testing in staging by actually practicing continuous integration. Build mechanisms and policy that severely limit the frequency and distances that staging systems diverge from prod and I wager you’d get much much further than the comfortable status quo of merging to staging, manual and automated integrating, and then a cadence-based release to prod. So yeah: test in prod! Just don't use your real prod unless you have to.

I’ve seen plenty of staging envs that look nothing like prod, and that what I’m calling the real sham.

edgyquant · on Nov 23, 2021

That’s true but depends on what you’re building. We don’t have a million users or anything yet but I clone the prod db every month or so, change the passwords, and use that for testing. Before we had a staging db and a prod db but they’d diverge and staging would have almost no data while prod would be full of it.

grandphuba · on Nov 23, 2021

How do you manage sensitive data with this workflow (i.e. do you do it manually everytime, do you automate it, what scripts, etc.)?

I get changing passwords, but say that data leaks (whether by a vulnerability in the clone environment, or by a dev gone rogue), how do you mitigate possible damage done to real users (since you did clone from prod).

I ask not because I question your actions, but because I've been wanting to do something similar in staging env to allow practical testing, but I haven't had the chance to research how to do it "properly".

PeterisP · on Nov 23, 2021

Not the parent post, but working in finance, multiple products had a "scrambling" feature which replaced many fields (names, addresses, etc) with random text, and that was used upon restoring any non-production environments. It's not proper anonymization since there are all kinds of IDs that still are linkable (account numbers, reference numbers) to identities but can't be changed without breaking all the processes that are needed even in testing, but it's a simple action that does reduce some risks.

hinkley · on Nov 23, 2021

> literally the same code

There’s a ton of stuff on this list that you bloody well can test in preproduction and you’re a damn fool (or you work for them) if you don’t/can’t.

    - A specific network stack with specific tunables, firmware, and NICs
    - Services loosely coupled over networks
    - Specific CPUs and their bugs; multiprocessors
    - Specific hardware RAM and memory bugs
    - Specific distro, kernel, and OS versions
    - Specific library versions for all dependencies
    - Build environment
    - Deployment code and process
    - Specific containers or VMs and their bugs
    - Specific schedulers and their quirks

That’s 40% of that list, or 5/8ths of the surface area of 2 problem interactions. CI/CD, Twelve Factor… you can fill an entire bookcase with books on this topic. Some of those books are almost old enough to drink. Someone whose by-line is “been on call half of their life” has had time to read some of them.

dcow · on Nov 23, 2021

All of those are the same for us. Num CPUs and amount of RAM is the only difference.

hinkley · on Nov 23, 2021

To be fair, I've had to argue with a lot of managers prior to The Cloud about how the QA team was given shit hardware instead of identical hardware. The IT manager even had a concrete use case for identical hardware that I thought was for sure going to win me that argument but it didn't.

If you don't have enough identical hardware for pre-prod, then you probably don't have spare servers for production either. If you get a flash traffic due to a news article, or one of your machines develops a hardware fault, then you have to order replacements. At best you might be able to pull off an overnight Fedex, but only if the problem happens in the morning.

If, however, you have identical QA hardware, you can order the new hardware and cannibalize QA. Re-image the machine and plop it into production. QA will be degraded for a couple of days but that's better than prod having an issue.

With the Cloud, the hardware is somewhat fungible, so you can generally pick identical hardware for preprod and prepare an apology if anyone even notices you've done it. If the nascent private cloud computing vendors manage to take off, they'll have to address that phenomenon or lose a lot of potential supporters at customer sites.

dcow · on Nov 23, 2021

I'm sure there are clueless companies/managers that don't quite get it in infra land (and that are still great places/people to work for and products to work on) and if you find yourself in one of those situations, it's pretty rational to need prod if it's the only instance of your problem because of large divergences in the things you and the article mention. You're not wrong. But something that I've been a stickler on since our company's beginnings is that dev is really, as much as is feasible and useful, an exact copy of prod. And it's working so far. We have yet to scale to massive heights, I'll admit that. But it's a principle that I've seen more than a few companies simply neglect.