Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Our dev and prod environments are literally the same code other than some operational stuff. When you need to test something before it’s continuously integrated on merge, you use the ephemeral feature environment that’s automatically created for you when you open a PR. This forces features to be done done when they’re merged unless they’re gated in some way. And it makes us less spooked about deploying to prod because it happens all the time. It raises the bar for PR reviews because if you approve something broken, unless an automated test catches it, it’s going straight to prod. So most reviewers take the time to actually verify changes work like we’re all supposed to do but usually laze out on. Since dev and prod are always the same, and ephemeral envs use dev resources (DBs), you know exactly what to expect and don’t have the cognitive overhead of keeping track of which versions of things are deployed where. If someone experiences an issue in prod it has always been instantly reproducible in dev. In those ways, we test in prod.


A major dot point of the article that it's impossible to replicate a prod environment. A staging environment won't capture all the issues


That’s essentially why we don't have a staging environment and why “dev” is a direct copy of prod but with different backing data. I’d argue more often than not it’s overwhelmingly difficult to maintain the discipline required to not have a crummy staging that doesn't in any way resemble prod, so I’m sympathetic. We deal with that by taking away the notion that you get to stage your changes anywhere persistent before they land in prod. Of course dev is not 1000% identical to the very last bit, I’m not going to argue that. But it is a hell of a lot better than the type of staging environments I imagine drove the author to take such a stance. Like I said, we’ve yet to experience a prod-only bug that didn't reproduce in our dev env. So in that sense, anecdotally, the point does not hold.

Just to be a little more clear: I agree with the author that issues happen in prod that are unique to prod that you simply won’t catch pre-prod. And I agree with the hot take mantra that “testing in prod” is okay and not to be as frowned upon as people seem to think is trendy. But I’m also suggesting that instead of viewing the ability to test in prod as a badge of honor, it’s also possible to apply this mantra towards traditional notions of a staging environment. You can cut out many of the issues and frustrations surrounding testing in staging by actually practicing continuous integration. Build mechanisms and policy that severely limit the frequency and distances that staging systems diverge from prod and I wager you’d get much much further than the comfortable status quo of merging to staging, manual and automated integrating, and then a cadence-based release to prod. So yeah: test in prod! Just don't use your real prod unless you have to.


> That’s essentially why we don't have a staging environment and why “dev” is a direct copy of prod but with different backing data.

That's fair, and a decent way to make a staging environment, though as echoed elsewhere, the data itself can exercise your code in ways that uncover bugs. I think also this is more feasible on, say, a monolith setup, vs a sharded multi-cluster service that's integrated with manifold 3rd party systems - but yes, if you can, you probably should have this kind of prod-replica staging in adjunct to incremental canary rollouts, prod-safe testing suites etc. and also the article was explicitly suggesting in-prod testing should be adjunct to non-prod testing


You and others also have a point. I’m now thinking of ways we could seed our dev data to be maximally similar to prod. It’s all encrypted blobs though so it would mostly be about scale in our case. But your point is still taken.


> “dev” is a direct copy of prod but with different backing data

Then it's not a direct copy of prod. Many times it's the data that makes bugs appear.


Let me quote myself:

> Of course dev is not 1000% identical to the very last bit, I’m not going to argue that. … Like I said, we’ve yet to experience a prod-only bug that didn't reproduce in our dev env. …

> You can cut out many of the issues and frustrations surrounding testing in staging by actually practicing continuous integration. Build mechanisms and policy that severely limit the frequency and distances that staging systems diverge from prod and I wager you’d get much much further than the comfortable status quo of merging to staging, manual and automated integrating, and then a cadence-based release to prod. So yeah: test in prod! Just don't use your real prod unless you have to.

I’ve seen plenty of staging envs that look nothing like prod, and that what I’m calling the real sham.


That’s true but depends on what you’re building. We don’t have a million users or anything yet but I clone the prod db every month or so, change the passwords, and use that for testing. Before we had a staging db and a prod db but they’d diverge and staging would have almost no data while prod would be full of it.


How do you manage sensitive data with this workflow (i.e. do you do it manually everytime, do you automate it, what scripts, etc.)?

I get changing passwords, but say that data leaks (whether by a vulnerability in the clone environment, or by a dev gone rogue), how do you mitigate possible damage done to real users (since you did clone from prod).

I ask not because I question your actions, but because I've been wanting to do something similar in staging env to allow practical testing, but I haven't had the chance to research how to do it "properly".


Not the parent post, but working in finance, multiple products had a "scrambling" feature which replaced many fields (names, addresses, etc) with random text, and that was used upon restoring any non-production environments. It's not proper anonymization since there are all kinds of IDs that still are linkable (account numbers, reference numbers) to identities but can't be changed without breaking all the processes that are needed even in testing, but it's a simple action that does reduce some risks.


> literally the same code

There’s a ton of stuff on this list that you bloody well can test in preproduction and you’re a damn fool (or you work for them) if you don’t/can’t.

    - A specific network stack with specific tunables, firmware, and NICs
    - Services loosely coupled over networks
    - Specific CPUs and their bugs; multiprocessors
    - Specific hardware RAM and memory bugs
    - Specific distro, kernel, and OS versions
    - Specific library versions for all dependencies
    - Build environment
    - Deployment code and process
    - Specific containers or VMs and their bugs
    - Specific schedulers and their quirks
That’s 40% of that list, or 5/8ths of the surface area of 2 problem interactions. CI/CD, Twelve Factor… you can fill an entire bookcase with books on this topic. Some of those books are almost old enough to drink. Someone whose by-line is “been on call half of their life” has had time to read some of them.


All of those are the same for us. Num CPUs and amount of RAM is the only difference.


To be fair, I've had to argue with a lot of managers prior to The Cloud about how the QA team was given shit hardware instead of identical hardware. The IT manager even had a concrete use case for identical hardware that I thought was for sure going to win me that argument but it didn't.

If you don't have enough identical hardware for pre-prod, then you probably don't have spare servers for production either. If you get a flash traffic due to a news article, or one of your machines develops a hardware fault, then you have to order replacements. At best you might be able to pull off an overnight Fedex, but only if the problem happens in the morning.

If, however, you have identical QA hardware, you can order the new hardware and cannibalize QA. Re-image the machine and plop it into production. QA will be degraded for a couple of days but that's better than prod having an issue.

With the Cloud, the hardware is somewhat fungible, so you can generally pick identical hardware for preprod and prepare an apology if anyone even notices you've done it. If the nascent private cloud computing vendors manage to take off, they'll have to address that phenomenon or lose a lot of potential supporters at customer sites.


I'm sure there are clueless companies/managers that don't quite get it in infra land (and that are still great places/people to work for and products to work on) and if you find yourself in one of those situations, it's pretty rational to need prod if it's the only instance of your problem because of large divergences in the things you and the article mention. You're not wrong. But something that I've been a stickler on since our company's beginnings is that dev is really, as much as is feasible and useful, an exact copy of prod. And it's working so far. We have yet to scale to massive heights, I'll admit that. But it's a principle that I've seen more than a few companies simply neglect.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: