Staging Is a Trap

twawaaay · on Oct 28, 2022

I worked on a lot of large projects projects. Large means tens of developers and millions of lines of code and also hundreds of thousands to millions of dollars per hour if something goes terribly wrong.

I just can't imagine stuff done without staging. I get that staging loosens discipline -- you will definitely put more effort if you know that nobody else is going to see it before it hits prod.

But at the same time the changes we were doing were extremely complex and if we had to break it up into constant stream of small updates that each had to work perfectly we would be in a state of analysis paralysis.

As expensive as the staging is, it makes development cheaper by allowing people to work in parallel and lessen the cost of their mistakes.

And before somebody points out that companies do successfully deploy things without staging -- I will say that these are a bit different systems. Large scale systems tend to not be very complex -- they are just large scale. But each component of the system tends to be reasonably understandable by a single person.

The systems I worked on tended to be more internal systems with less scale but had domain models consisting of up to tens of thousands of different domain objects with very complex interactions.

michael_j_x · on Oct 28, 2022

Had a similar scenario, where we had to deploy a new search feature in our product. This required a large and complex DB migration and backfill. We resorted to cloning our prod DB to our staging env, and thus where able to test our migration scripts, review the search feature's performance on the real data, verify the release plan and the downtime needed, and even run a few load tests. I cannot see how we could have done this without a staging env.

jpswade · on Oct 28, 2022

Best staging is live. I knew when I saw this title that Charity Majors would be mentioned somewhere.

Practically, having no staging is such a hard concept for people to get their heads around.

It’s one more safety net gone, but it’s a good one.

Maturity of the business, team and product play a big part in this decision.

LordHeini · on Oct 28, 2022

Not for everyone. We develop software for others and most of our customers have staging access.

So they can see development happening, can check or approve changes or can see if the api change they made breaks anything.

They are usually very happy about this and keeps away tons of problems in prod.

placatedmayhem · on Oct 28, 2022

I agree. I find canary or blue/green deployments paired with good rollback plans to be more valuable than a staging environment, yet I see them utilized less. I'm not sure why, though.

yamtaddle · on Oct 28, 2022

Tougher to set up, maintain, and reason about.

thih9 · on Oct 28, 2022

I like deploying proof of concept branches to a staging environment and giving non-tech product owners early access, so that we could confirm that we're on the right track, even before a PR is ready.

What would be the best way of approaching that without staging?

jdlshore · on Oct 28, 2022

Feature flags tied to user accounts.

Alternatively (since you said “prior to pull request”) local builds, screen sharing, and live conversations with stakeholders.

Alternatively (if you don’t want to screen share) continuous deployment, feature flags, and a PR-free review technique such as pairing or mobbing.

Alternatively (if you don’t want pairing or mobbing) continuous deployment, feature flags, robust automated tests, and post-deployment code review.

thih9 · on Oct 28, 2022

That would require deploying the code to production. But at this stage the code is not ready for a PR/merge yet; e.g. it introduces a problematic dependency, requires a complex schema change, or similar.

Edit: Parent comment has been edited, originally it was just "Feature flags tied to user accounts." To address the rest: screen recording is sadly not always enough; screen sharing, pairing and mobbing are not ideal either, the stakeholders want to spend more time with the feature than I'm willing to spend on a call.

jdlshore · on Oct 28, 2022

If your question is, "I really like the status quo and don't want to change anything, how could eliminating staging environments help me," the answer is obviously, "it can't."

thih9 · on Oct 28, 2022

This comment feels needlessly defensive. I never said "I don't want to change anything", I gave reasons for rejecting changes.

jdlshore · on Oct 29, 2022

I overreacted. Let me try again.

Your original question was, paraphrased, "how do we give non-tech product owners early access without a staging environment?"

The canonical answer, from organizations that do this, is to use feature flags that are tied to user accounts. But this requires a certain level of engineering maturity—specifically, something that looks a lot like continuous deployment.

There's a bunch of things that go along with continuous deployment, but one of them is continuous integration (the practice, not the poorly-named build servers), which is a combination of trunk-based development and frequent merges to the trunk. This requires programmers to hide incomplete work behind feature flags (or keystones¹). There's an associated set of practices for dealing with schema changes that I'd be happy to describe.

People who are using feature flags (and associated practices) don't have "code that's not ready to merge." That's the whole point of the feature flags—to allow them to merge and deploy unfinished code.

So when you put the constraint of "our code isn't ready to merge" on my "use feature flags" answer, I thought you weren't engaging in good faith, because it's kind of nonsensical—like saying risk of sunburn prevents you from using sunscreen lotion. I apologize for misconstruing. I assume what you really meant was, "We aren't able to use feature flags to merge and deploy incomplete code."

That's a perfectly fine answer! But you asked how people share incomplete work when they don't have a staging environment, and the answer is, "feature flags."² Specifically, they deploy incomplete work to production and use feature flags to selectively hide and show it. If you can't do that, then it's probably best to keep your staging environment.

¹Keystones: https://martinfowler.com/bliki/KeystoneInterface.html

²For teams that use keystones, but not feature flags, pairing with stakeholders (screen sharing) is another approach I've seen used. It has the advantage of being simpler. But I agree that it's more constraining.

ldjkfkdsjnv · on Oct 28, 2022

Staging in the cloud, with every resource being defined via infrastructure as code is actually not too bad. Staging is hard when your software is poorly written or your infrastructure is janky. A lot of the value in beta, staging, and then prod is the fact that the code ran successfully twice (even if with no real data), before getting to prod. This roots out a surprising number of bugs. Throw in some light integration testing and youre a professional

SkipperCat · on Oct 28, 2022

Host prod and staging in the cloud. Both environments are identical. Boom! Problem solved.

Then after patting yourself on the back and buying yourself a drink, have a very painful conversation with your CFO explaining why your AWS bill doubled....

jiayo · on Oct 28, 2022

> Both environments are identical.

Except for, you know, the real user data and continuous traffic.

adrianmonk · on Oct 28, 2022

Continuing the article's theme that with more resources you can more closely approximate production, with more resources you can generate more realistic fake user data and you can add on a realistic load simulator.

hedora · on Oct 28, 2022

Alternatively, could produce an adversarial workload generator and inject faults.

That way, if the changes work for a week in staging, they'll probably work for years in production.

jlg23 · on Oct 28, 2022

You can replay live traffic. Though I usually do that already in testing.

s17n · on Oct 28, 2022

Vercel style instant branch deploys are super helpful for collaboration in my experience. Not sure if that counts as "staging".

hodgesrm · on Oct 29, 2022

Please, God, I hope our customers don't see this. (I'm not upvoting it for that reason.)

We support data warehouses and not trying things out in staging is root cause for many P0 cases. Especially on the weekend because a lot of companies do upgrades Saturday night. I had to deal with this exact scenario last weekend in Istanbul Airport at 9am Sunday morning over a janky LTE connection. It was not a pretty sight.

If you are running databases the null hypothesis for any mission critical system is that you must have staging to test upgrade as well as out-of-band activities like restoring from backup.

Edit: And to test performance too. Use real data.

drewcoo · on Oct 28, 2022

Yes! No more staging!

Staging hides bugs. Staging tends to be manually altered in ways that CD doesn't do. Staging for testing means likely collisions. Not to mention cleanup so that the environment can continue to be used.

Favor ephemeral environments instead. Need to demo? Spin one up. Need to do some end-to-end exploring? Spin one up.

Spot a problem in an ephemeral environment? Keep it up so someone can investigate.

All done with it? Spin it down.

Can't do that because of all your special snowflake infra code? Well, that's a problem waiting to happen anyway. Fix it so that you can spin up ephemeral environments.

hayst4ck · on Oct 28, 2022

If you can set up a canary, you can set up a staging environment. Once you set up a staging environment, it's a matter of sending the traffic you want to send to staging. That's trivially done by vending out cookies to those you wish to send to the staging environment.

I don't see much point in staging anything but business logic. I don't see the point in staging in any way that isn't "live."

60secs · on Oct 28, 2022

Staging environments solve transitivity problems which canaries don't. If your services are very weakly coupled, you can probably get by with canaries.

If you have tightly coupled microservices, staging has a lot of opportunity to catch bugs before they get to prod and cause outages.

The canary / break prod / rollback attitude is why developers are constantly putting out fires instead of getting things done.

hayst4ck · on Oct 29, 2022

> The canary / break prod / rollback attitude is why developers are constantly putting out fires instead of getting things done.

Having a team solely responsible for rollbacks and failed builds is immensely helpful. If you frequently have code that passes tests but fails in prod, then a team must "regulate" the commons (the shared code base).

jstx1 · on Oct 28, 2022

Is this common practice? Live staging doesn't sound like staging at all to me, it sounds like prod.

hayst4ck · on Oct 29, 2022

It's fairly common to continuously deliver to a staging environment and then occasionally choose one of the staged revisions to push to prod.

Normally employees would be sent to that kind of staging environment which is very much prod.

cellis · on Oct 28, 2022

Why multiple databases and microarchitectures are bad: Reason #7237.

If you have one database, you only need one staging database. It's easy to sync them. I've worked at a company where this worked awesomely. I also worked at a company trying to follow Netflix and staging was a nightmare.

leetharris · on Oct 28, 2022

So they are bad because they are harder to pull off?

It sounds like you work at small companies and haven't ran into the kinds of problems microservices and database-per-service patterns solve.

CPLX · on Oct 28, 2022

This is dumb. How else am I going to have a conversation about the color of the buttons and if the developer knows picked the right fields to query in the free-text search.