It's money, of course. No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.
> No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.
Well, fly by night outfits will do that. Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.
Look at a big bank or a big corporation's accounting systems, they'll pay millions just for the hot standby mainframes or minicomputers that, for most of them, would never be required.
> Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.
Used to, but it feels like there is no corporate responsibility in this country anymore. These monopolies have gotten so large that they don't feel any impact from these issues. Microsoft is huge and doesn't really have large competitors. Google and Apple aren't really competing in the source code hosting space in the same way GitHub is.
> Take the number of vehicles in the field, A, multiply it by the probable rate of failure, B, then multiply it by the result of the average out of court settlement, C. A times B times C equals X. If X is less than the cost of a recall, we don't do one.
> Look at a big bank or a big corporation's accounting systems
Not my experience. Any banking I used, in multiple countries, had multiple and significant outages and some of them where their cards have failed to function. Do a search of "U.S. Bank outage" to see how many outages have happened so far this year.
Modern internet company backends are very complex, even on a good day they're at the outer limits of their designers' and operators' understanding, & every day they're growing and changing (because of all the money and effort that's being spent on them!). It's often a short leap to a state that nobody thought of as a possibility or fully grasped the consequences of. It's not clear that it would be practical with any amount of money to test or rule out every such state in advance. Some exciting techniques are being developed in that area (Antithesis, formal verification, etc) but that stuff isn't standard of care for a working SWE yet. Unit tests and design reviews only get you so far.
I've worked at many big banks and corporations. They are all held together with the proverbial sticky tape, bubblegum, and hope.
They do have multiple layers of redundancies, and thus have the big budgets, but they won't be kept hot, or there will be some critical flaws that all of the engineers know about but they haven't been given permission/funding to fix, and are so badly managed by the firm, they dgaf either and secretly want the thing to burn.
There will be sustained periods of downtime if their primary system blips.
They will all still be dependent on some hyper-critical system that nobody really knows how it works, the last change was introduced in 1988 and it (probably) requires a terminal emulator to operate.
I've worked on software used by these and have been called in to help support from time to time. One customer which is a top single digit public company by market cap (they may have been #1 at the time, a few years ago) had their SAP systems go down once every few days. This wasn't causing a real monetary problem for them because their hot standby took over.
They weren't using mainframes, just "big iron" servers, but each one would have been north of $5 million for the box alone, I guess on a 5ish year replacement schedule. Then there's all the networking, storage, licensing, support, and internal administration costs for it which would easily cost that much again.
Now people will say SAP systems are made entirely of dict tape and bubblegum. But it all worked. This system ran all their sales/purchasing sites and portals and was doing a million dollars every couple of minutes so that all paid for itself many times over during the course of that bug. Cold standby would not have cut it. Especially since these big systems take many minutes to boot and HANA takes even longer to load from storage.
These companies do take it seriously, on the software side, but when it comes to configurations, what are you going to do:
Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk. It looks like even the most critical and prestigious companies in the world are doing the former.
> Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk.
There's also the problem that doubling your cloud footprint to reduce the risk of a single point of failure introduces new risks: more configuration to break, new modes of failure when both infrastructures are accidentally live and processing traffic, etc.
Back when companies typically ran their own datacenters (or otherwise heavily relied on physical devices), I was very skeptical about redundant switches, fearing the redundant hardware would cause more problems than it solved.
I'm not sure, it's only money. People could have a lot of simpler cheaper software, by relying on core (OS) features instead of rolling there own, or relying on bloated third-parties, but a lot don't due to cargo culting.
And tech hype. Infrastructure to mitigate here isn't expensive. In many cases quite the opposite. The expensive thing is that you made yourself dependent on these services. Sometimes this is inevitable, but to host on GitHub is a choice.
…can I make the case that this might be reasonable? If you’re not running a hospital†, how much is too much to avoid a few hours of downtime around once a year?
† Hopefully there aren’t any hospitals that depends on GitHub being continuously available?
This is true. But unfortunately the exact same process is used even for critical stuff (the crowdstrike thing for example). Maybe there needs to be a separate swe process for those things as well, just like there is for aviation. This means not using the same dev tooling, which is a lot of effort.