If you want to reason about a world that has random software and hardware failur...

bob1029 · on March 1, 2023

This line of "what if an asteroid hits the primary & DR data centers in the same microsecond" thinking is why we settled on running our product on 1 VM with SQLite in-proc.

After taking our customers through this same kind of apocalyptic rabbit hole conversation, they tend to agree with this architecture decision.

The cost of anticipating the .00001% that might never come is completely drowned out by the massive, daily 99%-certain headache that is managing a convoluted, multi-cloud cluster.

Many times the business owners will get the message and finally reveal that they have always had access to a completely ridiculous workaround involving literal paper & pen that is just as feasible in 2023 as it was in the 18th century.

morelisp · on March 1, 2023

In my experience the customers, and even the POs, are the easy ones to convince. “We get 99% of the uptime for 30% of the price? Great!”

It’s the resume-driven mid dev in the next office you’ve got to watch out for.

CuriousSkeptic · on March 2, 2023

or the dev who spent a few nights and weekends rescuing the system after one of those 1% failures the customer, as it turns out, has no patience for at all

morelisp · on March 2, 2023

Disaster recovery is just one of many things that is much simpler in non-distributed systems.

You seem to be confusing a system that produces bad results 1% of the time with a system that's down 1% of the time. If you can only write the first kind of non-distributed system, you're in for a bad trip if you try to write a distributed equivalent.

hinkley · on March 1, 2023

Last month I had a deployment go wrong on one box, and the part of deployments outside of my control is all or nothing. No partial credit for 96% success. Some random consul call consumed the port we listen on, a shutdown timeout expired and the process was killed, and so that socket was left in CLOSE_WAIT (like seriously, Hashicorp, SO_LINGER has been around for at least 30 years).

This led to an existential crisis because given the number of ports we open and the number of machines we run and the number of processes per machine, there must be over a 0.1% chance of any deployment blowing up this way. We do hundreds a year in prod and probably hundreds a month in preprod. We've been winning the lottery this whole time.

Throw enough events around and a one in a million corner case will happen every week, every day, twice a day, three times in a row. That gets old really really quickly.

metadat · on March 1, 2023

> SO_LINGER. Lingers on close if data is present. If this option is enabled and there is unsent data present when close() is called, the calling application program is blocked during the close() call, until the data is transmitted or the connection has timed out.

I had to look this one up for a refresher, but 100% violently agree - Such behavior certainly warrants a bug submission.

hinkley · on March 1, 2023

The obnoxious thing about CLOSE_WAIT is that it's supposed to time out after 2 minutes or 10 minutes, but I gave up and kicked the box out of the cluster after a half hour of trying to ask it nicely. Which is probably what everyone else does.