Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you want to reason about a world that has random software and hardware failures, than you cannot really have any kind of pure results. A backhoe could cut your network cable at exactly the wrong point, or a malfunctioning network switch could decide to insert the right extra few bytes in exactly the wrong place, changing the meaning of your message without altering any of the checksums. As the scale of your application increases, the chance of this sort of chaos increases as well. The question then becomes how to reason in the face of chaos, what sorts of error rates are acceptable, and how to build systems that can recover from supposedly impossible states. If your bank's software makes an error, they have established processes to determine that and correct the balance of accounts.


This line of "what if an asteroid hits the primary & DR data centers in the same microsecond" thinking is why we settled on running our product on 1 VM with SQLite in-proc.

After taking our customers through this same kind of apocalyptic rabbit hole conversation, they tend to agree with this architecture decision.

The cost of anticipating the .00001% that might never come is completely drowned out by the massive, daily 99%-certain headache that is managing a convoluted, multi-cloud cluster.

Many times the business owners will get the message and finally reveal that they have always had access to a completely ridiculous workaround involving literal paper & pen that is just as feasible in 2023 as it was in the 18th century.


In my experience the customers, and even the POs, are the easy ones to convince. “We get 99% of the uptime for 30% of the price? Great!”

It’s the resume-driven mid dev in the next office you’ve got to watch out for.


or the dev who spent a few nights and weekends rescuing the system after one of those 1% failures the customer, as it turns out, has no patience for at all


Disaster recovery is just one of many things that is much simpler in non-distributed systems.

You seem to be confusing a system that produces bad results 1% of the time with a system that's down 1% of the time. If you can only write the first kind of non-distributed system, you're in for a bad trip if you try to write a distributed equivalent.


Last month I had a deployment go wrong on one box, and the part of deployments outside of my control is all or nothing. No partial credit for 96% success. Some random consul call consumed the port we listen on, a shutdown timeout expired and the process was killed, and so that socket was left in CLOSE_WAIT (like seriously, Hashicorp, SO_LINGER has been around for at least 30 years).

This led to an existential crisis because given the number of ports we open and the number of machines we run and the number of processes per machine, there must be over a 0.1% chance of any deployment blowing up this way. We do hundreds a year in prod and probably hundreds a month in preprod. We've been winning the lottery this whole time.

Throw enough events around and a one in a million corner case will happen every week, every day, twice a day, three times in a row. That gets old really really quickly.


> SO_LINGER. Lingers on close if data is present. If this option is enabled and there is unsent data present when close() is called, the calling application program is blocked during the close() call, until the data is transmitted or the connection has timed out.

I had to look this one up for a refresher, but 100% violently agree - Such behavior certainly warrants a bug submission.


The obnoxious thing about CLOSE_WAIT is that it's supposed to time out after 2 minutes or 10 minutes, but I gave up and kicked the box out of the cluster after a half hour of trying to ask it nicely. Which is probably what everyone else does.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: