I don’t agree with much in this writing other than that eventual consistency is a bad choice. Distributed systems are hard but in 2024 there are enough known patterns and techniques to make them less icky. Systems built on total ordering are much more tractable than weaker protocols. Mahesh Balakrishnan’s recent paper[0] on the Shared Log abstraction is a great recipe, for example.
As an aside, I’ve never enjoyed the defeatist culture that permeated operations and distributed systems pop culture, which this post seems to re-enforce.
I think the defeatism comes from practicality. I'm at a 30 person IT org trying to do distributed systems and eventual consistency. Dont take on complexity unless you have to. And eventual consistency requires a LOT of scale before it becomes a have too.
I dont think it's necessarily a question of scale. Where I work, we have a lost of strategic partnerships, and all those partners have their own it systems with their own master data. It's intractable to enforce strong consistency between all of these disparate systems that don't speak to one another, and you expressly dont want to take the whole substrate offline when a single parter has a network issue. The best you can do is really eventual consistency.
> I don’t agree with much in this writing other than that eventual consistency is a bad choice
Does it really matter wherever is bad or not? As far as I know, every database that scales beyond a single node (for performance) is eventually consistent. Otherwise you've gotta wait for a sync between nodes before a response can be given, which would effectively force your cluster to have worse performance then running on single node again.
> As far as I know, every database that scales beyond a single node (for performance) is eventually consistent
That's SO not true. Spanner, Amazon's S3 are some of the biggest databases on the planet, and they are strongly consistent.
> Otherwise you've gotta wait for a sync between nodes before a response can be given, which would effectively force your cluster to have worse performance then running on single node again.
Yes, you are trading latency for fault-tolerance, but so what? What if the resulting latency is still more than good enough? There is no shortage of real large-scale applications where this is the case.
As an aside, I’ve never enjoyed the defeatist culture that permeated operations and distributed systems pop culture, which this post seems to re-enforce.
0 - https://maheshba.bitbucket.io/papers/osr2024.pdf