Yep, every cluster approaching 10k I know of has either pared back etcd's durabi...

Yep, every cluster approaching 10k I know of has either pared back etcd's durability guarantees or rewritten and replaced it in some manner. Actually the post goes into detail about doing this exactly, the Alibaba paper they reference says about the same.

> Sharding total compute capacity compute capacity into multiple isolated k8s clusters reduces the likelihood that a software bug is going to take down everything as you can carefully upgrade only a single cell at a time with bake periods between each cell.

Yeah, I've been meaning to try out something like Armada to simplify things on the cluster-user side. Cluster-providers have lots of tools to make managing multiple clusters easier but if it means having to rewrite every batch job..