You didn’t explain how nose failures can be handled. In k8s a new pod is started...

cookiecaper · on June 19, 2019

"Node failures" have been handled much more elegantly than k8s's simple "kill and rebuild" approach via live migration for at least a dozen years. [0] Wikipedia lists 15 hypervisors that support it. [1]

If you're interested, Red Hat has a very thorough guide on how to achieve this with free and open-source software. While you have to run VMs rather than containers, it's much more robust. [2] There are proprietary options too.

And then, there's also the good old fallback method that k8s uses: just divert traffic to healthy nodes and fire up a replacement. There are many frameworks for that simple model, and there's no reason to pretend that it's done exclusively with handcrafted "buggy shell scripts", nor to pretend that "buggy shell scripts" are inherently worse than "incorrect YAML configs that confused k8s and killed everything in our cluster" (see OP for a compendium of such incidents).

[0; PDF] https://www.usenix.org/legacy/event/nsdi05/tech/full_papers/...

[1] https://en.wikipedia.org/wiki/Live_migration

[2] https://access.redhat.com/documentation/en-us/red_hat_enterp...

zaphirplane · on June 19, 2019

You can’t do live migration from a dead host.

You say things are possible and provide links to building blocks but haven’t said how you do it end to end