It's hard when it's a cobbled-together set of systems sending signals to each ot...

pmlnr · on June 20, 2018

No, it's not. Make every step check the existence of a kill-swith entry in a shared resource, say, zookeeper, mysql, etc., before they act. If it's there, halt. A human than can do the manual cleanup.

mirceal · on June 20, 2018

Yes and no. You’re assuming all systems have a sane way of integrating. You’re also assuming that whoever build this has knowledge of all moving parts and has thought through all edge cases.

When you put in the killswitch, is this for a step, for a workflow, for a subsystem, for the whole thing? It’s alluring to think that you have the option to stop everything, but do you really want to stop everything if there are thousands of things happening in the system? Can a human think through what happens to what in this situation? If the builders of the system missed it, what are the odds that an operator will catch it?

For example: gas stations. They have a big red button that stops everything. If a pump if on fire it makes sense. If the trash can near the pump is full and you cannot throw your garbage away would anyone stop everything for the can to be emptied?