sounds like the machine worked perfectly. it did what it was supposed to do and followed the rules layed out by us humans.
the problem is in the way the workflow is described/implemented - and even with that we would need to see some numbers around how many times the workflow ran vs issues encountered.
It's hard when it's a cobbled-together set of systems sending signals to each other, some of which are automated by human robots (e.g. main system sees exit, files a dozen tickets, some of these get picked up by different automation, some of them cause humans to trigger other processes, each of which can again file half a dozen tickets in three different systems...)
No, it's not. Make every step check the existence of a kill-swith entry in a shared resource, say, zookeeper, mysql, etc., before they act. If it's there, halt. A human than can do the manual cleanup.
Yes and no. You’re assuming all systems have a sane way of integrating. You’re also assuming that whoever build this has knowledge of all moving parts and has thought through all edge cases.
When you put in the killswitch, is this for a step, for a workflow, for a subsystem, for the whole thing? It’s alluring to think that you have the option to stop everything, but do you really want to stop everything if there are thousands of things happening in the system?
Can a human think through what happens to what in this situation? If the builders of the system missed it, what are the odds that an operator will catch it?
For example: gas stations. They have a big red button that stops everything. If a pump if on fire it makes sense. If the trash can near the pump is full and you cannot throw your garbage away would anyone stop everything for the can to be emptied?
sometimes there is a disconnect between the people that make the decision on the what vs people that implement it. Also, sometimes, what starts as a reasonable system evolves over time to a point where it's stupid hard to think of all the edge cases.
the problem is in the way the workflow is described/implemented - and even with that we would need to see some numbers around how many times the workflow ran vs issues encountered.