Interesting reading. The key points on how the incident actually happened I pulled out were:
- They had two links from their System Flight Server, one for redundancy if one goes down. Both went down at the same time, apparently unprecedented for them.
- There was a system limit for a shared system resource (Atomic Functions) that was defined twice in different systems, but with different magic numbers. The problem wasn't spotted earlier because a recent system change actually brought one of the systems close to the limit for the first time. (More military controller functions were amalgamated into NATS the month previously)
- There was a UX problem where the "Select sectors" button, which is used often, is placed directly beside the "soft Sign off" button, which isn't often used, and in fact was well known to be pressed by accident relatively often. It was pressed at the time of the incident, putting the system into an illegal state and hence triggering an automatic shutdown.
Problems that on their own you could argue aren't showstopper problems, but when triggered together cause things like this.
Exactly my understanding of the report except that the system (actually the workstation) wasn't in an illegal state but it went into a special mode called the "watching" state. This is quite a legal state. However this mode accessed the System Flight Server through a different code path which had a different (hardcoded?) original magic number of 151, not the new capacity 193.
Yes, the system going into "watching mode" with more than 151 Atomic Functions caused the illegal state.
It's interesting where in the flow the number of Atomic Functions was being checked against the theoretical maximum - surely that would be checked if you pressed "Select sectors" too, though perhaps with the correct higher limit rather than the incorrect limit.
> They had two links from their System Flight Server, one for redundancy if one goes down. Both went down at the same time, apparently unprecedented for them.
What I understood here, is that they were redundant systems, but running the same software, with the same bug present, so both went down. Redundant hardware can only protect you against hardware failures.
- They had two links from their System Flight Server, one for redundancy if one goes down. Both went down at the same time, apparently unprecedented for them.
- There was a system limit for a shared system resource (Atomic Functions) that was defined twice in different systems, but with different magic numbers. The problem wasn't spotted earlier because a recent system change actually brought one of the systems close to the limit for the first time. (More military controller functions were amalgamated into NATS the month previously)
- There was a UX problem where the "Select sectors" button, which is used often, is placed directly beside the "soft Sign off" button, which isn't often used, and in fact was well known to be pressed by accident relatively often. It was pressed at the time of the incident, putting the system into an illegal state and hence triggering an automatic shutdown.
Problems that on their own you could argue aren't showstopper problems, but when triggered together cause things like this.