> In one particular case at Google, a software controller–acting on bad feedback from another software system–determined that it should issue an unsafe control action. It scheduled this action to happen after 30 days. Even though there were indicators that this unsafe action was going to occur, no software engineers–humans–were actually monitoring the indicators. So, after 30 days, the unsafe control action occurred, resulting in an outage.
Isn't this the time they accidentally deleted governmental databases? I love the attempt at blameless generalization, but wow.
If you're referring to the time they nuked an Australian retirement fund's VMware setup, no, that was basically a billing screwup. An operator left a field blank, the system assumed that meant a 1-year expiry, and dutifully deleted it after 1 year was up.
All mega deletes should be authorised. A human person should type in the word "delete" and then only the action should take place. Not doing this is like the decision is taken by VOID created by complex interacting systems.
Honestly unless it’s RTBF, no deletion should happen at all as long as you meet your reserve capacity of freshly silvered disks. Every defunct account should probable go to cold storage first.
We have sensible reasons to suggest this in both the cases : simple and complex.
If GCP is composed of 10-30 services (hypothetically) then keeping 5-10 employees whose job is ensure mega deletes are safe is not too much of a cost.
If GCP is composed of 500 services, then it is all the more important to have humans in the loop so ensure correct behaviour so that complex interacting services don't take a wrong action.
Isn't this the time they accidentally deleted governmental databases? I love the attempt at blameless generalization, but wow.