Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's great to see a postmortem with this level of detail, fairly quickly. It's also great to see Joyent hang the blame on the system that allowed rebooting every server, and the poor recovery from that failure, rather than continuing to throw the operator under the bus:

  "...we will be rethinking what tools are necessary over
  the coming days and weeks so that "full power" tools
  are not the only means by which to accomplish routine tasks."


More importantly, I hope that the person that issued this command still keeps his job. He learned an important lesson and unless there is sheer incompetence here, this individual will have a medal on his chest indicating he has been in the worst "combat" that a sysadmin can endure.

Everyone makes mistakes, and judging by the language of the postmortem it appears it was just that.

Would love to have that sysadmin on my team, because he will never do that again....


We debated whether or not to make this explicit in the postmortem, but yes, the operator in question still has their job, and for exactly the reasons that you outlined: it was an honest mistake, they were deeply apologetic (as one might imagine) -- and we know that they (certainly!) won't be making that mistake again. Mistakes like this are their own punishment; additional punitive action serves only to instill fear rather than effect the changes necessary to not repeat the failure.


"Five why's" might be appropriate to suss out how the ultimate mistake was even possible.


It is said, perhaps apocryphally, that the head of the trading desk at Mizuho, when asked whether he fired the woman who physically keyed in an order for 600k shares at 1 yen rather than an order for 1 share at 600k yen ($300 million or so in losses), said "Why would I do that, after spending $300 million to make her the least likely person in Japan to typo trade instructions."


That quote is interesting because because the conclusion doesn't follow. It's basically the Monte Carlo fallacy.

She's probably less likely than before due to fear or guilt, but I don't know that this makes her less error prone than every other person in the country, including those that double check every time, for instance.


>continuing to throw the operator under the bus

There has been no evidence of them doing this at all.. I don't know why people keep saying it.


That was poorly phrased... In my mind, I meant it in a "sadly it's too often that the human making the mistake gets blamed, rather than the systems that didn't have appropriate safeguards to make said mistake non-trivial"


They threw an open-source contributor under the bus for rejecting a pull request that changed a code comment. Knee-jerk responses are definitely part of their modus operandi.


Agreed. They seem to be good at threatening to fire employees of other companies, but not their own. I suppose that's a net positive. Still.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: