It's great to see a postmortem with this level of detail, fairly quickly. It's also great to see Joyent hang the blame on the system that allowed rebooting every server, and the poor recovery from that failure, rather than continuing to throw the operator under the bus:
"...we will be rethinking what tools are necessary over
the coming days and weeks so that "full power" tools
are not the only means by which to accomplish routine tasks."
More importantly, I hope that the person that issued this command still keeps his job. He learned an important lesson and unless there is sheer incompetence here, this individual will have a medal on his chest indicating he has been in the worst "combat" that a sysadmin can endure.
Everyone makes mistakes, and judging by the language of the postmortem it appears it was just that.
Would love to have that sysadmin on my team, because he will never do that again....
We debated whether or not to make this explicit in the postmortem, but yes, the operator in question still has their job, and for exactly the reasons that you outlined: it was an honest mistake, they were deeply apologetic (as one might imagine) -- and we know that they (certainly!) won't be making that mistake again. Mistakes like this are their own punishment; additional punitive action serves only to instill fear rather than effect the changes necessary to not repeat the failure.
It is said, perhaps apocryphally, that the head of the trading desk at Mizuho, when asked whether he fired the woman who physically keyed in an order for 600k shares at 1 yen rather than an order for 1 share at 600k yen ($300 million or so in losses), said "Why would I do that, after spending $300 million to make her the least likely person in Japan to typo trade instructions."
That quote is interesting because because the conclusion doesn't follow. It's basically the Monte Carlo fallacy.
She's probably less likely than before due to fear or guilt, but I don't know that this makes her less error prone than every other person in the country, including those that double check every time, for instance.
That was poorly phrased... In my mind, I meant it in a "sadly it's too often that the human making the mistake gets blamed, rather than the systems that didn't have appropriate safeguards to make said mistake non-trivial"
They threw an open-source contributor under the bus for rejecting a pull request that changed a code comment. Knee-jerk responses are definitely part of their modus operandi.
>'In our experience, platforms with this network device will encounter this boot-time issue about 10% of the time. The mitigation for this is for an operator to simply initiate another reboot, which we performed on those afflicted nodes as soon as we identified them.'
This bit bothers me more than anything else. It's not just a die roll configuration, but a known one with an operator required to do the re-rolling.
Everything is a value judgement I guess, but knowingly leaving that one 'mitigated' would drive me insane.
Sorry if we didn't adequately convey our frustration with this particular issue. It's one that's been with us for a long time and it absolutely sucks -- and after trying (and failing) to work with the vendor to get the issue understood (it's essentially a firmware-level issue), we ultimately decided to move away from that particular NIC vendor entirely. If we could wave a wand and be rid of these particular parts, we gladly would -- but until then, this transient boot-time issue needs to be manually mitigated with an additional reboot.
Why the shyness about saying the brand's name? If their product and support was subpar to the point of blacklisting the entire vendor, it could be useful to spread the information to A) warn others of potential problems and B) put pressure on the vendor to improve their products.
If I had to guess, it's Broadcom. While their merchant switching ASICs (Trident+, Trident2) have become good enough to displace most custom spun ASICs for 10 Gbps and 40 Gbps switching, their NIC hardware has long been somewhat of a disaster. Interesting to note is that Broadcom has basically sold the NIC business to QLogic: http://www.broadcom.com/press/release.php?id=s832628
All vendors have bad products from time to time, but they deal with it differently. I can think of one vendor who covered up bugs in their silicon with undocumented driver hacks in Windows, and stonewalled Linux kernel devs on the nature of the faults, for example.
Since I don't have deep pockets I shan't name the vendor, but it started out with us taking delivery of a bunch of blades where our blade supplier had changed the NIC from one vendor to another. We discovered we had repeatable problems with J2EE cluster traffic over UDP - we'd see packet loss rates go to absurd levels as we ramped up load on the cluster, leading to a situation where a node dropping out and then rejoining would cuase the cluster to lock up trying to bring the node back into the fold. We could repeat the massive packet loss using UDP test tools. Rather ugly.
Coincidentally we had a visit from the CTO of high-performance storage vendor who happen to have a bunch of kernel hackers on their staff. We mentioned our problem in passing and he explained how they'd nearly lost a major contract because a deployment had moved thei customer from being storage-bound to initially untracable data loss. Digging around by their kernel hackers showed the NIC was losing data. After a certain amount of too-ing-and-fro-ing the vendor moved from denying the problem to admitting that their silicon had a defect that would throw away data under load, and their Windows drivers tried to spackle over the problem. They were relying on no-one being able to drive enough load through the card to cause a problem.
This dovetailed with our experience, and we found that installing a different manufacturer's card in the blade let us work around the problem. Our blade supplier moved to a new NIC vendor subsequently.
My condolences, we had a very similar if not identical issue with some NICs, and also ended up switching vendors. The smoke coming from our team's ears was enough to harvest a small beehive for honey.
>The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter. Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay.
"To make error is human. To propagate error to all server in automatic way is #devops." -@DEVOPS_BORAT [1]
I don't have a lot of experience with datacenters, and am trying to understand this sentence:
"Because there was a simultaneous reboot of every system in the datacenter, there was extremely high contention on the TFTP boot infrastructure, which like all of our infrastructure, normally has throttles in place to ensure that it cannot run away with a machine."
What does "cannot run away with a machine" mean? Why would you want to by default restrict the speed at which that system runs?
The throttle that we're referring to there is a CPU throttle. When we provision an OS-virtualized instance, there is a default throttle to prevent it from consuming a disproportionate amount of CPU on the box. The instance that runs TFTP on the headnode was provisioned as a relatively small instance (it needs very little DRAM), which also gave it (by default) a CPU throttle that restricted its CPU utilization. Normally, of course, this isn't an issue -- but normally we don't try to TFTP boot the entire datacenter at once. This issue was obvious immediately (thanks, DTrace!), and we resolved it on the headnode by manually raising the throttle, and will be making the fix in SmartDataCenter itself as well. Does that answer your question?
Exactly; it uses all of the same infrastructure, actually -- it's just provisioned on a different network (namely, the admin network). It's also worth noting that OS-based virtualization helped us here not only because of the global visibility we get with DTrace (which immediately indicated that TFTP was waiting for CPU), but also because we could dynamically adjust the throttle and simply give it more CPU without having to bounce the box and interrupt all of the TFTP booting in progress. It was a small amount of solace on what was easily the worst day we've had in a while, if not ever...
Not an accident -- I'm a well-known awk lover, and it was very much an inspiration for DTrace. (Even down to the documentation: if you haven't read it recently, re-read "The AWK Programming Language"[1] -- it's beautifully concise.)
Oh man Bryan I am so wary of sharing this with you on account of it being so wrong, gross, and buggy... But WTH: https://github.com/msliczniak/onetenth Sorry about the rough day you all had - cheers Mike
I think they're trying to say that that machine has other duties except for TFTP and that the process is running at a priority low enough that it can be pre-empted by other processes. So if there is a lot of contention on the TFTP boot infrastructure the throttles limit the amount of work a single machine can do, which in turn can cause starvation.
The reason why is probably related to what it does besides TFTP, which under normal circumstances is probably more important than new nodes joining the network.
During a mass reboot that will bite you because then the chances of starvation will go up quite a bit as the throttles cause one machine after another to time-out and retry their boot sequence.
I also have no experience, but my guess is that when you are running normally, you want to share the bandwidth of a server fairly between all clients. When you've just accidentally rebooted all servers, you want to do it completely unfairly - so that instead of 10 computers taking an hour to download something, the first takes 6 minutes, the second takes 12 minutes etc. This allows at least some of the servers to get started and your infrastructure to start getting back online.
I.e. you want to change from round robin scheduling to shortest job first scheduling (which gives the minimum average waiting time).
They wanted to prevent all of the servers trying to contact the PXE server at the same time to avoid overloading it. Its basically to prevent an accidently self DDOS.
>The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter.
Substitute reboot with 'upgrade to win 7' and data center with university at get a story from a month? ago.
rm ./*.* - delete all files in current dir
rm /*.* - never do this
Can you spot the difference? A colleague did the later on our testing server. No clients were disturbed, but us devs were left working on things that can run locally (much more pleasant stuff for sure, yet the schedule suffered a lot).
Oh shit, that is production, not staging. I got the surge of adrenaline and hit Ctrl+C within a couple of seconds.
Thankfully it only completed the very first part of the process that prevents new users from logging in and puts up the maintenance page. No playing users were kicked off the game servers. All I had to do was run
./realmctl.py allow
to get new logins to start working again.
This is what taught us that we need to have an extra confirmation for actions on the production realm.
Your 'never do this' line would erase initrd.img on most machines, which you'd only find out about after the next reboot if nobody told you about it (and good luck getting that fixed).
There are a large number of varieties of this particular error. Some with terrible results.
rm -rf * .bak
for instance (especially when executed in the root directory).
That '#' prompt is there for a reason.
The way to solve these sort of issues is to first get the files using 'find' until you're totally happy about the result and then to use 'rm' as the command passed to find.
Would also erase everything else on the box so you might not know for a few seconds, but then the database errors start and the lib and pid files start disappearing and now your the king of a mountain of shit. It won't take till next reboot to notice.
If you ever get a chance, rm -rf / a box before you throw it out and just play around with it for a couple minutes while it eats itself.
You can spare your self most errors like that with tab complete, the built in sanity checker. I tab complete everything since I'm dyslexic, and it saves me a shit ton of time cause I'm never far from the error when I notice it.
> Would also erase everything else on the box so you might not know for a few seconds, but then the database errors start and the lib and pid files start disappearing and now your the king of a mountain of shit. It won't take till next reboot to notice.
Yes, it will, because the original post didn't include -r. So it's only deleting things that match the glob in the root directory. On many systems, that is nothing.
My point was that the lack of -r doesn't really change anything; even with -r, it would still only delete initrd.img and similar, which would take until the next reboot to notice.
A colleague of mine once called me over because they were having trouble with their computer. Apparently while doing "a little tidying up" they decided to move the windows folder to somewhere else. I can't remember the details but for some reason it was impossible to get a command prompt (Windows NT maybe?) without a disk, which I didn't have. I think I had to rip the drive out and stick it in another machine to fix it.
You know what's also fun. Try deleting all files that start with a . in your current directory:
rm -rf .*
Would you expect this to go UPWARD? I would have never. I.e. even if you're in /x/y/z it will traverse up the tree and start deleting /x and eventually /.
I'm not sure which design issue you're referring to; we absolutely did consider this scenario and the system behaved essentially as designed in this regard (in particular, the contingency that we can determine a compute node's platform image while the control plane was impaired was critical for our relatively quick recovery). The only real issue with the recovery of compute nodes was the CPU throttling that we encountered as the entire fleet rebooted simultaneously, but that was quickly discovered and remediated (and would have only resulted in a longer recovery time, not an inability to recover).
Edit: the comment that I was replying to was apparently deleted. I won't attribute it, but here is the text I was replying to:
I'm generally the last person to finger point but wow! You guys didn't think about and test what happens when a lot of your servers reboot simultaneously? That happens any time you lose power! This exact issue has been raised and discussed every single time I've deployed any substantial number of PXE booting machines. There are actually a number of issues with relying on PXE booting and this is just one of them. This design make is far worse than the mistyped command.
One of the worst days of my life was when one of my techs called me as I hopped on a plane...
Tech: "hey phil, I just broke $bigcustomer's database"
Phil: "..."
Tech: "I typed drop database ImportantDB;2"
Tech: "instead of drop database Importantdb2;"
Even worse because this was lets say.. a very undersized install due to customer cheapness. And all we had for backups (since who wants to PAY for backups?!?!) was a day-old mysqldump off a slave. It took multiple days to re-import all that data. Customer was not pleased.
There should be Zero method to reboot the whole dc by one operator. There should be means where only issuance of seperate commands from separate operators can, when jointly submitted, will result in simultaneous dc reboot. Tpi and all that.
As a non-security focused company, they probably don't think there is an issue giving so much power to one command that can be issued by a single person.
Doesn't mean everyone is affected by it. If I was hosting some unimportant website on their platform, I probably wouldn't even notice, and it wouldn't be an inconvenience to me. Silly thing to feel so strongly about.
This is the same company that chose to publicly shame one of the best contributors to node.js into quitting because he closed a pull request changing a gendered pronoun in a code comment. They place very little value on engineering excellence and more on 'being cool'. This leads to half-baked tools using <buzzword technology> like the one described in the article that allowed every server to be simultaneously rebooted without confirmation.
That's the Joyent side explaining that they would have fired him over it. Here is a news outlet report of it saying that it was mainly a language barrier issue.
Either way, it was just shocking to see how quickly they were willing to throw a contributor under the bus who put more than all of Joyent combined into libuv without even trying to understand the his reasoning.
It's funny how Cantrill is so willing to throw someone else under a bus, yet he's the guy who once asked if another developer had "ever kissed a girl"?