Postmortem for outage of us-east-1

jffry · on May 28, 2014

It's great to see a postmortem with this level of detail, fairly quickly. It's also great to see Joyent hang the blame on the system that allowed rebooting every server, and the poor recovery from that failure, rather than continuing to throw the operator under the bus:

  "...we will be rethinking what tools are necessary over
  the coming days and weeks so that "full power" tools
  are not the only means by which to accomplish routine tasks."

blantonl · on May 29, 2014

More importantly, I hope that the person that issued this command still keeps his job. He learned an important lesson and unless there is sheer incompetence here, this individual will have a medal on his chest indicating he has been in the worst "combat" that a sysadmin can endure.

Everyone makes mistakes, and judging by the language of the postmortem it appears it was just that.

Would love to have that sysadmin on my team, because he will never do that again....

bcantrill · on May 29, 2014

We debated whether or not to make this explicit in the postmortem, but yes, the operator in question still has their job, and for exactly the reasons that you outlined: it was an honest mistake, they were deeply apologetic (as one might imagine) -- and we know that they (certainly!) won't be making that mistake again. Mistakes like this are their own punishment; additional punitive action serves only to instill fear rather than effect the changes necessary to not repeat the failure.

dmourati · on May 29, 2014

"Five why's" might be appropriate to suss out how the ultimate mistake was even possible.

patio11 · on May 29, 2014

It is said, perhaps apocryphally, that the head of the trading desk at Mizuho, when asked whether he fired the woman who physically keyed in an order for 600k shares at 1 yen rather than an order for 1 share at 600k yen ($300 million or so in losses), said "Why would I do that, after spending $300 million to make her the least likely person in Japan to typo trade instructions."

bagels · on May 29, 2014

That quote is interesting because because the conclusion doesn't follow. It's basically the Monte Carlo fallacy.

She's probably less likely than before due to fear or guilt, but I don't know that this makes her less error prone than every other person in the country, including those that double check every time, for instance.

bobbles · on May 29, 2014

>continuing to throw the operator under the bus

There has been no evidence of them doing this at all.. I don't know why people keep saying it.

jffry · on May 29, 2014

That was poorly phrased... In my mind, I meant it in a "sadly it's too often that the human making the mistake gets blamed, rather than the systems that didn't have appropriate safeguards to make said mistake non-trivial"

paulwolf · on May 29, 2014

They threw an open-source contributor under the bus for rejecting a pull request that changed a code comment. Knee-jerk responses are definitely part of their modus operandi.

chris_wot · on May 29, 2014

Agreed. They seem to be good at threatening to fire employees of other companies, but not their own. I suppose that's a net positive. Still.

incision · on May 28, 2014

>'In our experience, platforms with this network device will encounter this boot-time issue about 10% of the time. The mitigation for this is for an operator to simply initiate another reboot, which we performed on those afflicted nodes as soon as we identified them.'

This bit bothers me more than anything else. It's not just a die roll configuration, but a known one with an operator required to do the re-rolling.

Everything is a value judgement I guess, but knowingly leaving that one 'mitigated' would drive me insane.

bcantrill · on May 28, 2014

Sorry if we didn't adequately convey our frustration with this particular issue. It's one that's been with us for a long time and it absolutely sucks -- and after trying (and failing) to work with the vendor to get the issue understood (it's essentially a firmware-level issue), we ultimately decided to move away from that particular NIC vendor entirely. If we could wave a wand and be rid of these particular parts, we gladly would -- but until then, this transient boot-time issue needs to be manually mitigated with an additional reboot.

AceJohnny2 · on May 29, 2014

Why the shyness about saying the brand's name? If their product and support was subpar to the point of blacklisting the entire vendor, it could be useful to spread the information to A) warn others of potential problems and B) put pressure on the vendor to improve their products.

bcantrill · on May 29, 2014

I'm with you, but somehow a postmortem for our own outage seemed like the wrong place to name-and-shame a vendor...

insaneirish · on May 29, 2014

If I had to guess, it's Broadcom. While their merchant switching ASICs (Trident+, Trident2) have become good enough to displace most custom spun ASICs for 10 Gbps and 40 Gbps switching, their NIC hardware has long been somewhat of a disaster. Interesting to note is that Broadcom has basically sold the NIC business to QLogic: http://www.broadcom.com/press/release.php?id=s832628

jychang · on May 29, 2014

Given that context... Yes, it probably would be the wrong time and place to blame other people for a (semi) unrelated issue.

wmf · on May 29, 2014

OTOH, almost every vendor has had a bad product at one time, so shaming tends to just give people ammunition for their existing irrational biases.

rodgerd · on May 29, 2014

All vendors have bad products from time to time, but they deal with it differently. I can think of one vendor who covered up bugs in their silicon with undocumented driver hacks in Windows, and stonewalled Linux kernel devs on the nature of the faults, for example.

jychang · on May 29, 2014

Storytime? What happened?

rodgerd · on May 29, 2014

Since I don't have deep pockets I shan't name the vendor, but it started out with us taking delivery of a bunch of blades where our blade supplier had changed the NIC from one vendor to another. We discovered we had repeatable problems with J2EE cluster traffic over UDP - we'd see packet loss rates go to absurd levels as we ramped up load on the cluster, leading to a situation where a node dropping out and then rejoining would cuase the cluster to lock up trying to bring the node back into the fold. We could repeat the massive packet loss using UDP test tools. Rather ugly.

Coincidentally we had a visit from the CTO of high-performance storage vendor who happen to have a bunch of kernel hackers on their staff. We mentioned our problem in passing and he explained how they'd nearly lost a major contract because a deployment had moved thei customer from being storage-bound to initially untracable data loss. Digging around by their kernel hackers showed the NIC was losing data. After a certain amount of too-ing-and-fro-ing the vendor moved from denying the problem to admitting that their silicon had a defect that would throw away data under load, and their Windows drivers tried to spackle over the problem. They were relying on no-one being able to drive enough load through the card to cause a problem.

This dovetailed with our experience, and we found that installing a different manufacturer's card in the blade let us work around the problem. Our blade supplier moved to a new NIC vendor subsequently.

doxcf434 · on May 29, 2014

My condolences, we had a very similar if not identical issue with some NICs, and also ended up switching vendors. The smoke coming from our team's ears was enough to harvest a small beehive for honey.

incision · on May 28, 2014

Sure, I mean to relate more than judge - I've been through similar hoops though surely on a smaller scale.

bignaj · on May 28, 2014

>The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter. Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay.

"To make error is human. To propagate error to all server in automatic way is #devops." -@DEVOPS_BORAT [1]

[1]https://twitter.com/DEVOPS_BORAT/status/41587168870797312

adamfeldman · on May 28, 2014

I don't have a lot of experience with datacenters, and am trying to understand this sentence:

"Because there was a simultaneous reboot of every system in the datacenter, there was extremely high contention on the TFTP boot infrastructure, which like all of our infrastructure, normally has throttles in place to ensure that it cannot run away with a machine."

What does "cannot run away with a machine" mean? Why would you want to by default restrict the speed at which that system runs?

bcantrill · on May 28, 2014

The throttle that we're referring to there is a CPU throttle. When we provision an OS-virtualized instance, there is a default throttle to prevent it from consuming a disproportionate amount of CPU on the box. The instance that runs TFTP on the headnode was provisioned as a relatively small instance (it needs very little DRAM), which also gave it (by default) a CPU throttle that restricted its CPU utilization. Normally, of course, this isn't an issue -- but normally we don't try to TFTP boot the entire datacenter at once. This issue was obvious immediately (thanks, DTrace!), and we resolved it on the headnode by manually raising the throttle, and will be making the fix in SmartDataCenter itself as well. Does that answer your question?

adamfeldman · on May 28, 2014

Ah got it! The control plane has instance restrictions similar to customer nodes, since it uses some of the same infrastructure.

bcantrill · on May 28, 2014

Exactly; it uses all of the same infrastructure, actually -- it's just provisioned on a different network (namely, the admin network). It's also worth noting that OS-based virtualization helped us here not only because of the global visibility we get with DTrace (which immediately indicated that TFTP was waiting for CPU), but also because we could dynamically adjust the throttle and simply give it more CPU without having to bounce the box and interrupt all of the TFTP booting in progress. It was a small amount of solace on what was easily the worst day we've had in a while, if not ever...

sokoloff · on May 28, 2014

dtrace reminds me a lot of awk...

bcantrill · on May 28, 2014

Not an accident -- I'm a well-known awk lover, and it was very much an inspiration for DTrace. (Even down to the documentation: if you haven't read it recently, re-read "The AWK Programming Language"[1] -- it's beautifully concise.)

[1] http://www.amazon.com/The-AWK-Programming-Language-Alfred/dp...

mzs · on May 29, 2014

Oh man Bryan I am so wary of sharing this with you on account of it being so wrong, gross, and buggy... But WTH: https://github.com/msliczniak/onetenth Sorry about the rough day you all had - cheers Mike

jacquesm · on May 28, 2014

I think they're trying to say that that machine has other duties except for TFTP and that the process is running at a priority low enough that it can be pre-empted by other processes. So if there is a lot of contention on the TFTP boot infrastructure the throttles limit the amount of work a single machine can do, which in turn can cause starvation.

The reason why is probably related to what it does besides TFTP, which under normal circumstances is probably more important than new nodes joining the network.

During a mass reboot that will bite you because then the chances of starvation will go up quite a bit as the throttles cause one machine after another to time-out and retry their boot sequence.

CHY872 · on May 28, 2014

I also have no experience, but my guess is that when you are running normally, you want to share the bandwidth of a server fairly between all clients. When you've just accidentally rebooted all servers, you want to do it completely unfairly - so that instead of 10 computers taking an hour to download something, the first takes 6 minutes, the second takes 12 minutes etc. This allows at least some of the servers to get started and your infrastructure to start getting back online. I.e. you want to change from round robin scheduling to shortest job first scheduling (which gives the minimum average waiting time).

ChargingWookie · on May 28, 2014

They wanted to prevent all of the servers trying to contact the PXE server at the same time to avoid overloading it. Its basically to prevent an accidently self DDOS.

JBiserkov · on May 28, 2014

>The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter.

Substitute reboot with 'upgrade to win 7' and data center with university at get a story from a month? ago.

     rm ./*.* - delete all files in current dir

     rm /*.* - never do this

Can you spot the difference? A colleague did the later on our testing server. No clients were disturbed, but us devs were left working on things that can run locally (much more pleasant stuff for sure, yet the schedule suffered a lot).

Negitivefrags · on May 29, 2014

I once went to shutdown our staging realm to do some task.

    ~/realm_control_production$ ./realmctl.py stop --no-warn

Oh shit, that is production, not staging. I got the surge of adrenaline and hit Ctrl+C within a couple of seconds.

Thankfully it only completed the very first part of the process that prevents new users from logging in and puts up the maintenance page. No playing users were kicked off the game servers. All I had to do was run

    ./realmctl.py allow

to get new logins to start working again.

This is what taught us that we need to have an extra confirmation for actions on the production realm.

protomyth · on May 29, 2014

That is why my production terminal windows are red background with yellow text, because I sometimes "space out" the prompt.

mickeyp · on May 29, 2014

This is good advice. I have always followed that through by insisting $PS1 has a bright red banner that says "PRODUCTION".

protomyth · on May 29, 2014

I like that one, I will have to try that out. A damn shame you cannot add animation to the prompt.

will_hughes · on May 29, 2014

So, it's now like this?

    ~/realm_control_production$ ./realmctl.py stop --no-warn --no-really --i-know-its-production

jacquesm · on May 28, 2014

Your 'never do this' line would erase initrd.img on most machines, which you'd only find out about after the next reboot if nobody told you about it (and good luck getting that fixed).

There are a large number of varieties of this particular error. Some with terrible results.

rm -rf * .bak

for instance (especially when executed in the root directory).

That '#' prompt is there for a reason.

The way to solve these sort of issues is to first get the files using 'find' until you're totally happy about the result and then to use 'rm' as the command passed to find.

Of course, nobody does this ;)

jethro_tell · on May 28, 2014

>Your 'never do this' line

Would also erase everything else on the box so you might not know for a few seconds, but then the database errors start and the lib and pid files start disappearing and now your the king of a mountain of shit. It won't take till next reboot to notice.

If you ever get a chance, rm -rf / a box before you throw it out and just play around with it for a couple minutes while it eats itself.

You can spare your self most errors like that with tab complete, the built in sanity checker. I tab complete everything since I'm dyslexic, and it saves me a shit ton of time cause I'm never far from the error when I notice it.

jsmthrowaway · on May 28, 2014

> Would also erase everything else on the box so you might not know for a few seconds, but then the database errors start and the lib and pid files start disappearing and now your the king of a mountain of shit. It won't take till next reboot to notice.

Yes, it will, because the original post didn't include -r. So it's only deleting things that match the glob in the root directory. On many systems, that is nothing.

eurleif · on May 28, 2014

Even with -r,

  /*.*

only matches files and directories in / that have a period in their name. So it would skip over /usr, /home, etc.

jsmthrowaway · on May 29, 2014

That's what "match the glob" means.

eurleif · on May 29, 2014

My point was that the lack of -r doesn't really change anything; even with -r, it would still only delete initrd.img and similar, which would take until the next reboot to notice.

dbaupp · on May 28, 2014

You have the same fun in an emulator (even http://bellard.org/jslinux/ is somewhat good enough).

cordite · on May 29, 2014

Another one with a space is from the bumblebee install script where it removed /usr

Instead of /usr/some/path/here

http://www.miltonbayer.com/blog/news/when-a-code-commit-goes...

nl · on May 29, 2014

>rm -rf * .bak

Try

  chown -R user:group /var/local/some/data/dir /

inside an init script. Of course it had been tested many times, and the space was only fat-fingered in when someone move the data directory.

aidos · on May 28, 2014

Ah yes. There are so many ways to bork a box.

A colleague of mine once called me over because they were having trouble with their computer. Apparently while doing "a little tidying up" they decided to move the windows folder to somewhere else. I can't remember the details but for some reason it was impossible to get a command prompt (Windows NT maybe?) without a disk, which I didn't have. I think I had to rip the drive out and stick it in another machine to fix it.

susi22 · on May 29, 2014

You know what's also fun. Try deleting all files that start with a . in your current directory:

    rm -rf .*

Would you expect this to go UPWARD? I would have never. I.e. even if you're in /x/y/z it will traverse up the tree and start deleting /x and eventually /.

vacri · on May 29, 2014

'..' starts with '.', but I wouldn't have expected it to go more than one level up. I guess it saw ../.., then ../../.. ...

These days, any time I do an rm with a wildcard, I always prepend with a directory, no matter how trivial, just to keep in the habit. 'rm ../mydir/.*'

adamfeldman · on May 28, 2014

The incident was at Emory University: http://thenextweb.com/shareables/2014/05/16/emory-university...

on May 28, 2014

[deleted]

bcantrill · on May 28, 2014

I'm not sure which design issue you're referring to; we absolutely did consider this scenario and the system behaved essentially as designed in this regard (in particular, the contingency that we can determine a compute node's platform image while the control plane was impaired was critical for our relatively quick recovery). The only real issue with the recovery of compute nodes was the CPU throttling that we encountered as the entire fleet rebooted simultaneously, but that was quickly discovered and remediated (and would have only resulted in a longer recovery time, not an inability to recover).

Edit: the comment that I was replying to was apparently deleted. I won't attribute it, but here is the text I was replying to:

I'm generally the last person to finger point but wow! You guys didn't think about and test what happens when a lot of your servers reboot simultaneously? That happens any time you lose power! This exact issue has been raised and discussed every single time I've deployed any substantial number of PXE booting machines. There are actually a number of issues with relying on PXE booting and this is just one of them. This design make is far worse than the mistyped command.

sathio · on May 28, 2014

so, basically it was like a "DELETE FROM table;" without limit :)

phil21 · on May 28, 2014

One of the worst days of my life was when one of my techs called me as I hopped on a plane...

  Tech: "hey phil, I just broke $bigcustomer's database"
  Phil: "..."
  Tech: "I typed drop database ImportantDB;2"
  Tech: "instead of drop database Importantdb2;"

Even worse because this was lets say.. a very undersized install due to customer cheapness. And all we had for backups (since who wants to PAY for backups?!?!) was a day-old mysqldump off a slave. It took multiple days to re-import all that data. Customer was not pleased.

le_meta · on May 28, 2014

It seems your fleet is running uncertified applications from unknown parties. your fleet has been rebooted for your protection.

samstave · on May 29, 2014

There should be Zero method to reboot the whole dc by one operator. There should be means where only issuance of seperate commands from separate operators can, when jointly submitted, will result in simultaneous dc reboot. Tpi and all that.

paulwolf · on May 29, 2014

As a non-security focused company, they probably don't think there is an issue giving so much power to one command that can be issued by a single person.

geetarista · on May 28, 2014

    On behalf of all of Joyent, we are extremely sorry for this outage, and the severe inconvenience it may have caused to you, and your customers.

I hate it when people apologize saying maybe it was inconvenient. If someone is using your service and you fucked up, it is an inconvenience.

RussianCow · on May 28, 2014

Doesn't mean everyone is affected by it. If I was hosting some unimportant website on their platform, I probably wouldn't even notice, and it wouldn't be an inconvenience to me. Silly thing to feel so strongly about.

FireBeyond · on May 29, 2014

Disagree. They acknowledge it may be severely inconvenient, not just an 'oops'.

hueving · on May 28, 2014

This is the same company that chose to publicly shame one of the best contributors to node.js into quitting because he closed a pull request changing a gendered pronoun in a code comment. They place very little value on engineering excellence and more on 'being cool'. This leads to half-baked tools using <buzzword technology> like the one described in the article that allowed every server to be simultaneously rebooted without confirmation.

mirashii · on May 29, 2014

For those wondering why this is downvoted, read the "shaming" and the reasons behind it for yourself.

http://www.joyent.com/blog/the-power-of-a-pronoun

hueving · on May 29, 2014

That's the Joyent side explaining that they would have fired him over it. Here is a news outlet report of it saying that it was mainly a language barrier issue.

https://gigaom.com/2013/12/02/slap-fight-in-node-js-land/

Either way, it was just shocking to see how quickly they were willing to throw a contributor under the bus who put more than all of Joyent combined into libuv without even trying to understand the his reasoning.

vacri · on May 29, 2014

I've always hated that essay. "We value empathy here at Joyent. That's why we would rather fire an [theoretical] employee than retrain them".

chris_wot · on May 29, 2014

It's funny how Cantrill is so willing to throw someone else under a bus, yet he's the guy who once asked if another developer had "ever kissed a girl"?