Amazon AWS had a power failure, their backup generators failed

skywhopper · on Sept 4, 2019

This seems to be getting slightly overblown in that thread. To be clear, this impacted one datacenter out of ten that make up one availability zone out of six in AWS’s us-east-1 region. So we are talking 2-3% at most of that region’s capacity was impacted.

I haven’t seen a report yet on exactly why their generator failed, but from what I’ve heard, the power failed, and the backup generator kicked in and ran fine for over an hour, but then it failed. This sort of thing sucks to deal with but it’s also inevitable. Of their hundreds of datacenters, a mechanical failure is going to happen occasionally no matter how good their maintenance plans are.

So the key when using cloud services like AWS is to plan for the possibility of failure. EBS expects an annual failure rate of 0.1%. So one out of a thousand EBS volumes will fail in a given year. If you operate at the scale of thousands of servers in AWS, you see this sort of thing all the time. Luckily, EBS also makes it trivial to take volume snapshots which are stored in S3, which has much much higher reliability and durability. So if you have data in EBS that needs to be kept safe, take regular snapshots. Here’s a doc that explains how you can set up scheduled, auto-rotated snapshots: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/snapshot...

amykhar · on Sept 4, 2019

We were hit by it over the weekend, and because of our failsafes, our customers didn't notice a thing. Stuff happens. It's our job to prepare for these things as well.

blantonl · on Sept 4, 2019

Exactly. The benefits of the cloud allow you almost unlimited computing resources spread across enormous swaths of geography, power, and internet access.

Therefore, if you were impacted, its your fault, not AWS. Sorry.

nwsm · on Sept 4, 2019

* If it's within your SLA

bdcravens · on Sept 4, 2019

I don’t understand this statement. If your application is developed with fault tolerance, this is independent of any SLA.

victor106 · on Sept 4, 2019

Curious to know what kind of failsafes you have in place so your customers were not impacted?

cloakandswagger · on Sept 4, 2019

I'm not the original commenter but for an EC2 hosted application, any architecture that uses a VPC with proper cross-AZ subnetting and a load balancer would not have been impacted. EBS snapshots would minimize any data loss.

smn1234 · on Sept 4, 2019

probably multi-az, multi-tier, stateless apps

https://12factor.net/ can help guide

hamiltont · on Sept 4, 2019

Don't disagree with your point that this is overblown, but here's an important related point

"Your nines are not my nines" - https://rachelbythebay.com/w/2019/07/15/giant/

mk89 · on Sept 4, 2019

If I recall correctly this is nicely covered in the book "Release it!" which BTW I recommend.

dymk · on Sept 4, 2019

But for a large number of 9s of people, AWS's 9s are their 9s.

joehandzik · on Sept 4, 2019

Reminds me of the hover text on this xkcd: https://xkcd.com/325/

skywhopper · on Sept 4, 2019

More info: Sounds like the backup generators operated fine, but only have about an hour of fuel in their individual storage tanks. There's an automated fuel delivery system to keep them topped up which is the piece that failed, meaning the individual generators didn't get their tanks refilled and eventually ran out of fuel. This was compounded by the operations staff being busy with trying to restore the main power feed and somehow (not explained) didn't get a timely notification that the fuel delivery system itself had failed.

AWS says they plan to install a backup fuel delivery system at this particular datacenter (it wasn't detailed whether this SPOF is common at other datacenters or if this was an outlier) and that they have already updated their notifications to be more aggressive.

koolba · on Sept 4, 2019

EBS Snapshots don't just protect you in case of a failure to recover to something, they actually reduce the chance of failure[1]. Assuming that the S3 storage has no data loss[2], you lower the data loss possibility for your EBS volume to only those bits that have been changed since the last snapshot.

[1]: https://stackoverflow.com/questions/13576363/does-taking-a-s...

[2]: Yes it's a big if but S3 for durability against data loss is like the US Treasury for risk free returns on T-Bills.

Narkov · on Sept 6, 2019

I actually posted that Stack Overflow question and I don't think any of the answers support my assumption and your statement that it reduces the chance of failure.

The EBS volume will still fail but you can restore it from the higher reliability S3-backed EBS snapshot.

kerkeslager · on Sept 4, 2019

> This seems to be getting slightly overblown in that thread. To be clear, this impacted one datacenter out of ten that make up one availability zone out of six in AWS’s us-east-1 region. So we are talking 2-3% at most of that region’s capacity was impacted.

For how many users was this 100% of their business?

Single-digit outage percentages for cloud services like AWS look like no big deal from the big perspective, which means that they often aren't a high priority. But when you're one of the customers in the 2-3%, it is a big deal, and you want it to be a high priority.

Small hosting providers may have only 3 9s of availability, but when your site goes down their world stops until it's fixed. I've seen reports of 5 9s of availability from Amazon, but when your site goes down, their alerts are still all green and they'll call you back at their leisure.

acdha · on Sept 4, 2019

> For how many users was this 100% of their business?

… but they still chose to ignore all of the prominent warnings and architectural guidance, not to mention avoiding use of the services which have HA built-in. I mean, I'm sympathetic to anyone who had a bad day with a forced learning experience but it's not like this is some dark secret.

kerkeslager · on Sept 4, 2019

> but they still chose to ignore all of the prominent warnings and architectural guidance

So that's it, blame the user and caveat emptor?

> not to mention avoiding use of the services which have HA built-in.

Many (all?) of these services tightly couple you to Amazon, so avoiding them is a very reasonable decision.

acdha · on Sept 4, 2019

> So that's it, blame the user and caveat emptor?

If I sell you a loaf of bread and you complain that it's not a sandwich, is it anything else?

> Many (all?) of these services tightly couple you to Amazon, so avoiding them is a very reasonable decision.

That's just a cop-out: checking the “multi-AZ” box in RDS completely avoided this problem with zero lock-in. If you're deploying containers, you have multiple options which are portable and avoid this completely. If you're deploying EC2 instances, again you have options with very limited lock-in (e.g. auto-scaling with multiple AZs).

More importantly, that's also a business decision: if you're that worried about lock-in you are accepting responsibility to operate the alternatives. For example, following industry-standard practice might suggest that you run everything in Kubernetes in multiple AZs, regions, or providers but it would never support running everything in a single AZ.

gnur · on Sept 4, 2019

Honestly, if you are running on a single AZ in aws you are setting yourself up for failure..

skywhopper · on Sept 4, 2019

None if they had any multi-AZ capability at all.

vectorEQ · on Sept 4, 2019

agree. like you imply, too many people rely heavily on their providers for business critical things like backups and redundancy. while generally big providers to a good job at this (and aws certainly does a good job at this), it does not mean there is a guarantee of any kind failures won't ever occur. thus the need to heavily invest in failure resistant technologies upon this borrowed infrastructure is arguably more important than picking 'the best' provider for the job.

I would like to note that the response time suggested by the tweet is a bit bad, 4 days to realise something and send out response / alerts to customers is a bit slow even for amazon. But then again, everything goes slower for bigger things, and amazon is quite big i'd say. not sure what the SLA response time to such an incident is, so it might be within agreed times...

skywhopper · on Sept 4, 2019

I suspect the four days was how long it took to confirm that their particular EBS volume was definitely not recoverable. Soon after the outage ended on Saturday morning it was clear that some EBS volumes were not recovering quickly, and support's advice at that time was to rebuild/recover if you needed to still be online.

When a datacenter loses power like this, a few of the storage arrays will just not come back online. But another few will take time to run through their corruption recovery process, and it may take a long time and some service by a human (eg parts replacement, etc) before they can be certain a particular volume is not recoverable. Given their scale and the timing, at the beginning of a holiday weekend, four days is annoying, but not bad.

romaaeterna · on Sept 4, 2019

But the reason I pay AWS is so that I don't have to hire a team to take care of backups and redundancy on my side. If they can't be relied on, a lot of the justification for their cost markup goes out the window.

kevan · on Sept 4, 2019

If you don't want to think about things like redundancy then use higher-abstraction services. Lambda for example takes care of multi-AZ redundancy so you don't have to think about it. The lower level building blocks like EC2 don't. They expose the fault boundaries so that you can build HA applications on top of them, but it's still your responsibility to do so.

skywhopper · on Sept 4, 2019

It sounds like you might misunderstand the product you are buying from them. They are very clear about the reliability of EBS (1 in 1000 volumes will fail during a year of uptime), and they provide a really easy way to back things up, and there are tools available to schedule automated backup rotation. So I'm not sure what more you expect. AWS can't possibly know what your needs are for backup and restore for a particular EBS volume. If you want data durability, use S3.

blihp · on Sept 4, 2019

AWS is only selling you infrastructure as a service, not a turnkey solution. It's up to you to combine and coordinate these services into a solution that delivers the capabilities (including backup, recovery and fault tolerance) appropriate to your needs. So while you don't need to hire a team to take care of backups and redundancy, you do need to provision and configure what is required so that their team can.

thruhiker · on Sept 4, 2019

Respectfully, that's not a good reason to use public cloud providers like AWS. They provide features and tooling that make building redundancy into your services easier but for many of these redundancy features you must integrate them into your infrastructure design to take advantage of them.

tnolet · on Sept 4, 2019

If you replace AWS with Heroku in your statement, I agree. Heroku abstracts the redundant AWS resources for you so you can just “run your app”. However, Heroku also had a huge outage. That is way more problematic as far as I am concerned.

bdcravens · on Sept 4, 2019

AWS gives you access to redundant resources inexpensively. If you have your application in a single AZ, you’ve elected to bypass the redundancy.

cloakandswagger · on Sept 4, 2019

It can be relied on, but it's up to you to configure it properly. AWS has no way of knowing how critical your application is and what level of redundancy it needs, and this has cost implications so they can't do it automatically.

romaaeterna · on Sept 4, 2019

I'm pretty sure that RDS is EBS-backed.

acdha · on Sept 4, 2019

… and RDS has a multi-AZ checkbox which does exactly what it claims. Anyone who used it did not have a problem with this outage.

philwelch · on Sept 4, 2019

When people talk about data backups, they always say you should rehearse restoring your backups just so you're confident that the backups are complete. If you don't do that, you lose your data and your recovery fails because you never tested it.

Likewise, I guess it's easy to speculate from the peanut gallery, but I wouldn't be surprised if the backup generators just hadn't been sufficiently tested and maintained because, well, they're backup generators.

staticvar · on Sept 4, 2019

Do you have a source detailing this issue in us-east-1? I don't see a recent post-event summary from AWS mentioning us-east-1 https://aws.amazon.com/premiumsupport/technology/pes/

j2bax · on Sept 4, 2019

Here is what they said in my support panel. This event made for a fairly tense Saturday morning for my development team and I.

[01:30 PM PDT] At 4:33 AM PDT one of ten data centers in one of the six Availability Zones in the US-EAST-1 Region saw a failure of utility power. Our backup generators came online immediately but began failing at around 6:00 AM PDT. This impacted 7.5% of EC2 instances and EBS volumes in the Availability Zone. Power was fully restored to the impacted data center at 7:45 AM PDT. By 10:45 AM PDT, all but 1% of instances had been recovered, and by 12:30 PM PDT only 0.5% of instances remained impaired. Since the beginning of the impact, we have been working to recover the remaining instances and volumes. A small number of remaining instances and volumes are hosted on hardware which was adversely affected by the loss of power. We continue to work to recover all affected instances and volumes and will be communicating to the remaining impacted customers via the Personal Health Dashboard. For immediate recovery, we recommend replacing any remaining affected instances or volumes if possible.

Terretta · on Sept 4, 2019

Serious question, for market strategy understanding:

HA would normally use two AZs, 21C architectures would use three AZs, and other patterns such as pilot light let you use additional AZs without a significant cost hit. Further, when you can make workload-handling instances so much smaller (even down into the T sizes, and with cattle patterns you can start to leverage spot pricing), each additional AZ you add to the 21C mix represents a smaller percentage of your capacity lost in an AZ outage.

What factors have you choosing an automation powered CSP such as AWS but using only a single AZ out of a half dozen?

If using a single AZ and tense during outages, why not use Hetzner, Softlayer (now IBM Cloud), etc.?

fulafel · on Sept 4, 2019

A lot of organisations are very very conservative and slow moving, and AWS seemed the most likely thing to still be around after/during the 15 year project to shift things out of private data centers.

darkwater · on Sept 4, 2019

Can you detail/link resources about "21C architectures" and "pilot light pattern"? Thanks!

jsjohnst · on Sept 4, 2019

My guess on “21C architectures” is “21st century architectures”. I can’t seem to find anything on “pilot light pattern” in quick searches.

Either way, GP is using obscure terminology at best.

lt · on Sept 4, 2019

https://d1.awsstatic.com/whitepapers/aws-disaster-recovery.p...

> The term pilot light is often used to describe a DR scenario in which a minimal version of an environment is always running in the cloud. The idea of the pilot light is an analogy that comes from the gas heater. In a gas heater, a small flame that’s always on can quickly ignite the entire furnace to heat up a house

jsjohnst · on Sept 4, 2019

> In a gas heater, a small flame that’s always on can quickly ignite the entire furnace to heat up a house

Sure, I’m being nitpicky, but that’s a terrible explanation of the purpose of a pilot light in a legacy furnace.

notyourday · on Sept 4, 2019

If you are using anything other than VMs, the APIs are too different.

alienreborn · on Sept 4, 2019

Some AWS services can't span across multiple availability zones like EMR.

bshacklett · on Sept 4, 2019

Even in these cases, however, there are usually architectural methods for mitigating this kind of failure. In the case of EMR, using transient clusters with state stored in durable file stores such as EMRFS and vanilla S3 is one such option.

It is inevitable that systems will fail. The best the industry can do is work to reduce the number of failures and understand failure modes well so that they can be planned for. AWS does a very good job of this in my experience.

Regardless of whether applications are hosted in the cloud, on premises, co-located or in some hybrid configuration, it's important to design for that inevitable failure and keep business decision makers in the loop while doing so. Understanding requirements around RPO and RTO are extremely important in developing an architecture which meets the needs of the business, yet is still cost effective.

skywhopper · on Sept 4, 2019

Yes, and also for many managed services (like Kinesis or Lambda) you don't have any control over which AZs are in use, and so you have no choice but to wait for the AWS teams to address any problems caused on their own backends.

Regardless, it's important to be aware of the risks of whatever tool you're using. It's unrealistic to expect any provider to be able to avoid failure entirely. You have to be aware of possible failure scenarios and have your own plan to address them.

eropple · on Sept 4, 2019

EMR seems kind of legacy from AWS's perspective these days. Probably very profitable, but they're seeming to try to push you towards things like Glue and Athena.

carlsborg · on Sept 4, 2019

This conclusion "The cloud is just a .. blah blah blah" is weak: Amazon offers isolated availability zones within each region to mitigate this risk at the system level, it gives you the ability to take EBS snapshots that you can backup on S3 (with 11 9s of durability), and scaling features you just will not find on "just another computer." And you are meant to architect to this with multi-AZ designs.

It's managed infrastructure not some miraculous alternative universe where probabilities do not apply to you.

From the docs:

"Amazon EBS volumes are designed for an annual failure rate (AFR) of between 0.1% - 0.2% ..."

vectorEQ · on Sept 4, 2019

the conclusion is weak but true. people do tend to forget the cloud is also a bunch of computers, and they might fail. however, it's not an argument to avoid cloud entirely, and in that sense its weak as an argument against cloud.

philwelch · on Sept 4, 2019

Agreed. If the cloud is somebody else's computer, I'm glad to let it be a computer that belongs to an extremely specialized and demanding company that focuses on providing that computer to me as a service.

earth2mars · on Sept 4, 2019

exactly. remember, everything fails, all the time. https://thenextweb.com/2008/04/04/werner-vogels-everything-f...

linsomniac · on Sept 4, 2019

We had a similar problem at our hosting facility last winter during that "Ice Vortex" storm that was all over the news. The facilities guys had been very proud of never having a power outage, I've had servers with them for 15 years now.

The morning of the worst of the storm, we completely lose access to all services at that facility. Super unusual, everything we have there is redundant. So I do some minor investigation, and get on the horn to them.

They were being super cagey. "Hey, we lost all access to our systems." "Ok, I'll open a ticket and we will investigate." "Uhhh. It feels like it's a big problem with the data center, are you guys having problems or is it just us?" "I can't say anything more until we've completed an investigation." "I'm trying to decide if we need to start failing over to our DR site, or if I need to put chains on the truck to drive 50 miles to the data center in this storm. Is anyone else having problems? Are fire alarms going off?" "We have received multiple reports of problems."

Power was back on in less than half an hour, but they still weren't saying anything for a few hours. Spent that time trying to figure out if we should shut everything down and ride it out, or if they were back in business. We had one system that suffered disk corruption, despite having a (according to the weekly testing) correctly operating BBU on the RAID.

So what happened? It shouldn't have been possible, our cabinet was being fed by two lines from independent PDUs. Each PDU is fed by 2 independent UPSes (one shared between the two PDUs), each UPS fed by a dedicated generator. Should have required 3 failures to bring us down.

They eventually reveal that they had had one of their UPSes down for weeks, waiting for replacement parts. The other two UPSes had independent failures (one was a controller board, one was battery related). They said they still did quarterly full load tests of the power systems, but reading between the lines I think they weren't testing these two UPSes because the other one was not there to back them up.

Still, one power event in 15 years isn't too shabby.

jgalentine007 · on Sept 4, 2019

I had a similar situation in the early 2000's at NTT/Verio in Sterling, VA. They lost power because someone dug through a utility line and introduced a ground a fault into their system. They switched to generators and then against protocol (basically human error) tried to force override a transfer switch onto their other utility which ended up killing the generators. Eventually the UPSs were all drained, but we had to shutdown servers long ahead of that because with no AC, they were overheating. People were taking servers out by pickup truck to stand them up at their other offices and data centers. 48 hours of pain.

macintux · on Sept 4, 2019

Power seems like the big wildcard in data center management. So tough to properly test your failover preparations, and so many different ways things can go wrong.

I know of a large company that had the data center emergency cutoff button next to the automatic doors on the way out. Sure enough, a contractor hit it one day thinking it was the way to open the doors.

linsomniac · on Sept 4, 2019

Oh, I forgot to mention that their solution was to run off generators for the impacted UPSes until repairs could be completed. I think one of them ended up running for 2 weeks. It seems strange that replacement parts for this UPS were so hard to source. That's why it was down in the first place. Someone making them by hand?

lovetocode · on Sept 4, 2019

So dude is mad because he didn’t have a redundancy plan? You can take snapshots of EBS volumes which backs everything up to S3. They even tell you that EBS volumes can fail in the documentation. But blaming someone else is easier I guess...

ben509 · on Sept 4, 2019

How do you do restore in those scenarios? You have a whole mess of snapshots of data that's in various stages of being wrong.

I'd rather just put up with the agony of RDS to get a point in time restore and treat my instances' data as volatile.

inopinatus · on Sept 4, 2019

In followup, turns out they didn't even lose data as claimed.

It was a case of either lying or incompetence. Hunt didn't like that and blocked me for calling it out.

ti_ranger · on Sept 5, 2019

> In followup, turns out they didn't even lose data as claimed.

Link?

sheeshkebab · on Sept 4, 2019

AWS need to add a button/option to ebs to have volumes be automatically backed up by aws itself. Without this few will do it or are even aware that it’s possible to do.

It doesn’t help that ebs backup takes forever, especially initially.

rwiggins · on Sept 4, 2019

They did add that: https://aws.amazon.com/backup/

rad_gruchalski · on Sept 4, 2019

Ah, and the storm of "my AWS bill has skyrocketed, why is AWS doing this..." incoming in 1 week after such thing. No, thanks. RTFM.

empath75 · on Sept 4, 2019

Not only should you be architecting your app to survive an az going down you should be planning on an entire region going down and maybe even an entire cloud provider.

It’s annoying when amazon has outages but they have local outages all the time and they give you all the tools you need to handle them.

sgt101 · on Sept 4, 2019

money money money though - all that time and effort (and fees) doesn't help you to lowball the next contract, cross your fingers and hope not to get caught out!

darkcha0s · on Sept 4, 2019

This just sounds like a bad architected solution, nothing else. The same problems that can happen in your own datacenter can happen in the cloud; it's just not your responsibility to fix it. If you lack that knowledge as an architect, rethink your title.

jrace · on Sept 4, 2019

I think the real complaint is:

>>Then it took them four days to figure this out and tell us about it.

stickydink · on Sept 4, 2019

I must have missed something entirely, did this all happen before the weekend? Where is the 4 days coming from? Amazon's RSS feed hit our Slack on Saturday morning with an explanation that the power had gone out and the backup generators failed.

Is there some post mortem that just came out?

>>We want to give you more information on progress at this point, and what we know about the event. At 4:33 AM PDT one of 10 datacenters in one of the 6 Availability Zones in the US-EAST-1 Region saw a failure of utility power. Backup generators came online immediately, but for reasons we are still investigating, began quickly failing at around 6:00 AM PDT. This resulted in 7.5% of all instances in that Availability Zone failing by 6:10 AM PDT. Over the last few hours we have recovered most...

whoisjuan · on Sept 4, 2019

I mean what did he expect? A 5 min resolution? 4 days sounds like a very reasonable investigation time to figure out why something complex failed.

nerdbaggy · on Sept 4, 2019

I remember somebody on here writing out how when they worked at AWS they wrote custom firmware for their generators to get max performance.

MichaelApproved · on Sept 4, 2019

I couldn't find the HN post but I found an article that talks about firmware mods they do.

https://www.datacenterknowledge.com/archives/2017/04/07/how-...

> The piece of technology Amazon designed to avoid this type of outage is the firmware that decides what electrical switchgear should do when a data center loses utility power. Typical vendor firmware prioritizes preventing damage to expensive backup generators over preventing a full data center outage, according to Hamilton. Amazon (and probably most other large-scale data center operators) prefers risking the loss of a sub-$1 million piece of equipment rather than risking widespread application downtime.

> When everything happens as expected during a utility outage (which is the case most of the time), the switchgear waits a few seconds in case utility power comes back (also the most common scenario) and if it doesn’t, the switchgear fires up generators, while the data center runs on energy stored by UPS systems. Once the generators are stabilized, the switchgear makes them the primary source of power to the IT systems.

> Last year’s Delta data center outage was attributed to switchgear “locking out” the generators at the airline’s facility in Atlanta. That’s what most switchgear is designed to do when it senses a major voltage anomaly either in the data center or on the incoming utility feed. Plugging a live generator into a shorted circuit will usually fry the generator, and switchgear locks generators out to avoid that.

generatorguy · on Sept 4, 2019

Diesel generators at Hospitals and diesel motors running pumps for fire suppression systems are normally set up to keep running closer to the line of risking damage to the generator and engine.

nerdbaggy · on Sept 4, 2019

That’s it thanks!

linsomniac · on Sept 4, 2019

People tend to make a habit of treating AWS like a VPS provider, in my experience. And you can skate by on this for a while, but it really isn't designed for that. And that will, eventually, lead to pain and suffering.

Sometimes instances just go out to lunch. Sometimes an AZ goes down. Chaos Monkey isn't just a good idea, it's required for reliable operation.

But please, please, if you are going to treat AWS like a VPS, at least don't do it in us-east-1! It seems to have more outages.

Our setup related to instances and EBS includes: At least 2 instances in different AZs, an ELB in front of them, a backup running at our hosting facility (though this could just be a different AWS zone, or different provider), and DNS with full-paper-path health checks that switch DNS over to the colo servers if any component of the primary fails.

serkanh · on Sept 4, 2019

Although this was not reported on the status dashboard, this also affected elasticcache as well.It was acknowledged by the rep on the phone and via email issue got resolved. We weren't able start/reboot any redis instances on us-east-1a so had to launch on us-east-1c.

therealx · on Sept 4, 2019

Is there a reason it often seems like backup generators fail? Is it that we don't hear the success stories of all the times they don't fail? Just due to not being tested often?

toast0 · on Sept 4, 2019

We generally only hear about the failures, but it's also a tricky thing to test. Simple setups won't put load on the generator during the periodic tests, which can result in outages if the generator will start, but can't run the load for whatever reason (ex: mechanical problems, or load size grew beyond capacity). More complicated setups may be able to switch the load to the generator, but not switch back to utility fast enough in case the generator under test fails during the test. The transfer switches themselves are prone to failures and hard to make redundant.

It seems that it's pretty hard to get this right from the beginning too, every large datacenter ends up learning this again after they have a sequence of power incidents. That said, successful switch to generator for 1 hour and then generator failure is not a terrible outcome; if there was a notification, that's enough time to evacuate critical systems (assuming you have a plan).

extrapickles · on Sept 4, 2019

Yes, you don’t hear the hundred or two times they start correctly, as they typically are tested quarterly to monthly. A good test is where you island the datacenter (run it only off the generators) so you can also make sure all the switchgear works.

dsr_ · on Sept 4, 2019

If you were buying datacenter space instead of cloud resources, you would expect each datacenter to report to you quarterly that they had run their backup generator tests and fixed any problems that occurred. You would also expect to be told in advance about network maintenance and other foreseeable service issues.

Cloud vendors, of course, get to hide all that from you until something goes wrong.

joncrane · on Sept 4, 2019

Is there a source for this other than an angry Twitter user?

j2bax · on Sept 4, 2019

Here is what Amazon shared via my support panel after the outage.

[01:30 PM PDT] At 4:33 AM PDT one of ten data centers in one of the six Availability Zones in the US-EAST-1 Region saw a failure of utility power. Our backup generators came online immediately but began failing at around 6:00 AM PDT. This impacted 7.5% of EC2 instances and EBS volumes in the Availability Zone. Power was fully restored to the impacted data center at 7:45 AM PDT. By 10:45 AM PDT, all but 1% of instances had been recovered, and by 12:30 PM PDT only 0.5% of instances remained impaired. Since the beginning of the impact, we have been working to recover the remaining instances and volumes. A small number of remaining instances and volumes are hosted on hardware which was adversely affected by the loss of power. We continue to work to recover all affected instances and volumes and will be communicating to the remaining impacted customers via the Personal Health Dashboard. For immediate recovery, we recommend replacing any remaining affected instances or volumes if possible.

random_visitor · on Sept 4, 2019

I came across this news article: https://www.theregister.co.uk/2019/09/04/aws_power_outage_da...

The headline sounded so clickbaity that I ignored it before seeing this thread

hexteria · on Sept 4, 2019

The Register often uses snarky or tongue-in-cheek headlines, I like it tbh

ksec · on Sept 4, 2019

Forgive me, What may be a stupid question.

As far as I can tell, the number of a Mechanical / Hardware Failure is far higher than Software. And it is always Power, UPS, Generator, BBU, Raid Card Failure etc.

Why is it that we keep hearing failure in these segment? And it doesn't seems anything have been done? Are there any Innovation happening in this space?

staticvar · on Sept 4, 2019

This tweet might be in response to the AWS Post event summary from August 23, 2019:

> We’d like to give you some additional information about the service disruption that occurred in the Tokyo (AP-NORTHEAST-1) Region on August 23, 2019. Beginning at 12:36 PM JST, a small percentage of EC2 servers in a single Availability Zone in the Tokyo (AP-NORTHEAST-1) Region shut down due to overheating.

https://aws.amazon.com/message/56489/

joncrane · on Sept 4, 2019

But the guy's tweet specifically mentions Reston (Virginia). This is in the vicinity of us-east-1.

By the way, I'm pretty sure none of the actual AWS datacenters are in Reston proper. They are in Ashburn and other more sparse suburbs.

Source: I live in the DC area and regularly visit Reston and Herndon. There are large AWS offices in Herndon but not so many datacenters. Real estate in Reston is pretty expensive.

jsjohnst · on Sept 4, 2019

> By the way, I'm pretty sure none of the actual AWS datacenters are in Reston proper.

Pretty sure you are right. The physical AWS data centers I know of in Reston area are:

4 DCs on Smith Switch Rd in Ashburn, VA

2 DCs (IAD54 and IAD67) elsewhere in Ashburn, VA

3 DCs on West Severn Way in Sterling, VA

3 DCs on Dulles Summit Ct in Sterling, VA

2 DCs on Prologis Dr in Sterling, VA

2 DCs on Relocation Dr in Sterling, VA

2 DCs (IAD69 and IAD76) elsewhere in Sterling, VA

1 DC in South Riding, VA

4 DCs on Westfax Drive in Chantilly, VA

3 DCs on Mason King Ct in Manassas, VA

2 DCs elsewhere in Manassas, VA

dgritsko · on Sept 4, 2019

mLab was affected by this, and based on their own status page it looks like some volumes are permanently unrecoverable: https://status.mlab.com/

rwiggins · on Sept 4, 2019

I can confirm that. We had a single-AZ RDS instance whose underlying storage (a "magnetic" EBS volume) unrecoverably failed, according to AWS. It had to be restored from a backup. Fortunately, "point-in-time recovery" meant there was very little data loss, just some downtime.

(Not that data loss or downtime mattered for this instance, which was just used for internal testing.)

Power-up is very stressful for hard drives, so it's not too surprising that some failed when the power turned back on. EBS does offer spinning rust storage options, so maybe mLab was using those for some of those failed volumes. I don't know if the same is true for SSDs or not.

xxxpupugo · on Sept 4, 2019

EBS is persistent right?

I mean the failure of backup is not a surprise. Entropy is just everywhere.

durpleDrank · on Sept 4, 2019

Is this North Virginia AGAIN???? How ironic that a company named "Amazon" cannot keep it's servers up when ever there is rain. This has happened practically every time hurricane season appears. Obviously being a bit hard on them (for humor) but come on guys, get a giant umbrella or something.

the_70x · on Sept 4, 2019

I wonder if they perform routine tests on their support infra: power, cooling, et al

darkcha0s · on Sept 4, 2019

No, they just let it ride and hope nothing breaks.

chasd00 · on Sept 4, 2019

I know you're being sarcastic but you're probably more right than you realize. Many many "redundant" systems turn out to not be so redundant when it counts.

talk to some datacenter admins and you'll learn there's a lot more bailing wire and hope-for-the-best out there than you would think.

aidanlister · on Sept 4, 2019

Wonder no more, the answer is: yes, obviously

ben509 · on Sept 4, 2019

Someone is always signing off on routine tests.

Same deal as those 50 point inspections your mechanic does: some things are easier to inspect than others, some people do a more thorough job than others, etc.

lithos · on Sept 4, 2019

Yes and frequently.

Polyisoprene · on Sept 4, 2019

Scary that AWS can’t restore EBS volumes properly after a power failure. Snapshot are not a solution to this in a live system.

Quarrel · on Sept 4, 2019

If you lose power mid-write to an HDD, of course you can lose data.

This guy sounds like if he'd self hosted he'd be complaining about an HDD failure. It happens- you need to design around it. Luckily, EBS volumes, snapshots and AZs make all of this pretty straight forward.

Polyisoprene · on Sept 4, 2019

Data, sure. Lose the volume, no.

cthalupa · on Sept 4, 2019

In my 15 years of experience as a sysadmin and architect, hard drives are far and above the most frequent hardware casualty of power failures.

The EBS documentation states that there's an expected AFR of 1 to 2 per thousand volumes, so you should plan accordingly. Replicate any data that will cause harm to your business if lost to other sources. Keep backups.

notyourday · on Sept 4, 2019

He is just having a temper tantrum. Everything can fail and everything does fail. There was an outage a few years ago in a well known colo/ip/managed services provider where the feeder line from the power company failed, the ATS flipped to the backup power which had a limited run time. And, due to one of those 1/1000 events that should never happen (because that ATS should flip maybe 5-6 times a year) it fused to the new position. And it happened in a place where the DC operator would cut off the service on the second line to ensure they can safely work on removing the affected ATS. So the redundant power lines + backup power did not work. If you happened to be in that specific area of the building and happened to know building engineers and data center engineers and power company engineers you would have had heard what actually happened. Otherwise you just got "Imminent power failure" notification. Hopefully you knew that it means you want to shutdown all your workloads remotely and send someone you have on call who can reach the data center in 10-15 minutes to physically disconnect your PDUs from the incoming lines just in case someone messed up when they play at fixing the power so you don't blow 10-30% of your PDUs.

That's the reality of the life in a data center. So yeah, either accept that stuff like this happens or build for stuff like this happening. Engineering around physical problems in the cloud environment is far easier than in the data center environment.

acdha · on Sept 4, 2019

I’ve seen at least 3 variations of the problem you mention, where power failover caused protracted downtime requiring rush delivery of niche replacement hardware. (That last is big: I’ve seen 8-figure enterprise hardware spends down for a week because it requires flying someone in to fix it, whereas AWS/Google has 24x7 staffing along with redundancy).

Anyone thinking this doesn’t happen with private data centers is either very green or selectively excusing problems.