Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Amazon packages pile up after AWS outage spawns delivery havoc (detroitnews.com)
419 points by HiroProtagonist on Dec 8, 2021 | hide | past | favorite | 198 comments


Check out the Amazon Flex Drivers subreddit[0], there are tons of people saying they were paid for the day of deliveries without delivering any packages and told to go home.

[0] https://www.reddit.com/r/AmazonFlexDrivers/comments/rb3ggn/i...


Wouldn't surprise me. I mean, DSPs wouldn't be able to scan packages to pick them up without being able to access the flex app, so unless there's a procedure in place to allow packages to be picked up and manually marked as having been picked up... they wouldn't be able to do anything. Moreover, even if they could pick anything up, they wouldn't have any way to navigate, drop packages off, snap photos, and otherwise record all that offline for later upload... because drivers probably can't even sign in.

What's worse, though, is that this also implies that even if an outage is localized to a particular region, data for deliveries to that region isn't multihomed, so you can't just failover flex app service to another region and keep delivering. For a service that lives and dies on being able to deliver everything faster than everyone else... that seems like a massive oversight... but also completely in-character for Amazon.


Failing over to another AWS region is actually pretty difficult for stateful services. Especially if you can't even access data in the primary region at all. Most teams probably don't have the bandwidth to solve this problem given the amount of outages you see in a year (1 or 2). Also, this would be a problem many teams would be solving, so most teams probably just wait and see what leaders have to say about it and well, nothing ends up getting done.


A one-day outage in December can be crippling for retail.

I don't doubt that many functions are difficult to failover, but a bare-bones minimum seems straightforward. For example, evidence of delivery is append-only and only needs to be globally consistent later, after a dispute.


I totally understand why Amazon halted everything. Sure, one could deliver shipments off-line and sort them out afterwards manually. And at a lower scale Amazon might have tried it (at least it would have been on the table when I worked there a couple of years ago).

But what then? You had a complete loss of traceability of shipments and operations, once you regain it, a junk of shipments isn't there anymore where there are supposed to be. No you have not one potential root cause for this, the outage, that could be resolved by retriggering those shipments (not loss, as you didn't deliver anything to customers) but two: the outage and some off-line shipments. In case it was just one FC, sure that would be doable. If the whole network in a complete region goes down, no way to handle that. It is much easier and safer to just stop operations until the outage is resolved, re-route orders to other regions in the meantime, and then work through the backlog. Amazon's ops are good at that, specifically because they have almost complete transparency on their material flows. Going off-line would have jeopardized that transparency, making a quick recovery after the outage all the harder.


Any idea if Amazon has insurance to cover this type of event?


Can't speak for Amazon. Generally so I don't see how one could insure against it. I know that e.g. Allianz offers policies against IT outages. In that case so, what is the actual damage? Probably the delivery drivers paid without delivering and salaries, plus potential overtime to solve the backlog. Depending on the conditions a company the size of Amazon would get, maybe it's not worth it.


Databases, stateful part of services, have matured to have multi-region support. This isn't new either. Nobody's saying that its easy to have multi-region redundancy for stateful services. Its just something you need to have to prevent nasty single region outages affect your service. This is an excellent example where it would have been better to have degraded performance (higher latency) instead of complete unavailability and interruption in business.


Tell that to management of a medium company showing them the bill for something that has 0.01% chance of happening according to AWS..

Not every workload is of the micro size.

MR up our databases would cost around 15mil per year for a company that makes 50mil..


I don't think it's correct to call Amazon a "medium company" and last time I checked they make more than 50mil


I don't think the parent's "our company" is Amazon, because Amazon does indeed make more than $50m, but I can attest to cloud provider multi region replication being extremely expensive for us (also not Amazon), if only due to data transfer costs


Can you elaborate? E.g. postgres replication is pretty straightforward and not a new technology. I'm outside AWS ecosystem and with just dedicated boxes having some DC burn down is manageable. How do magic clouds make that hard?


"postgres replication" would probably be the least of their worries. It's not about "magic clouds". These are services that are handling millions of requests per second and there's a lot going on where they have to maintain consistency and fail predictably. Having some services go down in one region but being back up in another still serving requests and committing transactions is unpredictable and could create a lot of inconsistencies that would be very difficult to resolve later especially when you have customer facing services like this where someone's package could be lost resulting in bad reviews and other things you don't want to deal with. People here mostly have never even imagined the level of workloads they're handling and are throwing around "easy" solutions like replication or multi-region availability. For something of this scale, it's just not that simple. It would also be incredibly expensive to do this when you could simply shut operations down for a brief period of time. Not like something of this scale has happened that often


OK, but this is programmer to manager explanation. I get how computers work. I know simple things can get very complex at scale. I just thought that the huge extra you pay for cloud services is mostly for battle tested solutions for these scale problems.

As far as I understand it, if you design is sane, the database/storage handles fallback and recovery.

Or maybe in other words - you need to make your service handle single machine going down without any problem - cloud or not. And there seem to be two options - it's your machine or part of a service which AWS provides to you. In second case it's on AWS to handle that and in the first case shouldn't AWS make it such that for you DC is just a parameter and they handle all virtual network and other magic?

To be super clear - I'm not arguing, just trying to learn I would love some specific examples which make the problem hard, because all these stories make me stay away from cloud which in theory is solution well worth paying extra for in a bigger company context.


State ("the database") is hard in distributed systems. Was the package picked up or is it still sitting on a shelf waiting? If your distributed system is partitioned, different queries may give different answers and your warehouse workers are going to be running around looking for boxes that aren't there.

Even if you create a system that is eventually consistent when availability is restored (a difficult problem all by itself, and probably needs a lot of application layer logic), it may not be worth the trouble. Warehouse workers interact with the "perfect" state of the real world, and if the computers don't have access to that, they aren't very useful.


The proper fallback is costly both hardware-wise and in development efforts. It can be cheaper just skip it and tolerate occasional service unavailability.


A correctly designed infra and app, will have zero issues with hard failover.

It has been my experience however, that as more people use the cloud, all that "ease of use" both adds layers of complexity, and further, abstracts the backend away.

Thus, by outsourcing sysadmin tasks to AWS, no in house experise exists. People don't know how to handle correct failover, unless the platform 100% does it all for them.


Is manageable depends entirely on the scenario and needs. Maybe you can afford a 5 minute data loss, but can Amazon? Also it's quite possible that they have an enormous volume of data, complicating everything.

And there's probably more than just a database. Maybe a message queue for asynchronous treatment, object storage for photos, etc.

It's certainly not an insurmountable problem, but maybe they consider the failure rate so low (it is, us-east-1 going down is like a once in a few years event) that the complexity of multi-region isn't worth it.


It happened once. My expectation of Amazon is that it would be fixed before it became a trend.



It's official[0] Amazon is sending out emails to Flex drivers informing them Flex is down and they'll be paid for scheduled blocks without deliveries.

Apparently even some warehouses were ghost towns today without anyone sorting packages[1]

Here in Phoenix, I've had two packages supposed to be delivered today now delayed. Also, delivery on items just went from overnight delivery to 3+ day delivery as the quickest option.

[0] https://www.reddit.com/r/AmazonFlexDrivers/comments/rbf8ti/j...

[1] https://www.reddit.com/r/AmazonFlexDrivers/comments/rbbnwa/y...


That's actually pretty nice though; even though people are missing out on packages at least Amazon won't leave them out in the cold for today.


My inner cynic says Amazon paid Flex drivers because A) they still need drivers available during the christmas run-up and B) screwing drivers would put Amazon in an Ebenezer Scrooge-like PR disaster.


Amazon in a PR disaster involving corporate greed and avarice? Oh no. Anyway. No really, I would not be surprised if Amazon has several playbooks upon playbooks involving commercials of how Amazon helped some warehouse worker get through college or some medical bill as their means of a public mea culpa.


Paying them only to avoid the PR disaster is fine. Kinda what PR disasters are “for”


As long as their action is reined in line.

Corporate words are worthless anyway.


Wow, I would have expected Amazon to have stiffed them. But I guess Amazon really needs them for the last mile.


High turnover right before the Christmas boom would not be good for Amazon.


I read many comments and didn't see one that was saying that but you might be right as I was too lazy to read everything.


<ctrl-f>home<enter> finds them pretty quick for me.


there was a couple comments that matched this search filter.... not that many, but thanks.


Whenever us-east-1 goes down you just get a really good feel for how many other companies also have pretty fragile setups. They apps I work on can deal with a few hours of downtime, so as long as I'm sure I can recover from getting totally leveled its ok. And I think it's that was for the majority of companies. Most don't want the extra effort and cost of failover.


Us-East-1 took out commuter rail travel in Chicago for like a good 30 minutes, so not just companies


Why is a commuter rail travel system dependent on AWS? O___O


From what I’ve seen about Government IT it’s probably a good thing it’s on AWS.

Why shouldn’t Government IT be using the same tools regular companies use for IT?


Government IT using AWS is fine. Using it for safety critical things like railways is a lot more of an issue. What if there's any kind of latency etc?


They decided to run their infra in AWS instead of another cloud provider or on-prem?

why is McDonalds dependent on AWS for their app to work? (/s)


> Us-East-1 took out commuter rail travel in Chicago for like a good 30 minutes, so not just companies

Was it Metra or the CTA out of curiosity?


In Chicago I believe “commuter rail” almost always refers to Metra.


I'm almost tempted to think that having an explicit policy of forcibly shutting down each region once per month for a few hours (at times that are not publicly announced in advance) would be a worthwhile value-add.

A service that is unable to handle such a failure does not qualify as being ready for deployment. And I'm not just saying that to be a self-righteous pedant, I'm saying it because this kind of failure is statistically likely to happen at some point, so ignoring it is setting yourself up for major (and potentially very expensive) problems if you don't test for it regularly.



Chaos Engineering was more or less created for a service mesh of microservices.

The companies we talk here about are most likely running a monolith app on Centos6..


* raises hand :)


Therefor I didn’t expect Netflix to go down. Would be interesting to learn in a postmortem why Netflix had issues despite Chaos Gorilla.


It seems like the fact that Route53 went down for changes was a big part of it.


I wonder how the simian army has evolved in the past decade.


Smallish non-tech companies typically have single points of failure that are much more severe than a server/service being down for a couple of hours. Accountant on holiday? Guess we're not writing any invoices this week. Two of our drivers are ill? Guess we'll have to let all of this stuff pile up here for a bit. By comparison, a public-facing service or intranet being down for a few hours hardly merits attention.


Shutting down availability zones I could almost see, since it's not that much more difficult or expensive to architect to handle that gracefully.

But an entire region? I've never worked anywhere that decided being multi-region was a good tradeoff. At best we've replicated data to another region and had some of our management services there, so in the absolute worst case (which would need to be much worse than yesterday) we could rebuild our product's infrastructure there.

Do I agree with this approach? Not in all possible cases of course, but for my employers? Overall, yes. It mitigates the highest impact risks. Going further would have significant complexity and costs. Those companies success or failure haven't been impacted by their multi-region strategy AFAICT.


Most services and applications can occasionally have unplanned downtime. Many have also planed downtimes sometimes measured in half a day or more.

Services that truly need 100% uptime (and by that I don't mean what management say they WANT, but they are prepared to PAY) are a tiny minority, to the point that I imagine most software developers never work on one.

Even setups that i have seen to handle such cases, usually had some single point of failure somewhere.

And lest face it. Even if you do multi region on AWS, that won't protect you next time someone screws up BGP/switches/DNS or whatever and every region goes down for a while. In that case you better have failover to some other cloud vendor. Even planes or other safety critical equipment is not 100% failure proof.


At the project level maybe. At the AWS level, that‘s ridiculous. A lot of services can live with one major outage a year, but not once a month. Completely failing when their region goes down is reasonable.


Region failures can be non-recoverable (e.g. natural disaster).

One major outage per year doesn't necessarily mean just a few hours of downtime, it could mean having to redeploy your entire service somewhere else, which could take several days or more if you haven't prepared for it. How many of those services can live with that?


> How many of those services can live with that?

Apparently, many can! Look at how many companies choose not to pay ransom when hit with a ransomware attack, or prefer to negotiate for days instead of buckling straight away, even if it means operations are crippled for weeks. They don't typically go bankrupt afterwards, everyone coped, life moves on.


It's finals season, so all the students taking online exams on Canvas had a fun day.


The frightening thing is how fragile it seems.

If you're going to concentrate risk on AWS, it better be essentially flawless, stable, and highly-redundant.


It is, most of the time. That's why these outages are such a big deal. Nobody would care if it went out multiple times a month.


I'm still surprised how many people are still using us-east-1.


I'm just surprised Amazon was not using something sectioned off from the rest of AWS.


My understanding is there's a pretty strong culture of running amazon.com directly in AWS as much as possible. I don't think there's much reason to believe a sectioned off alternative would be much more reliable, it would just break at different times maybe.


Possibly stems from one of the tenets of the apocryphal Bezos decree that started AWS[1]

> All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.

1: https://nordicapis.com/the-bezos-api-mandate-amazons-manifes...


That attitude, and the ability to enforce it at the scale Amazon is at, are probably the biggest strength of Bezos. It does cover all aspects so, including work culture...


Those Increasing error rates with some customers affected... /s

All green https://news.ycombinator.com/item?id=29473630


Yeah "increased error rates" is an understated way to describe a massive outage that took several services offline for a full business day.


'Patient is experiencing a decrease in heart-rate'


I thought this story was about this house x-)

"Neighbors in Tennessee city worry as Amazon packages pile up outside home"

https://www.youtube.com/watch?v=qVQjEB2sxBw


I don't even understand what's going on with this one. Transcribing the blurb from the clip, from interviewing the residents of the house:

"[The people who live in the house] have a friend who has a contract with an Amazon warehouse in China. They say whenever that friend's contract expires, she will send the packages to their house for the family to sort and then send back to Amazon for the company to sell."


From what I understand, it's a play around the Amazon Inventory Storage fees ([1], [2]), which normally are $0.75 per cubic foot monthly, but grow to $6.90 per cubic foot, if stored for more than a year and not sold.

It's possible to ask Amazon to remove selected items from their warehouse and send to any US address for a somewhat reasonable fee of around $0.32/item ([3]).

What happens here is the Chinese sellers send packages to the residents of the house, and then relist items under a different account, so that the counter is reset and the Amazon storage fees are low again.

That said, the residents of the house run a subpar operation.

1. https://sellercentral.amazon.com/gp/help/external/G3EDYEF6KU...

2. https://sellercentral.amazon.com/gp/help/external/help.html?...

3. https://sellercentral.amazon.com/gp/help/external/help.html?...


That would pretty much explain it. Thanks for taking the time to link it together. Suppose this is a situation of "Don't hate the player, hate the game."


Right? I seriously cannot come up with any scenario where this might make sense, not even anything remotely plausible. And they're "storing" the packages, all sorts of different shapes/sizes, out in the elements?! What the heck? My curiosity is running wild & unsatisfied. =D


For some replacement computer parts and other such items shipped via FedEx to a certain remote location, all packages came with a coating of dust. Like wash your hands after receiving harddrive box, and then wash again after cutting the box open to keep the anti static bag clean. I guess if a shipping company can get away with it they will? Roofs and floors cost money.


Yeah, I noticed that too, someone's getting scammed.


When a residential house becomes a review farm.


when the house is becoming an amazon warehouse. This is crazy, i can't believe what the home owners thought of so many many boxes coming there.


There was a similar story on my local news as well. Someone's front yard filled with hundreds of packages.


More like dropshipping gone wrong.


As bad as this article makes AWS sound, it's actually the reason you should go with AWS over say Azure or GCP; when AWS goes down, its owners actually feel the pain with you, Microsoft and Google run their own stuff elsewhere...


> Microsoft and Google run their own stuff elsewhere...

That is simply not true.

https://www.zdnet.com/article/microsoft-moves-closer-to-runn...


true for google though, with their borg-thingy being separate from gcp last time i asked them


Gcp runs on top of borg. (Gcp failure does not mean borg failure though)


This is simply not true. Microsoft dog foods tons of its own software and all internal cloud services I know of run on azure. Also, having consumed both cloud service offerings, customer support within Azure was far more responsive.

Source: used to work as a SWE on a flagship Azure service


> Microsoft dog foods tons of its own software and all internal cloud services I know of run on azure

so it's even more embarrassing how terrible the software is.

trolling aside - if this is true, the state of e.g: MS Teams is a travesty. Implementation of replies to messages implemented in 2021!? So many bugs, etc. it seriously damages productivity. And don't get me started on sharepoint.


Since it seems I need to back up my statement: https://news.ycombinator.com/item?id=29492884 is one example... Is that teams' fault? No, but they'd be able to call 911 if it wasn't installed.

I have a running list of teams bugs/flaws/inferiorities, it's currently about 30 items, will probably publish it sometime soon


Well you definitely don't understand Azure or GCP then.

Office 365, Teams, Dynamics, xcloud, xbox live - they all run on Azure.


No, not true at all. What do you think Google and Microsoft run their cloud services on?


Google does not run the bulk of their services on GCP. Unlike AWS, GCP was not a productization of their existing infrastructure, but rather a separate cloud product developed fairly independently.

I'm sure that's changing with time.

YouTube also has its own infrastructure independent of GCP and the rest of Google


GCP is not separate from Google's core infrastructure; rather, it's built on top of it. That means that while you can certainly have GCP specific outages, this kind of core infra "everything is down" situation is almost guaranteed to hit everything, GCP and not included. A lot of GCP sub-products are productionizations of existing Google tech; e.g. BigQuery is a public version of Dremel, an internal database/query engine they'd been using internally for a while.

I'm pretty sure YouTube hasn't had their own infra in quite a while. When I was there ~8 years ago I think it was all integrated already. Certainly database, video processing, storage, CDN were all on core Google infra, and I'm sure the frontends were too though I don't remember looking into that explicitly.


There aren’t 100k Googlers developing on top off GCP services to get their job done on a daily basis. That’s the big difference between the two clouds level of dog fooding.


> There aren’t 100k Googlers developing on top off GCP services to get their job done on a daily basis.

Doing something on top of GCP rather than the normal way at Google was a huge pain. Borg tutorials and documentation were just far superior, I could get a thing running on borg in an hour from not knowing anything about borg, I spent a week trying to get something running internally on GCP but still couldn't get it right (our team wanted to see if we could run things on GCP so I was tasked with testing it, I couldn't find anyone who knew how to do it so we just gave up after I didn't make any real progress). That was the worst documented thing I've ever worked with. And even worse the internal GCP pages were probably running in california and probably weren't tested from Europe, so the page took like 2 seconds between mouse click and it responded to anything.

That was years ago though and I no longer work there, but at least back then the work to make using GCP internally seamless wasn't done. Maybe it is simpler if you run everything in it and don't need it play well with borg, but there is a reason why it isn't popular internally. And likely you wont find many engineers who left Google who recommend you will use it, since they probably didn't test it and if they did it probably was a bad experience (unless they worked on GCP).


>GCP was not a productization of their existing infrastructure

Neither was AWS.


AWS, of course


> Microsoft and Google run their own stuff elsewhere...

Where on Earth did you learn that?


Source? I'd be surprised if Google does not dog food. Though, I guess when GCP had their recent global load balancer outage but neither Gmail or Google search went down maybe not.


Most of Google's stuff runs on Borg, which predates GCP (and is the fabric GCP runs on top of).


Borg is the predecessor to Kubernetes (sort of) right? They could have switched to Kubernetes and run on Google Kubernetes Engine (GKE).


I'm a bit out of the loop, but the vibe for years was "It isn't broke, so we're not going to rewrite it to run on something else."

Actually, it's a bit more than that. Some half-decade ago, Google Cloud got stung by a coordinated attack over the holidays where attackers used stolen credit cards to build a net of GCE instances and do an attack on Tor via endpoint control. Cited by SRE in the postmortem was the relative immaturity of the cloud monitoring, logging, and "break glass" tools that SREs were accustom to in Borg... Essentially, Cloud didn't have the maturity of framework that Borg did and they felt the extra layer of abstraction complicated understanding and stopping the abuse of the service.

This report had a chilling effect internally. Whereas management had previously been encouraging people to migrate to Cloud as quickly as possible, after this incident software engineering teams and the SREs that supported them were able to push back with "Can we trust it to be as maintainable as what we already have?" and put cloud on the defensive to prove that hypothesis.


HN is a very pro-Amazon place. Regardless of what Amazon does, at least one of the top 3 comments is always justifying Amazon's actions.


That’s not the impression I got from the last 24 hours, nor the last few years.

1. Scale is hard and downtime is hard, HNers either recognize the struggle or appreciate their lack of experience. When AWS fails many armchair architects come out to suggest solutions but many more techies just sympathize with the Amazonians.

2. Technically, Amazon has built something impressive. It might not be what you want, or what others have, but AWS is impressive in scale and scope and even reliability. Many people share credit where due.

3. One can criticize the treatment of warehouse and delivery workers that Amazon is known for, but this has little baring on the tech workers there nor AWS generally. So AWS stories tend to be free from the social critique the company as a whole receives.


Here in Chicago every item on Amazon is showing an earliest delivery date of Saturday the 11th. Checked an item today I had bought yesterday and delivered overnight, the item is now showing available delivery as Saturday. What a terrible time of the year for this to happen to them.


Exact same here in Phoenix.


I'm in Chicago and just purchased a Prime-eligible book(but sold by a third party) that's listed as a Jan 6 delivery.


I've seen mention of a major cable tv operator/last-mile residential ISP that had their entire field-tech dispatch system go down (based on a web gui https app that people can use on their phones) because it's hosted in AWS.


My friend lives in very cold Alaska. She has an app on her car that lets her remotely warm her car, which is a requirement before driving because it's so cold. The AWS outage meant her phone couldn't talk to her car. She was 20 mins late to work because of that.

This is the future. AWS has an outage and your car won't work.


There was a time when we didn't have remote start cars. We managed to get to work on time.

This is why I buy older cars with minimal electronics, BTW.


Yeah... go out and start it, for whatever value of "start" you need. I frequently walk out to our Volt and start it to let it warm up the cabin on shore power, though I can also do it through a window with the key fob. Or I'll even go start a motorcycle to let it run for five minutes before I head out in the winter.

I'm really not sure what I'm going to do with cars in the future. The Volt is the last wave of not-always-connected-software-OTA-big-data-analysis cars, and even it had Onstar, it just tries to talk to towers that no longer exist. But there's a difference between that sort of system and what newer cars have. I suppose you can always disable the cell modem and let them have occasional wifi access to update, but... ugh.


Couldn’t remote start the car with the fob? Viper/Directed has a paid cloud/app service for convenience, but the primary/default method to activate the remote start is RF.


Couldn't you also just go outside and turn on the car, like people did before remote starts.


We apparently don't teach advanced problem-solving like this in school anymore.


Thats a funny comment, but in reality the heating was probably timed, and she only realized the car was cold when it was time to leave.


When I was consulting in Nome AK - it was -40F outside and there were locals standing outside behind the hospitals smoking cigs...

My brother lives in Anchorage and he went hiking with his with ~week ago, and he said it was -20F (My brother was a colonel in the USAF, so not just like a weak person)

But the point is, she should have been able to just ran out to the car like other alaskans do, and not be late to work.


Yeah at those temperatures it really depends on how much wind there is and how long you will be outside. -20F did freeze all my car doors completely closed to the point where they could not be wrestled open once though.


It's unclear to me that the app being discussed started the car?

In sufficiently cold conditions, you need to warm the engine up before starting the car, with an electric heating element. If this was controlling one of those heating elements, the problem was presumably that the car wasn't able to start at all, not that the car couldn't be started remotely.

https://en.wikipedia.org/wiki/Block_heater


It's probably an electric that has to warm up the battery before going anywhere in extreme cold, though 20 minutes seems short.

When I use a block heater, I let it run for a couple hours before needing the vehicle in the cold. I've got an outdoor outlet timer that I can use to start my truck's block heater around 2-3AM if I need it for something in the morning. It'll start at 0F without it, but it's exceedingly clear that it's not happy about the arrangement, so I preheat.

Whatever it is, I'm entirely unsurprised that some app or another, talking to some cloud service or another, talking to some car or another, fails silently when "impossible" things have happened. Nobody seems to consider that the cloud can fail. Even though it does, quite regularly, and reliably breaks all sorts of stuff every single time it does.


The app failed quietly. So yes, bad app UI design but also bad back-end design with a single point of AWS failure.


5 years ago you couldn't even start your car remotely, and many people lived in Alaska, this is such a first world problem lmao.


> 5 years ago you couldn't even start your car remotely

Of course you could. There have been remote car starters for 20+ years. They don't rely on internet nonsense to work either.


Also block heaters are a thing


Did AWS hide her keys, too? This story makes no sense even in block heater Alaska.


Apparently based on another comment the app fails silently. Which is wonderful design.


It's amazing to me that people think they're adding something to the conversation by posting the most banal response.

"Yes, the dumb broad didn't think to walk outside and turn the car on! I shall right this wrong with my clever internet post! Behold my intelligence!"


To be fair if someone was using the “app didn’t work” excuse multiple times, suspicion could be raised.

But for a one off occurrence? Why would you assume a car company knows what they’re doing over the person telling the story? It’s silly.


There is a certain personality type common among midwits who are smart enough to think of the most obvious workaround/solution/exception but not smart enough to realize that when people communicate, especially online, they usually choose brevity over exactness.

Like if I say "Humans have two feet" some midwit will come along with an article or anecdote about a person who was born without two feet.

And if I say the multi-hour outage of AWS made my friend minutes late because of her car warmer, what is going through the mind of a person who offers the solution "Did she try walking to the car and turning it on manually?" These are the kinds of people that if you met them in real life you'd quickly distance yourself from them.


>Like if I say "Humans have two feet" some midwit will come along with an article or anecdote about a person who was born without two feet.

It's autism.

One of the symptoms of autism is [the inability to recognize sarcasm](http://www.healthcentral.com/autism/c/1443/162610/autism-sar...) without the help of idiotic, illiterate signals like "/s". Your example has the same cause; people who are unable to understand nuance and social cues, whether in real life or in written form.


> My friend lives in very cold Alaska.

There’s another Alaska, less cold?

I’ve read reports that even Ring doorbells don’t eh… ring because of the outage.


I thought the whole point of AWS that it was a distributed system with no one point of failure? In that case, how did it have an outage?


Region outages happen, which is why the guidance is to build in multiple regions, but even Amazon sometimes doesn't take their own advice. Sometimes the reason is good, sometimes it isn't.


Our EC2 instances in us-east-1 didn't actually go down. But IT engineering was completely disrupted because our SSH login mechanism relies on the API to show you the list of instances for you to select which one to start an IAM handshake with. Our support phone line was also down. Even SQS kept chugging along just fine. I'm actually glad we're not on Lambda because of this.


I love GCP so much because of their ssh login infrastructure. It’s just beautiful.

The cli is gorgeous, the web gui is terrible.


I thought the guidance was to be multi-AZ, as an AZ is the failure boundary?


AZs are physically located near each other, usually within a small enough radius that they could be all impacted by the same natural disaster. In some cloud providers and regions, AZs are simply different parts of the same building (IIRC one of the Japan regions of Azure was essentially this, but don't quote me). And evidently, the share some infrastructure.

At a previous job where we needed to always be up, our disaster recovery plan assumed that the us-east-1 site had been hit by a meteor (not literally, but that's how we explained it to each other to put ourselves in the mindset.)


A meteor did not hit us-east-1 yesterday.


You misunderstand- our planning was to prepare for a scenario of that extremity.


AZs are physical boundaries, but the networking and software is interconnected. Regions are (mostly) isolated, though global services like IAM and CloudFront often have their main control plane in us-east-1


Software bugs recognize no boundary


Everyone wants to be multi-AZ, multi-cell, but it's a multi-year project, especially for services that have been around for a while. My last team had been working on it for a couple of years when I left.


I think it'd be correct to call the AZ a failure boundary, not the failure boundary. the is hardly the first time a failure has exceeded an AZ.


You can always do better because failures can always be bigger/wider. multi-instance < multi-AZ < multi-region < multiple vendors


Multi-AZ deployment and multi-region failover is considered best practice.


AZs are in one physical location. If that physical location has a problem, all AZs within it will go down.


>An Availability Zone (AZ) is one or more discrete data centers with redundant power, networking, and connectivity in an AWS Region... AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.

https://aws.amazon.com/about-aws/global-infrastructure/regio...


do they not teach people what a failure domain is anymore?

https://en.wikipedia.org/wiki/Failure_domain

Clouds aren't magic. They require a certain amount of operational confidence in order to understand that, yes, an entire region can fall out from under you at any time and it's your responsibility to detect and deploy into an unaffected region if possible.

edit: Generally, one entire region will not fail. However, core services like STS rely on us-east-1 so it's particularly susceptible to disruption.


People get slagged on for not having inter-region redundancy. But unless your business model can accommodate that, you’re introducing another failure domain and alot of money for duplicative infrastructure, network fees, etc.

For many use cases, it’s acceptable to shrug and blame AWS for a failure. It’s harder when your high availability solution fails independently, which they almost always do more than US-East-1


If your site goes down while everyone else's goes down, it's weirdly forgivable.

And being the only one up doesn't win as many market cred points and you'd think.


Exactly. Often people want the perception of reliability.

For many services, it makes more sense to make it reliable than not. For other services, it makes more sense to think about the engineering of the solution in the field.

Example: McDonald’s product images on kiosks are apparently in S3 and not cached locally. Seems like a dumb idea to me, but I wouldn’t try to build a more reliable cloud storage backend to control that risk.


> do they not teach people what a failure domain is anymore?

Empirically? No.

But seriously, instead of making every dev team consuming an AWS product hire the extra engineers required to build a system that spans multiple failure domains — which lets be honest, companies won't — why hasn't Amazon just hired the engineers required to do it for me?

> Generally, one entire region will not fail.

Laughs in global failures.


Honestly, no. Not frequently enough.

There are also way too many successful, public cloud-native businesses running without any semblance of a Business Continuity or DR Plan among any of their teams. I wish I could name and shame some of the more egregious cases I have seen.


> However, core services like STS rely on us-east-1 so it's particularly susceptible to disruption.

FYI this example hasn’t been true for a while. STS regional endpoints are generally what you should be using these days. The “global” us-east-1 endpoint still works, and may be the default for some clients, but isn’t a requirement.


In my experience, once an item is done done done, it’s out of your hands.


https://status.aws.amazon.com/ says that multiple network devices failed.


They very carefully did not actually say that. The weasel wording used could be a routing configuration error, among other things.


A lot of AWS is very old and it's very hard to go back and make everything nice and reliable without breaking things when you have a shitload of customers 24/7 using your services.

So yes, in theory much of AWS's services are probably very reliable and distributed across AZs and regions, but in practice there's likely a whole bunch of debt where one thing gets fucked up and it cascades.


As an AWS employee, you’re under too much artificial deadline pressure to address technical debt like this… let alone engineer a service properly in the first place.


As a corporate employee, you’re under too much artificial deadline pressure to address technical debt like this… let alone engineer a service properly in the first place.


A profitable "accident".


I guess on a load balancer or DNS level? A request has to hit a single domain name / IP before it's load balanced to the distributed system right?


Multiple devices in very different parts of the world can respond to the same IP address. A great example are DNS servers like 1.1.1.1 and 8.8.8.8- you get routed to the nearest site that can handle your request. A common example of how this is implemented is via anycast: https://www.cloudflare.com/learning/cdn/glossary/anycast-net...


Does Amazon follow their own multi-zone best practices?

To be fair, we only run services in us-east-1, and we had zero downtime from this issue. The only issue we encountered were API calls that were failing for us, and those were for CloudWatch and CodeDeploy. We are heavily reliant on EC2, OpenSearch, and S3.


Something I'm not getting there. OK, one zone went down in AWS. But there supposed to be multiple ones? And if somebody knows how to use AWS so that one zone going down wouldn't knock over the whole thing, it should be Amazon? I thought the whole point of this multi-zone cloud setup would be so that even if one zone goes down, it could survive and perform as usual - maybe a little slower for a bit while backups kick in and such, but after initial short period everything should be fine? Here we're talking about the whole system going down hard. Was the problem much wider than reported - i.e. affecting all (US) zones at least, or is it that Amazon doesn't know or doesn't care to use its own infrastructure to build a robust system?


This affected on region, us-east-1, which is the oldest and largest aws region, and also hosts many core aws services. Each region has several availability zones (AZs) that are basically whole datacenters. In theory AZs should be mostly isolated, but as we saw bugs happen. Reading between the lines of the status updates, it sounds like this either affected some core infra shared between AZs, or was from a change rolled out safely previously (maybe last night) and failed later (eg due to increased load in the morning).

At the end of the day, building redundant systems is expensive (and may introduce whole new bugs), so they probably did the math and figured the risk of a whole region outage was less than the cost of building redundancy in some systems that are hard to make redundant.


The whole concept that there are "core AWS services" living in one single zone sounds like anti-thesis to everything AWS should be about. What's the use of having all this nice distributed setup - and paying for it! - if a single failure in a single zone takes everything down anyway? I mean sure, maybe they built it on a shoe-string budget years ago - but since then, they had years and billions, and still didn't bother to fix it?


There's a lot of pitfalls. You have to diligent about stuff like making sure you're using region-specific endpoints like sts.us-west-2.amazonaws.com instead of sts.amazonaws.com. And some services, like Route53 and Cloudfront have built-in dependencies on us-east-1. There's also internal dependencies that are sometimes not obvious until an outage occurs.

Though, as you say, none of that's a great excuse for Amazon itself. I know Alexa devices, Ring Devices, Prime Video, imdb.com, etc, also all had issues in the first hours of this outage.


This zone hosts the infra to manage the rest of the zones. DNS management is also here and can easily cause a global failure.


It’s beginning to sound like a cyberattack to me.


Does Amazon host its core infrastructure in us-east-1?


It would appear so. Also it’s good Amazon has skin in the game here.


Yes, and also:

[9:37 AM PST] We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates.

https://status.aws.amazon.com/


East was the first region. Salty programmers avoid it when possible. Because it's the oldest and cruftiest region.


Some of it. Lots of stuff is distributed, but us-east-1 hosts some global internal services


why wouldn't they?


Because it's terrible and has always been terrible?

After 2017 I'll never use US-east-1 again. Hell... I should have learned that particular lesson in 2011 but it took two catastrophic failures for me to figure it out.

There are numerous threads here on HN covering the topic "why does US-east-1 suck so hard."

https://news.ycombinator.com/item?id=13756082 is just one example.


Maybe it’s terrible because they host their own stuff there.


I wonder if there’s too much moral hazard for them to run a data center just for Amazon but run exactly the same as the other data centers? Or would Amazon scope creep it into its own snowflake.

I kind of doubt that they’re getting efficient use of hardware Amazon could be using but isn’t, since if someone allocates it it’s not available for Amazon anymore.


That's the story of AWS as a whole.


If their stuff was all super resilient and multi AZ, it would be easier to place blame on the customer


Multi AZ != Multi region. AWS has many services that run in multiple regions but some of their services are not.


Correct, I mean region. Thank you


correlation and causation, anyone?

You'd think if it's anything mission critical, Amazon would be following its very own Well-Architected Framework. The Reliability pillar would speak to this.


I don't think the Well-Architected framework specifies that you should use multi-region availability, but it's certainly a mentioned option for mission critical applications. Usually the go-to doctorine for high availability in AWS documentation is multi-AZ, not multi-region.

https://docs.aws.amazon.com/wellarchitected/latest/reliabili...


If this outage isn’t the catalyst to get the Amazon side of the house to finally move out of US-east, I don’t know what will be. Or at least be multi-region.

Although the cost to make all of Amazon commerce, logistics, and digital truly multi-region is probably an order of magnitude more than the impact of this outage.


True. But let's be honest... this is neither the first such outage in US-east-1 nor the last. So I'd argue it's long past time for Amazon to pay the bill and go multi-region.


I wonder if their system would allow them to move availability zones between regions. They could create new stealth AZs, move their stuff into them, then use those to start building a new region only they use.

But I suspect there are third part integrations that benefit from being on the same AZ as Amazon APIs. I’ve been having little convos all day about how I think the inter region pricing creates a perverse incentive that’s exacerbating the us-east-1 situation.


Last big e-commerce outage I can think of was Prime day years ago when Sable browned out. But that was non-AWS infrastructure. There was also the infamous s3 fat-finger; but I can’t think of a holiday shopping and delivery day with a massive outage.


It’s kind of ironic to go through AWS well architected framework and add all the complexity associated with multi-region setups when Amazon themselves couldn’t get it right.

To be clear, I am not advocating for ignoring high availability setups. Just highlighting the complexity cost of it.


I placed a few orders this morning. They appeared to go through, but didn't show up in my order list until this evening. Good thing I made the connection to the outage before I decided to try re-ordering! Sounds like others weren't so lucky.


I feel like everything being centralized completely misses the point of the internet.


Shit happens. My last employer is now mostly on AWS, and mostly in that region, so things broke all over the place yesterday (oddly enough the only service that is multi region was auth). In the past when we were mostly in our own datacenters we suffered as well, so that's not better either. One time a bad DNS update killed every single service worldwide and it took most of a day to properly restart thousands of services. Build complex systems with thousands or millions of things and it's bound to blow up sometime.


Yeah my packages didn't get delivered today. I assume there will be cascading delays for the rest of the week. Getting close to xmas travel times, trying to finish up my shopping this week.


Will Amazon compensate all of their clients?


If there's an SLA or reputation to uphold, probably, and otherwise they probably won't. So big AWS customers with SLAs will probably get compensation. And Amazon dot com customers that complain might get some compensation.


I think being "multi-cloud" is going to be a feature more prominently displayed in days ahead.


Us easy 1 has an uptime of 99.9%, that’s low enough to get most sysadmins fired, but being able to point to headline news of how the same outage affects Amazon placates management.

The key benefit of the cloud is blameshifting. It’s someone else’s problem, you just get the day off.


The kinesis outage in 2020 and the S3 outage in 2017 didn't change much. Both as bad or worse than today.


This really tarnishes any trust you might have had in Amazon as a professionally run business.

(of course outages can happen and are to be expected especially at Amazon's scale, it's just the bad communication and the amateurish non-redundant setup of their own core services that is shocking)


FYI: My local Amazon Hub+ caught-up in under an hour.


"If it bleeds we can kill it"

Imagine ransomware targeting Amazon and spilling into the real world logistics, how much ransom could they charge.


Chaos.


Weird I just ordered a package about 3-4 weeks ago and it didn’t show up so finally went to track it on amazon last week and got a message saying that they had lost the package and to request a refund. It kind of sucks because that was my mom’s Christmas gift now I have to figure out a plan B.


I have this happen constantly through Amazon. Now all the hassle is on you to get the refund. And they won't process the refund immediately, so you have to pay out of your own money to get a replacement. And then often, as I've found, the price of the item has now gone up too, so you have to pay extra.

p.s. Your username?! I can think this must be the only site you've managed to get that handle?


It is the only site. Someone once had an even lower number and someone made mention to it and the person replied saying it was easy and he had just gotten it so I decided to try and see what was available. I had been a long term lurker and never posted but once I got this username I try and respectfully post when I can.


What's fascinating is that it was still available in 2018.

And the username I use here isn't available on pretty much any other site, but was still available here in 2021.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: