Check out the Amazon Flex Drivers subreddit[0], there are tons of people saying they were paid for the day of deliveries without delivering any packages and told to go home.
Wouldn't surprise me. I mean, DSPs wouldn't be able to scan packages to pick them up without being able to access the flex app, so unless there's a procedure in place to allow packages to be picked up and manually marked as having been picked up... they wouldn't be able to do anything. Moreover, even if they could pick anything up, they wouldn't have any way to navigate, drop packages off, snap photos, and otherwise record all that offline for later upload... because drivers probably can't even sign in.
What's worse, though, is that this also implies that even if an outage is localized to a particular region, data for deliveries to that region isn't multihomed, so you can't just failover flex app service to another region and keep delivering. For a service that lives and dies on being able to deliver everything faster than everyone else... that seems like a massive oversight... but also completely in-character for Amazon.
Failing over to another AWS region is actually pretty difficult for stateful services. Especially if you can't even access data in the primary region at all. Most teams probably don't have the bandwidth to solve this problem given the amount of outages you see in a year (1 or 2). Also, this would be a problem many teams would be solving, so most teams probably just wait and see what leaders have to say about it and well, nothing ends up getting done.
A one-day outage in December can be crippling for retail.
I don't doubt that many functions are difficult to failover, but a bare-bones minimum seems straightforward. For example, evidence of delivery is append-only and only needs to be globally consistent later, after a dispute.
I totally understand why Amazon halted everything. Sure, one could deliver shipments off-line and sort them out afterwards manually. And at a lower scale Amazon might have tried it (at least it would have been on the table when I worked there a couple of years ago).
But what then? You had a complete loss of traceability of shipments and operations, once you regain it, a junk of shipments isn't there anymore where there are supposed to be. No you have not one potential root cause for this, the outage, that could be resolved by retriggering those shipments (not loss, as you didn't deliver anything to customers) but two: the outage and some off-line shipments. In case it was just one FC, sure that would be doable. If the whole network in a complete region goes down, no way to handle that. It is much easier and safer to just stop operations until the outage is resolved, re-route orders to other regions in the meantime, and then work through the backlog. Amazon's ops are good at that, specifically because they have almost complete transparency on their material flows. Going off-line would have jeopardized that transparency, making a quick recovery after the outage all the harder.
Can't speak for Amazon. Generally so I don't see how one could insure against it. I know that e.g. Allianz offers policies against IT outages. In that case so, what is the actual damage? Probably the delivery drivers paid without delivering and salaries, plus potential overtime to solve the backlog. Depending on the conditions a company the size of Amazon would get, maybe it's not worth it.
Databases, stateful part of services, have matured to have multi-region support. This isn't new either. Nobody's saying that its easy to have multi-region redundancy for stateful services. Its just something you need to have to prevent nasty single region outages affect your service. This is an excellent example where it would have been better to have degraded performance (higher latency) instead of complete unavailability and interruption in business.
I don't think the parent's "our company" is Amazon, because Amazon does indeed make more than $50m, but I can attest to cloud provider multi region replication being extremely expensive for us (also not Amazon), if only due to data transfer costs
Can you elaborate? E.g. postgres replication is pretty straightforward and not a new technology. I'm outside AWS ecosystem and with just dedicated boxes having some DC burn down is manageable. How do magic clouds make that hard?
"postgres replication" would probably be the least of their worries. It's not about "magic clouds". These are services that are handling millions of requests per second and there's a lot going on where they have to maintain consistency and fail predictably. Having some services go down in one region but being back up in another still serving requests and committing transactions is unpredictable and could create a lot of inconsistencies that would be very difficult to resolve later especially when you have customer facing services like this where someone's package could be lost resulting in bad reviews and other things you don't want to deal with. People here mostly have never even imagined the level of workloads they're handling and are throwing around "easy" solutions like replication or multi-region availability. For something of this scale, it's just not that simple. It would also be incredibly expensive to do this when you could simply shut operations down for a brief period of time. Not like something of this scale has happened that often
OK, but this is programmer to manager explanation. I get how computers work. I know simple things can get very complex at scale. I just thought that the huge extra you pay for cloud services is mostly for battle tested solutions for these scale problems.
As far as I understand it, if you design is sane, the database/storage handles fallback and recovery.
Or maybe in other words - you need to make your service handle single machine going down without any problem - cloud or not. And there seem to be two options - it's your machine or part of a service which AWS provides to you. In second case it's on AWS to handle that and in the first case shouldn't AWS make it such that for you DC is just a parameter and they handle all virtual network and other magic?
To be super clear - I'm not arguing, just trying to learn I would love some specific examples which make the problem hard, because all these stories make me stay away from cloud which in theory is solution well worth paying extra for in a bigger company context.
State ("the database") is hard in distributed systems. Was the package picked up or is it still sitting on a shelf waiting? If your distributed system is partitioned, different queries may give different answers and your warehouse workers are going to be running around looking for boxes that aren't there.
Even if you create a system that is eventually consistent when availability is restored (a difficult problem all by itself, and probably needs a lot of application layer logic), it may not be worth the trouble. Warehouse workers interact with the "perfect" state of the real world, and if the computers don't have access to that, they aren't very useful.
The proper fallback is costly both hardware-wise and in development efforts. It can be cheaper just skip it and tolerate occasional service unavailability.
A correctly designed infra and app, will have zero issues with hard failover.
It has been my experience however, that as more people use the cloud, all that "ease of use" both adds layers of complexity, and further, abstracts the backend away.
Thus, by outsourcing sysadmin tasks to AWS, no in house experise exists. People don't know how to handle correct failover, unless the platform 100% does it all for them.
Is manageable depends entirely on the scenario and needs. Maybe you can afford a 5 minute data loss, but can Amazon? Also it's quite possible that they have an enormous volume of data, complicating everything.
And there's probably more than just a database. Maybe a message queue for asynchronous treatment, object storage for photos, etc.
It's certainly not an insurmountable problem, but maybe they consider the failure rate so low (it is, us-east-1 going down is like a once in a few years event) that the complexity of multi-region isn't worth it.
It's official[0] Amazon is sending out emails to Flex drivers informing them Flex is down and they'll be paid for scheduled blocks without deliveries.
Apparently even some warehouses were ghost towns today without anyone sorting packages[1]
Here in Phoenix, I've had two packages supposed to be delivered today now delayed. Also, delivery on items just went from overnight delivery to 3+ day delivery as the quickest option.
My inner cynic says Amazon paid Flex drivers because A) they still need drivers available during the christmas run-up and B) screwing drivers would put Amazon in an Ebenezer Scrooge-like PR disaster.
Amazon in a PR disaster involving corporate greed and avarice? Oh no. Anyway. No really, I would not be surprised if Amazon has several playbooks upon playbooks involving commercials of how Amazon helped some warehouse worker get through college or some medical bill as their means of a public mea culpa.
Whenever us-east-1 goes down you just get a really good feel for how many other companies also have pretty fragile setups. They apps I work on can deal with a few hours of downtime, so as long as I'm sure I can recover from getting totally leveled its ok. And I think it's that was for the majority of companies. Most don't want the extra effort and cost of failover.
I'm almost tempted to think that having an explicit policy of forcibly shutting down each region once per month for a few hours (at times that are not publicly announced in advance) would be a worthwhile value-add.
A service that is unable to handle such a failure does not qualify as being ready for deployment. And I'm not just saying that to be a self-righteous pedant, I'm saying it because this kind of failure is statistically likely to happen at some point, so ignoring it is setting yourself up for major (and potentially very expensive) problems if you don't test for it regularly.
Smallish non-tech companies typically have single points of failure that are much more severe than a server/service being down for a couple of hours. Accountant on holiday? Guess we're not writing any invoices this week. Two of our drivers are ill? Guess we'll have to let all of this stuff pile up here for a bit. By comparison, a public-facing service or intranet being down for a few hours hardly merits attention.
Shutting down availability zones I could almost see, since it's not that much more difficult or expensive to architect to handle that gracefully.
But an entire region? I've never worked anywhere that decided being multi-region was a good tradeoff. At best we've replicated data to another region and had some of our management services there, so in the absolute worst case (which would need to be much worse than yesterday) we could rebuild our product's infrastructure there.
Do I agree with this approach? Not in all possible cases of course, but for my employers? Overall, yes. It mitigates the highest impact risks. Going further would have significant complexity and costs. Those companies success or failure haven't been impacted by their multi-region strategy AFAICT.
Most services and applications can occasionally have unplanned downtime.
Many have also planed downtimes sometimes measured in half a day or more.
Services that truly need 100% uptime (and by that I don't mean what management say they WANT, but they are prepared to PAY) are a tiny minority, to the point that I imagine most software developers never work on one.
Even setups that i have seen to handle such cases, usually had some single point of failure somewhere.
And lest face it. Even if you do multi region on AWS, that won't protect you next time someone screws up BGP/switches/DNS or whatever and every region goes down for a while. In that case you better have failover to some other cloud vendor. Even planes or other safety critical equipment is not 100% failure proof.
At the project level maybe. At the AWS level, that‘s ridiculous. A lot of services can live with one major outage a year, but not once a month. Completely failing when their region goes down is reasonable.
Region failures can be non-recoverable (e.g. natural disaster).
One major outage per year doesn't necessarily mean just a few hours of downtime, it could mean having to redeploy your entire service somewhere else, which could take several days or more if you haven't prepared for it. How many of those services can live with that?
Apparently, many can! Look at how many companies choose not to pay ransom when hit with a ransomware attack, or prefer to negotiate for days instead of buckling straight away, even if it means operations are crippled for weeks. They don't typically go bankrupt afterwards, everyone coped, life moves on.
My understanding is there's a pretty strong culture of running amazon.com directly in AWS as much as possible. I don't think there's much reason to believe a sectioned off alternative would be much more reliable, it would just break at different times maybe.
Possibly stems from one of the tenets of the apocryphal Bezos decree that started AWS[1]
> All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
That attitude, and the ability to enforce it at the scale Amazon is at, are probably the biggest strength of Bezos. It does cover all aspects so, including work culture...
I don't even understand what's going on with this one. Transcribing the blurb from the clip, from interviewing the residents of the house:
"[The people who live in the house] have a friend who has a contract with an Amazon warehouse in China. They say whenever that friend's contract expires, she will send the packages to their house for the family to sort and then send back to Amazon for the company to sell."
From what I understand, it's a play around the Amazon Inventory Storage fees ([1], [2]), which normally are $0.75 per cubic foot monthly, but grow to $6.90 per cubic foot, if stored for more than a year and not sold.
It's possible to ask Amazon to remove selected items from their warehouse and send to any US address for a somewhat reasonable fee of around $0.32/item ([3]).
What happens here is the Chinese sellers send packages to the residents of the house, and then relist items under a different account, so that the counter is reset and the Amazon storage fees are low again.
That said, the residents of the house run a subpar operation.
That would pretty much explain it. Thanks for taking the time to link it together. Suppose this is a situation of "Don't hate the player, hate the game."
Right? I seriously cannot come up with any scenario where this might make sense, not even anything remotely plausible. And they're "storing" the packages, all sorts of different shapes/sizes, out in the elements?! What the heck? My curiosity is running wild & unsatisfied. =D
For some replacement computer parts and other such items shipped via FedEx to a certain remote location, all packages came with a coating of dust. Like wash your hands after receiving harddrive box, and then wash again after cutting the box open to keep the anti static bag clean. I guess if a shipping company can get away with it they will? Roofs and floors cost money.
As bad as this article makes AWS sound, it's actually the reason you should go with AWS over say Azure or GCP; when AWS goes down, its owners actually feel the pain with you, Microsoft and Google run their own stuff elsewhere...
This is simply not true. Microsoft dog foods tons of its own software and all internal cloud services I know of run on azure. Also, having consumed both cloud service offerings, customer support within Azure was far more responsive.
Source: used to work as a SWE on a flagship Azure service
> Microsoft dog foods tons of its own software and all internal cloud services I know of run on azure
so it's even more embarrassing how terrible the software is.
trolling aside - if this is true, the state of e.g: MS Teams is a travesty. Implementation of replies to messages implemented in 2021!? So many bugs, etc. it seriously damages productivity. And don't get me started on sharepoint.
Since it seems I need to back up my statement: https://news.ycombinator.com/item?id=29492884 is one example... Is that teams' fault? No, but they'd be able to call 911 if it wasn't installed.
I have a running list of teams bugs/flaws/inferiorities, it's currently about 30 items, will probably publish it sometime soon
Google does not run the bulk of their services on GCP. Unlike AWS, GCP was not a productization of their existing infrastructure, but rather a separate cloud product developed fairly independently.
I'm sure that's changing with time.
YouTube also has its own infrastructure independent of GCP and the rest of Google
GCP is not separate from Google's core infrastructure; rather, it's built on top of it. That means that while you can certainly have GCP specific outages, this kind of core infra "everything is down" situation is almost guaranteed to hit everything, GCP and not included. A lot of GCP sub-products are productionizations of existing Google tech; e.g. BigQuery is a public version of Dremel, an internal database/query engine they'd been using internally for a while.
I'm pretty sure YouTube hasn't had their own infra in quite a while. When I was there ~8 years ago I think it was all integrated already. Certainly database, video processing, storage, CDN were all on core Google infra, and I'm sure the frontends were too though I don't remember looking into that explicitly.
There aren’t 100k Googlers developing on top off GCP services to get their job done on a daily basis. That’s the big difference between the two clouds level of dog fooding.
> There aren’t 100k Googlers developing on top off GCP services to get their job done on a daily basis.
Doing something on top of GCP rather than the normal way at Google was a huge pain. Borg tutorials and documentation were just far superior, I could get a thing running on borg in an hour from not knowing anything about borg, I spent a week trying to get something running internally on GCP but still couldn't get it right (our team wanted to see if we could run things on GCP so I was tasked with testing it, I couldn't find anyone who knew how to do it so we just gave up after I didn't make any real progress). That was the worst documented thing I've ever worked with. And even worse the internal GCP pages were probably running in california and probably weren't tested from Europe, so the page took like 2 seconds between mouse click and it responded to anything.
That was years ago though and I no longer work there, but at least back then the work to make using GCP internally seamless wasn't done. Maybe it is simpler if you run everything in it and don't need it play well with borg, but there is a reason why it isn't popular internally. And likely you wont find many engineers who left Google who recommend you will use it, since they probably didn't test it and if they did it probably was a bad experience (unless they worked on GCP).
Source? I'd be surprised if Google does not dog food. Though, I guess when GCP had their recent global load balancer outage but neither Gmail or Google search went down maybe not.
I'm a bit out of the loop, but the vibe for years was "It isn't broke, so we're not going to rewrite it to run on something else."
Actually, it's a bit more than that. Some half-decade ago, Google Cloud got stung by a coordinated attack over the holidays where attackers used stolen credit cards to build a net of GCE instances and do an attack on Tor via endpoint control. Cited by SRE in the postmortem was the relative immaturity of the cloud monitoring, logging, and "break glass" tools that SREs were accustom to in Borg... Essentially, Cloud didn't have the maturity of framework that Borg did and they felt the extra layer of abstraction complicated understanding and stopping the abuse of the service.
This report had a chilling effect internally. Whereas management had previously been encouraging people to migrate to Cloud as quickly as possible, after this incident software engineering teams and the SREs that supported them were able to push back with "Can we trust it to be as maintainable as what we already have?" and put cloud on the defensive to prove that hypothesis.
That’s not the impression I got from the last 24 hours, nor the last few years.
1. Scale is hard and downtime is hard, HNers either recognize the struggle or appreciate their lack of experience. When AWS fails many armchair architects come out to suggest solutions but many more techies just sympathize with the Amazonians.
2. Technically, Amazon has built something impressive. It might not be what you want, or what others have, but AWS is impressive in scale and scope and even reliability. Many people share credit where due.
3. One can criticize the treatment of warehouse and delivery workers that Amazon is known for, but this has little baring on the tech workers there nor AWS generally. So AWS stories tend to be free from the social critique the company as a whole receives.
Here in Chicago every item on Amazon is showing an earliest delivery date of Saturday the 11th. Checked an item today I had bought yesterday and delivered overnight, the item is now showing available delivery as Saturday. What a terrible time of the year for this to happen to them.
I've seen mention of a major cable tv operator/last-mile residential ISP that had their entire field-tech dispatch system go down (based on a web gui https app that people can use on their phones) because it's hosted in AWS.
My friend lives in very cold Alaska. She has an app on her car that lets her remotely warm her car, which is a requirement before driving because it's so cold. The AWS outage meant her phone couldn't talk to her car. She was 20 mins late to work because of that.
This is the future. AWS has an outage and your car won't work.
Yeah... go out and start it, for whatever value of "start" you need. I frequently walk out to our Volt and start it to let it warm up the cabin on shore power, though I can also do it through a window with the key fob. Or I'll even go start a motorcycle to let it run for five minutes before I head out in the winter.
I'm really not sure what I'm going to do with cars in the future. The Volt is the last wave of not-always-connected-software-OTA-big-data-analysis cars, and even it had Onstar, it just tries to talk to towers that no longer exist. But there's a difference between that sort of system and what newer cars have. I suppose you can always disable the cell modem and let them have occasional wifi access to update, but... ugh.
Couldn’t remote start the car with the fob? Viper/Directed has a paid cloud/app service for convenience, but the primary/default method to activate the remote start is RF.
When I was consulting in Nome AK - it was -40F outside and there were locals standing outside behind the hospitals smoking cigs...
My brother lives in Anchorage and he went hiking with his with ~week ago, and he said it was -20F (My brother was a colonel in the USAF, so not just like a weak person)
But the point is, she should have been able to just ran out to the car like other alaskans do, and not be late to work.
Yeah at those temperatures it really depends on how much wind there is and how long you will be outside. -20F did freeze all my car doors completely closed to the point where they could not be wrestled open once though.
It's unclear to me that the app being discussed started the car?
In sufficiently cold conditions, you need to warm the engine up before starting the car, with an electric heating element. If this was controlling one of those heating elements, the problem was presumably that the car wasn't able to start at all, not that the car couldn't be started remotely.
It's probably an electric that has to warm up the battery before going anywhere in extreme cold, though 20 minutes seems short.
When I use a block heater, I let it run for a couple hours before needing the vehicle in the cold. I've got an outdoor outlet timer that I can use to start my truck's block heater around 2-3AM if I need it for something in the morning. It'll start at 0F without it, but it's exceedingly clear that it's not happy about the arrangement, so I preheat.
Whatever it is, I'm entirely unsurprised that some app or another, talking to some cloud service or another, talking to some car or another, fails silently when "impossible" things have happened. Nobody seems to consider that the cloud can fail. Even though it does, quite regularly, and reliably breaks all sorts of stuff every single time it does.
There is a certain personality type common among midwits who are smart enough to think of the most obvious workaround/solution/exception but not smart enough to realize that when people communicate, especially online, they usually choose brevity over exactness.
Like if I say "Humans have two feet" some midwit will come along with an article or anecdote about a person who was born without two feet.
And if I say the multi-hour outage of AWS made my friend minutes late because of her car warmer, what is going through the mind of a person who offers the solution "Did she try walking to the car and turning it on manually?" These are the kinds of people that if you met them in real life you'd quickly distance yourself from them.
>Like if I say "Humans have two feet" some midwit will come along with an article or anecdote about a person who was born without two feet.
It's autism.
One of the symptoms of autism is [the inability to recognize sarcasm](http://www.healthcentral.com/autism/c/1443/162610/autism-sar...) without the help of idiotic, illiterate signals like "/s". Your example has the same cause; people who are unable to understand nuance and social cues, whether in real life or in written form.
Region outages happen, which is why the guidance is to build in multiple regions, but even Amazon sometimes doesn't take their own advice. Sometimes the reason is good, sometimes it isn't.
Our EC2 instances in us-east-1 didn't actually go down. But IT engineering was completely disrupted because our SSH login mechanism relies on the API to show you the list of instances for you to select which one to start an IAM handshake with. Our support phone line was also down. Even SQS kept chugging along just fine. I'm actually glad we're not on Lambda because of this.
AZs are physically located near each other, usually within a small enough radius that they could be all impacted by the same natural disaster. In some cloud providers and regions, AZs are simply different parts of the same building (IIRC one of the Japan regions of Azure was essentially this, but don't quote me). And evidently, the share some infrastructure.
At a previous job where we needed to always be up, our disaster recovery plan assumed that the us-east-1 site had been hit by a meteor (not literally, but that's how we explained it to each other to put ourselves in the mindset.)
AZs are physical boundaries, but the networking and software is interconnected. Regions are (mostly) isolated, though global services like IAM and CloudFront often have their main control plane in us-east-1
Everyone wants to be multi-AZ, multi-cell, but it's a multi-year project, especially for services that have been around for a while. My last team had been working on it for a couple of years when I left.
>An Availability Zone (AZ) is one or more discrete data centers with redundant power, networking, and connectivity in an AWS Region... AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.
Clouds aren't magic. They require a certain amount of operational confidence in order to understand that, yes, an entire region can fall out from under you at any time and it's your responsibility to detect and deploy into an unaffected region if possible.
edit: Generally, one entire region will not fail. However, core services like STS rely on us-east-1 so it's particularly susceptible to disruption.
People get slagged on for not having inter-region redundancy. But unless your business model can accommodate that, you’re introducing another failure domain and alot of money for duplicative infrastructure, network fees, etc.
For many use cases, it’s acceptable to shrug and blame AWS for a failure. It’s harder when your high availability solution fails independently, which they almost always do more than US-East-1
Exactly. Often people want the perception of reliability.
For many services, it makes more sense to make it reliable than not. For other services, it makes more sense to think about the engineering of the solution in the field.
Example: McDonald’s product images on kiosks are apparently in S3 and not cached locally. Seems like a dumb idea to me, but I wouldn’t try to build a more reliable cloud storage backend to control that risk.
> do they not teach people what a failure domain is anymore?
Empirically? No.
But seriously, instead of making every dev team consuming an AWS product hire the extra engineers required to build a system that spans multiple failure domains — which lets be honest, companies won't — why hasn't Amazon just hired the engineers required to do it for me?
There are also way too many successful, public cloud-native businesses running without any semblance of a Business Continuity or DR Plan among any of their teams. I wish I could name and shame some of the more egregious cases I have seen.
> However, core services like STS rely on us-east-1 so it's particularly susceptible to disruption.
FYI this example hasn’t been true for a while. STS regional endpoints are generally what you should be using these days. The “global” us-east-1 endpoint still works, and may be the default for some clients, but isn’t a requirement.
A lot of AWS is very old and it's very hard to go back and make everything nice and reliable without breaking things when you have a shitload of customers 24/7 using your services.
So yes, in theory much of AWS's services are probably very reliable and distributed across AZs and regions, but in practice there's likely a whole bunch of debt where one thing gets fucked up and it cascades.
As an AWS employee, you’re under too much artificial deadline pressure to address technical debt like this… let alone engineer a service properly in the first place.
As a corporate employee, you’re under too much artificial deadline pressure to address technical debt like this… let alone engineer a service properly in the first place.
Multiple devices in very different parts of the world can respond to the same IP address. A great example are DNS servers like 1.1.1.1 and 8.8.8.8- you get routed to the nearest site that can handle your request. A common example of how this is implemented is via anycast: https://www.cloudflare.com/learning/cdn/glossary/anycast-net...
Does Amazon follow their own multi-zone best practices?
To be fair, we only run services in us-east-1, and we had zero downtime from this issue. The only issue we encountered were API calls that were failing for us, and those were for CloudWatch and CodeDeploy. We are heavily reliant on EC2, OpenSearch, and S3.
Something I'm not getting there. OK, one zone went down in AWS. But there supposed to be multiple ones? And if somebody knows how to use AWS so that one zone going down wouldn't knock over the whole thing, it should be Amazon? I thought the whole point of this multi-zone cloud setup would be so that even if one zone goes down, it could survive and perform as usual - maybe a little slower for a bit while backups kick in and such, but after initial short period everything should be fine? Here we're talking about the whole system going down hard. Was the problem much wider than reported - i.e. affecting all (US) zones at least, or is it that Amazon doesn't know or doesn't care to use its own infrastructure to build a robust system?
This affected on region, us-east-1, which is the oldest and largest aws region, and also hosts many core aws services. Each region has several availability zones (AZs) that are basically whole datacenters. In theory AZs should be mostly isolated, but as we saw bugs happen. Reading between the lines of the status updates, it sounds like this either affected some core infra shared between AZs, or was from a change rolled out safely previously (maybe last night) and failed later (eg due to increased load in the morning).
At the end of the day, building redundant systems is expensive (and may introduce whole new bugs), so they probably did the math and figured the risk of a whole region outage was less than the cost of building redundancy in some systems that are hard to make redundant.
The whole concept that there are "core AWS services" living in one single zone sounds like anti-thesis to everything AWS should be about. What's the use of having all this nice distributed setup - and paying for it! - if a single failure in a single zone takes everything down anyway? I mean sure, maybe they built it on a shoe-string budget years ago - but since then, they had years and billions, and still didn't bother to fix it?
There's a lot of pitfalls. You have to diligent about stuff like making sure you're using region-specific endpoints like sts.us-west-2.amazonaws.com instead of sts.amazonaws.com. And some services, like Route53 and Cloudfront have built-in dependencies on us-east-1. There's also internal dependencies that are sometimes not obvious until an outage occurs.
Though, as you say, none of that's a great excuse for Amazon itself. I know Alexa devices, Ring Devices, Prime Video, imdb.com, etc, also all had issues in the first hours of this outage.
[9:37 AM PST] We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates.
Because it's terrible and has always been terrible?
After 2017 I'll never use US-east-1 again.
Hell... I should have learned that particular lesson in 2011 but it took two catastrophic failures for me to figure it out.
There are numerous threads here on HN covering the topic "why does US-east-1 suck so hard."
I wonder if there’s too much moral hazard for them to run a data center just for Amazon but run exactly the same as the other data centers? Or would Amazon scope creep it into its own snowflake.
I kind of doubt that they’re getting efficient use of hardware Amazon could be using but isn’t, since if someone allocates it it’s not available for Amazon anymore.
You'd think if it's anything mission critical, Amazon would be following its very own Well-Architected Framework. The Reliability pillar would speak to this.
I don't think the Well-Architected framework specifies that you should use multi-region availability, but it's certainly a mentioned option for mission critical applications. Usually the go-to doctorine for high availability in AWS documentation is multi-AZ, not multi-region.
If this outage isn’t the catalyst to get the Amazon side of the house to finally move out of US-east, I don’t know what will be. Or at least be multi-region.
Although the cost to make all of Amazon commerce, logistics, and digital truly multi-region is probably an order of magnitude more than the impact of this outage.
True. But let's be honest... this is neither the first such outage in US-east-1 nor the last. So I'd argue it's long past time for Amazon to pay the bill and go multi-region.
I wonder if their system would allow them to move availability zones between regions. They could create new stealth AZs, move their stuff into them, then use those to start building a new region only they use.
But I suspect there are third part integrations that benefit from being on the same AZ as Amazon APIs. I’ve been having little convos all day about how I think the inter region pricing creates a perverse incentive that’s exacerbating the us-east-1 situation.
Last big e-commerce outage I can think of was Prime day years ago when Sable browned out. But that was non-AWS infrastructure. There was also the infamous s3 fat-finger; but I can’t think of a holiday shopping and delivery day with a massive outage.
It’s kind of ironic to go through AWS well architected framework and add all the complexity associated with multi-region setups when Amazon themselves couldn’t get it right.
To be clear, I am not advocating for ignoring high availability setups. Just highlighting the complexity cost of it.
I placed a few orders this morning. They appeared to go through, but didn't show up in my order list until this evening. Good thing I made the connection to the outage before I decided to try re-ordering! Sounds like others weren't so lucky.
Shit happens. My last employer is now mostly on AWS, and mostly in that region, so things broke all over the place yesterday (oddly enough the only service that is multi region was auth). In the past when we were mostly in our own datacenters we suffered as well, so that's not better either. One time a bad DNS update killed every single service worldwide and it took most of a day to properly restart thousands of services. Build complex systems with thousands or millions of things and it's bound to blow up sometime.
Yeah my packages didn't get delivered today. I assume there will be cascading delays for the rest of the week. Getting close to xmas travel times, trying to finish up my shopping this week.
If there's an SLA or reputation to uphold, probably, and otherwise they probably won't. So big AWS customers with SLAs will probably get compensation. And Amazon dot com customers that complain might get some compensation.
Us easy 1 has an uptime of 99.9%, that’s low enough to get most sysadmins fired, but being able to point to headline news of how the same outage affects Amazon placates management.
The key benefit of the cloud is blameshifting. It’s someone else’s problem, you just get the day off.
This really tarnishes any trust you might have had in Amazon as a professionally run business.
(of course outages can happen and are to be expected especially at Amazon's scale, it's just the bad communication and the amateurish non-redundant setup of their own core services that is shocking)
Weird I just ordered a package about 3-4 weeks ago and it didn’t show up so finally went to track it on amazon last week and got a message saying that they had lost the package and to request a refund. It kind of sucks because that was my mom’s Christmas gift now I have to figure out a plan B.
I have this happen constantly through Amazon. Now all the hassle is on you to get the refund. And they won't process the refund immediately, so you have to pay out of your own money to get a replacement. And then often, as I've found, the price of the item has now gone up too, so you have to pay extra.
p.s. Your username?! I can think this must be the only site you've managed to get that handle?
It is the only site. Someone once had an even lower number and someone made mention to it and the person replied saying it was easy and he had just gotten it so I decided to try and see what was available. I had been a long term lurker and never posted but once I got this username I try and respectfully post when I can.
[0] https://www.reddit.com/r/AmazonFlexDrivers/comments/rb3ggn/i...