The N. Virginia datacenter has been historically unreliable. I moved my personal...

nostromo · on Oct 22, 2012

One thing we discovered this morning: it appears the AWS console itself is hosted in N Virginia.

This means that if you were trying to make changes to your EC2 instances in the West using the GUI, you couldn't, even though the instances themselves were unaffected.

mgkimsal · on Oct 22, 2012

shouldn't amazon themselves have architected their own app to be able to move around?

I get tired of the snipes from people that "well, you're doing it wrong", as if this is trivial stuff. But if Amazon themselves aren't even making their AWS console redundant between locations, how easy/straightforward is it for anyone else?

To what extent is this just "the cobbler's kids have no shoes?"

calinet6 · on Oct 22, 2012

If it's systematically difficult to do it correctly, then the system is wrong.

jamesaguilar · on Oct 22, 2012

. . . or the problem is inherently complex.

Terretta · on Oct 22, 2012

> . . . or the problem is inherently complex.

You're close. Put another way, "inherent complexity is the problem."

What I mean by that is, the more your system is coupled, the more it is brittle.

Frankly, this is AWS's issue. It is too coupled: RDS relies on EBS, the console relies on both, etc. Any connection between two systems is a POF and must be architected to let those systems operate w/o that connection. This is why SMTP works the way it does. Real time service delivery isn't the problem, but counting on it is.

Uncouple all the things!

jamesaguilar · on Oct 22, 2012

Depends. Generic interfaces and non-reliance have costs too. In general I agree that things should be decoupled, but it's not always easy or practical.

calinet6 · on Oct 22, 2012

Surely true, but that's the purpose of a system in the first place: to manage complexity and make it predictable. You could argue that we have such a system in place, given how well the Internet works overall. The fact that this system has problems goes against what I believe is fully evident proof that such a system can, in fact, work even better.

We're not talking about a leap in order of magnitude of complexity here—just simple management of common human behavioral tendencies in order to promote more reliability. "The problem is inherently complex" is always true and will always be true, but it's no excuse for not designing a system to gracefully handle that complexity.

jamesaguilar · on Oct 22, 2012

The internet works because it provides very weak consistency guarantees compared to what businesses might require out of an EC2 management console. (IMO.)

IheartApplesDix · on Oct 23, 2012

That's what twilio + Heroku are for, abstract up another layer. There's even a site where you just give it a github location and it does the rest.

jeremyjh · on Oct 23, 2012

Well the Heroku abstraction was leaking like a sieve today.

IheartApplesDix · on Oct 23, 2012

Hardly

pkill17 · on Oct 22, 2012

Their CoLo space is the same space shared by AOL and a few other big name tech companies. It's right next to the Greenway, just before you reach IAD going northeast. That CoLo facility seems pretty unreliable in the scheme of things; Verizon and Amazon both took major downtime this summer when a pretty hefty storm rolled through VA[1], but AOL's dedicated datacenters in the same 10 mile radius all experienced no downtime whatsoever.

Edited: [1] http://www.datacenterknowledge.com/archives/2012/06/30/amazo...

ohashi · on Oct 22, 2012

To be fair, the entire region was decimated by that storm. I didn't have power for 5 days. Much of the area was out. There was a ton of physical damage. That's not excusing them, they should do better, but that storm was like nothing I've experienced living in the area for 20 years.

acangiano · on Oct 22, 2012

Realistically, it's at least in part because everyone defaults to the East region. So it's the most crowded and demanding on the system.

obeattie · on Oct 22, 2012

Yep, according to the most recent estimate I saw[1], us-east was more than twice the size of all other regions combined.

[1] http://huanliu.wordpress.com/2012/03/13/amazon-data-center-s...

0xbadcafebee · on Oct 22, 2012

It's not just because it's crowded. Everyone I know who's worked in that DC hated it. Aside from that, storms regularly knock out the grid in NoVa.

jey · on Oct 22, 2012

Yeah, it's got to be much larger than the other regions, so it makes sense that we see more errors. Since error_rate = machines * error_rate_per_machine.

nasmorn · on Oct 22, 2012

The whole region is down, you just calculated the chance of at least one machine having an error.

jey · on Oct 22, 2012

No, I calculated the error rate for the region. If us-east-1 has 5 times the machines (or availability zones, or routers, or EBS backplanes, or other thing-that-can-fail) as us-west-1, we would expect to see us-east-1 have each type of error occur about 5 times as often as us-west-1.

merlincorey · on Oct 22, 2012

I believe this is because North Virginia is also their historical first facility.

ceejayoz · on Oct 22, 2012

And largest, and busiest.

notatoad · on Oct 22, 2012

I'm surprised amazon hasn't built another region in the east. If you're in the west you get US-West-1 and US-West-2 and can failover and distribute between the two, why don't they have that kind of duplication in the east?

dsl · on Oct 22, 2012

Stop thinking about regions as datacenters.

us-east-1 was 11 different datacenters last time I bothered to check.

us-west-2 by comparison is two datacenters. The reason west-1 and west-2 exist is because they are geographically diverse enough to prevent low latency inner-connections (and also have dramatically different power costs so they bill differently).

notatoad · on Oct 22, 2012

then how come when east goes down, it always seems to take down all the AZs in the region, never just one AZ? As long as the region fails like a single datacenter, i'll think of it like a single datacenter.

0xbadcafebee · on Oct 22, 2012

They already expanded into a DC in Chantilly, one more in Ashburn and I believe one in Manassas. But they lean on Ashburn for everything they do, and a small problem results in a daisy-chain failure (which because everyone uses Amazon for every service imaginable, means even the smallest problem takes down whole websites)

donrhummy · on Oct 22, 2012

I don't understand why anyone's site is only in one datacenter. i thought the point of AWS was that it was distributed with fault tolerance? Why don't they distribute all the sites/apps across all their centers?

fragsworth · on Oct 22, 2012

It takes development/engineering resources, and additional hardware resources to make your architecture more fault-tolerant and to maintain this fault-tolerance over long periods of time.

Weigh this against the estimated costs of your application going down occasionally. It's really only economical for the largest applications (Netflix, etc.) to build these systems.

gazarsgo · on Oct 22, 2012

disagree. the only area it really hurts the wallet is multi-AZ on your RDS, because it doubles your cost no matter what and RDS is toughest to scale horizontally. The upside is if you scale your data layer horizontally you don't need to use RDS anymore.

two c1.medium, which are very nice for webservers, are enough to host >1M pageviews a month (wordpress, not much caching) and cost around $120/mo each, effective $97/mo if you prepay for 12months at a time via reserved instances.

piggity · on Oct 22, 2012

The other issue is that you can have redundant services, but when the control plane goes down - you are screwed.

Every day I have to build basic redundancy into my applications I wish that we could just go with a service provider (like Rackspace / Contegix) that offered more redundancy at the hardware level.

I know the cloud is awesome and all, but having to assume your disks will disappear, fail, go slow at random uncontrollable times is expensive to design around.

If you don't have an elastic load, then the cloud elasticity is pointless - and is ultimately an anchor around your infrastructure.

gazarsgo · on Oct 22, 2012

heroku only uses one AZ, apparently. Which is completely awful, for a PaaS...

zmonkeyz · on Oct 22, 2012

They sell it as a feature.

donavanm · on Oct 23, 2012

Us-west-2 is about the same cost as us-east these days. And latency is only ~10ms more than us-west-1. I'm puzzled that people aren't flocking to us-west-2. I can't the last time there was an outage there either.

g3orge · on Oct 22, 2012

You can move your projects on demand?