The N. Virginia datacenter has been historically unreliable. I moved my personal projects to the West Coast (Oregon and N. California) and I have seen no significant issues in the past year.
N. Virginia is both cheaper and closer to the center of mass of the developed world. I'm surprised Amazon hasn't managed to make it more reliable.
One thing we discovered this morning: it appears the AWS console itself is hosted in N Virginia.
This means that if you were trying to make changes to your EC2 instances in the West using the GUI, you couldn't, even though the instances themselves were unaffected.
shouldn't amazon themselves have architected their own app to be able to move around?
I get tired of the snipes from people that "well, you're doing it wrong", as if this is trivial stuff. But if Amazon themselves aren't even making their AWS console redundant between locations, how easy/straightforward is it for anyone else?
To what extent is this just "the cobbler's kids have no shoes?"
You're close. Put another way, "inherent complexity is the problem."
What I mean by that is, the more your system is coupled, the more it is brittle.
Frankly, this is AWS's issue. It is too coupled: RDS relies on EBS, the console relies on both, etc. Any connection between two systems is a POF and must be architected to let those systems operate w/o that connection. This is why SMTP works the way it does. Real time service delivery isn't the problem, but counting on it is.
Depends. Generic interfaces and non-reliance have costs too. In general I agree that things should be decoupled, but it's not always easy or practical.
Surely true, but that's the purpose of a system in the first place: to manage complexity and make it predictable. You could argue that we have such a system in place, given how well the Internet works overall. The fact that this system has problems goes against what I believe is fully evident proof that such a system can, in fact, work even better.
We're not talking about a leap in order of magnitude of complexity here—just simple management of common human behavioral tendencies in order to promote more reliability. "The problem is inherently complex" is always true and will always be true, but it's no excuse for not designing a system to gracefully handle that complexity.
The internet works because it provides very weak consistency guarantees compared to what businesses might require out of an EC2 management console. (IMO.)
Their CoLo space is the same space shared by AOL and a few other big name tech companies. It's right next to the Greenway, just before you reach IAD going northeast. That CoLo facility seems pretty unreliable in the scheme of things; Verizon and Amazon both took major downtime this summer when a pretty hefty storm rolled through VA[1], but AOL's dedicated datacenters in the same 10 mile radius all experienced no downtime whatsoever.
To be fair, the entire region was decimated by that storm. I didn't have power for 5 days. Much of the area was out. There was a ton of physical damage. That's not excusing them, they should do better, but that storm was like nothing I've experienced living in the area for 20 years.
Yeah, it's got to be much larger than the other regions, so it makes sense that we see more errors. Since error_rate = machines * error_rate_per_machine.
No, I calculated the error rate for the region. If us-east-1 has 5 times the machines (or availability zones, or routers, or EBS backplanes, or other thing-that-can-fail) as us-west-1, we would expect to see us-east-1 have each type of error occur about 5 times as often as us-west-1.
I'm surprised amazon hasn't built another region in the east. If you're in the west you get US-West-1 and US-West-2 and can failover and distribute between the two, why don't they have that kind of duplication in the east?
us-east-1 was 11 different datacenters last time I bothered to check.
us-west-2 by comparison is two datacenters. The reason west-1 and west-2 exist is because they are geographically diverse enough to prevent low latency inner-connections (and also have dramatically different power costs so they bill differently).
then how come when east goes down, it always seems to take down all the AZs in the region, never just one AZ? As long as the region fails like a single datacenter, i'll think of it like a single datacenter.
They already expanded into a DC in Chantilly, one more in Ashburn and I believe one in Manassas. But they lean on Ashburn for everything they do, and a small problem results in a daisy-chain failure (which because everyone uses Amazon for every service imaginable, means even the smallest problem takes down whole websites)
I don't understand why anyone's site is only in one datacenter. i thought the point of AWS was that it was distributed with fault tolerance? Why don't they distribute all the sites/apps across all their centers?
It takes development/engineering resources, and additional hardware resources to make your architecture more fault-tolerant and to maintain this fault-tolerance over long periods of time.
Weigh this against the estimated costs of your application going down occasionally. It's really only economical for the largest applications (Netflix, etc.) to build these systems.
disagree. the only area it really hurts the wallet is multi-AZ on your RDS, because it doubles your cost no matter what and RDS is toughest to scale horizontally. The upside is if you scale your data layer horizontally you don't need to use RDS anymore.
two c1.medium, which are very nice for webservers, are enough to host >1M pageviews a month (wordpress, not much caching) and cost around $120/mo each, effective $97/mo if you prepay for 12months at a time via reserved instances.
The other issue is that you can have redundant services, but when the control plane goes down - you are screwed.
Every day I have to build basic redundancy into my applications I wish that we could just go with a service provider (like Rackspace / Contegix) that offered more redundancy at the hardware level.
I know the cloud is awesome and all, but having to assume your disks will disappear, fail, go slow at random uncontrollable times is expensive to design around.
If you don't have an elastic load, then the cloud elasticity is pointless - and is ultimately an anchor around your infrastructure.
Us-west-2 is about the same cost as us-east these days. And latency is only ~10ms more than us-west-1. I'm puzzled that people aren't flocking to us-west-2. I can't the last time there was an outage there either.
N. Virginia is both cheaper and closer to the center of mass of the developed world. I'm surprised Amazon hasn't managed to make it more reliable.