Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The N. Virginia datacenter has been historically unreliable. I moved my personal projects to the West Coast (Oregon and N. California) and I have seen no significant issues in the past year.

N. Virginia is both cheaper and closer to the center of mass of the developed world. I'm surprised Amazon hasn't managed to make it more reliable.



One thing we discovered this morning: it appears the AWS console itself is hosted in N Virginia.

This means that if you were trying to make changes to your EC2 instances in the West using the GUI, you couldn't, even though the instances themselves were unaffected.


shouldn't amazon themselves have architected their own app to be able to move around?

I get tired of the snipes from people that "well, you're doing it wrong", as if this is trivial stuff. But if Amazon themselves aren't even making their AWS console redundant between locations, how easy/straightforward is it for anyone else?

To what extent is this just "the cobbler's kids have no shoes?"


If it's systematically difficult to do it correctly, then the system is wrong.


. . . or the problem is inherently complex.


> . . . or the problem is inherently complex.

You're close. Put another way, "inherent complexity is the problem."

What I mean by that is, the more your system is coupled, the more it is brittle.

Frankly, this is AWS's issue. It is too coupled: RDS relies on EBS, the console relies on both, etc. Any connection between two systems is a POF and must be architected to let those systems operate w/o that connection. This is why SMTP works the way it does. Real time service delivery isn't the problem, but counting on it is.

Uncouple all the things!


Depends. Generic interfaces and non-reliance have costs too. In general I agree that things should be decoupled, but it's not always easy or practical.


Surely true, but that's the purpose of a system in the first place: to manage complexity and make it predictable. You could argue that we have such a system in place, given how well the Internet works overall. The fact that this system has problems goes against what I believe is fully evident proof that such a system can, in fact, work even better.

We're not talking about a leap in order of magnitude of complexity here—just simple management of common human behavioral tendencies in order to promote more reliability. "The problem is inherently complex" is always true and will always be true, but it's no excuse for not designing a system to gracefully handle that complexity.


The internet works because it provides very weak consistency guarantees compared to what businesses might require out of an EC2 management console. (IMO.)


That's what twilio + Heroku are for, abstract up another layer. There's even a site where you just give it a github location and it does the rest.


Well the Heroku abstraction was leaking like a sieve today.


Hardly


Their CoLo space is the same space shared by AOL and a few other big name tech companies. It's right next to the Greenway, just before you reach IAD going northeast. That CoLo facility seems pretty unreliable in the scheme of things; Verizon and Amazon both took major downtime this summer when a pretty hefty storm rolled through VA[1], but AOL's dedicated datacenters in the same 10 mile radius all experienced no downtime whatsoever.

Edited: [1] http://www.datacenterknowledge.com/archives/2012/06/30/amazo...


To be fair, the entire region was decimated by that storm. I didn't have power for 5 days. Much of the area was out. There was a ton of physical damage. That's not excusing them, they should do better, but that storm was like nothing I've experienced living in the area for 20 years.


Realistically, it's at least in part because everyone defaults to the East region. So it's the most crowded and demanding on the system.


Yep, according to the most recent estimate I saw[1], us-east was more than twice the size of all other regions combined.

[1] http://huanliu.wordpress.com/2012/03/13/amazon-data-center-s...


It's not just because it's crowded. Everyone I know who's worked in that DC hated it. Aside from that, storms regularly knock out the grid in NoVa.


Yeah, it's got to be much larger than the other regions, so it makes sense that we see more errors. Since error_rate = machines * error_rate_per_machine.


The whole region is down, you just calculated the chance of at least one machine having an error.


No, I calculated the error rate for the region. If us-east-1 has 5 times the machines (or availability zones, or routers, or EBS backplanes, or other thing-that-can-fail) as us-west-1, we would expect to see us-east-1 have each type of error occur about 5 times as often as us-west-1.


I believe this is because North Virginia is also their historical first facility.


And largest, and busiest.


I'm surprised amazon hasn't built another region in the east. If you're in the west you get US-West-1 and US-West-2 and can failover and distribute between the two, why don't they have that kind of duplication in the east?


Stop thinking about regions as datacenters.

us-east-1 was 11 different datacenters last time I bothered to check.

us-west-2 by comparison is two datacenters. The reason west-1 and west-2 exist is because they are geographically diverse enough to prevent low latency inner-connections (and also have dramatically different power costs so they bill differently).


then how come when east goes down, it always seems to take down all the AZs in the region, never just one AZ? As long as the region fails like a single datacenter, i'll think of it like a single datacenter.


They already expanded into a DC in Chantilly, one more in Ashburn and I believe one in Manassas. But they lean on Ashburn for everything they do, and a small problem results in a daisy-chain failure (which because everyone uses Amazon for every service imaginable, means even the smallest problem takes down whole websites)


I don't understand why anyone's site is only in one datacenter. i thought the point of AWS was that it was distributed with fault tolerance? Why don't they distribute all the sites/apps across all their centers?


It takes development/engineering resources, and additional hardware resources to make your architecture more fault-tolerant and to maintain this fault-tolerance over long periods of time.

Weigh this against the estimated costs of your application going down occasionally. It's really only economical for the largest applications (Netflix, etc.) to build these systems.


disagree. the only area it really hurts the wallet is multi-AZ on your RDS, because it doubles your cost no matter what and RDS is toughest to scale horizontally. The upside is if you scale your data layer horizontally you don't need to use RDS anymore.

two c1.medium, which are very nice for webservers, are enough to host >1M pageviews a month (wordpress, not much caching) and cost around $120/mo each, effective $97/mo if you prepay for 12months at a time via reserved instances.


The other issue is that you can have redundant services, but when the control plane goes down - you are screwed.

Every day I have to build basic redundancy into my applications I wish that we could just go with a service provider (like Rackspace / Contegix) that offered more redundancy at the hardware level.

I know the cloud is awesome and all, but having to assume your disks will disappear, fail, go slow at random uncontrollable times is expensive to design around.

If you don't have an elastic load, then the cloud elasticity is pointless - and is ultimately an anchor around your infrastructure.


heroku only uses one AZ, apparently. Which is completely awful, for a PaaS...


They sell it as a feature.


Us-west-2 is about the same cost as us-east these days. And latency is only ~10ms more than us-west-1. I'm puzzled that people aren't flocking to us-west-2. I can't the last time there was an outage there either.


You can move your projects on demand?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: