Why do people host static websites on S3 at all? It really isn't designed for that: it is an object store. Yes: it has a URL structure accessible in a way that makes it look like static hosting, and Amazon caved pretty early to people wanting to use it that way by adding features to make it more reasonable, but it doesn't fix the underlying problem.
Specifically, it is both allowed to--and often does--return 50x errors to requests. The documentation for the S3 API states you should immediately retry; that's fine for an API where I can code that logic into my client library, but is simply an unacceptable solution on the web. Maybe there are one or two exceptions to this, but I have simply never seen a web browser retry these requests: the result is you just get broken images, broken stylesheets, or even entire broken pages. Back when Twitter used to serve user avatar pictures directly from S3 the issue was downright endemic (as you'd often load a page with 30 small images to request from S3, so every few pages you'd come across a dud).
Sure, it only happens to some small percentage of requests, but for a popular website that can be a lot of people (and even for an unpopular one, every user counts), and it is orders of magnitude higher of a random error rate than I've experience with my own hosting on EC2; it is also irritating because it is random: when my own hosting fails, it fails in its entirely: I don't have some tiny fraction of requests for users all over the world failing.
Regardless, I actually have an administrative need to stop hosting a specific x.com to www.x.com redirect on some non-AWS hosting I have (the DNS is hosted by Route53, etc., but I was left with a dinky HTTP server in Kentucky somewhere handling the 301), and I figured "well, if it doesn't have to actually request through to an underlying storage system, maybe I won't run into problems; I mean, how hard is it to take a URL and just immediately return a 301?", but after just a few minutes of playing with it I managed to get a test request that was supposed to return a 301 returning a 500 error instead. :(
HTTP/1.1 500 Internal Server Error
x-amz-request-id: 1A631406498520D6
x-amz-id-2: hXQ1YXyu0gxaiGITKvcB+P8+tgPsP3UITX/Or4emyjZtaL16ULAyHFx2ROT4QPXY
Content-Type: text/html; charset=utf-8
Content-Length: 354
Date: Fri, 28 Dec 2012 07:19:24 GMT
Connection: close
Server: AmazonS3
This wasn't just a single-time problem either: I've setup a loop requesting random files (based on the timestamp of the test run and a sequence number from the test) off this pure-redirect bucket that I've left running for a few minutes, and some of the S3 nodes I'm talking to (72.21.194.13 being a great example) are just downright unreliable, often returning 500 errors in small clumps (that one node is giving me a 2% failure rate!!). S3 is simply not an appropriate mechanism for site hosting, and it is a shame that Amazon is encouraging people to misuse it in this fashion.
(edit: Great, and now someone downvoted me: would you like more evidence that this is a problem?)
I've been messing with S3 for a new project involving the HTML5 canvas- So lots of CORS and canvas security concerns, PUTing objects from the browser, and desire for low-latency changes.
S3 has not been delivering. Here's a few reasons:
* S3 only provides read-after-write consistency for non-standard regions: http://aws.amazon.com/s3/faqs/#What_data_consistency_model_d... Since moving to US-West-1, we've had noticeably more latency. Working without read-after-write just isn't an option, users get old data for the first few seconds after data is pushed.
* Oh, and the editor for CORS data introduces newlines into your config around the AllowedHost that BREAK the configuration. So you need to manually delete them when you make a change. Don't forget!
* I swear, I get 403s and other errors at a higher rate than I have from any custom store in the past. But this is purely subjective.
Based on all this- I really need to agree with saurik that the folks at S3 aren't taking their role as an HTTP API seriously enough. They built an API on HTTP, but not an API that browsers can successfully work with. Things are broken in very tricky ways, and I'd caution anybody working with S3 on the front-end of their application to consider the alternatives.
I'm moving some things to Google Cloud Storage right now, and it is blazing fast, supports CORS properly, and has read-after-write consistency for the whole service. Rackspace is going to get back to me, but I expect they could do the same (and they have real support).
While you are fixing things, can you please make cloudfront send HTTP 1.1 instead of HTTP 1.0 for 206 (Partial Content) responses to range get requests. It is invalid since 206 is not part of HTTP 1.0, and Chrome refuses to cache the responses, which makes cloudfront terrible for delivering HTML5 media.
I host very small-time sites for a few family members on S3 because it is practically free (pennies per month) and there's more or less zero chance some script kiddie will break in and deface it, as was the case when they were going the traditional "php/wordpress on godaddy" route. EC2 is great but for hosting tiny non-money-making sites it's way more expensive and maintenance consuming, a micro comes out to $14 a month and a small comes out to $46 a month. For a site that gets hit a few hundred times a week tops, you're just paying for tons of idle time. A very rare 500 error (I've never seen that before) is not an issue in this case.
We investigated your report of issues with requests, and found that one S3 host was behaving incorrectly. We identified the root cause and deployed a fix. Can you verify that we have fixed your issue?
(I also have this specific high-500 rate as case 81302771, which I humorously did not get an answer to yet; I got a response asking for more information which I provided before going to sleep this morning, but no resolution... yet I switch back to HN and you have responded here? ;P)
I cannot replicate the really-high 500 rate anymore on 72.21.194.13 (the node that was particularly bad). However, I'm still concerned about what caused that: Is it likely to happen again? Why did it only happen to that one node? (In essence: help me trust this system ;P.)
However, what I'm most interested in is whether the "static website hosting" endpoint of S3 (the *.s3-website-us-east-1.amazonaws.com URLs) has different semantics than S3 normally does, so that under "normal interaction" scenarios[1] I can rely on "this will do its best to not return a 500 error, retrying if required to get the underlying S3 blob".
Have you tried setting a redirection rule on your bucket so that when the 500 error occurs, S3 will automatically retry the request? You can set a redirection rule in the S3 console, and I think the following rule might work:
This will redirect all 500s back to the same location, effectively re-trying the request. This should cover the random 500 case but I'm not sure that it will work 100% of the time though.
I'm interested by this solution too. I will try to take a look and test it over the weekend.
I think that this solution can only be an improvement though.
I believe not: I'm pretty certain putting CloudFront in front of your bucket is fine (it handles the HTTP layering correctly); this problem is one of attempting to directly host static content from S3 only.
(That said, I have very little personal experience with CloudFront, as in my experience it is more expensive for fewer features with less POPs than using a "real CDN" like CDNetworks, or even Akamai.)
(edit:)
For this specific circumstance, I'm not certain at all what CloudFront's behavior will be; it seems like the "redirect" concept is a property of the "static website hosting" feature of S3, not part of the underlying bucket, and CloudFront "normally" (in quotes, as I just mean the default origin options it provides) directly accesses the bucket.
I thereby imagine that if I simply set a custom origin to the ___.s3-website-us-east-1.amazonaws.com URL provided by the S3 static hosting feature I will get the right behavior (where CloudFront forwards and caches the 301 responses), but then I have no clue if it will correctly retry the 500 error responses.
That said, I will point out that I am not even certain if CloudFront retries the 500 requests anyway: it occurred to me that with a small error rate combined with a cache, if you (as I somewhat did at least) expect the potential fix to be S3-specific, you might simply never really "catch" an actual failing request in a test scenario.
It could then be that CloudFront retries all 50x failures (in which case if I set it up with a custom origin to the S3 static hosting URL you'd still get the retry behavior), but I somehow doubt that it does that (and just earlier I saw two requests in a row to S3 fail for these 301 redirects, so it might not even help).
CloudFront doesn't retry on 500s. Besides, you can't alias to a CloudFront distribution from x.com apex, so I don't think it would work for you even if that were the case.
CloudFront, when used with Route 53 as your DNS provider, can be used for zone apex hosting, as you can place an "ALIAS" record (as opposed to a real DNS CNAME) to the other hostname; this is the same procedure you use to get S3's static hosting feature working with a zone apex and these new instructions today. (That said, I have never done this personally with CloudFront, as again: I do not use CloudFront.)
Nope. Route 53 has to ALIAS to another RR in your own hosted zone. There's no way for Route 53 to return the A records/IP address RDATA that CloudFront uses to direct clients to the fastest site.
Other "ALIAS" providers can't do real CloudFront apex support either. Their intermediate resolvers end up caching the CloudFront records without varying per client subnet.
Interesting! I was reading something in one of the FAQ's earlier that seemed to indicate that that works, but now digging further and reading through the forums, I see that you are totally right: you simply couldn't do build this use-case with pure-AWS (non-EC2) tools without having this new S3 static hosting redirection feature.
(As for non-Amazon DNS with server-side aliasing support, it wouldn't be that bad, for this kind of use case: you are already taking a latency hit by returning the 301, and direct links will never target these URLs as they will have the canonical www. hostname, so if you just end up with an edge node near a geo-ip DNS server near the original user, it will be approximately good enough.)
I spent some time over the past weekend migrating my static-generated blog over to S3+CF and the only problems that I ran into were invalidation and permissions on the bucket. It is likely a result of my lack of knowledge of S3 bucket/CF utilities, but I've been using s3cmd for sync.
Definitely impressed with how quickly it went. I muddled through setting up AWS DNS, S3 and CF through a bunch of blog articles. But it was well worth the time investment.
Likely I'll just wrote a post on my experiences as well once everything is said and done. I haven't had the time over the holiday to figure out the AWS bucket policy. But my overall plan is to have node-webkit shim that has a markdown editor for editing posts.
It should be relatively easy and would be a complete win for me blog post wise. Especially since my "drafts" would be in S3 themselves.
I'm in a middle of creating a web application, and my plan was to serve static files from S3. Based on your post, it seems like a really bad idea. If the problems are so apparent, I wonder why this is so generally accepted and recommended approach. One example, Heroku guide that praises putting static files on S3: https://devcenter.heroku.com/articles/s3
I trust S3 a lot (in fact, there was a time I had >1% of all objects in S3; I have since deleted a very large percentage, but I believe I still have well over a billion objects stored).
I would definitely agree: it doesn't fail under load; for a while I was seriously using S3 as a NoSQL database with a custom row-mapper and query system (not well generalized at all) that I built.
However, this particular aspect is a known part of S3 that has been around since the beginning: that it is allowed to fail your request with a 500 error, and that you need to retry.
This is something that if you read through the S3 forums you can find people often commenting on, you will find code in every major client library to handle it, and it is explicitly documented by Amazon.
"Best Practices for Using Amazon S3", emphasis mine:
> 500-series errors indicate that a request didn't succeed, but may be retried. Though infrequent, these errors are to be expected as part of normal interaction with the service and should be explicitly handled with an exponential backoff algorithm (ideally one that utilizes jitter). One such algorithm can be found at...
Regardless, the 2% failure rate on that one S3 IP endpoint is definitely a little high, so I filed a support ticket (I pay for the AWS Business support level) with a list of request ids and correlation codes that returned a 500 error during my "static hosting + redirect" test today. I'll respond back here if I hear anything useful from them.
2% failure rates are excessive, agreed - but why is the requirement to retry on 500 so off-putting? Virtually all API's have this occur on some level and you do the exponential backoff song and dance.
What am I missing that makes this such a show stopper with browsers? You can still do the backing off clientside with a line or two of javascript.
Seems like the pros outweigh the cons but I'm probably missing something.
You are still thinking about this as an API, with for example JavaScript and some AJAX. The use case here is zone apex static website hosting: if you go to http://mycompany.com/ and get a 500 error, the user is just going to be staring at an error screen... there will be no JavaScript, and the browser will not retry. As I actually explicitly said multiple times: for an API that is a perfectly reasonable thing to have, but for static website hosting it just doesn't fly.
Oh I see what you mean, you're concerned about the first bytes to the browser being faulty. Well, that 2% error rate is spread out across the total of requests, the likelihood of a user getting a 500 on his first hit should be significantly less then 2%. (but it does seem like it will still be way too high)
Very valid point saurik, thanks for pointing out the extent of the problem. It is a dilemma. Seems kind of silly to have an instance just for the first hit to go through reliably for visitors, goddamit Amazon.
Edit: Wait a minute, maybe this could be solved with custom error pages which I think they support. :P
You're going to need to explain how the requests would differ. If anything I'd expect image files to be more cache-friendly and have fewer visible failures than the critical html files. An image might have a 2% failure rate once or twice plus fifty error-free cache loads, while an html page might have 2% failure every single click.
Interesting. I guess in the case of static web hosting you could use onerror to deal with failed frontend requests to smooth out the broken images from the user perspective. Though as I say, not been a problem for me.
Yeah, for images you can probably deal with that; but what if your JavaScript doesn't load because the script itself was a 500 error, or the entire website doesn't load because of a 500 error... well, you're screwed. The use case here is for zone-apex whole-site static website hosting (either of just canonicalizing redirects or of the final webpage: same issue).
Because it's easy? I have 100 static websites on S3. After initial setup of buckets, it's trivial to update/sync all of these sites using command line tool (I use S3 Sync) with one click on a batch file. And hosting on S3 is cheap.
This is pretty interesting, I'd like to hear more about it. I'd also like Amazon to hear more about it because maybe they could treat web buckets differently or something.
As I've stated elsewhere in this thread, this is documented behavior from S3. I also have billions of objects in S3, and I definitely get back 500 errors. I'm sorry, but even the CTO of Twitpic is not in a position to say "we push infinitely more data than you, so we know better", at least not for S3 ;P.
Honestly, I have to ask: would you know if some tiny percentage of your requests failed with a 500 error? I bet the answer is "no", as the idea that you wrote some JavaScript to look for a condition you probably didn't realize could happen is almost 0. I'd love to be surprised, however ;P.
(That said, as you are hosting "images", at least you could detect it with JavaScript and fix it, so one could thereby imagine a realistic reason why this would not be a serious problem for you; however, I'd argue that you are then treating S3 as an API, not as a static web hosting system.)
I have one bucket that has 3,148,859,832 objects in it <- I got that number from the AWS Account Activity for S3, StandardStorage / StorageObjectCount metric. I apparently make 1-2 million GET requests off of it per hour. Yesterday, Amazon returned a 500 error to me 35 times, or 1-2 per hour.
That's about a 1 in a million chance of failure, but if you are serving 4 billion images out of S3 (assuming you mean # requests and not # objects), then that means that 4,000 of your requests failed with a 500 error. That's 4,000 people out there who didn't get to see their image today.
So, seriously: are you certain that didn't happen? That out of the billions of people you are serving images to off of Twitpic, that you don't have some small percentage of unhappy people getting 500 errors? Again: it is a small chance of failure, but when it happens the browser won't retry.
As I said: "it only happens to some small percentage of requests, but for a popular website that can be a lot of people (and even for an unpopular one, every user counts)" <- websites like ours serve tens to hundreds of millions of users billions of requests... one-in-a-million actually happens.
(edit: Also, I will note that you seem to be using CloudFront to serve the images from S3, which might be a very different ballgame than serving directly out of S3; for all we know, CloudFront's special knowledge of S3 might cause it to automatically retry 500 errors; for that matter, the "website" feature of S3 could be doing this as well, but I have yet to get word from Amazon on whether that's the case... just pulling directly from the bucket using the normal REST API endpoint does, however, return 500 errors in the way they document.)
12:54:47 * saurik ('s [third] sentence still managed to feel a little more confrontational than he wanted, even with the ;P at the end; he was going for more of a funny feel)
A) Do you define "marginal" as "one in a million"? ;P
B) The only reason I opted for "# requests" instead of "# objects" is because it let me put a hard figure on "number of people dissatisfied if you have a one in a million error rate". Let's say you are doing 4 billion image requests per hour (the time scale is actually irrelevant): then at a 0.0001% error rate (which is what I get from S3) then 4,000 users per hour are getting an error.
C) ... you aren't doing S3 static web hosting if you are keeping access logs, as the only people who know about the request are the user's web browser and the server. You can attempt to detect the error in JavaScript on the client, but you can't keep an access log. If you are logging requests made by your server, then the error rate is irrelevant as you can just retry the operation.
Specifically, it is both allowed to--and often does--return 50x errors to requests. The documentation for the S3 API states you should immediately retry; that's fine for an API where I can code that logic into my client library, but is simply an unacceptable solution on the web. Maybe there are one or two exceptions to this, but I have simply never seen a web browser retry these requests: the result is you just get broken images, broken stylesheets, or even entire broken pages. Back when Twitter used to serve user avatar pictures directly from S3 the issue was downright endemic (as you'd often load a page with 30 small images to request from S3, so every few pages you'd come across a dud).
Sure, it only happens to some small percentage of requests, but for a popular website that can be a lot of people (and even for an unpopular one, every user counts), and it is orders of magnitude higher of a random error rate than I've experience with my own hosting on EC2; it is also irritating because it is random: when my own hosting fails, it fails in its entirely: I don't have some tiny fraction of requests for users all over the world failing.
Regardless, I actually have an administrative need to stop hosting a specific x.com to www.x.com redirect on some non-AWS hosting I have (the DNS is hosted by Route53, etc., but I was left with a dinky HTTP server in Kentucky somewhere handling the 301), and I figured "well, if it doesn't have to actually request through to an underlying storage system, maybe I won't run into problems; I mean, how hard is it to take a URL and just immediately return a 301?", but after just a few minutes of playing with it I managed to get a test request that was supposed to return a 301 returning a 500 error instead. :(
This wasn't just a single-time problem either: I've setup a loop requesting random files (based on the timestamp of the test run and a sequence number from the test) off this pure-redirect bucket that I've left running for a few minutes, and some of the S3 nodes I'm talking to (72.21.194.13 being a great example) are just downright unreliable, often returning 500 errors in small clumps (that one node is giving me a 2% failure rate!!). S3 is simply not an appropriate mechanism for site hosting, and it is a shame that Amazon is encouraging people to misuse it in this fashion.(edit: Great, and now someone downvoted me: would you like more evidence that this is a problem?)