Regarding how the ELB works, then I think the AWS engineers have a different ide...

JimmyL · on July 26, 2015

AWS support has said that ELBs will continue to accept connections (and use the correct backend) for at least an hour after the CNAME stops resolving to a particular IP.

lisper · on July 26, 2015

Ah, I didn't know that. That makes a big difference.

But there's still a significant hole here: is there any way to get notified that this has happened other than polling the DNS?

lisper · on July 26, 2015

> you would need to be able to handle reconnects, but that shouldn't be a new challenge for any system relying on connections being open for long

That's true, but it kind of misses the point. Normally, if a connection is dropped it means something has gone wrong. In this case, connections are dropped by design, and there is no way (AFAICT) to work around this. Designing so that behavior that is otherwise the result of things going wrong is now the normal designed-for behavior is, IMHO, the very definition of Bad Design.

ishigoemon · on July 26, 2015

I'd argue that this design promotes building reliable applications. A system that cannot reconnect is fragile, and the best way to know if the system can handle that failure is to occasionally induce the event. Assuming that you are running across a lossless, ideal network is, IMHO, the very definition of Bad Design.

"If it hurts, do it more often." -- Martin Fowler

lisper · on July 26, 2015

Certainly systems should be designed to be robust against failures. But encouraging this by deliberately producing failures in production seems like a bad idea to me. It's kind of like saying, "Let's see if the new hull design works by deliberately steering the boat into an iceberg!"

ishigoemon · on July 27, 2015

A TCP socket teardown followed by a reconnect is hardly the equivalent of ramming a floating chunk of ice. There are a bunch of reasons you will see that teardown in practice, like NAT timeouts in a home router, or carrier-grade 6to4 NAT, or mobile devices rehoming to a new tower, or anywhere else that state is tied to the path.

Sure this is a deliberately produced failure, but only in the sense that this is a "normal" failure. This is a condition that is to be expected on the internet, and this is simply an additional place it occurs.

kelnos · on July 27, 2015

Bad analogy. It's like saying "let's see if the new hull design works by deliberately running it into things in a test laboratory setting". Because, y'know, if you deploy an application to production using a particular network configuration (that is, using an ELB) without testing it in a development/staging environment first, you're doing a poor job.

This disconnect behavior is just a property of the system. Either you design your application to handle it, or you use a different system. (Not that you can get away with not handling disconnects even without ELBs.)

bmurphy1976 · on July 26, 2015

My analysis shows an AWS ELB changes ip addresses on us roughly every 2 weeks. Often enough to cause problems if you aren't prepared but infrequent enough to give you a false confidence that things are working as designed.