Regarding how the ELB works, then I think the AWS engineers have a different idea on how to implement stuff. AWS is very much a dynamic platform and the engineers seem to have embraced this when they came up with this solution.
It doesn't necessarily mean you can't use websockets through an ELB, it just means that you would need to be able to handle reconnects, but that shouldn't be a new challenge for any system relying on connections being open for long. Also, the load balancer servers doesn't switch every 60 seconds, you can have connections running for a lot longer than that. I would also assume the load balancers keep handling connections for a while after they were taken out of the DNS rotation, in order to make sure DNS caches are updated before the IP addresses stops working.
AWS support has said that ELBs will continue to accept connections (and use the correct backend) for at least an hour after the CNAME stops resolving to a particular IP.
> you would need to be able to handle reconnects, but that shouldn't be a new challenge for any system relying on connections being open for long
That's true, but it kind of misses the point. Normally, if a connection is dropped it means something has gone wrong. In this case, connections are dropped by design, and there is no way (AFAICT) to work around this. Designing so that behavior that is otherwise the result of things going wrong is now the normal designed-for behavior is, IMHO, the very definition of Bad Design.
I'd argue that this design promotes building reliable applications. A system that cannot reconnect is fragile, and the best way to know if the system can handle that failure is to occasionally induce the event. Assuming that you are running across a lossless, ideal network is, IMHO, the very definition of Bad Design.
Certainly systems should be designed to be robust against failures. But encouraging this by deliberately producing failures in production seems like a bad idea to me. It's kind of like saying, "Let's see if the new hull design works by deliberately steering the boat into an iceberg!"
A TCP socket teardown followed by a reconnect is hardly the equivalent of ramming a floating chunk of ice. There are a bunch of reasons you will see that teardown in practice, like NAT timeouts in a home router, or carrier-grade 6to4 NAT, or mobile devices rehoming to a new tower, or anywhere else that state is tied to the path.
Sure this is a deliberately produced failure, but only in the sense that this is a "normal" failure. This is a condition that is to be expected on the internet, and this is simply an additional place it occurs.
Bad analogy. It's like saying "let's see if the new hull design works by deliberately running it into things in a test laboratory setting". Because, y'know, if you deploy an application to production using a particular network configuration (that is, using an ELB) without testing it in a development/staging environment first, you're doing a poor job.
This disconnect behavior is just a property of the system. Either you design your application to handle it, or you use a different system. (Not that you can get away with not handling disconnects even without ELBs.)
My analysis shows an AWS ELB changes ip addresses on us roughly every 2 weeks. Often enough to cause problems if you aren't prepared but infrequent enough to give you a false confidence that things are working as designed.
It doesn't necessarily mean you can't use websockets through an ELB, it just means that you would need to be able to handle reconnects, but that shouldn't be a new challenge for any system relying on connections being open for long. Also, the load balancer servers doesn't switch every 60 seconds, you can have connections running for a lot longer than that. I would also assume the load balancers keep handling connections for a while after they were taken out of the DNS rotation, in order to make sure DNS caches are updated before the IP addresses stops working.