Cool stuff I think though that much of this can be handled with other ways of do...

mLewisLogic · on Oct 23, 2013

Having a central load balancer is going to turn into a nightmare once you start managing a reasonable number of servers. Hardware goes bad (especially in the cloud), and having a single point of failure leaves you at it's mercy.

druiid · on Oct 23, 2013

A load balancer should never be a single point of failure. You should always have multiples.

Also, if the response to this is then 'but it's still a central point of failure', they haven't really removed that in this solution. If the zookeeper cluster dies you lose everything.

Generally if a clustered load-balancer dies and another takes over there's a half second or a couple seconds of transition, but you're back up and running with a very simple architecture. If all of your load-balancers die, something much bigger to worry about is going on.

Really all this does is place the failure mode into a higher-level service with a much greater potential for failure. Zookeeper even makes you specify the size of your cluster and in my experience it's difficult to live update this. I've read they're working on that, but still.

Clustered load-balancers (using Pacemaker/Corosync, keepalived or similar) are very well understood these days. Pacemaker/Corosync can even run within EC2 now, since a couple years ago they added unicast support thus obviating the multicast issues present within EC2.

Additionally, if we want to talk about load then a well configured haproxy/Nginx load-balancer can handle hundreds of thousands of connections a second. If your installation needs more than this then I'm certain you could get a layer to distribute the load-balancing between a set. Obviously another problem to introduce, but still not one you'll reach until you probably have even more traffic than airbnb gets.

reissbaker · on Oct 24, 2013

For accuracy it's worth pointing out that if Zookeeper dies only registration/deregistration goes down. The local HAProxy processes will continue to run, and applications will continue to be able to communicate with services. Zookeeper isn't a single point of failure for communication; it's just a registration service.

vzctl · on Oct 24, 2013

by the way keepalived >= 1.2.8 also supports vrrp over unicast

druiid · on Oct 26, 2013

Indeed it does! I use it for less complicated fail-over situations myself, but if you start to need more complicated topologies (or something which has good interaction with IPVS/LVS) then Corosync!