Cool stuff I think though that much of this can be handled with other ways of doing things (although obviously there is never one right way of doing these kinds of things). This application kit is one way of orchestrating service/server discovery. Another way, which I have implemented personally is to use a combination of mcollective and puppet (with puppet facts enabled). This allows you to defined roles for specific systems and run tasks against servers of that specific role type, keep track of which servers are that role type, connect them to a 'central' load-balancer and many other things.
This serves to solve most of the issues that this toolkit provides for, but likely would not be the good option for everyone. Just some info on at least one other way to deal with this stuff!
Having a central load balancer is going to turn into a nightmare once you start managing a reasonable number of servers. Hardware goes bad (especially in the cloud), and having a single point of failure leaves you at it's mercy.
A load balancer should never be a single point of failure. You should always have multiples.
Also, if the response to this is then 'but it's still a central point of failure', they haven't really removed that in this solution. If the zookeeper cluster dies you lose everything.
Generally if a clustered load-balancer dies and another takes over there's a half second or a couple seconds of transition, but you're back up and running with a very simple architecture. If all of your load-balancers die, something much bigger to worry about is going on.
Really all this does is place the failure mode into a higher-level service with a much greater potential for failure. Zookeeper even makes you specify the size of your cluster and in my experience it's difficult to live update this. I've read they're working on that, but still.
Clustered load-balancers (using Pacemaker/Corosync, keepalived or similar) are very well understood these days. Pacemaker/Corosync can even run within EC2 now, since a couple years ago they added unicast support thus obviating the multicast issues present within EC2.
Additionally, if we want to talk about load then a well configured haproxy/Nginx load-balancer can handle hundreds of thousands of connections a second. If your installation needs more than this then I'm certain you could get a layer to distribute the load-balancing between a set. Obviously another problem to introduce, but still not one you'll reach until you probably have even more traffic than airbnb gets.
For accuracy it's worth pointing out that if Zookeeper dies only registration/deregistration goes down. The local HAProxy processes will continue to run, and applications will continue to be able to communicate with services. Zookeeper isn't a single point of failure for communication; it's just a registration service.
Indeed it does! I use it for less complicated fail-over situations myself, but if you start to need more complicated topologies (or something which has good interaction with IPVS/LVS) then Corosync!
This serves to solve most of the issues that this toolkit provides for, but likely would not be the good option for everyone. Just some info on at least one other way to deal with this stuff!