Also, remember that even building a "simple" web app (comprised of a browser (client), web server, and database) counts as a distributed system. Concerns over event ordering, lack of global time, and data consistency models are all alive and well even within boring LOB systems.
To put it another way, distributed system does not necessarily imply a massive research cluster connected via MPI.
A nice survey of the topic can be found in Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey, Xavier Défago, André Schiper, Péter Urbán, 2004[1]. Interestingly, the unexpected fame of the CAP theorem and especially its various misinterpretations has done some harm to this field of distributed systems when people started believing that eventual consistency is the only way to achieve the scalability requirements of their applications.
On the JVM, the best group communication infrastructure is the terrific JGroups library[2]. On .NET (and soon the JVM, too) there's Vsync (Isis 2)[3]. For C/C++ (with Java and Python binding), there's Spread[4]. All three are heavily inspired by the work of (or, in the case of Vsync/ISIS, directly created by) Ken Birman of Cornell, and implement group membership and leader election, atomic multicast, failure detection and more.
I can't speak for JGroups (definitely going to check that out tonight), but spread and Vsync/ISIS both struck me as perfect examples of the meme of poor quality academic code.
Spread is an #ifdef soup containing scary stuff like "#ifndef _REENTRANT" in their hello world examples. Additionally it has a weird not quite BSD licence that doesn't look compatible with the GPL. It also makes you send them either an ssh key to checkout from subversion, or send them email/name/company info just to get the download link working.
Vsync/ISIS is a single 50KLOC C# file that spawns upwards of a dozen threads on init.
IMHO there's room for someone with more of a software engineering background to come in and clean them up (assuming that this person/group knows enough about distributed systems to clear things up without introducing more problems).
> spread and Vsync/ISIS both struck me as perfect examples of the meme of poor quality academic code.
I was about to say that academic code is almost always poor quality, but that should never, ever apply to distributed systems code. If there's one thing we should see from academic distributed systems research, it's principled, well-constructed, solidly tested libraries, rather than wannabe commercial databases written by graduate students.
Let's say I wanted to prove to a future employer that I had learned how t build a distributed system with a practical example, but my current employer doesn't require a distributed system. How could I go about doing this in a side project?
Implementing the Raft protocol[1] has been a popular activity over the last couple of years. Amazon's Dynamo paper[2] is also easy enough to grasp to build a prototype of a distributed key/value store.
You should dive into Mesosphere. The resources available are quite robust. With only a tenuous grasp on the larger concepts I was able to architect a system with Mesos + Chronos for job queueing and Marathon for container deployment and moved some of my hobby projects to it. For less than $150 a month or so you could provision 3 primary and 3 worker boxes and get a start.
And once you have a real system, you can start looking into the principles of the underlying pieces such as Zookeeper etc.
Just out of curiosity... have you used JGroups before? I've heard negative things about it but wasn't sure if those were unfounded or maybe just early versions.
Any thoughts of how these compare to something like the Data Distribution Service (DDS) for group communication?
Yes, I have, and found it great (i.e. no more problems than any other library I've used). It is under (very) active development (by the same developer for 17 years now), who is also very responsive on the mailing list. JGroups serves as the foundation for JBoss Infinispan.
I don't know much about DDS (except that a company I used to work for bought the RTI implementation, and it was quite expensive).
Much of the theory of Distributed systems is published at Principles of Distributed Systems (PODC). However, a lot of emphasis there is on the asynchronous model. Results that hold for the asynchronous model generally hold for other models. The main point a layman needs to know about the asynchronous model is that a slow node cannot be distinguished from a dead node.
Implementers of distributed systems will generally ignore most results on the asynchronous model. A slow node might as well be a dead node, so we can use heartbeats and timeouts to exclude these nodes. Typically, robust distributed systems in industry will follow some form of partially synchronous model, adapted to the type of environment it works in: LAN on baremetal => reliable, almost synchronous when not overloaded; WAN (Internet) => as close to the asynchronous model as is practical.
I enjoy reading on distributed systems, but does anyone have suggestions for practical blogs/books/articles on the subject with directly applicable information?
Nice to see this, and completely agree with the sentiment. There is a lot of academic stuff, which is great but not always easy to evaluate or apply, and there is also a lot of harrumphing about how distributed systems are really hard without going into much in the way of details about how to actually evaluate and approach the problem.
> [...] there is also a lot of harrumphing about how distributed systems are really hard without going into much in the way of details about how to actually evaluate and approach the problem.
So true. Also – I've been involved in a few contexts, tangentially or otherwise, where it seemed to me that there was a root problem that no one quite understood. For reasons, that meant it had to be solved with a distributed system. Now there are more problems, because "distributed systems are really hard."
(N.B. I'm not deriding that phrase, or doubting it's accuracy; just saying that it's also used as a way to shy away from not having a clue what problem you're trying to solve in the first place.)
Perhaps one of the consequences of getting information about programming topics mostly through twitter or other quick one take sources ;)
More seriously, when I started digging into distributed systems I just started picking up papers and reading and asking questions. It was definitely a meandering path until I synthesized, maybe after 1 year, what it was all about. Reading Paxos one week and DyanmoDB the next week made it tough to see the bigger picture. I think this survey post would have been really helpful to me in the past.
Which is fine, but not everyone has the time or inclination for that. Sometimes academic papers leave out lots of important practical details, or don't make a good connection with how a system might look in the real world.
I think there's some room for a few good books that fill that space, although perhaps they exist and I'm not aware of them.
I have a question, hopefully it doesn't sound too stupid.
Basically my question is: do you have any recommendations for reading about "concurrency" but at a higher level than "concurrent programming" and less formal than "distributed systems" algorithms? Here's an attempt at explaining what I mean:
Suppose I worked for a company which mostly writes web apps in python. They don't write much multi-threaded code. They don't write native desktop graphical apps or do "systems programming". So books about "concurrent programming" with threads, or in go/scala/clojure/erlang etc seem, superficially, slightly irrelevant. Also, books about formal distributed systems algorithms seem, superficially, slightly irrelevant. But "concurrency" is still highly relevant! It's just at a higher level than is meant by "concurrency" in a programming language and it's not as formal or regular as is usually meant by "distributed systems".
What I mean is: the company handles concurrent HTTP requests resulting in concurrent database transactions; celery tasks might also be running at the same time; message-queue type problems come up frequently; they are increasingly tending towards a "service-oriented" architecture which is inherently concurrent; and, at the end of the day, all the higher-level "business logic" is some sort of emergent property of a complex, and concurrent, underlying system of apps communicating via HTTP, with asynchronous background tasks, etc. Wise senior programmers say: "Oh, you're going to need to apply some row-locking there.". Wide-eyed junior programmers say: "Oh yes, of course" and wonder, "What should I be reading to know this?".
What should they be reading?
Also, how the hell do people diagram this stuff in order to talk about it? Google searches tend to end up either in UML languages that seem to be unfashionable (but might in fact be just the right thing?) or in academic CS papers that seem to be uninfluential. But there must be something better than those pessimistic paragraphs from No Silver Bullet:
> Software is invisible and unvisualizable... The reality of software is not inherently embedded in space... As soon as we attempt to diagram software structure, we find it to constitute not one, but several, general directed graphs superimposed one upon another. The several graphs may represent the flow of control, the flow of data, patterns of dependency, time sequence, name-space relationships... In spite of progress in restricting and simplifying the structures of software, they remain inherently unvisualizable, and thus do not permit the mind to use some of its most powerful conceptual tools. This lack not only impedes the process of design within one mind, it severely hinders communication among minds.
Sometimes, in order to understand the higher level concepts to a satisfying degree, there is no choice other than to learn and understand the low-level stuff. I think Java Concurrency in Practice has everything you need (even if you're not using Java). It is aimed at the practitioner, but covers all the necessary background.
To put it another way, distributed system does not necessarily imply a massive research cluster connected via MPI.