What We Talk About When We Talk About Distributed Systems

maglite77 · on Dec 10, 2015

Also, remember that even building a "simple" web app (comprised of a browser (client), web server, and database) counts as a distributed system. Concerns over event ordering, lack of global time, and data consistency models are all alive and well even within boring LOB systems.

To put it another way, distributed system does not necessarily imply a massive research cluster connected via MPI.

pron · on Dec 9, 2015

A nice survey of the topic can be found in Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey, Xavier Défago, André Schiper, Péter Urbán, 2004[1]. Interestingly, the unexpected fame of the CAP theorem and especially its various misinterpretations has done some harm to this field of distributed systems when people started believing that eventual consistency is the only way to achieve the scalability requirements of their applications.

On the JVM, the best group communication infrastructure is the terrific JGroups library[2]. On .NET (and soon the JVM, too) there's Vsync (Isis 2)[3]. For C/C++ (with Java and Python binding), there's Spread[4]. All three are heavily inspired by the work of (or, in the case of Vsync/ISIS, directly created by) Ken Birman of Cornell, and implement group membership and leader election, atomic multicast, failure detection and more.

[1]: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.102....

[2]: http://jgroups.org/

[3]: http://vsync.codeplex.com/

[4]: http://www.spread.org/

monocasa · on Dec 9, 2015

I can't speak for JGroups (definitely going to check that out tonight), but spread and Vsync/ISIS both struck me as perfect examples of the meme of poor quality academic code.

Spread is an #ifdef soup containing scary stuff like "#ifndef _REENTRANT" in their hello world examples. Additionally it has a weird not quite BSD licence that doesn't look compatible with the GPL. It also makes you send them either an ssh key to checkout from subversion, or send them email/name/company info just to get the download link working.

Vsync/ISIS is a single 50KLOC C# file that spawns upwards of a dozen threads on init.

IMHO there's room for someone with more of a software engineering background to come in and clean them up (assuming that this person/group knows enough about distributed systems to clear things up without introducing more problems).

yid · on Dec 10, 2015

> spread and Vsync/ISIS both struck me as perfect examples of the meme of poor quality academic code.

I was about to say that academic code is almost always poor quality, but that should never, ever apply to distributed systems code. If there's one thing we should see from academic distributed systems research, it's principled, well-constructed, solidly tested libraries, rather than wannabe commercial databases written by graduate students.

pron · on Dec 10, 2015

Of the three, JGroups is the only one I have used. It is definitely not academic, and it serves as the infrastructure for JBoss Infinispan.

padobson · on Dec 9, 2015

Let's say I wanted to prove to a future employer that I had learned how t build a distributed system with a practical example, but my current employer doesn't require a distributed system. How could I go about doing this in a side project?

macintux · on Dec 9, 2015

Implementing the Raft protocol[1] has been a popular activity over the last couple of years. Amazon's Dynamo paper[2] is also easy enough to grasp to build a prototype of a distributed key/value store.

[1]: https://raft.github.io

[2]: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp...

rhgraysonii · on Dec 9, 2015

You should dive into Mesosphere. The resources available are quite robust. With only a tenuous grasp on the larger concepts I was able to architect a system with Mesos + Chronos for job queueing and Marathon for container deployment and moved some of my hobby projects to it. For less than $150 a month or so you could provision 3 primary and 3 worker boxes and get a start.

And once you have a real system, you can start looking into the principles of the underlying pieces such as Zookeeper etc.

kylixz · on Dec 9, 2015

Just out of curiosity... have you used JGroups before? I've heard negative things about it but wasn't sure if those were unfounded or maybe just early versions.

Any thoughts of how these compare to something like the Data Distribution Service (DDS) for group communication?

[DDS]: http://portals.omg.org/dds/

pron · on Dec 10, 2015

Yes, I have, and found it great (i.e. no more problems than any other library I've used). It is under (very) active development (by the same developer for 17 years now), who is also very responsive on the mailing list. JGroups serves as the foundation for JBoss Infinispan.

I don't know much about DDS (except that a company I used to work for bought the RTI implementation, and it was quite expensive).

buckbova · on Dec 9, 2015

I couldn't reach the https version of the vsync site. Perhaps it's my company's web filtering.

This worked.

http://vsync.codeplex.com/

toast0 · on Dec 9, 2015

Super offtopic, but it looks like they only support SSL3 and TLS1.0, you might have something that does TLS1.2 only.

gamapuna · on Dec 9, 2015

http://nil.csail.mit.edu/6.824/2015/schedule.html MIT distributed systems course is pretty good IMO.

ra7 · on Dec 9, 2015

Do you know if they have lecture videos for this course?

v1ct0r · on Dec 9, 2015

Cloud Computing Concept on Cousera is really good. I've learned a lot from there.

ra7 · on Dec 9, 2015

Yes, it's really good too.

jamesblonde · on Dec 9, 2015

Much of the theory of Distributed systems is published at Principles of Distributed Systems (PODC). However, a lot of emphasis there is on the asynchronous model. Results that hold for the asynchronous model generally hold for other models. The main point a layman needs to know about the asynchronous model is that a slow node cannot be distinguished from a dead node. Implementers of distributed systems will generally ignore most results on the asynchronous model. A slow node might as well be a dead node, so we can use heartbeats and timeouts to exclude these nodes. Typically, robust distributed systems in industry will follow some form of partially synchronous model, adapted to the type of environment it works in: LAN on baremetal => reliable, almost synchronous when not overloaded; WAN (Internet) => as close to the asynchronous model as is practical.

jlas · on Dec 9, 2015

I appreciate the Carver reference https://en.wikipedia.org/wiki/What_We_Talk_About_When_We_Tal...

jessaustin · on Dec 9, 2015

At this point it might be more of a reference to Birdman.

cdcarter · on Dec 10, 2015

I came to start making the references via Murakami, personally.

buckbova · on Dec 9, 2015

I enjoy reading on distributed systems, but does anyone have suggestions for practical blogs/books/articles on the subject with directly applicable information?

thecage411 · on Dec 9, 2015

Distributed Systems for Fun and Profit is pretty good: http://book.mixu.net/distsys/single-page.html.

Focuses on theory results that are applicable to real systems and analyzes those systems.

macintux · on Dec 9, 2015

I haven't kept this up to date for a while now, but https://gist.github.com/macintux/6227368 may have some content of interest.

faster · on Dec 10, 2015

There is a distributed systems Slack group. Sign up here: https://dist-sys-slack.herokuapp.com/ There is a channel for links to papers.

davidw · on Dec 9, 2015

Nice to see this, and completely agree with the sentiment. There is a lot of academic stuff, which is great but not always easy to evaluate or apply, and there is also a lot of harrumphing about how distributed systems are really hard without going into much in the way of details about how to actually evaluate and approach the problem.

mstade · on Dec 9, 2015

> [...] there is also a lot of harrumphing about how distributed systems are really hard without going into much in the way of details about how to actually evaluate and approach the problem.

So true. Also – I've been involved in a few contexts, tangentially or otherwise, where it seemed to me that there was a root problem that no one quite understood. For reasons, that meant it had to be solved with a distributed system. Now there are more problems, because "distributed systems are really hard."

(N.B. I'm not deriding that phrase, or doubting it's accuracy; just saying that it's also used as a way to shy away from not having a clue what problem you're trying to solve in the first place.)

sbilstein · on Dec 9, 2015

Perhaps one of the consequences of getting information about programming topics mostly through twitter or other quick one take sources ;)

More seriously, when I started digging into distributed systems I just started picking up papers and reading and asking questions. It was definitely a meandering path until I synthesized, maybe after 1 year, what it was all about. Reading Paxos one week and DyanmoDB the next week made it tough to see the bigger picture. I think this survey post would have been really helpful to me in the past.

davidw · on Dec 9, 2015

> It was definitely a meandering path

Which is fine, but not everyone has the time or inclination for that. Sometimes academic papers leave out lots of important practical details, or don't make a good connection with how a system might look in the real world.

I think there's some room for a few good books that fill that space, although perhaps they exist and I'm not aware of them.

Myrmornis · on Dec 10, 2015

I have a question, hopefully it doesn't sound too stupid.

Basically my question is: do you have any recommendations for reading about "concurrency" but at a higher level than "concurrent programming" and less formal than "distributed systems" algorithms? Here's an attempt at explaining what I mean:

Suppose I worked for a company which mostly writes web apps in python. They don't write much multi-threaded code. They don't write native desktop graphical apps or do "systems programming". So books about "concurrent programming" with threads, or in go/scala/clojure/erlang etc seem, superficially, slightly irrelevant. Also, books about formal distributed systems algorithms seem, superficially, slightly irrelevant. But "concurrency" is still highly relevant! It's just at a higher level than is meant by "concurrency" in a programming language and it's not as formal or regular as is usually meant by "distributed systems".

What I mean is: the company handles concurrent HTTP requests resulting in concurrent database transactions; celery tasks might also be running at the same time; message-queue type problems come up frequently; they are increasingly tending towards a "service-oriented" architecture which is inherently concurrent; and, at the end of the day, all the higher-level "business logic" is some sort of emergent property of a complex, and concurrent, underlying system of apps communicating via HTTP, with asynchronous background tasks, etc. Wise senior programmers say: "Oh, you're going to need to apply some row-locking there.". Wide-eyed junior programmers say: "Oh yes, of course" and wonder, "What should I be reading to know this?".

What should they be reading?

Also, how the hell do people diagram this stuff in order to talk about it? Google searches tend to end up either in UML languages that seem to be unfashionable (but might in fact be just the right thing?) or in academic CS papers that seem to be uninfluential. But there must be something better than those pessimistic paragraphs from No Silver Bullet:

> Software is invisible and unvisualizable... The reality of software is not inherently embedded in space... As soon as we attempt to diagram software structure, we find it to constitute not one, but several, general directed graphs superimposed one upon another. The several graphs may represent the flow of control, the flow of data, patterns of dependency, time sequence, name-space relationships... In spite of progress in restricting and simplifying the structures of software, they remain inherently unvisualizable, and thus do not permit the mind to use some of its most powerful conceptual tools. This lack not only impedes the process of design within one mind, it severely hinders communication among minds.

pron · on Dec 10, 2015

Sometimes, in order to understand the higher level concepts to a satisfying degree, there is no choice other than to learn and understand the low-level stuff. I think Java Concurrency in Practice has everything you need (even if you're not using Java). It is aimed at the practitioner, but covers all the necessary background.