Call me maybe: etcd and Consul

philips · on June 12, 2014

Thanks to aphyr for doing this sort of testing. It is important that we not only verify etcd with our own internal testing but also have dedicated third party feedback. From the beginning we have wanted something that was simple and worked correctly.

It is also great to see etcd showing up in lots of interesting projects like skydns and kubernetes. I think we have built something that is not just a great building block for CoreOS but the OSS community at large.

Thanks to everyone who has helped get the project to where it is today; there is a bright future ahead.

dinedal · on June 12, 2014

I think the most surprising fact, despite all the amazing work here, is that Comcast helps pay for this kind of research!

Other then that, it's great that we are getting some good results after serious vetting for more modern replacements to ZK. I may not like ZK's crufty-ness, but I could always trust it. Now with these results, I am seriously going to consider Consul as a replacement.

hallmark · on June 20, 2014

In fact, Comcast has a formal program to fund research grants and open source work. There is a lot more "real" technology in place here at Comcast than one may expect for a telecommunications and entertainment provider.

http://techfund.comcast.com/

The ideal project would be something used within Comcast that doesn't otherwise have a corporate sponsor.

antirez · on June 12, 2014

This sounds like a report of a bug, but I believe this is not the actual story. It is more a report of a design tradeoff: the authors of those CP systems completely understand what happens, but were not happy to pay this performance price for reads. One thing is to have a data store that has a very limited performance in write operations but is very fast when you need to read, another thing is a data store where both writes and reads are very slow. However once you read potentially stale data from nodes, many of the advantages of having a CP system are gone. IMHO to revert those systems to a default where reads are applied to the state machine like writes is the sanest thing to do, even if options to potentially read stale reads are also useful in some context.

lomnakkus · on June 13, 2014

> This sounds like a report of a bug, but I believe this is not the actual story. It is more a report of a design tradeoff: the authors of those CP systems completely understand what happens, but were not happy to pay this performance price for reads

If the authors were aware of these issues then the documentation was dangerously misleading[1] and they should be docked points for that.

[1] As reported by aphyr, haven't read through it all myself. I'm thinking primarily of the labeling of "read from leader without going through log" as "consistent" bit.

antirez · on June 13, 2014

That's why I think this is a design decisions in both cases:

In one of this products (etcd if I remember correctly) there was a clear statement in the documentation about this semantics, and anyway, who implements Raft knows that for reads to be consistent they need to go the same path as writes. In the Raft paper you can find a whole section about this.

If you check the paper there are the following clearly stated informations:

Leaders can't reply to read queries without doing additional checks otherwise the reads are not linearizable.

For the reads to be linearizable, the following two things must be performed by leaders.

1) Commit a NOP at the start of its term, which is not a problem from a performance point of view. The problem is "2".

2) A leader needs to check if it is still the leader before every read, and this requires to contact a majority. That's the performance problem of linearizable reads, because you need to pay a latency equal to the latency of the slowest reply of the N/2+1 acks you need.

However note that even linearizable reads don't require fsync() to be called, so they are still better than writes.

brunov · on June 12, 2014

What's your opinion on exposing the option of stale vs. consistent read in the API? I can see cases where I'd be ok with a stale read while for others I'd like the most up-to-date value.

antirez · on June 12, 2014

That makes a lot of sense, there are definitely use cases where to read a past value is viable, especially considering the big difference in performances between the two kind of reads.

Dave_Rosenthal · on June 12, 2014

Wow. I thought of etcd as taking consistency very seriously. Aphyr's discoveries are indeed quite surprising: "The very first test I ran with reported a linearizability failure. I was so surprised I spent another week double-checking Knossos and Jepsen, then writing my own etcd client, to make sure I hadn’t made a mistake. Sure enough, etcd’s registers are not linearizable."

leorocky · on June 12, 2014

I'm not sure what you've added here other than to exclaim surprise. Bugs can be surprising, bugs are usually not intentional. Given the fast response by the etcd team, they seem to be taking consistency very seriously.

aaronblohowiak · on June 12, 2014

This was not a bug, it was a design decision (source: https://github.com/coreos/etcd/issues/741 ) . You can get stale data in some circumstances because they wanted to avoid the performance penalty of ensuring the latest really is the latest value from a quorum of followers.

teraflop · on June 12, 2014

There's the design decision, and then there's the incorrect documentation that says reads with "consistent=true" are guaranteed to return the latest value.

https://github.com/coreos/etcd/blob/master/Documentation/api...

evidencepi · on June 12, 2014

etcd does provide linearizability with regard to the logical clock index.

What you found is not a bug, but a design decision.

Maybe the coreos people just should not assume that linearizability means the same thing to all people. And they should document it clearly.

sagichmal · on June 12, 2014

    > etcd does provide linearizability with regard to the 
    > logical clock index.

https://twitter.com/aphyr/status/477210387796865024

    > No, it really doesn't. Reads are not monotonic w.r.t 
    > indexes.

evidencepi · on June 12, 2014

Quote from the wiki: A history is linearizable if:

    its invocations and responses can be reordered to yield a sequential history
    that sequential history is correct according to the sequential definition of the object
    if a response preceded an invocation in the original history, it must still precede it in the sequential reordering.

Please forgive my stupidness. But could you tell me which one it violates?

mononcqc · on June 13, 2014

It is possible that the transaction log's last entry contains a value that is not yet considered 'committed' at large, as per the paper. This transaction needs to be confirmed later on when an additional transaction comes and supercede it, confirming it was committed by all.

It is also possible for a leader to be demoted during a split, where the log of that partial transaction will not be counted as final. The new leader at this point can then force a truncation of a follower's log, or ignore it entirely.

The entry you have read from a node that was thought to still be a master without first consulting a majority is therefore possibly a bad write that won't be part of history as far as consensus goes.

This is explained later in the original Raft paper, and this is why you need to read from the quorum to be able to guarantee consistency under all circumstances, among other problem cases.

rubyn00bie · on June 12, 2014

Small pet peeve, please for the love of god and people stop with the "call me maybe" titles... they aren't cute, funny, or anything but a waste of characters. By the title alone, I assume the content to be juvenile (despite already knowing the content of _that_ blog is anything but).

syntern · on June 12, 2014

It is not about being funny, "call me maybe" is very relevant in the eventual consistency world. And If you know any other article by aphyr, you jump immediately to read it.

owyn · on June 12, 2014

All of his posts about Jepsen related stuff are prefixed with that phrase. It's a reference to the song "Call Me Maybe" by Carly Rae Jepsen... because his testing tool is CALLED Jepson. Some of his other posts also have prefixes, indicating it's part of a series, like "Clojure from the ground up:"

ownagefool · on June 13, 2014

The series is also about distributed systems where he tests how they fail. Call me maybe is suprisngly apt for what he's doing.

rubyn00bie · on June 12, 2014

"It is not about being funny, 'call me maybe' is very relevant in the eventual consistency world."

That's not fact, please don't state it as such. That's your opinion on modern day vernacular influencing an unrelated field/topic.

Perhaps you didn't actually read what I wrote, but I said I'm aware of how good their content is/can be. My issue is that a blog of that caliber is using something in our modern vernacular that has been beat to death and adds no actual value.

It'd probably do you best to read what people write instead of putting words in their text or assuming things outside of the scope of what they said...

kisielk · on June 12, 2014

Isn't someone going by the handle "rubyn00bie" criticizing an article with "call me maybe" in the title as being juvenile something akin to the pot calling the kettle black?

syntern · on June 12, 2014

You may criticizing the author of its choice of words. You may disagree on what I state or don't state. But calling out people that they should add value, while all you do is trashing them is a bit controversial and it has no place here.

On the actual critique: if you have had worked with eventually consistent database, a 'Call' record's presence is really a great pun on the 'call me maybe' phrase. I'm really sorry if you don't appreciate that part either.

keypusher · on June 12, 2014

That's the name of his series investigating distributed systems.