Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Call me maybe: etcd and Consul (aphyr.com)
175 points by nwjsmith on June 12, 2014 | hide | past | favorite | 24 comments


Thanks to aphyr for doing this sort of testing. It is important that we not only verify etcd with our own internal testing but also have dedicated third party feedback. From the beginning we have wanted something that was simple and worked correctly.

It is also great to see etcd showing up in lots of interesting projects like skydns and kubernetes. I think we have built something that is not just a great building block for CoreOS but the OSS community at large.

Thanks to everyone who has helped get the project to where it is today; there is a bright future ahead.


I think the most surprising fact, despite all the amazing work here, is that Comcast helps pay for this kind of research!

Other then that, it's great that we are getting some good results after serious vetting for more modern replacements to ZK. I may not like ZK's crufty-ness, but I could always trust it. Now with these results, I am seriously going to consider Consul as a replacement.


In fact, Comcast has a formal program to fund research grants and open source work. There is a lot more "real" technology in place here at Comcast than one may expect for a telecommunications and entertainment provider.

http://techfund.comcast.com/

The ideal project would be something used within Comcast that doesn't otherwise have a corporate sponsor.


This sounds like a report of a bug, but I believe this is not the actual story. It is more a report of a design tradeoff: the authors of those CP systems completely understand what happens, but were not happy to pay this performance price for reads. One thing is to have a data store that has a very limited performance in write operations but is very fast when you need to read, another thing is a data store where both writes and reads are very slow. However once you read potentially stale data from nodes, many of the advantages of having a CP system are gone. IMHO to revert those systems to a default where reads are applied to the state machine like writes is the sanest thing to do, even if options to potentially read stale reads are also useful in some context.


> This sounds like a report of a bug, but I believe this is not the actual story. It is more a report of a design tradeoff: the authors of those CP systems completely understand what happens, but were not happy to pay this performance price for reads

If the authors were aware of these issues then the documentation was dangerously misleading[1] and they should be docked points for that.

[1] As reported by aphyr, haven't read through it all myself. I'm thinking primarily of the labeling of "read from leader without going through log" as "consistent" bit.


That's why I think this is a design decisions in both cases:

In one of this products (etcd if I remember correctly) there was a clear statement in the documentation about this semantics, and anyway, who implements Raft knows that for reads to be consistent they need to go the same path as writes. In the Raft paper you can find a whole section about this.

If you check the paper there are the following clearly stated informations:

Leaders can't reply to read queries without doing additional checks otherwise the reads are not linearizable.

For the reads to be linearizable, the following two things must be performed by leaders.

1) Commit a NOP at the start of its term, which is not a problem from a performance point of view. The problem is "2".

2) A leader needs to check if it is still the leader before every read, and this requires to contact a majority. That's the performance problem of linearizable reads, because you need to pay a latency equal to the latency of the slowest reply of the N/2+1 acks you need.

However note that even linearizable reads don't require fsync() to be called, so they are still better than writes.


What's your opinion on exposing the option of stale vs. consistent read in the API? I can see cases where I'd be ok with a stale read while for others I'd like the most up-to-date value.


That makes a lot of sense, there are definitely use cases where to read a past value is viable, especially considering the big difference in performances between the two kind of reads.


Wow. I thought of etcd as taking consistency very seriously. Aphyr's discoveries are indeed quite surprising: "The very first test I ran with reported a linearizability failure. I was so surprised I spent another week double-checking Knossos and Jepsen, then writing my own etcd client, to make sure I hadn’t made a mistake. Sure enough, etcd’s registers are not linearizable."


I'm not sure what you've added here other than to exclaim surprise. Bugs can be surprising, bugs are usually not intentional. Given the fast response by the etcd team, they seem to be taking consistency very seriously.


This was not a bug, it was a design decision (source: https://github.com/coreos/etcd/issues/741 ) . You can get stale data in some circumstances because they wanted to avoid the performance penalty of ensuring the latest really is the latest value from a quorum of followers.


There's the design decision, and then there's the incorrect documentation that says reads with "consistent=true" are guaranteed to return the latest value.

https://github.com/coreos/etcd/blob/master/Documentation/api...


etcd does provide linearizability with regard to the logical clock index.

What you found is not a bug, but a design decision.

Maybe the coreos people just should not assume that linearizability means the same thing to all people. And they should document it clearly.


    > etcd does provide linearizability with regard to the 
    > logical clock index.
https://twitter.com/aphyr/status/477210387796865024

    > No, it really doesn't. Reads are not monotonic w.r.t 
    > indexes.


Quote from the wiki: A history is linearizable if:

    its invocations and responses can be reordered to yield a sequential history
    that sequential history is correct according to the sequential definition of the object
    if a response preceded an invocation in the original history, it must still precede it in the sequential reordering.

Please forgive my stupidness. But could you tell me which one it violates?


It is possible that the transaction log's last entry contains a value that is not yet considered 'committed' at large, as per the paper. This transaction needs to be confirmed later on when an additional transaction comes and supercede it, confirming it was committed by all.

It is also possible for a leader to be demoted during a split, where the log of that partial transaction will not be counted as final. The new leader at this point can then force a truncation of a follower's log, or ignore it entirely.

The entry you have read from a node that was thought to still be a master without first consulting a majority is therefore possibly a bad write that won't be part of history as far as consensus goes.

This is explained later in the original Raft paper, and this is why you need to read from the quorum to be able to guarantee consistency under all circumstances, among other problem cases.


Small pet peeve, please for the love of god and people stop with the "call me maybe" titles... they aren't cute, funny, or anything but a waste of characters. By the title alone, I assume the content to be juvenile (despite already knowing the content of _that_ blog is anything but).


It is not about being funny, "call me maybe" is very relevant in the eventual consistency world. And If you know any other article by aphyr, you jump immediately to read it.


All of his posts about Jepsen related stuff are prefixed with that phrase. It's a reference to the song "Call Me Maybe" by Carly Rae Jepsen... because his testing tool is CALLED Jepson. Some of his other posts also have prefixes, indicating it's part of a series, like "Clojure from the ground up:"


The series is also about distributed systems where he tests how they fail. Call me maybe is suprisngly apt for what he's doing.


"It is not about being funny, 'call me maybe' is very relevant in the eventual consistency world."

That's not fact, please don't state it as such. That's your opinion on modern day vernacular influencing an unrelated field/topic.

Perhaps you didn't actually read what I wrote, but I said I'm aware of how good their content is/can be. My issue is that a blog of that caliber is using something in our modern vernacular that has been beat to death and adds no actual value.

It'd probably do you best to read what people write instead of putting words in their text or assuming things outside of the scope of what they said...


Isn't someone going by the handle "rubyn00bie" criticizing an article with "call me maybe" in the title as being juvenile something akin to the pot calling the kettle black?


You may criticizing the author of its choice of words. You may disagree on what I state or don't state. But calling out people that they should add value, while all you do is trashing them is a bit controversial and it has no place here.

On the actual critique: if you have had worked with eventually consistent database, a 'Call' record's presence is really a great pun on the 'call me maybe' phrase. I'm really sorry if you don't appreciate that part either.


That's the name of his series investigating distributed systems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: