Hmm. Thank you for the datapoint. It’s why I scaled the bet down to $100 for 200ms.
I think it’s worth uncovering whether a 100ms delay could result in an outage. If I were on call, it’d be hard to sleep knowing that was true.
The root claim is of course that disabling NDELAY can result in an outage. It still seems $200-unlikely that this could be true. Certainly it might cause performance problems, but the claim was reliability. Outages would put it firmly in the “unreliable” section of the Venn diagram.
My claim about 1min leader reelections is admittedly more suspicious. It’s surprising the reelections caused outages. But I suppose if there were a lot of long-running operations that needed a total order, frequent reelections would hose that.
In fairness, I don't know if we kept the default. I'm responding to two independent things at this point: first, there are definitely systems where 200ms delays have rippling impacts, and second, leader elections aren't always benign.
(Consul would, I'm sure, converge eventually regardless of the election frequency, but that doesn't mean everything that relies on Consul will tolerate those delays).
I don't have much of a take here, beyond that I don't think you can extrapolate as much from what's on the 6.824 pages as you might have done here. Certainly, in a system where 200ms is the difference between "healthy" and "not healthy" status on a peer relationship, I'd think you'd want Nagle disabled. But I haven't thought carefully about this, or looked that closely at the typical packet flow between Consul nodes. I could be wrong about all of this; more reason not to give me any money.
Later
Per the comment upthread, I haven't even bothered to check which parts of this packet flow are even TCP to begin with.
I think it’s worth uncovering whether a 100ms delay could result in an outage. If I were on call, it’d be hard to sleep knowing that was true.
The root claim is of course that disabling NDELAY can result in an outage. It still seems $200-unlikely that this could be true. Certainly it might cause performance problems, but the claim was reliability. Outages would put it firmly in the “unreliable” section of the Venn diagram.
My claim about 1min leader reelections is admittedly more suspicious. It’s surprising the reelections caused outages. But I suppose if there were a lot of long-running operations that needed a total order, frequent reelections would hose that.