Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nagle's algorithm also really screws with distributed systems - you are going to be sending quite a few packets with time bounds, and you REALLY don't want them getting Nagled.

In fact, Nagle's algorithm is a big part of why a lot of programmers writing distributed systems think that datacenter networks are unreliable.



I don't think this is correct. 6.824 emphasizes reliability over latency. They mention it in several places: https://pdos.csail.mit.edu/6.824/labs/guidance.html

> It should be noted that tweaking timeouts rarely fixes bugs, and that doing so should be a last resort. We frequently see students willing to keep making arbitrary tweaks to their code (especially timeouts) rather than following a careful debugging process. Doing this is a great way to obscure underlying bugs by masking them instead of fixing them; they will often still show up in rare cases, even if they appear fixed in the common case.

> In particular, in Raft, there are wide ranges of timeouts that will let your code work. While you CAN pick bad timeout values, it won't take much time to find timeouts that are functional.

Their unit tests are quite worthwhile to read, if only to absorb how many ways latency assumptions can bite you.

It's true that in the normal case, it's good to have low latency. But correctly engineered distributed systems won't reorganize themselves due to a ~200ms delay.

To put it another way, if a random 200ms fluctuation causes service disruptions, your system probably wasn't going to work very well to begin with. Blaming it on Nagle's algorithm is a punt.


In my decades of experience in telco, capital markers, and core banking, unexplained latency spikes of hundreds of ms are usually analyzed to death as they can have ripple effects. I’ve had 36 hour severity 1 incidents with multiple VPs taking notes on 24/7 conference calls when a distributed system starts showing latency spikes in the 400ms range.

No, the system isn’t going haywire, but 200-400ms is concerning inside a datacenter for core apps.

But let’s forget IT apps, let’s talk about the network. In a network 200ms is catastrophic.

Presumably you know BGP is the very popular distributed system that converges Internet routes?

Inside a datacenter the Bidirectional Forwarding Protocol (BFD) is used to drop BGP convergence times to be sub-second if you’re using it as an IGP. BFD is also useful with other protocols but anyway. It has heartbeats of 100-300ms. If there’s a fluctuation of the network 3x that interval, it will drop the link and trigger a round of convergence. This is essential in core networks or telco 4G/5G transport networks.

Of course, flapping can be the consequence of setting too low an interval. Tradeoffs.

Back to the original point, I’ve contributed to the code of equity and bond trading apps, telco apps, core banking systems. And cloud/Kubernetes systems. All RPC distributed systems. Every. Single. One. That performed well… For 30 years! Has enabled TCP_NODELAY. Except when serving up large amounts of streaming data. And the reason fundamentally is that most of the time you have less control over client settings (delayed TCP acks), so it’s easier to control the server.


That is all well and good in an academic setting. Many distributed systems in the real world like having time bounds under 200 ms for certain things like Paxos consensus within a datacenter. It turns out that latency, at some level, is equivalent to reliability, and 200 milliseconds is almost always well beyond that level.


I’m not sure what else to say than “this isn’t true.” 6.824’s labs have been paxos-based for at least the better part of a decade, and at no point did they emphasize latency as a key factor in reliability of distributed systems. If anything, it’s the opposite.

Dismissing rtm as “academic” seems like a bad bet. He’s rarely mistaken. If something were so fundamental to real-world performance, it certainly wouldn’t be missing from his course.


I'll be sure to tell my former colleagues (who build distributed storage systems at Google) that they are wrong about network latency being an important factor in the reliability of their distributed systems because an MIT course said so.

I'm not insinuating that your professor doesn't know the whole picture - I'm sure he does research in the area, which would mean that he is very familiar with the properties of datacenter networks, and he likely does research into how to make distributed systems fast. I'm suggesting that he may not be telling it to you because it would complicate his course beyond the point where it is useful for your learning.


Tell you what. If you ask your colleague “Do you feel that a 100ms delay will cause our distributed storage system to become less reliable?” and they answer yes, I’ll venmo you $200. If you increase it to 200ms and they say yes, I’ll venmo you $100. No strings attached, and I’ll take you at your word. But you have to actually ask them, and the phrasing should be as close as possible.

If we were talking >1s delays, I might agree. But from what I know about distributed systems, it seems $200-unlikely that a Googler whose primary role is distributed systems would claim such a thing.

The other possibility is that we’re talking past each other, so maybe framing it as a bet will highlight any diffs.

Note that the emphasis here is “reliability,” not performance. That’s why it’s worth it to me to learn a $200 lesson if I’m mistaken. I would certainly agree as a former gamedev that a 100ms delay degrades performance.


RAFT can easily be tuned to handle any amount of latency. It even discusses this in the paper. The issue is “how long are you willing to be in a leaderless state” and for some applications, it’s very tight. If your application needs a leader to make a distributed decision, but that is currently unavailable, the application might not know how to handle that or it might block until one becomes available causing throughput issues.

However, you shouldn’t be using TCP for latency-sensitive applications IMHO. Firstly, TCP requires a handshake on any new connection. This takes time. Secondly, if 3 packets are sent and the first one is lost, you won’t get those last packets for a couple hundred ms anyway (default retransmission times). So you’re better off using something like UDP. So, if you need the properties of TCP, you aren’t doing latency-sensitive anything.


TCP within datacenters is tuned to have shorter retransmit times, and tends to use long-lived connections.

See this RFC for more info on retransmit times: https://www.rfc-editor.org/rfc/rfc6298


I'll give you a few examples, and maybe I'll run a casual poll next time we get a beer. No venmo.

I will point out that leader election generally has very long timeouts (seconds), but a common theme here is that you do lots of things that are not leader election but have short timeouts which can cause systems to reconfigure because the system generally wants to run in a useful configuration, not just a safe configuration.

In a modern datacenter, 100 milliseconds is ample time to determine whether a server is "available" and retry on a different server - servers and network switches can get slow, and when they get THAT slow, something is clearly wrong, so it is better to drain them and move that data somewhere else. When the control plane hears about these timeout failures from clients, it dutifully assumes that a server is offline and drains it.

Usually, this works well: The machine to machine latency within a datacenter has way less than 100 microseconds, and if you include the OS stack under heavy load, it might get all the way to 1 millisecond. Something almost always is wrong if a very simple server can't respond within 10-100 milliseconds. This results in 10-100 millisecond response times meaning "not available" at the lower layers of the stack. As I mentioned before, enough reports of "unavailable" results in a machine being drained, and a critical number of these results in an outage.

Attack of the killer microseconds is a good paper that addresses the issue here (albeit obliquely): https://dl.acm.org/doi/10.1145/3015146

Here are a couple of examples:

* There is a very important 10 ms timeout in Colossus (distributed filesystem) to determine the liveness of a disk server - I have seen one instance where enough of a cell broke this timeout due to a software change, and made the entire cell go read-only. In another instance, a small test cell went down due to this timeout under one experiment.

* Another cell went down under load due to a different 10 ms liveness timeout thanks to the misapplication of Nagle's algorithm (although not to networking) - I forget if it was a test cell or something customer-facing.

* Bigtable (NoSQL database) has a similar timeout under 100 ms (but greater than 10 ms) for its tablet servers. I'm sure Spanner (NewSQL database) has the same.


At 200 ms you start to assume the other end is dead and retry. You don’t get sub second consumer response times with random 200 ms delays on data center to data center calls.


The default last_server_contact timeout for Consul is 200ms. Can I have $200?


Maybe. The sticking point for me is that I’ve implemented enough distributed system protocols to know that even if a server occasionally drops out, the overall reliability of the service isn’t affected. I would be very curious to hear from someone in the field if they feel differently.

It’s easy to assume that a server dropout = less reliable network. But even if a leader election were happening every minute, it seems unlikely to drastically affect any ops in flight.

But sure, if they agree I’ll venmo you $200 too.


We operate a large Consul cluster (Consul is great, but we abuse the hell out of it). Frequent leader elections have been responsible for outages. Don't worry about the $200, I'm just fucking with you, but I don't think you're on very firm ground with this line of argument. It's fun to watch though, so I do hope you keep going with it. :)


Hmm. Thank you for the datapoint. It’s why I scaled the bet down to $100 for 200ms.

I think it’s worth uncovering whether a 100ms delay could result in an outage. If I were on call, it’d be hard to sleep knowing that was true.

The root claim is of course that disabling NDELAY can result in an outage. It still seems $200-unlikely that this could be true. Certainly it might cause performance problems, but the claim was reliability. Outages would put it firmly in the “unreliable” section of the Venn diagram.

My claim about 1min leader reelections is admittedly more suspicious. It’s surprising the reelections caused outages. But I suppose if there were a lot of long-running operations that needed a total order, frequent reelections would hose that.


In fairness, I don't know if we kept the default. I'm responding to two independent things at this point: first, there are definitely systems where 200ms delays have rippling impacts, and second, leader elections aren't always benign.

(Consul would, I'm sure, converge eventually regardless of the election frequency, but that doesn't mean everything that relies on Consul will tolerate those delays).

I don't have much of a take here, beyond that I don't think you can extrapolate as much from what's on the 6.824 pages as you might have done here. Certainly, in a system where 200ms is the difference between "healthy" and "not healthy" status on a peer relationship, I'd think you'd want Nagle disabled. But I haven't thought carefully about this, or looked that closely at the typical packet flow between Consul nodes. I could be wrong about all of this; more reason not to give me any money.

Later

Per the comment upthread, I haven't even bothered to check which parts of this packet flow are even TCP to begin with.


I've never directly used Consul's internals, but I'm guessing it uses Stubby, which is built on top of TCP.


It does Serf over UDP, but I get fuzzy on the integration of Serf and Consul.


Raft and the Consul RPC API use TCP, Serf uses both TCP and UDP.

While the Consul RCP API may have grown options to use GRPC (I forget now), Raft uses length-prefixed msgpack PDUs.


Whoops, I thought this was a Google product, given the discussion. Stubby is basically GRPC internal to Google.


it seems to me like systems like these are the exception rather than the rule. you can always turn off nagle's algorithm if you have something really latency-sensitive, but it should not be off by default.

200 ms is not the end of the world in most cases, it's far better than relying on everything doing its own buffering correctly and suffering a massive performance penalty when something inevitably doesn't.


I have to disagree 200 Ms is usually most of your latency budget in my experience. 200 ms delays randomly kill your p99 numbers and harm the customers. Most internet traffic is in the data center, not to the edge. And I assume fastly and Akamai and cloud flare are all aware of how to tune to slow last miles.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: