> Note there is no intrinsic reason running multiple streams should be faster th...

nh2 · 2026-02-02T17:57:09 1770055029

Is it certain that this is the reason?

rsync's man page says "pipelining of file transfers to minimize latency costs" and https://rsync.samba.org/how-rsync-works.html says "Rsync is heavily pipelined".

If pipelining is really in rsync, there should be no "dead time between transfers".

dekhn · 2026-02-02T19:20:58 1770060058

The simple model for scp and rsync (it's likely more complex in rsync): for loop over all files. for each file, determine its metadata with fstat, then fopen and copy bytes in chunks until done. Proceed to next iteration.

I don't know what rsync does on top of that (pipelining could mean many different things), but my empirical experience is that copying 1 1 TB file is far faster than copying 1 billion 1k files (both sum to ~1 TB), and that load balancing/partitioning/parallelizing the tool when copying large numbers of small files leads to significant speedups, likely because the per-file overhead is hidden by the parallelism (in addition to dealing with individual copies stalling due to TCP or whatever else).

I guess the question is whether rsync is using multiple threads or otherwise accessing the filesystem in parallel, which I do not think it does, while tools like rclone, kopia, and aws sync all take advantage of parallelism (multiple ongoing file lookups and copies).

mschuster91 · 2026-02-03T01:40:15 1770082815

> I don't know what rsync does on top of that (pipelining could mean many different things), but my empirical experience is that copying 1 1 TB file is far faster than copying 1 billion 1k files (both sum to ~1 TB), and that load balancing/partitioning/parallelizing the tool when copying large numbers of small files leads to significant speedups, likely because the per-file overhead is hidden by the parallelism (in addition to dealing with individual copies stalling due to TCP or whatever else).

That's because of fast paths:

- For a large file, assuming the disk isn't fragmented to hell and beyond, there isn't much to do for rsync / the kernel: the source reads data and copies it to the network socket, the receiver copies data from the incoming network socket to the disk, the kernel just dumps it in sequence directly to the disk, that's it.

- The slightly less performant path is on a fragmented disk. Source and network still doesn't have much to do, but the kernel has a bit more work every now and then to find a contiguous block on the disk to write the data to. For spinning rust HDDs, the disk also has to do some seeking.

- Many small files? Now that's more nasty. First, the source side has to do a lot of stat(2) calls to get basic attributes of the file. For HDDs, that seeking can incur a sometimes significant latency penalty as well. Then, this information needs to be transferred to the destination, the destination has to do the same stat call again, and then the source needs to transfer the data, involving more seeking, and the destination has to write it.

- The utter worst case is when the files are plenty and small, but large enough to not fit into an inode as inline data [1]. That means two writes and thus seeks per small file. Utterly disastrous for performance.

And that's before stepping into stuff such as systems disabling write caches, soft-RAID (or the impact of RAID in general), journaling filesystems, filesystems with additional metadata...

[1] https://archive.kernel.org/oldwiki/ext4.wiki.kernel.org/inde...

nh2 · 2026-02-02T19:38:07 1770061087

> I guess the question is whether rsync is using multiple threads or otherwise accessing the filesystem in parallel

No, that is not the question. Even Wikipedia explains that rsync is single-threaded. And even if it was multithreaded "or otherwise" used concurent file IO:

The question is whether rsync _transmission_ is pipelined or not, meaning: Does it wait for 1 file to be transferred and acknowledged before sending the data of the next?

Somebody has to go check that.

If yes: Then parallel filesystem access won't matter, because a network roundtrip has brutally higher latency than reading data sequentially of an SSD.

dekhn · 2026-02-02T19:58:13 1770062293

Note that rsync on many small files is slow even within the same machine (across two physical devices), suggesting that the network roundtrip latency is not the major contributor.

nh2 · 2026-02-03T01:23:25 1770081805

The original post only mentions 3564 files and rsync spending 8 minutes on that. This just doesn't check out.

Dylan16807 · 2026-02-03T00:41:55 1770079315

The filesystem access and general threading is the question because transmission is pipelined and not a thing "somebody has to go check". You just quoted the documentation for it.

The dead time isn't waiting for network trips between files, it's parts of the program that sometimes can't keep up with the network.

nh2 · 2026-02-03T01:20:25 1770081625

I quoted the documentation that claims _something_ is pipelined.

That is extremely vague on what that is and I also didn't check that it's true.

Both the original claim "the issue is the serialization of operations" and the counter-claim all sound like extreme guesswork or me. If you know for certain, please link the relevant code.

Otherwise somebody needs to go check what it actually does; everything else is just speculating "oh surely it's the files" and then people remember stuff that might just be plain wrong.

Dylan16807 · 2026-02-03T01:39:39 1770082779

Speculation isn't the most useful thing, but saying "that is not the question" to valid speculation is even less useful.

nh2 · 2026-02-04T19:42:28 1770234148

I was about to say:

The question was what exactly rsync pipelines, and whether it serialises its network sends. If true, that would be a plausible cause of parallelism speeding it up.

Serial local reads are not a plausible cause, because the autor describes working on NVMe SSDs which have so low latency that they cannot explain that reading 59 GB across 3000 files take 8 minutes.

However:

You might actually be half-right because in the main output shown in the blog post, the author is NOT using local SSDs. The invocation is `rsync ... /Volumes/mercury/* /Volumes/...` where `mercury` is a network share mount (and it is unspecified what kind of share that is). So in that case, every read that looks "local" to rsync is actually a network access. It is totally possible that rsync treats local reads as fast and thus they are not pipelined.

In fact, it is even highly likely that rsync will not / cannot pipeline reading files that appear local to it, because normal POSIX file IO does not really offer any ways to non-blocking read regular files, so the only way to do that is with threads, which rsync doesn't use.

(Extra evidence about rsync using normal blocking writes, and not supporting threads, beyond the fact that no threading code exists in rsync's repo: https://github.com/RsyncProject/rsync/blob/236417cf354220669...)

So while "the dead time isn't waiting for network trips between files" would be wrong -- it absolutely would wait for network trips between files -- your "filesystem access and general threading is the question" would be spot-on.

So in that case rclone is just faster because it reads from his network mount in parallel. This would also explain why he reports `tar` as not being faster, because that, too, reads files serially from the network mount. Supposedly this situation could be avoided by running rsync "normally" via SSH, so that file reads are actually fast on the remote side.

The situation is extra confused by the author writing below his run output:

    even experimenting with running the rsync daemon instead of SSH

when in fact the output above didn't rsync over SSH at all.

Another weird thing I spotted is that the rsync output shown in the post

    Unmatched data: 62947785101 B

seems impossible: The string "Unmatched data" doesn't seem to exist in the rsync source code, and hasn't since 1996. So it is unclear to me what version of rsync used.

I commented that on https://www.jeffgeerling.com/blog/2025/4x-faster-network-fil...

Dylan16807 · 2026-02-05T00:34:02 1770251642

> Serial local reads are not a plausible cause, because the autor describes working on NVMe SSDs which have so low latency that they cannot explain that reading 59 GB across 3000 files take 8 minutes.

But the people you responded to were talking about slowdowns that exist in general, not just ones that apply directly to the post.

For the post, my personal guess is that per-file overhead isn't a huge factor here, and it's mostly rsync having trouble doing >1Gbps over the network.

> In fact, it is even highly likely that rsync will not / cannot pipeline reading files that appear local to it, because normal POSIX file IO does not really offer any ways to non-blocking read regular files, so the only way to do that is with threads, which rsync doesn't use.

Makes sense.

> it absolutely would wait for network trips between files

I don't see why you're saying this. I expect it to serially read files and then put that data into a buffer that can have data from multiple files at the same time. In other words, pipelined networking. As long as the transfer queue doesn't bottom out it shouldn't have to wait for any network round trips. What leads you to think otherwise?

nh2 · 2026-02-05T01:30:52 1770255052

> But the people you responded to were talking about slowdowns that exist in general, not just ones that apply directly to the post.

I think that's incorrect though. These slowdowns do not exist in general (see my next reply where I run rsync and it immadiately maxes out my 10 Gbit/s).

I think original poster digiown is right with "Note there is no intrinsic reason running multiple streams should be faster than one [EDIT: 'at this scale']. It almost always indicates some bottleneck in the application". In this case it's the user running rsync as a serially-reading program reading from a network mount.

> rsync having trouble doing >1Gbps over the network

rsync copies at 10 Gbit/s without problem between my machines.

Though I have to give `-e 'ssh -c aes256-gcm@openssh.com'` or aes128-gcm, otherwise encryption bottlenecks at 5 Gbit/s with the default `chacha20-poly1305@openssh.com`.

> I don't see why you're saying this.

Because of the part you agreed making sense: It read each file with the sequence `open()/read()/.../read()/close()`, but those files are on the network mount ("/Volumes/mercury"), so each `read()` of size `#define IO_BUFFER_SIZE (32*1024)` is a network roundtrip.

Dylan16807 · 2026-02-05T03:05:31 1770260731

I see, so you're saying the file end of rsync is forced to wait for the network because the filesystem itself waits, not the network end of rsync. That makes sense.

Though I wonder what the actual delay is. The numbers in the post implied several milliseconds, enough to maybe account for 30 seconds of the 8 minutes. But maybe changing files resets the transfer speed a bunch.

nh2 · 2026-02-05T06:00:43 1770271243

From which part do you take the "several milliseconds"?

If I assume 0.2ms ping and each rsync read() is a roundtrip, I arrive at 6.4 minutes = 62955918871 B / (321024 B) 0.0002 s / (60 s/min).

Dylan16807 · 2026-02-05T07:04:50 1770275090

To be clear, when I said "delay" in my last post I meant the per-file penalty, not the round-trip latency.

Radxa Orion O6/.DS_Store, 6KB at 4.8MBps, that means it took 1.3ms to transfer

Radxa Orion O6/Micro Center Visit Details.pages, 141KB at 9.8MBps, 14ms

Radxa Orion O6/Radxa Orion O6.md, 19KB at 1.9MBps, 10ms

We know the link can do well over 100MBps, so that time is almost all overhead. But it doesn't seem to be a simple delay. Perhaps a fresh TCP window scaling up on a per-file basis? That would be an unfortunate filesytem design.

Since the two 4MB files both get up to ~100MBps, the same speed as the 250MB file, it seems like the maximum impact from switching files isn't much more than 15ms. If the average is below 10ms then we're looking at half a minute wasted over 3564 files. If the average is 20ms then switching files is responsible for 71 seconds wasted.

By that estimate the file-level serialization is a real issue, but the bigger issue is whatever's preventing that 250MB file from ramping up all the way to 10Gbps.

spockz · 2026-02-02T19:08:02 1770059282

I’m not sure why, but just like with scp, I’ve achieved significant speeds ups by tarring the directory first (optionally compressing it), transferring and then decompressing. Maybe because it makes the tar and submit, and the receive, untar/uncompress, happen on different threads?

poke646 · 2026-02-02T23:20:33 1770074433

One of my "goto" tools is copying files over a "tar pipe". This avoids the temporary tar file. Something like:

  tar cf - *.txt | ssh user@host tar xf - -C /some/dir/

lelandbatey · 2026-02-02T19:23:00 1770060180

It's typically a disk-latency thing, as just stat-ing the many files in a directory can have significant latency implications (especially on spinning HDDs) vs opening a single file (the tar) and read-()ing that one file in memory before writing to the network.

If copying a folder with many files is slower than tarring that folder and the moving the tar (but not counting the untar) then disk latency is your bottleneck.

ahartmetz · 2026-02-02T21:07:15 1770066435

Not useful very often, but fast and kind of cool: You can also just netcat the whole block device if you wanted a full filesystem copy anyway. Optionally zero all empty space before using a tool like zerofree and use on-the-fly compression / decompression with lz4 or lzo. Of course, none of the block devices should be mounted, though you could probably get away with a source that's mounted read-only.

dd is not a magic tool that can deal with block devices while others can't. You can just cp myLinuxInstallDisk.iso to /dev/myUsbDrive, too.

spockz · 2026-02-02T22:08:04 1770070084

Okay. In this case the whole operation is faster end to end. That includes the time it takes to tar and untar. Maybe those programs do something more efficient in disk access than scp and rsync?

dnmc · 2026-02-03T04:20:55 1770092455

I've never verified this, but it feels like scp starts a new TCP connection per file. If that's the case, then scp-ing a tarred directory would be faster because you only hit the slow start once. https://www.rfc-editor.org/rfc/rfc5681#section-3.1

ndsipa_pomu · 2026-02-03T08:25:16 1770107116

Also handy to note that tar can handle sparse files, whereas scp doesn't.

wmf · 2026-02-02T17:51:21 1770054681

The ideal solution to that is pipelining but it can be complex to implement.