I feel like this product is optimizing for an anti-pattern. The blog argues that...

deliciousturkey · 2025-12-13T16:30:49 1765643449

In AI training, you want to sample the dataset in arbitrary fashion. You may want to arbitrarily subset your dataset for specific jobs. These are fundamentally opposed demands compared to linear access: To make your tar-file approach work, the data has to ordered to match the sample order of your training workload, coupling data storage and sampler design.

There are solutions for this, but the added complexity is big. In any case, your training code and data storage become tightly coupled. If you can avoid it by having a faster storage solution, at least I would be highly appreciative of it.

kburman · 2025-12-13T19:40:58 1765654858

- Modern DL frameworks (PyTorch DataLoader, WebDataset, NVIDIA DALI) do not require random access to disk. They stream large sequential shards into a RAM buffer and shuffle within that buffer. As long as the buffer size is significantly larger than the batch size, the statistical convergence of the model is identical to perfect random sampling.

- AI training is a bandwidth problem, not a latency problem. GPUs need to be fed at 10GB/s+. Making millions of small HTTP requests introduces massive overhead (headers, SSL handshakes, TTFB) that kills bandwidth. Even if the storage engine has 0ms latency, the network stack does not.

- If you truly need "arbitrary subsetting" without downloading a whole tarball, formats like Parquet or indexed TFRecords allow HTTP Range Requests. You can fetch specific byte ranges from a large blob without "coupling" the storage layout significantly.

deliciousturkey · 2025-12-13T19:58:19 1765655899

Highly dependent on what you are training. "Shuffling within a buffer" still results in your sampling being dependent on the data storage order. PyTorch DataLoader does not handle this for you. High level libraries like DALI do, but this is the exact coupling I wanted to say to avoid. These libraries have specific use cases in mind, and therefore have restrictions that may or may not suit your needs.

AI training is a bandwidth problem, not a latency problem. GPUs need to be fed at 10GB/s+. Making millions of small HTTP requests introduces massive overhead (headers, SSL handshakes, TTFB) that kills bandwidth. Even if the storage engine has 0ms latency, the network stack does not.

Agree that throughput is more of an issue than latency, as you can queue data to CPU memory. Small object throughput is definitely an issue though, which is what I was talking about. Also, there's no need to use HTTP for your requests, so HTTP or TLS overheads are more of self-induced problems of the storage system itself.

You can fetch specific byte ranges from a large blob without "coupling" the storage layout significantly.

This has exact same throughput problems as small objects though.

jamesblonde · 2025-12-13T16:44:28 1765644268

I agree that this is an anti-pattern for training. In training, you are often I/O bound over S3 - high b/w networking doesn't fix it (.saftensor files are typically 4GB in size). You need NVMe and high b/w networking along with a distributed file system.

We do this with tiered storage over S3 using HopsFS that has a HDFS API with a FUSE client, so training can just read data (from HopsFS datanode's NVMe cache) as if it is local, but it is pulled from NVMe disks over the network. In contrast, writes go straight to S3 vis HopsFS write-through NVMe cache.

hodgesrm · 2025-12-13T16:21:51 1765642911

> It seems like they are building a complex distributed system to solve a problem that is better solved by tar -cvf

That doesn't work on Parquet or anything compressed. In real-time analytics you want to load small files quickly into a central location where they can be both queried and compacted (different workloads) at the same time. This is hard to do in existing table formats like Iceberg. Granted not everyone shares this requirement but it's increasingly important for a wide range of use cases like log management.

fulafel · 2025-12-13T16:26:56 1765643216

You can do app optimizations to work with object databases that are slow for small objects, or you can have a fast object database - doesn't seem that black and white. If you can build a fast object database that is robust and solves that problem well, it's (hopefully) a non leaky abstraction that can warrant some complexity inside.

The tar -cvf is a good analogy though, are you working with a virtual tape drive or a virtual SSD.

kburman · 2025-12-13T19:46:53 1765655213

Expecting the storage layer to fix an inefficient I/O pattern (millions of tiny network requests) is optimizing the wrong part of the stack.

> are you working with a virtual tape drive or a virtual SSD.

Treating a networked object store like a local SSD ignores the Fallacies of Distributed Computing. You cannot engineer away the speed of light or the TCP stack.

fulafel · 2025-12-14T07:16:47 1765696607

SSD (over nvme) and TCP (over 100gbe) both exhibit low tens of microseconds of latency as the low bound. This is ignoring redundancy for both of course, but the cost of that should also be similar between the two.

If the storage is farther away, then you'll go slower of course. But since the article is comparing to EFS and S3 Express, it's fitting to talk about a nearby scenarios I think. And the point of the article was that S3 Express was more problematic for cost than small-object performance reasons.

jeremyjh · 2025-12-13T15:44:55 1765640695

Yeah I was a bit lost from the introduction. High performance object stores are "too expensive?" We live an era where I can store everything forever and query it in human scale time-frames at costs that are far less than what we paid for much worse technologies a decade ago. But I was thinking of datalakes, not vector stores or whatever they are trying to solve for AI.

Scubabear68 · 2025-12-13T15:51:43 1765641103

Loved your sentence at the end about tar -cvf.

Every generation seems to have to learn the lesson about batching small inputs together to keep throughput up.