> "in 2x time you can access 8x as much memory" is NOT what the article says. Th...

hinkley · 2025-10-08T20:00:53 1759953653

The operation GP is thinking of is a full scan, and that will always take n(n^(1/3)) lower bound time. Though if done right all of that latency will be occupied with computation and allow people to delude themselves into thinking it doesn’t matter.

But when something is constrained two or three ways, it drastically reduces the incentive to prioritize tackling any one of the problems with anything but small incremental improvements.

Dylan16807 · 2025-10-08T22:15:00 1759961700

> The operation GP is thinking of is a full scan, and that will always take n(n^(1/3)) lower bound time.

It doesn't. Full scans are faster than accessing each memory address in an unordered way.

Let's look at a Ryzen 2600X. You can sustain 32 bytes per second from L1, 32 bytes per second from L2, and 20 bytes per cycle from L3. That's 64KB, 512KB, and 16MB caches all having almost the same bandwidth despite very different latencies.

You can also imagine an infiniband network that fills 2 racks, and another one that fills 50000 racks. The bandwidth of a single node is the same in both situations, so even though latency gets worse as you add more nodes and hops, it's going to take exactly O(n) time for a single thread to scan the entire memory.

You can find correlations between memory size and bandwidth, but they're significantly weaker and less consistent than the correlations between memory size and latency.

hinkley · 2025-10-09T00:29:44 1759969784

I don’t think I can agree to ignore the latency problem. Intermachine Latency is the source of a cube root of N term.

Dylan16807 · 2025-10-09T02:14:33 1759976073

Once you start receiving data in bulk, the time it takes is quantity of data divided by your connection speed. Latency doesn't factor in.

Technically you need to consider the time it takes to start receiving data. Which would mean your total time is O(n + ∛n). Not O(n * ∛n). But not all nodes are ∛n away. The closest nodes are O(1) latency. So if you start your scan on the close nodes, you will keep your data link saturated from the start, and your total time will be O(n). (And O(n + ∛n) simplifies to O(n) anyway.)

wazdra · 2025-10-08T22:23:20 1759962200

I don't think we have such a lower bound: from a “theoretical" point of view (in the sense of the post), your processor could walk on the cube of memory and collect each bit one by one. Each move+read costs O(1) (if you move correctly), so you get O(n) to read the whole cube of n bits.

If you require the full scan to be done in a specific order however, indeed, in the worse case, you have to go from one end of the cube to the other between each reads, which incurs a O(n^{1/3}) multiplicative cost. Note that this does not constitute a theoretical lower-bound: it might be possible to detect those jumps and use the time spent in a corner to store a few values that will become useful later. This does look like a fun computational problem, I don't know it's exact worst-case complexity.