ARC, used by Swift, has its own cost.

1propionyl · on Feb 5, 2020

True, but it's generally better than most full GC solutions (for processes running for relatively short times without the benefit of profile-guided optimization), and worse than languages with fully statically analyzable memory usage.

Note: that parenthetical is a very big caveat, because properly profile-optimized JVM executables can often achieve exceptional performance/development cost tradeoffs.

In addition however, ARC admits a substantial amount of memory-usage optimization given bytecode, which is now what developers provide to Apple on iOS. Not to mention potential optimizations by allowing Apple to serve last-minute compiled microarchitecture optimized binaries for each device (family).

To satiate the pedants... ARC is more of less GC where calls into the GC mechanism are compiled in statically and where there are at worst deterministic bounds on potential "stop the world" conditions.

While this may not be presently optimal because profile-guided approaches can deliver better performance by tuning allocation pool and collection time parameters, it's arguably a more consistent and statically analyzable approach that with improvement in compilers may yield better overall performance. It also provides tight bounds on "stop the world" situations, which also exist far less frequently on mobile platforms than in long running sever applications.

Beyond those theoretical bounds, it's certainly much easier to handle when you have an OS that is loading and unloading applications according to some policy. This is extremely relevant as most sensible apps are not actually long running.

ernst_klim · on Feb 5, 2020

> but it's generally better than most full GC solutions

I doubt that. It implies huge costs without giving any benefits of GC.

A typical GC have compaction, nearly stack-like fast allocation [1], ability to allocate a bunch of objects at once (just bump the heap pointer once for a whole bunch).

And both Perl and Swift do indeed perform abysmally, usually worse than both GC and manual languages [2].

> ARC is more of less GC

Well, no. A typical contemporary GC is generational, often concurrent, allowing fast allocation. ARC is just a primitive allocator with ref/deref attached.

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.49....

[2] https://github.com/ixy-languages/ixy-languages

pkolaczk · on Feb 5, 2020

It is nowhere near stack-like. Stack is hot in cache. Heap memory in tlab is cold. Bringing the lines into cache is the major cost, not bumping the pointer.

ernst_klim · on Feb 6, 2020

> Stack is hot in cache. Heap memory in tlab is cold.

What? This doesn't make any sense. From the cache's POV stack and bump-allocated heap are the same thing. Both are continuous chunks of memory where the next value is being allocated right after the previous one.

The only difference between the stack and the bump-allocated heap is that the former has hardware support for pointer bumping and the latter has not. That's all.

pkolaczk · on Feb 6, 2020

You're missing the fact that the tlab pointer is only ever moved forward, so it always points to recently unused memory. Until the reset happens and it points back to the same memory again, the application managed to allocate several megabytes or sometimes hundreds of megabytes, and most of that new-gen memory does not fit even in L3 cache.

The stack pointer moves both directions and the total range of that back-and-forth movement is typically in kilobytes, so it may fit fully in L1.

Just check with perf what happens when you iterate over an array of 100 MB several times and compare that to iterating over 10 kB several times. Both are contiguous but the performance difference is pretty dramatic.

Besides that, there is also an effect that the faster you allocate, the faster you run out of new gen space, and the faster you trigger minor collections. These are not free. The faster you do minor collections, the more likely it is for the objects to survive. And the cost is proportional to survival rate. That's why many Java apps tend to use pretty big new generation size, hoping that before collection happens, most of young objects die.

This is not just theory - I saw this just too many times, when reducing allocation rate to nearly zero caused significant speedups - by order of magnitude of more. Reducing memory traffic is also essential to get good multicore scaling. It doesn't matter each core has a separate tlab, when their total allocation rate is so high that they are saturating LLC - main memory link. It is easy to miss this problem by classic method profiling, because a program with such problem will manifest by just everything being magically slow, but no obvious bottleneck.

ernst_klim · on Feb 6, 2020

> You're missing the fact that the tlab pointer is only ever moved forward, so it always points to recently unused memory. Until the reset happens and it points back to the same memory again, the application managed to allocate several megabytes or sometimes hundreds of megabytes, and most of that new-gen memory does not fit even in L3 cache.

Yes, you are right about stack locality. It indeed moves back and forward making effective used memory region quite small.

> These are not free. The faster you do minor collections, the more likely it is for the objects to survive. And the cost is proportional to survival rate.

Yes, that's true. Immutable languages are doing way better here having small minor heaps (OCaml has 2MB on amd64) and very small survival rates (with many object being directly allocated on older heap if they are known to be lasting in advance).

Now I understand your point better and I agree.