Stack's just the youngest generation. I can see the difference being that you ha...

coder543 · on March 31, 2022

From a conceptual point of view, I agree, but... in practice, stacks are incredibly cheap.

The entire set of local variables can be allocated with a single bump of the stack pointer upon entry into a function, and they can all be freed with another bump of the stack pointer upon exit. With heap allocations, even with the simplest bump allocator.. you still have to allocate once per object, which can easily be an order of magnitude more work than what you have to do with an equivalent number of stack allocated objects. Your program doesn't magically know where those objects are, so it still also has to pay to stack allocate the pointers to keep track of the heap objects. Then you have the additional pointer-chasing slowing things down and the decreased effectiveness of the local CPU caches due to the additional level of indirection. A contiguous stack frame is a lot more likely to stay entirely within the CPU cache than a bunch of values scattered around the heap.

Beyond that, and beyond the additional scanning you already mentioned, in the real world the heap is shared between all threads, which means there will be some amount of contention whenever you have to interact with the heap, although this is amortized some with TLABs (thread-local allocation buffers). You also have to consider the pointer rewriting that a generational GC will have to perform for the survivors of each generation, and that will not be tied strictly to function frames. The GC will run whenever it feels like it, so you may pay the cost of pointer rewriting for objects that are only used until the end of this function, just due to the coincidence of when the GC started working. I think (but could be wrong/outdated) that generational GCs almost all require both read and write barriers that perform some synchronization any time you interact with a heap allocated value, and this slows the program down even more compared to stack objects. (I believe that non-copying GCs don't need as many barriers, and that the barriers are only active during certain GC phases, which is beneficial to Go programs, but again, stack objects don't need any barriers at all, ever.)

GCs are really cool, but stack allocated values are always better when they can be used. There's a reason that C# makes developer-defined value types available; they learned from the easily visible problems that Java has wrestled with caused by not allowing developers to define their own value types. Go took it a step further and got rid of developer-defined reference types altogether, so everything is a value type (arguably with the exception of the syntax sugar for the built-in map, slice, channel, and function pointer types), and even values allocated behind a pointer will still be stack allocated if escape analysis proves it won't cause problems.

chrisseaton · on March 31, 2022

> you still have to allocate once per object

Why can’t you bump once to allocate enough space for multiple objects?

coder543 · on March 31, 2022

You can’t do it that way because each object has to have its own lifetime.

If you allocate them all as a single allocation, then the entire allocation would be required to live as long as the longest lived object. This would be horribly inefficient because you couldn’t collect garbage effectively at all. Memory usage would grow by a lot, as all local variables are continuously leaked for arbitrarily long periods of time whenever you return a single one, or store a single one in an array, or anything that could extend the life of any local variable beyond the current function.

If you return a stack variable, it gets copied into the stack frame of the caller, which is what allows the stack frame to be deallocated as a whole. That’s not how heap allocations work, and adding a ton of complexity to heap allocations to avoid using the stack just seems like an idea fraught with problems.

If you know at compile time that they all should be deallocated at the end of the function… the compiler should just use the stack. That’s what it is for. (The one exception is objects that are too large to comfortably fit on the stack without causing a stack overflow.)

chrisseaton · on March 31, 2022

> If you allocate them all as a single allocation, then the entire allocation would be required to live as long as the longest lived object.

No, you can copy objects out if they live longer than others. That's how a generational GC works.

coder543 · on March 31, 2022

I edited my comment before you posted yours. Making heap allocations super complicated just to avoid the stack is a confusing idea.

Generational GCs surely do not do a single bump allocate for all local variables. How could the GC possibly know where each object starts and ends if it did it as a single allocation? Instead, it treats them all as individual allocations within an allocation buffer, which means bumping for each one separately. Yes, they will then get copied out if they survive long enough, but that’s not the same thing as avoiding the 10+ instructions per allocation.

It’s entirely possible I’m wrong when it comes to Truffle, but at a minimum it seems like you would need arenas for each size class, and then you’d have to bump each arena by the number of local variables of that size class. The stack can do better than that.

chrisseaton · on March 31, 2022

When you allocate objects individually it looks like this:

    object_a = tlab
    tlab += 8
    check tlab limit
    object_b = tlab
    tlab += 8
    check tlab limit

When you allocate as a single allocation it looks like this:

    object_a = tlab
    object_b = tlab + 8
    tlab += 16
    check tlab limit

What about that do you see as impossible?

> How could the GC possibly know where each object starts and ends if it did it as a single allocation?

As above.

coder543 · on March 31, 2022

In your example, the objects are all the same size. That would certainly be easy.

If you have three local objects that are 8 bytes, 16 bytes, and 32 bytes… if you do a single 48 byte allocation on the TLAB, how can the GC possibly know that there are three distinct objects, when it comes time to collect the garbage? I can think of a few ways to kind of make it work in a single buffer, but they would all require more than the 48 bytes that the objects themselves need. Separate TLAB arenas per size class seem like the best approach, but it would still require three allocations because each object is a different size.

I understand you’re some researcher related to Truffle… this is just the first I’m hearing of multiple object allocation being done in a single block with GC expected to do something useful.

chrisseaton · on March 31, 2022

> If you have three local objects that are 8 bytes, 16 bytes, and 32 bytes… if you do a single 48 byte allocation on the TLAB, how can the GC possibly know that there are three distinct objects, when it comes time to collect the garbage?

Because the objects are self-describing - they have a class which tells you their size.

    object_a = tlab
    object_a.class = ClassA
    object_b = tlab + 8
    object_b.class = ClassB
    object_c = tlab + 16
    object_c.class = ClassC
    tlab += 48
    check tlab limit

coder543 · on March 31, 2022

Ok, so it’s not as simple as bumping the TLAB pointer by 48. Which was my point. You see how that’s multiple times as expensive as stack allocating that many variables? Even something as simple as assigning the class to each object still costs something per object. The stack doesn’t need self describing values because the compiler knows ahead of time exactly what every chunk means. Then the garbage collector has to scan each object’s self description… which way more expensive than stack deallocation, by definition.

You’re extremely knowledgeable on all this, so I’m sure that nothing I’m saying is surprising to you. I don’t understand why you seem to be arguing that heap allocating everything is a good thing. It is certainly more expensive than stack allocation, even if it is impressively optimized. Heap allocating as little as necessary is still beneficial.

chrisseaton · on March 31, 2022

Every stack allocation scheme I've seen creates a fully reified object though, with a class pointer as normal.

You may be confusing with scalar replacement of aggregates, which is a separate concept.

https://chrisseaton.com/truffleruby/seeing-escape-analysis/

coder543 · on March 31, 2022

Go does not put class pointers on stack variables. Neither does Rust or C. The objects are the objects on the stack. No additional metadata is needed.

The only time Go has anything like a class pointer for any object on the stack is in the case of something cast to an interface, because interface objects carry metadata around with them.

These days, Go doesn’t even stack allocate all non-escaping local variables… sometimes they will exist only within registers! Even better than the stack.

chrisseaton · on March 31, 2022

> These days, Go doesn’t even stack allocate all non-escaping local variables… sometimes they will exist only within registers!

Did you read the article I linked? That's what it says - and this isn't stack allocation, it's SRA. Even 'registers' is overly constrained - they're dataflow edges.

coder543 · on March 31, 2022

Go isn't just doing SRA, as far as I understood it from your article, though it is certainly doing that too. Go will happily allocate objects on the stack with their full in-memory representation, which does not include any kind of class pointer.

Here is a contrived example that creates a 424 byte struct: https://godbolt.org/z/5cKh7xTzq

As can be seen in the disassembly, the object created in "Process" does not leave the stack until it is copied to the heap in "ProcessOuter" because ProcessOuter is sending the value to a global variable. The on-stack representation is the full representation of that object, as you can also see by the disassembly in ProcessOuter simply copying that directly to the heap. (The way the escape analysis plays into it, the copying to the heap happens on the first line of ProcessOuter, which can be confusing, but it is only being done there because the value is known to escape to the heap later in the function on the second line. It would happily remain on the stack indefinitely if not for that.)

It's cool that Graal does SRA, but Go will actually let you do a lot of work entirely on the stack (SRA'd into registers when possible), even crossing function boundaries. In your SRA blog post example, when the Vector is returned from the function, it has to be heap allocated into a "real" object at that point. Go doesn't have to do that heap allocation, and won’t have to collect that garbage later.

Most of the time, objects are much smaller than this contrived example, so they will often SRA into registers and avoid the stack entirely… and this applies across function boundaries too, from what I’ve seen, but I haven’t put as much effort into verifying this.

coder543 · on March 31, 2022

I was addressing your statement that seemed to say class metadata is included in every language when dealing with stack variables. I trusted your earlier statements that this was definitely the case with Java/Truffle. Misunderstandings on my part are entirely possible.

Sorry I haven’t had time to read your article. It’s on my todo list for later.

chrisseaton · on March 31, 2022

> at a minimum it seems like you would need arenas for each size class

But TLABs are heterogenous by size. Objects of all sizes go in one linear allocation space. So allocating two objects next to each other is the same as allocating them both at the same time.

coder543 · on March 31, 2022

That doesn’t answer how the GC knows where the boundaries of each object are.

afiori · on March 31, 2022

No, outside special code it is impossible to know at compile time how many heap allocation a function will have.

The stack has the requirement that its size must be known at compile time for each function. In oversimplified terms its size is going to be the sum of size_of of all the syntactically local variables.

So for example you cannot grow the stack with a long loop, because the same variable is reused over and over in all the iterations.

You can instead grow the heap as much as you want with a simple `while(true){malloc(...)}`.

chrisseaton · on March 31, 2022

> No, outside special code it is impossible to know at compile time how many heap allocation a function will have.

Huh?

    def foo:
      return [new Object, new Object]

That'll always bump the TLAB three times. Why not bump it one time for all objects?

> The stack has the requirement that its size must be known at compile time for each function.

Also huh?

For example a local array that has a dynamic size.

morelisp · on March 31, 2022

> The stack has the requirement that its size must be known at compile time for each function.

Not really. You can bump your stack pointer at any time. Even C has alloca and VLAs. In lots of languages it's dangerous and not done because you can stack overflow with horrible results, and some (but not all) performance is lost because you need to offset some loads relative to another variable value, but you can do it.

What the stack really has a requirement that any values on it will go full nasal demon after the function returns, so you'd better be absolutely certain the value can't escape - and detecting that is hard.

native_samples · on March 31, 2022

Not quite - stack is constantly being reused and thus always hot in the cache. Young gen is never hot because it's much larger and the frontier is constantly moving forwards, instead of forwards and backwards.