A very good writeup, but one thing always confuses me.
What is meant specifically by the "heap" and "stack"? I know what a stack is, but "heap" gets thrown around in many different contexts and I've yet to find any explanation that made it clear.
If anyone has a good explanation or good links for those two terms in this context, I'd be very grateful. Thanks!
A stack is a data structure that grows and can be accessed from one end.
The stack usually means the call stack, or c stack, where one of the registers of the cpu is always pointed to the top of a stack structure in memory. That way it can be used to quickly allocate and free memory for function calls and function-scoped temporaries, and it has no trouble handling (limited) recursion.
Heap refers to the part of the memory that is fundamentally just a flat address space asked for from the OS. On top of that lives the heap allocator, whose job is to portion that into usable chunks and try to be able to reuse returned space. In C that is malloc/free, in c++ new and delete.
In the olden days heap and stack actually referred to the same area of memory, the heap growing upwards from the bottom and the stack growing down from the top. Luckily, today there is enough address space that we don't have to do this anymore.
AFAIK, heap and stack are still in the same "area of memory;" the discipline of separating the two is still enforced (or not) by the operating system. The heap-grows-up, stack-grows-down convention is still common on (32-bit, maybe 64-bit?) x86 too.
This looks pretty well-answered already, but I'll throw in my $0.02.
The stack and heap are the two main regions of memory where data can live in a program. They are (somewhat unhelpfully) named after the data structures originally used to represent them. Pretty much all modern languages have a concept of a stack and a heap, although there are differences both in how explicit they are, and in implementation details.
The stack is the function call stack, used to store data local to a function call. It behaves like a traditional stack, with push and pop operations. When you call a function, a new stack frame is pushed onto the stack, which contains all of the local variables that are only accessible inside that function call. When that function returns, it pops its stack frame (and sends its return value, somehow). The language scoping rules determine how one function may access the contents of an earlier stack frame that called it, but stack frames are basically deleted and inaccessible after they are popped (a source of errors in C).
The heap is the region of memory where you put data that lives longer than the function that created it. Heap behavior is far less specified, and more variable, between languages than stack behavior is (and involves the OS more), but if you're passing around pointers or references to an object you're probably using the heap. In languages that have them, the malloc and new operators usually mean "put this on the heap." In languages with a notion of reference types and value types, reference types are usually heap-allocated by default.
The distinction between the two types of storage is more important in some languages that others. It's really important in C, where you have to manually manage all your memory, but some of the more dynamic languages (especially everything-is-an-object ones) use the heap for all user data, and the call stack only for implementation details.
Languages also vary greatly in how much control you have over where things get put. On one extreme, C lets you put any type in any location, with malloc, while on the other I am not aware of any way in JavaScript to control object lifetime. Java and C# both distinguish between value-types (stack-allocated by default) and reference-types (heap-allocated by default), but C# allows user-defined value-types while Java does not. Common Lisp has a standardized way to tell the compiler that it is safe to stack-allocate a value (of any type). And so on.
There are several things I've never had a good understanding of regarding the stack. How big is the stack? Is it a fixed size dictated by the architecture of the processor? I know I've written recursions that overflow it. Other than gut feel and catching an exception after the fact, how do programmers know when their program might overflow it?
> How big is the stack? Is it a fixed size dictated by the architecture of the processor?
It's more to do with the OS or language runtime; I think the language runtime has a bigger impact for most programs.
> Other than gut feel and catching an exception after the fact, how do programmers know when their program might overflow it?
In general, you don't. You might be able to do some system-specific thing on specific systems or just happen to know what it is on some given system, but there's no way to find out in the general case. Worse, there's generally no way to recover from blowing the stack; this has always made programs written using alloca() or, more recently, variable-length arrays potentially unstable.
Generally, the "heap" is the addressable memory assigned to the process. E.g. you malloc a byte array and gain a reference to the (byte[] rep.) memory object in heap. The life-cycle of a memory object in heap is explicitly controlled by OS (and sometimes the language e.g. C/Obj-C/etc via freeing the heap memory object). Languages with garbage collectors do this implicitly.
The "stack" is the dynamic structure -- its a stack :) -- that maintains the nested call/method/function frames from the initial entry point (e.g. main(..) in C family) so it maintains the path traversing the invocation/call graph to the current executing call/method/function. The 'objects' in a stack are call/function/method parameters, various metadata (e.g. obj ptr in some oo langauges). The life cycle of objects in stack is scoped by the call/function/method itself. Further, typically attempts are made to store (current) stack objects in CPU registers so as to avoid incurring memory access latencies.
Usually when people say "the heap" (especially in the context of garbage collection) they mean the memory where new (non-stack) data/objects are allocated from. Usually this is the bulk of memory an application uses.
There are other meanings of the word "heap" in the realm of data structures, but I haven't personally seen "heap" the data structure talked about (or used) much in the wild.
Traditionally, the stack is a FILO (first in, last out) allocation area. Generally they start at the top of memory and work their way down as space is allocated, then back up as it is freed.
The heap might start at the low end of unused memory (above the program and libraries) and is used for allocating space with arbitrary release order. This means it has to keep track of what is used and what is available as well as when to "grow the heap" by allocating more pages of memory.
The complexity of generational garbage collection vs. the speed of manual collection, makes me feel like the happy medium of speed and simplicity is reference counting, like that found in objective-c. iPhone apps are fast, but take a bit longer to design, develop, and debug due to memory management issues. Though with experience these can be minimized.
It probably isn't possible without a ton of modification, but I wish the JVM/CLR had an option to garbage collect through reference counting.
Unlikely to happen. Reference counting has poor interaction with the CPU memory hierarchy: It makes all objects larger, making the working set correspondingly larger, meaning your data cache is less effective. It makes code larger because of the extra ref counting twiddling code. Larger code means a less effective instruction cache, as well as being slower because of all the extra operations performed. Reference counts must be frequently modified, and they are scattered all over memory (next to each object), so you get poorer locality and more atomic writes which has all sorts of negative implications for cache-coherent SMP architectures.
So while you may not notice the penalty of reference counting since it is spread out over every operation, it's unlikely that it is better than proper GC for most applications. Maybe just for ones which are extremely sensitive to latency, yet don't mind running much slower overall (hard real time? aeronautics and space?).
Gc pauses suck the "magic" out of interactive programs. The great thing about obj-c is that it I trivial to have a c array of vanilla c structs, so your performance sensitive code is still slim and hot. (size and cache locality.)
My iOS audio programming mixes c for audio processing and obj-c for ui and control quite effectively.
I didn't say that GC was free ... malloc has overhead too.
For interactive programs, I would recommend two things to avoid (or hide) the pauses. Firstly schedule calls which run the GC when it won't be noticed: between frames in an arcade-style game, or just before accepting user input in a turn-based game. Secondly, size the minor heap large enough so that all computation between the scheduled GC runs can happen without invoking a minor sweep (but not too large that the minor heap is wasting memory -- some experimentation and tuning required).
I've written a couple of interactive games in OCaml that used this strategy and avoided visible GC pauses.
Last I looked, it was a bit hard to follow, but iOS seemed to be storing reference counts separately from the objects. I think you could do (and maybe iOS does?) this on a per-thread basis, to reduce the need for locks.
While GC requires tuning for the best performance, (pure) reference counting imposes a different cost: because it can't collect circular structures, the programmer has to break cycles herself. This clutters the application code. So you can't simply take a program written to run on a GC platform and move it to a reference counting platform unmodified.
Alternatively, you could use both: do most of your reclamation with reference counting, but throw in an occasional GC to collect circular structures. CPython, the most commonly used Python implementation, does this. I doubt it is the best thing for performance, but CPython isn't very fast anyway.
It really depends on the task. For an embedded device with a UI, where memory is at a premium, reference counting is probably the way to go.
But for pure throughput, especially on a multiprocessor machine, you really want to go with tracing garbage collector. In a multiprocessor system you have to keep caches coherent, and writes to the reference count have the effect of invalidating remote cache lines, increasing coherency traffic, etc.
Reference counting is slower than proper garbage collection, since it adds an overhead for each access, instead of just during the collection. (There are ways to make reference counting faster, but they are no longer quite so simple.)
It's not quite true to say reference counting adds overhead "for each access" -- for example, you can easily have a loop which accesses an object but doesn't modify the reference count. Reference counting adds overhead each time someone "expresses an ownership interest" in an object (wording from http://developer.apple.com/library/ios/#documentation/genera...).
(That's not to say your main point is false. I don't actually know which is faster in general.)
This is true, but has to be thought through very carefully in multithreaded environments where another thread might remove an object from a collection, causing it to be deleted while you're still looking at it. Safe looping then requires locks or cloning the collection (which is suddenly more expensive because it will hit all the contents' ref counts).
In summary, safe and efficient multithreaded code is never easy.
Reference counting as in Objective C may be a happy medium for now, but that won't last. As the underlying hardware and VMs in mobile devices continue to improve, the need for developers to spend extra time on memory management will be seen as more and more of an unnecessary burden.
What are the long term implications of the increasing amount of parallel computing resources available to programmers? Are we getting to a point where there is enough excess CPU available, that the extra instruction on every assignment for reference counting is no big deal? (In certain contexts. In some contexts extra instructions are always a big deal, but these don't span all of computing.) Combine that with incremental algorithms for cycle reclamation, and you'd have great low-latency GC.
It's still not trivial even with massive parallelism. Performance is easily killed if you:
1) Completely saturate the memory interconnect between the processors and RAM chips
2) If your parallel collector threads are touching the same memory that your worker threads are, you're going to have some cache contention
3) One of the big challenges with cache lines that have writes in them is if they are shared between processors. For example, create two ref cells and have two parallel threads set them -- if the compiler did not happen to pad them out to cache line size, they'll be fighting for the same 64 bytes (8 64-bit words/pointers) of memory
We (Manticore) have done a lot of work to isolate the per-CPU work, heap pages, etc. (http://arxiv.org/abs/1105.2554 , MSPC 2011 proceedings should be out on acm.org soon-ish). But it's not easy and requires rearchitecting your whole compiler and runtime system. GHC is also in the middle of some similar work, and they have a much harder problem both due to the challenges of implementing laziness and they have a full-featured system and real user base, so they can't just change things willy-nilly.
Further, once you get the parallelism right, you end up in a different situation: Amdahl's Law bites. Even if 90% of your program is amenable to parallel work, if you make that portion go infinitely fast, you still have a sequential portion that takes 10% of the time and therefore limits you to a maximum of 10x speedup.
We are hitting that limitation right now in our research and going back to do more sequential performance work. I've got a small army of undergraduates implementing compiler optimizations and analyses :-)
The the object can be accessed (not modified, just accessed) from multiple threads, the extra instruction needs to be a thread safe atomic increment, which is hugely expensive, and becomes more expensive the more cores you have, especially in a NUMA architecture.
Also, when an object is freed, you need to decrement the reference count of every single object it points to, even objects that are referenced from many other places and clearly won't be reclaimed for a long time.
I enjoyed this basic introduction to GC; upvoted. What I'd look forward is further discussion of incremental and concurrent GC algorithms.
Until then, does anyone know what makes concurrent GC non-trivial? It seems like it shouldn't be too hard to trace concurrent to program execution. And if you don't compact, collection seems to just involve updating some structure tracking free blocks. I'd imagine it's possible to write a thread-safe version of that structure where every "free" request doesn't need to block every "malloc" request. But I must have missed something.
I'd also be interested to read how compaction works; how are references remapped from the old address to the new one? Is it possible that a reference value is a pointer to a reference "object" which contains the pointer to data, which needs to be updated? Then you only need to update a single pointer when moving data, but every dereferene incurs an extra layer of indirection.
Concurrent GC is somewhat non-trival as the mutator (i.e. the program you write) can delete references to objects from an area of the heap that has not been examined by the GC and introduce references to those same objects in an area that has been examined. This means those objects will be reclaimed since the GC never saw them.
To cope with this the mutator is modified at compile/run time to inform the GC of object updates that could lead to this situation.
During compaction of any sort, the first time an object is encountered that is to be moved it is copied somewhere else and the old copy's header is overwritten with it's forwarding address. Every time the GC encounters a reference to the old object, it re-writes the reference to the old object with it's new location (conveniently located in the old copy's header).
> Is it possible that a reference value is a pointer to a reference "object" which contains the pointer to data, which needs to be updated?
Look up something called 'Brook's style forwarding pointer', it is essentially what you've described.
Thanks; upvoted! I think I'm beginning to appreciate that concurrent marking is tougher than I thought; specifically, I can see how it can be hard to prove an object is unreachable. So I imagine the marking side is where the difficulties are.
Your explanation for compaction makes perfect sense. Of course, this won't work trivially concurrently. Only if you stop the world can you complete examination of the entire live heap and know you've updated all references to the moved object and can collect the original space.
> I can see how it can be hard to prove an object is unreachable.
The difficulty with concurrent GC is not to prove an object is unreachable but to prove is likely to be reachable. For instance Yusa-type (or snapshot-at-the-beginning) concurrent collectors have a tendency keep dead objects alive. This 'floating garbage' is a small price to pay for a fairly simple correctness proof of never dropping a reachable object.
It's no harm for a GC to keep some dead objects alive but dropping something reachable is completely unacceptable.
> Of course, this won't work trivially concurrently. Only if you stop the world [..] can collect the original space.
Yes, moving objects concurrently requires some serious thinking to get right. Sun's G1 collector does this by recording the location of pointer updates that point into regions which are to be compacted. This pushes the pause times down to the length of time to move objects and update references to them.
If you're real serious about moving objects concurrently, see "A Study of Concurrent Real-Time Garbage Collectors", Pizlo et al, 2008.
While the article is interesting, it skips important things, which have high influence on practical GC speed - write barriers and finalizers. The following ancient article from Microsoft has better coverage of GC internals http://msdn.microsoft.com/en-us/library/ms973837.aspx (somewhat biased to .NET :) ).
> The default choice of garbage collector in Hotspot is the throughput collector, which is ... entirely optimized for throughput
I just want to confirm this is true? Say you're doing a long running simulation. You don't care about pauses at all. You just want it to finish fast. The default GC with no particular options is the way to go?
Well, to not answer your question: Hotspot performance tuning is a fine art, and benchmarking is a somewhat separate fine art, people will sample their data carefully, do a lot fo runs, warming up properly, using different GC's, tuning heap,GC generations, MaxInlineSize, maybe 32-bit (-server). I remember reading linux package managers often don't include -client in the 64-bit install. JDK7 maybe faster, many people have said no difference between sun/Oracle and openJDK (except launcher and fonts, for IDEA, you must have sunjdk), and in some cases JRockit performs more predictably.
I'd say, odd are, that if you're just interested in throughput, the standard Hotspot GC without any options isn't going to give you optimal performance. I've found that, like gtani, mentions you'll have to spend time tuning to get the performance you want.
In a similar situation, I found that the Parallel Old was, by far, the fastest (throughput) collector.
I'd say low fruit are -server, Xmx, Xms and MaxInlineSize and looking for boxing/unboxing. At the other extreme, is it worth it to your company to spend a few weeks reading the Hunt/John book and benchmarking thoroughly?
What is meant specifically by the "heap" and "stack"? I know what a stack is, but "heap" gets thrown around in many different contexts and I've yet to find any explanation that made it clear.
If anyone has a good explanation or good links for those two terms in this context, I'd be very grateful. Thanks!
[EDIT: thanks everyone for the answers so far!]