Hi,
This is long, and mostly technical. Please correct me if I'm wrong anywhere, and let me know if I should share writings like this in the future.
I didn't understand why Valgrind (a tool for detecting memory anomalies) reported that that my simple Nim binary used uninitialised values. I'm not the first to detect this, see: https://github.com/nim-lang/Nim/issues/3063
The answer from the core developers to this issue is that Valgrind doesn't know about Nim's GC memory management. I see, but how does this lead to use uninitialized values? Valgrind is a great tool, it would be great to use it without any false warnings; also false warnings might send the wrong message about Nim. There were questions in my head: what's happening? Could we fix this? Can we zero that memory? So I started to read code.
I have a 64-bit linux operating system, so all numbers will reflect that (I'll read amd64 as true, and focus only on the posix parts). Nim runs the garbage collector only if I try to allocate new memory or trigger it manually. It sounds reasonable, I like it.
Now let's see that uninitialized value. There is an if condition in gc.nim in proc gcMark(...):
if c >% PageSize:
...
This can be unsafe according to Valgrind. Since PageSize is pretty constant, we need to find what c is. Originally markStackAndRegisters(...) calls gc_common.nim file's forEachStackSlot template, and that template calls gcMark(...). There is some pointer magic, but actually this variable c is going over what c_setjmp set to a variable called registers. And beyond that: it checks everything until we reach the bottom of the stack.
It turned out that we can have a snapshot of the registers in C with a function called setjmp. The state is restorable with longjmp. These functions are used to implement exceptions: if anything goes wrong, then we can quickly forget about the new calls in the stack (we can be much deeper by that time), and restore the previous state.
However, we are in the middle of a GC run, there is no plan to restore anything. We need to check the registers for possible pointers. If we have a pointer which points to a memory allocated by Nim, then we don't want to free it. And the registers are important: please imagine that a C compiler wants to do good, and instead of storing a value on the stack (like usual), it figures out that a processor register could be allocated for it (much faster). Therefore we need to check the registers.
In the middle of checking the registers there is an uninitialized value. We are talking about 200 bytes which consist of 3 groups: 64+8+128 bytes (from setjmp.h, bits/setjmp.h and bits/sigset.h). The first 64 bytes contains the environment, the rest of the bytes are about the signals. Valgrind is fine with the first 64 bytes (yeah!), but the next 8 bytes are uninitialized. So I modified the code to check only the fist 64 bytes (without any knowledge about signals and the damage I might done). But there were more uninitialized values later...
The code doesn't only check the registers, but checks all the stack till its bottom. On my CPU the stack's bottom is a memory address, and everything on the stack is put before that address (stackIncreases = false). So at each function call the parameters, the local variables, everything is stored before stack's bottom. Therefore we need two pointers to check everything on the stack: the two ends. At the start of the program we store the stack's bottom (setStackBottom(...) is called pretty soon in the generated C file). Obtaining the top of the stack is pretty easy: we just need any recent variable's address, so let's use this register variable's address. That's why in forEachStackSlot(...) the code is keep going util sp <% max where sp is initially pointed to registers, and max is gch.stackBottom.
So we still have uninitialized values, but how? My guess is that each time we use something smaller than 8 bytes, for example char, then padding might be needed. Everything starts at k*8 bytes position. If you have an array of char with length 10, then it will probably take 16 bytes. Nim doesn't touch that 6 padding bytes, then the GC doesn't know/care about the structure, and checks that 6 padding bytes whether that's a valid pointer.
I have so many thoughts here. Let me store my favorite number in a static variable and leave it there. If Nim uses the same memory address for a dynamically allocated object by chance, then that memory won't be freed until the end of the program because my favorite number blocks it. Moreover, if I store only one random byte (a char) and my previous function call had put a valid pointer to that 8 bytes where this byte is stored, then with 1/256 chance I'll have exactly the same number, which will prevent Nim from freeing that memory block (I'm assuming no zeroing based on Valgrind's warnings).
If you have a memory address that you will need in the future, then you should keep one pointer at it all the time. If you add 100 to that pointer, and later subtract 100 (very unlikely scenario), then the pointer can point to an already freed chunk.
I can see two different (not good) approach how to fix this: we could give GC more knowledge about the stacks (would be slow, extra work at each function call, or the GC would need to understand the stack), or clear the unused bytes generated by padding (C's padding is not part of the standard, sounds like endless developer work). Neither of these ideas seems promising.
Because of these cases, it is not possible to ensure that a finalizer gets called in the current system, see forum topic: http://forum.nim-lang.org/t/1652
People speak about reference counting solutions: every time we have a pointer got out of scope, we would decrease a number (and increase it when the pointer is copied). I guess that we could have something similar to Rc in Rust, maybe some extra compiler support is needed. As long as the pointer objects are stored on the stack and the counter is probably on the heap, I think it is doable.
However, I really like the current approach. It is amazingly fast. If you do anything in a for cycle with the data (for simplicity please don't allocate...), then imagine the pure C code which is generated from your code. There is no reference counting at all! It is pure speed. Amazing! Unfortunately the multithread related problems are real.
By the way, I also find that not malloc, but mmap is used to allocate large memory chunks. It was new to me, I knew only the "read one file with mmap" usage. And google told me that malloc is just an intelligent wrapper around mmap, which tries to minimize the call of mmap. So it's good that Nim tries to do that by itself and uses mmmap instead of using an extra layer provided by malloc.
I still don't know everything about ref and ptr, and I barely touched the code which checks pointers on the heap.
Thanks for reading, Peter
Thanks for your excellent analysis, it helps my understanding a lot.
I went back and re-read the description of Nim's GC, and picked up what I had missed before: references to the heap that are on the stack do not contribute to the reference count. Hence the stack must be scanned before any heap object is discarded. Which, in turn, means that a scan must be forced if one needs to trigger calls to the finalizers (and, as you explain above, the probable presence of uninitialzed portions of the stack neutralizes any guarantee even then). Now it all makes sense. The issue of how to manage shared memory without a stop-the-world GC remains, but at least it makes sense.
The issue of how to manage shared memory without a stop-the-world GC remains, but at least it makes sense.
It's actually not that hard: Make the thread that owns the data (!) GC_ref/GC_unref at strategic places via some form of message passing. Biggest problem: The GC doesn't have any support for this message passing yet. It needs some "foreign thread requests GC_ref/GC_unref" queues. Patches welcome. ;-)
You can then put these GC requests in a defer statement (or later in a destructor once these are stable...) and you're good to go.
One additional headache is safety: You need to wrap the foreign refs properly and ensure either statically or dynamically that you don't create datastructures that mix the different heaps in a wild fashion.
@Araq Thanks very much. If I understand your comment correctly, the key is to ensure that the foreign reference is only on the client thread's stack, so that it can detect when the foreign reference goes out of scope and message the owner/manager thread accordingly. Sounds do-able, even for someone with my limited knowledge :).
Time for a bit of hacking.