Hi folks,
I have a multi-threaded application that deadlocks during processing. The deadlock involves Nim's own allocator, through allocShared0.
One weird thing is that it only seems to happens when a signal handler is called during allocation, see stack entry #6 below:
[Switching to thread 4 (Thread 0x7f5840aaa700 (LWP 792241))]
#0 0x00007f5841138110 in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f58411300a3 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x000055cdd5144edd in allocSharedImpl ()
#3 0x000055cdd5144f2a in allocShared0Impl__KzdpcuLT9aef9bsiSHlIu9aFg ()
#4 0x000055cdd5145044 in rawNewString ()
#5 0x000055cdd51450db in signalHandler ()
#6 <signal handler called>
#7 0x000055cdd5144b37 in rawAlloc__mE4QEVyMvGRVliDWDngZCQ ()
#8 0x000055cdd5144e96 in alloc__UxtcZ3QOXKsB7mMchxUf9cg ()
#9 0x000055cdd5144ef0 in allocSharedImpl ()
#10 0x000055cdd5144f2a in allocShared0Impl__KzdpcuLT9aef9bsiSHlIu9aFg ()
#11 0x000055cdd5145044 in rawNewString ()
#12 0x000055cdd5141c48 in nimIntToStr ()
#13 0x000055cdd515c657 in push__MmYw8GiDmwrka3Kbq2YHNA ()
#14 0x000055cdd516cbab in colonanonymous___KhQtN3VqLqUwB4xEnAK61g_4 ()
#15 0x000055cdd5168db4 in stageExecutor__FbCoqNtjNd1Idpx8m3eyog ()
#16 0x000055cdd5169320 in stageExecutorWrapper__rUZUAkVnfg9cVz2WTXW1WHA_2 ()
#17 0x000055cdd5165f61 in slave__2x7X7LtvNxZKjHZ8WiavVQ ()
#18 0x000055cdd5146238 in threadProcWrapDispatch__cEgox17hukUW9cP9aBHnoXeA_2 ()
#19 0x000055cdd5146393 in threadProcWrapStackFrame__cEgox17hukUW9cP9aBHnoXeA ()
#20 0x000055cdd51463cd in threadProcWrapper__KwtUyNVh00QDWGRZcngjGA ()
#21 0x00007f584112d609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#22 0x00007f5841054293 in clone () from /lib/x86_64-linux-gnu/libc.so.6
Other threads have backtraces that look like this:
#0 0x00007f5841138110 in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f58411300a3 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x000055cdd5144edd in allocSharedImpl ()
#3 0x000055cdd5144f2a in allocShared0Impl__KzdpcuLT9aef9bsiSHlIu9aFg ()
#4 0x000055cdd5145044 in rawNewString ()
#5 0x000055cdd5141c48 in nimIntToStr ()
#6 0x000055cdd5168cc6 in stageExecutor__FbCoqNtjNd1Idpx8m3eyog ()
#7 0x000055cdd5169320 in stageExecutorWrapper__rUZUAkVnfg9cVz2WTXW1WHA_2 ()
#8 0x000055cdd5165f61 in slave__2x7X7LtvNxZKjHZ8WiavVQ ()
#9 0x000055cdd5146238 in threadProcWrapDispatch__cEgox17hukUW9cP9aBHnoXeA_2 ()
#10 0x000055cdd5146393 in threadProcWrapStackFrame__cEgox17hukUW9cP9aBHnoXeA ()
#11 0x000055cdd51463cd in threadProcWrapper__KwtUyNVh00QDWGRZcngjGA ()
#12 0x00007f584112d609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f5841054293 in clone () from /lib/x86_64-linux-gnu/libc.so.6
Now it's entirely possible that this is due to some of my own doing, although I'm trying to avoid sharing refs across threads as much as possible.
Eyeballing the backtraces above, where would your mind go in terms of trying to diagnose and pinpoint the issue further? What do you think might trigger the signal handler? Is there a way to identify which locks are owned by which threads? Is it somehow possible that the lock owned by the signal-handling thread isn't reentrant and therefore the thread blocks on a lock that it already owns?
Thanks in advance for any help!
When I tried Weave with allocShared I had deadlocks as well and I didn't use signals.
For multithreaded application, I suggest you create your own allocator wrapper that allow you to toggle betwen Nim allocator and malloc. https://github.com/mratsim/weave/blob/5034793/weave/memory/allocs.nim#L47-L88
Alternatively with --gc:arc you can try with -d:useMalloc to bypass Nim allocators/locks.
The bug is the attempt to allocate memory in an asynchronous signal handler...
Signal handlers are like threads but spawned at random points while pausing the thread it's running on. Most code are not and cannot handle this stuff, which is totally understandable. POSIX has a strict definition of what functions are "async-signal-safe", and malloc is not one of them.
so signalHandler shouldn't be doing this ?
+ var buf{.global.} = newStringOfCap(2000)
- var buf = newStringOfCap(2000)
Thanks for all the replies! This helps a lot. I had a suspicion the signal handler was trying to print a stack trace. Per @leorize' and @shirleyquirk's comments, it seems I stumbled on a stdlib bug. Should I file a bug report?
I'll try c_malloc to try to work around the issue and better understand what my program is doing wrong.