I've got a multi-threaded program, with threads communicating via channels. Unfortunately, over time, the thread count decreases. I know the threads die because with sometimes another thread will raise an exception with a "thread died" message when they try to send a message over a channel. I also monitor the process's thread count and can see it gradually decreasing. The threads that do die seem to have random problems under ARC/ORC, but don't show any errors at all under refc.
When I run the program with Valgrind there are no problems reported, and threads don't seem to die, but it's very slow and doesn't use all cores. I read that Valgrind is a VM where there are fewer potential problems such as alignment issues.
Has anyone seen this before?
Thanks, I've made sure I'm using --mm:orc -d:useMalloc but the issue persists without any information on why threads are dying, even with Valgrind.
On Windows I don't see the same issue. It seems this is Linux specific.
I'm using channels 1:1 right now, channels shared between threads sounds good though, unless there's a performance hit (will have to benchmark).
If new channels don't fix the issue I'll work on a minimum program to reproduce the issue.
I've updated my code to make sure globals aren't used. I now get an exceptions. I've actually seen similar exceptions before, except when I was preparing to write the initial post in this thread.
I didn't yet have stacktrace on for this one:
double free or corruption (fasttop)
Traceback (most recent call last)
/.choosenim/toolchains/nim-1.6.6/lib/system/seqs_v2.nim(114) myThread
/.choosenim/toolchains/nim-1.6.6/lib/system/arc.nim(164) nimRawDispose
SIGABRT: Abnormal termination.
This was one where I was iterating through an array of an object type:
/.choosenim/toolchains/nim-1.6.6/lib/system/orc.nim(494) nimDecRefIsLastCyclicStatic
/.choosenim/toolchains/nim-1.6.6/lib/system/orc.nim(466) rememberCycle
/.choosenim/toolchains/nim-1.6.6/lib/system/orc.nim(146) unregisterCycle
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
Segmentation fault
I think I've fixed the problem. The first/biggest problem was no errors from Valgrind. I had some code using a DB connection with a global, but this wasn't used by any threads I created so I didn't think it was a big deal. I fixed that and some other issues.
Then Valgrind started showing errors and I've fixed those. Now my program seems stable again, but I continue to test.
I'm actually not yet using the new channels, I'd rather only do that if I have to. But thanks for the advice.
I think I've solved the original problem now. My user on the Linux instance had a very limited number of open file descriptors available as shown by ulimit -n. Increasing this limit to a higher number seems to fix it.
This also explains why the problem wasn't seen running the same program under Windows or on Linux with Valgrind, since both are different environments which presumably don't have the limitation my Linux user had.
It looks like the issue never went away. However using valgrind --tool=helgrind I found a data race related to Chronicles logging. Initially I thought Chronicles might be able to handle logging from multiple files internally, but to be safe I modified my code to log each thread's output to a separate log file. However the data race still occurred.
If anyone's interested, I've attached a minimal reproducible test case in the issue I logged.