I am trying to test memory fragmentation after thread destruction, but I don't know how to destroy threads properly. What is wrong with this program?
import os
var threads: seq[Thread[int]]
proc wait(s: int) =
for i in 0..10:
var x = newStringOfCap(100000*(11-i))
os.sleep(s*1000)
proc cycle() =
for i in 0..threads.high:
assert(not running(threads[i]))
createThread(threads[i], wait, 1)
echo "Joining", threads.len
joinThreads(threads)
echo "Joined", threads.len
proc main() =
newSeq(threads, 4)
for i in 0..30:
cycle()
main()
Joining4
Joined4
Joining4
.... (hangs here)
Interesting. I got it working by moving newSeq(threads, 4) into cycle(). I guess once a thread has finished running, simply releasing the Thread[] object to GC is the way to destroy it. They are apparently not re-usable.
I haven't been able to cause fragmentation after thread destruction, which is good. But I don't yet know the performance penalty of continually creating new threads, and whether it's justified by the savings in reduced fragmentation...
I haven't been able to cause fragmentation
Are you sure that
var x = newStringOfCap(100000*(11-i))
is really contained in the executable? Well maybe with -O0, but maybe Nim compiler already removes it?
@cheatfate, Interesting, and good timing.
What about this bit of test/threads/treusetvar.nim for your new code:
+for i in 0..(ThreadsCount - 1):
+ var thread: Thread[Marker]
+ createThread(thread, worker, p)
+ joinThread(thread)
+echo p.counter
Shouldn't var thread: be outside the for-loop?
I tried it with my month-old Nim, and it does indeed hang either way. But why? Isn't var a brand new thing on the stack each time through the for-loop, fully zeroed?
My guess: The ThreadId was simply the address, and in this case the address does not change. And that's why it was so important to make Thread objects re-usable. Yes?
@cheatfate, Wonderful! Not just a new feature, but also a fix for a dangerous bug.
Independent of that, I have found that create/destroy of threads is very fast (on OSX). So araq's idea of relying of thread destruction for fast memory clean-up works beautifully.
(Large memory allocation is still expensive, but that's a separate issue, possibly just the cost of bzero.)
Another question: When a thread ends (for later re-spawning) in the threadpool, will it now have its heap quickly-cleaned, as for thread destruction?
Alternatively, are threadvars persistent across spawns of the same thread (by chance) in a threadpool?
Maybe there is still a bug in Thread? I now use threads in a very simple way:
for q in get_seq_data(config, min_n_read, min_len_aln):
var (seqs, seed_id) = q
log("len(seqs)=", $len(seqs), ", seed_id=", seed_id)
var cargs: ConsensusArgs = (inseqs: seqs, seed_id: seed_id, config: config)
if n_core == 0:
process_consensus(cargs)
else:
var rthread: ref Thread[ConsensusArgs]
new(rthread)
createThread(rthread[], process_consensus, cargs)
joinThread(rthread[])
... (threadpool first creates 48 threads, even though I do not use threadpool.)
[New Thread 0x7ffff015a700 (LWP 202052)]
[New Thread 0x7fffefedb700 (LWP 202053)]
[New Thread 0x7fffefbdc700 (LWP 202054)]
main(n_core=1)
len(seqs)=25, seed_id=2
[New Thread 0x7fffef52b700 (LWP 202055)]
[Thread 0x7fffef52b700 (LWP 202055) exited]
len(seqs)=98, seed_id=14
[New Thread 0x7fffef52b700 (LWP 202056)]
[Thread 0x7fffef52b700 (LWP 202056) exited]
len(seqs)=58, seed_id=15
[New Thread 0x7fffef52b700 (LWP 202057)]
[Thread 0x7fffef52b700 (LWP 202057) exited]
len(seqs)=43, seed_id=22
[New Thread 0x7fffef52b700 (LWP 202058)]
[Thread 0x7fffef52b700 (LWP 202058) exited]
len(seqs)=55, seed_id=25
[New Thread 0x7fffef52b700 (LWP 202059)]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffef52b700 (LWP 202059)]
deallocOsPages_e5IRqVbks39a9bBzvLjGxw2g (a=0x7ffff7f3d0c8) at /home/UNIXHOME/cdunn/repo/gh/Nim/lib/system/alloc.nim:740
740 osDeallocPages(it, it.origSize and not 1)
(gdb) bt
#0 deallocOsPages_e5IRqVbks39a9bBzvLjGxw2g (a=0x7ffff7f3d0c8) at /home/UNIXHOME/cdunn/repo/gh/Nim/lib/system/alloc.nim:740
#1 0x00000000004143f3 in deallocOsPages_njssp69aa7hvxte9bJ8uuDcg_3 () at /home/UNIXHOME/cdunn/repo/gh/Nim/lib/system/gc.nim:107
#2 threadProcWrapStackFrame_dXJaXMz804k05DGz7X4RkA (thrd=0x7ffff7f79328) at /home/UNIXHOME/cdunn/repo/gh/Nim/lib/system/threads.nim:427
#3 threadProcWrapper_2AvjU29bJvs3FXJIcnmn4Kg_2 (closure=0x7ffff7f79328) at /home/UNIXHOME/cdunn/repo/gh/Nim/lib/system/threads.nim:437
#4 0x00007ffff76ba182 in start_thread (arg=0x7fffef52b700) at pthread_create.c:312
#5 0x00007ffff73e700d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) l
735 when defined(debugHeapLinks):
736 cprintf("owner %p; dealloc A: %p size: %ld; next: %p\n", addr(a),
737 it, it.origSize and not 1, next)
738 sysAssert it.origSize >= PageSize, "origSize too small"
739 # note:
740 osDeallocPages(it, it.origSize and not 1)
741 it = next
742 when false:
743 for p in elements(a.chunkStarts):
744 var page = cast[PChunk](p shl PageShift)
(gdb) p it
$1 = (BigChunk_Rv9c70Uhp2TytkX7eH78qEg *) 0x101010101010101
That is with Nim origin/devel up-to-date, at
commit 172a9c8e97694846c3348983a9b2b7c2931c939d
Author: Dominik Picheta <[email protected]>
Date: Mon Mar 27 12:14:06 2017
My program works fine without threads (n_core=0). It worked fine when I used threadpool.
Another problem with this approach is that is goes 3x slower (despite using GC_disable within the thread) than my single-threaded version, which was 3x faster than C+Python/multiprocessing. Very disappointing. The single-threaded version also suffers an explosion in memory fragmentation, though not as bad as before I started re-using strings and seqs within each task.
So at this point, I've lost my runtime advantage; I have to jump through hoops to avoid memory fragmentation (compared with Python multiprocessing); and now I have this seg-fault.
If anyone wants to debug this, let me know. I can put together a full test-case (via my corporate cloud server). I have 3 test-cases: 75k, 1.4M, and 800M. This crash happens only on the largest, but at least it happens pretty quickly.
Sorry, I couldn't test this program
Could you tell me the problem? @bpr got it working (and failing). Try a fresh clone.
I'm guessing that @Araq meant he didn't have time to download and try out the program at all, not that he downloaded and couldn't get it to work. After the initial problem was fixed it was quite simple to get the results you described (thanks!) so I can't imagine it was a problem there.
I also had little time to experiment so I have no new info. Also, I'm hoping that @Araq or someone else gets there first and solves it :-)
With that fix on devel, it no longer hangs on OSX, but it still seg-faults on Ubuntu.
Also, on OSX it runs fine, seems fast, but sucks up lots of virtual memory -- about 10GB/min (yes, ten). I have only 8GB of real RAM on my Mac, so I was surprised to see a process taking 40GB. New threads are clearly not re-using the released memory of discarded threads.
Excellent!
Yes, it is working now, on both Ubuntu (and Centos6.6, built on Ubuntu) and OSX. With one worker thread, memory consumption is very low and stable, around 300MB. Beautiful!
With the earlier fix, there was actually a disturbing diff between expected and new output, indicating a really subtle memory bug, but that is fixed on origin/devel now too.
I will concentrate on runtime next, and experiment with multiple threads.
The profiler seems to hang when using more than 1 worker thread. Is that expected? Unsupported?
@bpr, could you verify? I have pushed an update that supports N threads, stored in a seq. Until we are sure, let's discuss this via email, or in:
Now the problem is a sudden jump in freemem on the main thread, and huge use of virtual memory. E.g.
+ log("tot=$1 occ=$2, free=$3 b4" % [$getTotalMem(), $getOccupiedMem(), $getFreeMem()])
+ GC_fullCollect()
+ log("tot=$1 occ=$2, free=$3 now" % [$getTotalMem(), $getOccupiedMem(), $getFreeMem()])
$ time N=4 SIZE=huge make
../main.exe --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 500 --n_core 4 > out.nim.fasta < data/la4.huge/huge.la4falco
n
main(n_core=4)
len(seqs)=25, seed_id=2
tot=4206592 occ=3895296, free=311296 b4
tot=4206592 occ=1511424, free=2695168 now
len(seqs)=98, seed_id=14
tot=12738560 occ=5505024, free=7233536 b4
tot=12738560 occ=5517312, free=7221248 now
len(seqs)=58, seed_id=15
tot=8822784 occ=8105984, free=716800 b4
tot=8822784 occ=5414912, free=3407872 now
...
len(seqs)=37, seed_id=42
tot=9052160 occ=6340608, free=2711552 b4
tot=9052160 occ=5013504, free=4038656 now
len(seqs)=49, seed_id=43
tot=9052160 occ=6770688, free=2281472 b4
tot=9052160 occ=6270976, free=2781184 now
len(seqs)=55, seed_id=53
tot=2156535808 occ=7073792, free=2149462016 b4 !!!!!!!!!!!
tot=2156535808 occ=6787072, free=2149748736 now !!!!!!!!!!!
len(seqs)=50, seed_id=57
tot=9445376 occ=7110656, free=2334720 b4 ???
tot=9445376 occ=6815744, free=2629632 now
len(seqs)=29, seed_id=58
tot=9445376 occ=7041024, free=2404352 b4
tot=9445376 occ=6066176, free=3379200 now
...
See the sudden jump? Something weird is going on. (That is with GC_fullCollect() bewteen "b4" and "now" on the main thread.)
Virtual memory jumps around, as low at 1TB and as high as 60TB. So think there is still a problem, though not as bad as before: no crash, no increase in system RAM, decent runtime.
@bpr never duplicated that problem, even on Ubuntu14. Things are fine on my MacAir, and on a Centos7 machine at work. But the Ubuntu14 machines at work have this strange behavior -- except they seem fine when my Nim program is run in serial mode.
Part of the problem was me. I had work-arounds in the the threadpool library which had a strange effect with araq's new code. Without those work-arounds, virtual memory is stable, but runtime is still horrible on the Ubuntu14 virtual machines. So I can't explain it, but I guess it's something with the set-up at work.