nimforum mirror - Thread-local persistence

cdunn2001 (orginal) [2017-03-21T01:39:17+01:00] view original

I need to use some per-thread, pre-allocated data-structures. The point is to avoid re-creating large arrays except just once, the first time each thread uses one.

With Python+C multiprocessing, I simply use a global. Each process has its own global, and Python manages a pool of processes which each persists its only global data. Very convenient.

My Nim version, with multi-threading, is 3x faster than the PyC version on my Mac. But it crashes on Linux, or runs very slowly and consumes quickly exploding amounts of memory. I think this is because I am using {.threadvar.} for the thread-local variable, and on Mac it persists while on Linux ... well I'm really not sure what's going on. But it runs fine on Linux single-threaded iff I drop threadvar.

So I'm wondering if threadvar works differently on Mac than on Linux. And beyond that pure curiosity, I'm wondering how best to handle per-thread persistence with a threadpool.

This is what works on Mac but not reliably (apparently) on Linux:

var top {.threadvar.}: ref Foo

proc go() {.thread.} =
  if top.isNil:
    top = makeNewFoo() # pre-allocate large mutable sequences etc.
  ...

hcorion (orginal) [2017-03-21T01:58:52+01:00] view original

What compiler are you using on both Mac and Linux. If your using clang on Mac and GCC on Linux, try compiling it on Linux using clang.

cdunn2001 (orginal) [2017-03-21T04:04:09+01:00] view original

Good idea. But clang didn't help.

In order to end the crashes, I had to stop relying on persistence. I simply release (set to nil) the top variable at the end of each spawned call.

The performance on Linux is terrible, possibly related to allocations. I'll post if I discover anything interesting. I was just hoping to learn something about the proper way to use threadvar.

euant (orginal) [2017-03-21T11:49:35+01:00] view original

Maybe something like this? I haven't played much with threads in Nim yet: https://github.com/nim-lang/Nim/blob/4f062c3be08fa2bc3e167e1a6b9842c92bc8c8f7/tests/threads/tonthreadcreation.nim

import locks

var
  thr: array[0..3, Thread[int]]
  L: Lock
  top  {.threadvar.}: seq[string]

proc threadDied() {.gcsafe.} =
  echo "Dying: ", top

proc foo(i: int) {.thread.} =
  onThreadDestruction threadDied
  
  {.gcsafe.}:
    top = @["hello", "world"]
  
  acquire(L)
  echo "Top in thread ", i, ": ", top
  release(L)

proc main =
  initLock(L)
  for i in 0..high(thr):
    createThread(thr[i], foo, i)
  
  joinThreads(thr)

main()

Seems to work on my machine (Windows) for that very basic example.

In the stdlib, `onThreadCreation` is used, but it seems to have been removed from `devel`

cdunn2001 (orginal) [2017-03-22T16:26:19+01:00] view original

First, yes, that's one of the problems with spawn: I don't know what thread I'm on. That limits the value of threadpool.

But my crashes are gone when I stop using thread-local nre.Regex. Now I create it locally always. A bit expensive, but safe. I've also dropped my other thread-local var. So thread-local is not the main problem for me.

My problems are runtime on Linux, and memory in general. Maybe memory allocation harms runtime, so I'd like to solve memory first. The problem is, when I print all, occupied, and free memory, I see all and free grow almost continuously. I've tried capping available memory (including virtual memory) with ulimit. Nothing I've tried has helped.

This happens even single-threaded. It's not a "leak", since "occupied memory" is stable. I have no runtime errors in debug-mode, and I've gotten rid of calloc everywhere in my code.

So far, I've been unable to produce a small test-case which reproduces the problem. I'll keep trying to repro with a small example this weekend. But here is my main question: When freelists are re-used in the default memory manager?

I could use a way to dump freelist sizes. Would that be a fair request on GitHub?

cdunn2001 (orginal) [2017-03-22T16:48:41+01:00] view original

Ok. I have a sort-of repro:

https://gist.github.com/pb-cdunn/9ac52d96d1791cbb2412e76865748a13

When I run that in https://glot.io/new/nim, things seem fine. And they are fine on my Mac. (I'll double-check soon.) But on my Linux VM, it does this:


n=1
...
n=1073741824
tot=4296118272 occ=3221426176, free=1074692096 b4
tot=4296118272 occ=3221430272, free=1074688000 now
n=1
...
n=1073741824
tot=5370155008 occ=3758309376, free=1611845632 b4
tot=5370155008 occ=3221438464, free=2148716544 now
n=1
...
n=1073741824
tot=6712696832 occ=3758309376, free=2954387456 b4
tot=6712696832 occ=3221438464, free=3491258368 now
n=1
...
n=1073741824
tot=6712700928 occ=3758313472, free=2954387456 b4
tot=6712700928 occ=3221442560, free=3491258368 now
n=1
...

It remains stable from there on, so maybe the manager prevents any given freelist from growing forever. But even this much is unacceptable. I have 48 cpus on a 96GB machine, so I have only 2 GB available to each thread. Plus, when I run single-threaded (without spawn), my own program's free-space grows without bound, until I kill it. When I use ulimit -m/-v, it runs out of memory, which is also sub-optimal behavior.

If I can be confident that the freelists are contiguous, then maybe I could rely on the Linux swapper to swap them out of memory, but my co-workers would still not approve.

This is a serious problem.

Jehan (orginal) [2017-03-22T23:08:02+01:00] view original

First, some notes:

Linux defaults to --tlsemulation:off, macOS defaults to --tlsemulation:on. I'm not sure how that would affect the behavior you mention, though.

Any computer with 48 cores is going to be a NUMA architecture, so if your code runs very slow, that might be the reason. Tweaking allocation-heavy multithreaded code to run properly on a NUMA architecture can be tricky. It's often best to limit the number of threads to the number of cores on a processor and use numactl to make sure everything happens locally. Use multiprocessing if you want higher scalability.

If you're allocating lots of very large blocks of memory, fragmentation is going to hurt you sooner or later. The only solution for that would be a compacting garbage collector. I'm not sure what you're allocating in your actual code. If you're reading files into memory, the memfiles module might help (reading large files in a memory-constrained situation is generally problematic, as they occupy OS cache space and space where they finally end up).

As for single-threaded free memory growing indefinitely, that is very puzzling. Have you tried running with --gc:markandsweep and --gc:boehm as alternatives? The Boehm GC is a bit more wasteful on memory, but it may help narrowing down the problem.

cdunn2001 (orginal) [2017-03-23T15:03:25+01:00] view original

Thanks for the ideas. I did not get much info from running boehm. When I have more time, I'll try markandsweep, and also tlsemulation.

It's often best to limit the number of threads to the number of cores on a processor...

I don't know how to do that. Nim's threadpool does not offer a way. The previous solution used Python/multiprocessing + Cython. You can argue that it avoids NUMA problems via multiprocessing, but it did not suffer memory growth. Possibly calloc/free uses mmap/munmap for large blocks.

Interestingly, my little Nim example on OSX sped up after total memory usage quickly stabilized. So maybe it's better optimized for large, repeated alloc/free.

But I don't know whether the Linux slow-down is from the large allocs, the many small allocs, or basic processing.

If you're allocating lots of very large blocks of memory, fragmentation is going to hurt you sooner or later. The only solution for that would be a compacting garbage collector. I'm not sure what you're allocating in your actual code.

Yeah, that's what I think is going on. I would love to be able to view the freelist lengths when some flag is set, so I wouldn't have to wonder. Would such a change likely be accepted in GitHub?

Jehan (orginal) [2017-03-24T17:09:23+01:00] view original

I mentioned the NUMA problems because you said you were experiencing poor performance running on 48 cores, which might be caused by memory locality issues. There's little you can do within Nim (other than through OS-specific system calls). You probably want to limit execution to just the cores of a single processor with numactl and check if you still experience the slowdown.

What is puzzling about the memory situation in general is that your free space seems to explode. It wouldn't be surprising if total heap space (live objects + free space) went up as a result of fragmentation, but free space explosion would mean that you've encountered a pathological case. Conservative scanning is unlikely to be the cause, because the amount of wasted memory is generally bounded. This is why I was suggesting to look at --gc:boehm and --gc:markandsweep: not because they might perform better, but to narrow down the diagnosis.

cdunn2001 (orginal) [2017-03-25T17:52:21+01:00] view original

I have some time this weekend to experiment with boehm and markandsweep.

It wouldn't be surprising if total heap space (live objects + free space) went up as a result of fragmentation, but free space explosion would mean that you've encountered a pathological case.

"Pathological"? If dealloc() fragments a large block, then repeated allocation of large blocks would definitely cause memory use to explode. One interesting note: When I briefly tried boehm, it complained one or twice early in the program about my large allocations. Perhaps the Nim mem-managers are simply not intended for repeated alloc/dealloc of large blocks. In that case, I could try making large allocations "threadvar", assuming I can count on data-persistence for a thread in the threadpool. I will experiment with that this weekend.

The performance hit happens even single-threaded, not using spawn. I guess it's possible that hit comes mainly from interactions with the memory manager. Even on Linux, with the simple example I pasted above, you can verify that the already speedy program becomes even speedier after total allocated memory stabilizes, which indicates to me that allocations from existing freelists are quicker than acquiring memory from the system.

I still would like to see freelist sizes. That would immediately answer questions on memory fragmentation. GNU malloc provides far better memory debugging tools than we have available:: https://www.gnu.org/software/libc/manual/html_node/Using-the-Memory-Debugger.html#Using-the-Memory-Debugger

Do you think that a pull-request for showing freelist sizes would be accepted?

Jehan (orginal) [2017-03-26T17:05:00+02:00] view original

cdunn2001: "Pathological"? If dealloc() fragments a large block, then repeated allocation of large blocks would definitely cause memory use to explode.

I was talking about something that sounded like unused space going up while allocated memory remained constant. That would be a pathological case. If I misunderstood you and the ratio unused/used doesn't go up indefinitely, then it wouldn't be.

cdunn2001: One interesting note: When I briefly tried boehm, it complained one or twice early in the program about my large allocations.

That has to do with Boehm being a conservative garbage collector (the Nim GCs scan only the stack conservatively, all other roots and the heap are being scanned precisely) and thus more vulnerable to integers and other non-pointer data being misidentified as pointers.

cdunn2001: I still would like to see freelist sizes.

I'm not sure what you mean by that. Large blocks in the allocator (type BigChunk) are kept in a single list and are allocated using first-fit. Unless you are also allocating a lot of small objects, the amount of free and used memory should be a reasonable approximation in your use case.

cdunn2001: Even on Linux, with the simple example I pasted above, you can verify that the already speedy program becomes even speedier after total allocated memory stabilizes, which indicates to me that allocations from existing freelists are quicker than acquiring memory from the system.

That shouldn't be too surprising. The first time, you pay the overhead for mmap and a lot of page faults as memory is being initialized. Try allocating a large seq, for example, then do the same and write zeros into it. It should take considerably less than twice the time.

In any event, this is all a bit beside the point (the above is about diagnosing the root causes and doesn't give you a solution). Your problem seems to be that large memory allocations lead to more wasted memory that you can justify. The primary problem appears to be fragmentation, and that's tricky to solve, barring a compacting GC; nothing we are talking about here can automatically fix it.

The exact solution will depend on what you need the large memory blocks for. If the allocation patterns are relatively simple to manage, then a possible solution would be to use mmap() with explicit allocation and deallocation (virtual memory fragmentation is less of an issue on 64-bit machines, except for the 64k limit on mmap()-ed blocks on Linux).

To then avoid manual memory management, these large blocks can be wrapped in GCed handlers with finalizers so that deallocation is handled automatically; however, last I checked, Nim didn't have something akin to .NET's GC.AddMemoryPressure to make sure that the GC runs frequently enough. One could track the entire mmap()-ed memory, however, and trigger a full GC when it exceeds a certain amount (relative to total memory occupied).

cdunn2001 (orginal) [2017-03-26T17:38:48+02:00] view original

markandsweep wasted even more memory, and it provided no memprofile info.

With boehm, I saw a crash on a nil "threadvar", which brings me back to my need for clarification on threadvar persistence. I fixed that crash by using --tlsemulation:on, as suggested. boehm then wasted about the same as markandsweep.

Back to refc. I have found one pathological case, and a way to avoid it. I have large allocations which drop and drop and drop, until the next iteration.

echo "Big seq:", $ssize
  newSeq(d_path, ssize)


Big seq:717585
Big seq:707953
Big seq:704943
...
Big seq:57793
tot=79265792 occ=10633216, free=68632576 b4 GC
tot=79265792 occ=11751424, free=67514368 now
Big seq:717585
Big seq:710361
Big seq:681465
...
tot=119255040 occ=12337152, free=106917888 b4
tot=119255040 occ=12070912, free=107184128 now

I mitigated that (fragmentation?) problem by making d_path a "var" parameter, and using setLen instead of newSeq:


tot=37920768 occ=11132928, free=26787840 b4
tot=37920768 occ=12132352, free=25788416 now

That's a big improvement. (Later, I'll apply that to other parts of the code.)

Jehan (orginal) [2017-03-26T17:48:19+02:00] view original

For what it's worth, I threw together a basic implementation of what I wrote above:

# Free Public License 1.0.0
#
# Permission to use, copy, modify, and/or distribute this software for any
# purpose with or without fee is hereby granted.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
# WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
# ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
# WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
# ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
# OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

import posix, os

type
  RawArray {.unchecked.} [T] = array[0..0, T]
  MemBlock*[T] = ref object
    data: ptr RawArray[T]
    count: int

const
  triggerGCfactor = 3

var
  prevAllocated {.threadvar.}: int
  newAllocated {.threadvar.}: int

# Some missing POSIX parts:

{.emit: """
#ifndef MAP_ANONYMOUS
#define MAP_ANONYMOUS MAP_ANON
#endif
""".}

var
  MAP_ANONYMOUS {.importc, header: "<sys/mman.h>".}: cint

proc getpagesize(): cint {.importc, header: "<unistd.h>".}

let pagemask = getpagesize().int - 1

proc blocksize(n: int, elemsize: int): int =
  let size = n * elemsize
  if (size and pagemask) == 0:
    size
  else:
    (size + pagemask) and not pagemask

proc finalizeMemBlock[T](p: MemBlock[T]) =
  discard munmap(p.data, blocksize(p.count, sizeof(T)))

proc newMemBlock*[T](count: int): MemBlock[T] =
  let occupied = getOccupiedMem() + prevAllocated
  let factor = occupied / newAllocated
  if factor <= triggerGCfactor:
    GC_fullCollect()
    prevAllocated += newAllocated
    newAllocated = 0
  let mem = mmap(nil, blocksize(count, sizeof(T)),
    PROT_READ or PROT_WRITE,
    MAP_PRIVATE or MAP_ANONYMOUS,
    -1, 0)
  if mem == cast[pointer](MAP_FAILED):
    raiseOSError(osLastError())
  new(result, finalizeMemBlock[T])
  result.data = cast[ptr RawArray[T]](mem)
  result.count = count

proc `[]`*[T](m: MemBlock[T], index: int): T {.inline.} =
  assert index >= 0 and index < m.count
  m.data[index]

proc `[]=`*[T](m: MemBlock[T], index: int, value: T) {.inline.} =
  assert index >= 0 and index < m.count
  m.data[index] = value

when isMainModule:
  proc main =
    for j in 1..1024:
      echo j
      let a = newMemBlock[int](1_000_000)
      for i in 0..1_000_000-1:
        a[i] = 0
  main()

This is still missing functionality (items/mitems iterators, resizing, etc.) and has only been tested on macOS, but could be a starting point if needed.

Note also that the data cannot contain references to GCed memory (or has to manage them explicitly with GC_ref and GC_unref).

cdunn2001 (orginal) [2017-03-26T17:52:41+02:00] view original

cdunn2001: I still would like to see freelist sizes.
I'm not sure what you mean by that. Large blocks in the allocator (type BigChunk) are kept in a single list and are allocated using first-fit. Unless you are also allocating a lot of small objects, the amount of free and used memory should be a reasonable approximation in your use case.

Yes, I also have lots of small allocations.

At Amazon, in C++, I used a vastly different scheme, which would work beautifully here. The main thread managed sufficiently large contiguous regions of memory. Each thread would take the regions it needed, allocate blocks sequentially within them, and leak everything. At the end of a job, the thread would return all its regions to the main region manager, which would zero them out (for security concerns), and that thread would be returned to the threadpool.

That's simple, secure, and extremely efficient. It's difficult to do in Nim (or in Rust, last I checked) because I cannot provide my own memory manager for portions of the code.

cdunn2001 (orginal) [2017-03-26T18:02:47+02:00] view original

You MemBlock idea is interesting. I guess it has 2 main drawbacks:

I don't know how to cast a seq[] or string into a portion of a RawArray. I guess I could map ptr Object, for some unsafe operations, but a big reason why I'm using Nim instead of C++ is the strong bounds-checking.

mmap/munmap are expensive, so I'd rather not re-run them for each task. If it's still faster than my Python+C alternative, I could live with that though.

mashingan (orginal) [2017-03-26T18:05:19+02:00] view original

It's difficult to do in Nim (or in Rust, last I checked) because I cannot provide my own memory manager for portions of the code.

Are you aware that in Nim you can turn off GC with --gc:none and for manual allocation you can use many alloc families such as allocShared0 ?

Even using ptr would not traced by GC and need somehow manual management for that.

Jehan (orginal) [2017-03-26T18:45:34+02:00] view original

cdunn2001: I guess it has 2 main drawbacks:

You cannot cast from/to something in GCed memory. You'd have to copy the contents. My problem here is that I have only a pretty vague idea of what you're doing, so I'm flying blind and am trying to guess what may help you.

It's done this way for simplicity; it's something that I hacked together in about half an hour. Obviously, a more refined implementation would attempt to reuse memory.

Araq (orginal) [2017-03-26T19:01:01+02:00] view original

That's simple, secure, and extremely efficient. It's difficult to do in Nim (or in Rust, last I checked) because I cannot provide my own memory manager for portions of the code.

No, it's simple to do in Nim, the allocator is thread local anyway, so in your worker thread call GC_disable and after the thread dies, everything is freed (and yes, more efficiently than running over the free lists).

cdunn2001 (orginal) [2017-03-27T13:52:47+02:00] view original

@araq:

and after the thread dies, everything is freed.

That's really the problem. I am using threadpool, which creates 48 threads when the program starts. Limiting the number of threads seems to cause some to die after usage, but I don't understand what that code is doing. Are you suggesting that I create a new thread for each task as needed, and destroy each thread after its task finishes? My question then -- which I keep trying to get an answer to -- is how do I diagnose and avoid memory fragmentation? When "everything is freed", is everything coalesced? If not, then my problem is not solved.

@Jehan:

My problem here is that I have only a pretty vague idea of what you're doing, so I'm flying blind and am trying to guess what may help you.

After I upload my Gigabyte of test-data to the cloud, I'll try to make my code available. It's not huge, and it's a re-write of a mostly C version (plus Python multiprocessing).

I scan the input to generate sets of DNA strings.

I reference those sets in a CStringArray so that they do not need to be copied for each thread.

For each set, I call a thread-proc.

The result of each thread proc is a DNA sequence which is immediately appended to the output file.

In (3), I allocate some 2 or 3-level nested arrays/seqs of objects. I have no idea whether each object in an array gets its own memory fragment. I also allocate 3 or 4 relatively large arrays repeatedly (in gradually descreasing sizes, which was the pathological case.) The allocation/freeing of many small objects seems not to interact well with the large ones. Memory grows steadily, and with 48 threads I can exhaust 96GB of RAM pretty quickly, which then leads to swapping.

I've already solved the worst offenders by using and resizing the same big string for the inner loop of each thread. I still create a new string for the task, but just once for the task. The thread persists in the threadpool, but the memory currently does not, since I'm not confident in threadvar persistence.

For (1), I was using Regex.findall(), which the memprofiler seemed to claim was a significant use of memory. I've re-written that to a simple string iterator with the underlying string re-used.

Even though the memory now "leaks" slowly, it still leaks. But I got sick this weekend, so I cannot show you an example right away.

Mirror of forum.nim-lang.org

2873 :: Thread-local persistence