I want to split a file into multiple pieces and give one piece to one thread. I figured memfiles should be the best option, but I'm struggling as to how I should use them to divide my file into multiple slices.
The file itself contains in each line a single datum.
Does anyone have any hints on how to solve this problem?
ff
You need to do 2 passes.
The first pass will lex the files and store either in memory or on disk an array of the starting bytes of each line.
Then you divide those lines by the number of cores and process them.
Hm, so you mean something along this?
var linePtr: seq[pointer]
var input = memfiles.open("file.txt", mode = fmRead)
for line in memSlices(input):
linePtr.add(line.data)
and then i divide the length of linePtr by my core count. This gives me the size of each block.
Now i don't really understand how i can use that information to get actual strings out of my memory mapped file. Mapping portions of the file only works in multiples of the PAGE_SIZE of my OS says the docs. This means i need to floor my chunk size to the nearest multiple of that? and if so how can i work on the text like i would on strings?
Thanks for your help :)
Good point,
Once it's indexed, you can use Nim streams and setPosition: https://nim-lang.org/docs/streams.html#setPosition%2CStream%2Cint.
So you would:
One thing that will be a bit painful is the thread-local GC. The strings are currently gc-ed unless you use gc:destructors so you can't use a global shared sequence with each thread updating their string fragment.
Either you compile with --gc:destructors which would allow that, or you use channels to communicate the final string between the main thread and the workers. This creates additional copy overhead.