Hi, I was wondering if there is a way to optimize this kind of data processing. I'm using Windows 10, and I need to parse many relatively small .txt files and rearrange their content in a table, where table index is a pair of 2D grid coordinates and table value is a sequence of strings. The order of magnitude is 18_000 .txt files of variable size, on average each file is around 3000 lines. Below you can find the code. I assume that accessing many small files through SSD is the main performance bottleneck, and I guess that multi-thread option, besides requiring some re-engineering (likely above my skills), wouldn't help in this case, since performances are not CPU bound. I tried as a (much simpler) excercise to understand if performing line counting across all these .TXT files by using threadpool could speed up the result... and it didn't. Is there some not-too-complex optimization to deal with cases like this...i.e. accessing many small .txt files? Thank you in advance.
import std/[tables, strutils, os]
var
emptyseq = @[""]
line, objMeas: string
gridkey: tuple[x: int, y: int]
lineSeq: seq[string]
gridTable = initTable[gridkey, emptyseq]()
for file in walkFiles("*.txt"):
let f = open(file)
# each obj file has several lines like these: grid coords (x, y) , obj_id, measure
# 526100 5043600 MX890M1E -110.58
# 526150 5043600 MX890M1E -110.3
# 526200 5043600 MX890M1E -110.19
# 526250 5043600 MX890M1E -110.13
# (...)
while f.readline(line):
lineSeq = line.split('\t')
gridkey = (x: parseint(lineSeq[0]), y: parseint(lineSeq[1]))
objMeas = lineSeq[2] & "@" & lineSeq[3]
if gridTable.hasKeyOrPut(gridkey, @[objMeas]):
gridtable[gridkey].add(objMeas)
close(f)
let exportcsv = open("gridtable.csv", fmWrite)
for k in gridtable.keys:
exportcsv.writeline($k.x & ";" & $k.y & ";" & gridTable[k].join(","))
exportcsv.close()
# a gridtable entry appear like this, different obj measurements falling in the same grid tile,
# are inserted in the seq associated to its grid index:
# 526100;5043600;[email protected],[email protected],[email protected],[email protected]
I suspect there may be system settings to optimize small file IO on Windows 10, but I am not the person to ask and that is actually not very Nim-specific. I will observe that 3,000 lines of 40-ish byte lines is like 120 KiB or 30 virtual memory pages which may not be what everyone considers "small". All together, 18_000*3_000*40 = 2.16e9 bytes which on a modern NVMe SSD should only take about 1 second of actual device IO. (I have one that can do that in about 250 milliseconds..).
You almost surely have much more than 2GB RAM on your computer. You may be able to use a RAM disk <https://github.com/nim-lang/RFCs/issues/503#issuecomment-1367542495> and just copy all the files into that (R: or T: or whatever). If you run the Nim code against files there then the time should be more CPU bound.
One way to use less CPU time within a stdlib setting would be to do less string creation/destruction in both your parsing and printing phases, as in:
import std/[tables, os, strutils]
var
emptySeq = @[""]
objMeas: string
gridKey: tuple[x: int, y: int]
gridTable = initTable[gridKey, emptySeq]()
x, y: int
for file in walkFiles("*.txt"):
for line in file.lines:
var i = 0
for field in line.split('\t'):
if i == 0: x = parseInt(field)
elif i == 1: y = parseInt(field)
elif i == 2: objMeas.setLen 0; objMeas.add field
elif i == 3: objMeas.add "@"; objMeas.add field
inc i
gridTable.mgetOrPut((x, y), emptySeq).add objMeas
let exportcsv = open("gridtable.csv", fmWrite)
for k, v in gridTable:
exportcsv.write k.x, ";", k.y
for i, objMeas in v:
exportcsv.write if i == 0: ';' else: ','
exportcsv.write objMeas
exportcsv.close()
Another way is to use std/memfiles. There is an example of that in this thread.
If you can go out to Nimble packages, then with cligen utility code you may get better performance out of something like this:
import std/[tables, os], cligen/[mfile, mslice]
var
emptySeq = @[""]
objMeas: string
gridKey: tuple[x: int, y: int]
lineSeq: seq[MSlice]
gridTable = initTable[gridKey, emptySeq]()
for file in walkFiles("*.txt"):
for ms in mSlices(file):
discard ms.msplit(lineSeq, '\t', 0)
gridKey = (x: parseInt(lineSeq[0]),
y: parseInt(lineSeq[1]))
objMeas.setLen 0; objMeas.add lineSeq[2]
objMeas.add "@"; objMeas.add lineSeq[3]
gridTable.mgetOrPut(gridKey, emptySeq).add objMeas
let exportcsv = open("gridtable.csv", fmWrite)
for k, v in gridTable:
exportcsv.write k.x, ";", k.y
for i, objMeas in v:
exportcsv.write if i == 0: ';' else: ','
exportcsv.write objMeas
exportcsv.close()
Seems like a good use-case for my LimDB. It's a table-like interface to a mature key-value database based on memory-mapped files . For smallish strings, this is usually a lot faster than file access because once the data is loaded the first time round, your code won't be doing calls to the kernel. It's similar to @cblake's memfiles answer above, just a bit slower and arguably harder to screw up.
import os, limdb
let db = initDatabase("some/directory/somewhere", "filescache")
# load files into database, comment out when complete
for file in for file in walkFiles("*.txt"):
db[file] = file.readFile
for name, contents in db:
# do parsing
discard
Huh. Well, if you got only a 5% speed-up then that is consistent with your prior IO bound claims, but also consistent with said IO being very slow.. maybe from anti-malware as you propose.
The Defender stuff could be intercepting system calls, too. That might be another reason to try the std/memfiles approach since there are many fewer syscalls there -- though I only gave links to examples for that. (I could imagine that cligen utility stuff not working on old Nims.)
@cblake Well, I tried a bit more with your cligen powered memory map code,... Work Laptop with Windows 10 (Nim 1.4.4): initially compiler complained about a missing c header... but I was running an old version of cligen so I uninstalled it and installed again 1.6.1 with nimble. The missing header issue had gone, but I got a new compiler error. Then I decided to test cligen on Home Laptop with Windows 10 (I'm more keen to experiment on this)... here first I updated Nim to last stable (1.6.12) and then here I installed last cligen version, 1.6.1. However I get the same compiler error I get on Work Laptop. Here it is:
$ nim c -d:release --mm:orc cligen_test.nim
Hint: used config file 'C:\Users\Andrea\.choosenim\toolchains\nim-1.6.12\config\nim.cfg' [Conf]
Hint: used config file 'C:\Users\Andrea\.choosenim\toolchains\nim-1.6.12\config\config.nims' [Conf]
..........................................................................................................
C:\Users\Andrea\.nimble\pkgs\cligen-1.6.1\cligen\mfile.nim(131, 17) Error: type mismatch: got <Handle, cint, cint, int, int, bool, bool>
but expected one of:
proc mopen(fd, fh: cint; fi: FileInfo; prot = PROT_READ; flags = MAP_SHARED;
a = 0.Off; b = Off(-1); allowRemap = false; noShrink = false;
err = stderr): MFile
first type mismatch at position: 1
required type for fd: cint
but expression 'fh' is of type: Handle
proc mopen(fh: cint; prot = PROT_READ; flags = MAP_SHARED; a = 0; b = Off(-1);
allowRemap = false; noShrink = false; err = stderr): MFile
first type mismatch at position: 1
required type for fh: cint
but expression 'fh' is of type: Handle
proc mopen(path: string; prot = PROT_READ; flags = MAP_SHARED; a = 0; b = -1;
allowRemap = false; noShrink = false; perMask = 0o000000000666;
err = stderr): MFile
first type mismatch at position: 1
required type for path: string
but expression 'fh' is of type: Handle
expression: mopen(fh, prot, flags, a, b, allowRemap, noShrink)
I wonder if it is Windows specific, or maybe I have a not-so-clean installation, so it's specific to my configuration (likely so) .I believe you may be running into this bug: https://github.com/c-blake/cligen/commit/d66b7715edada5473b5f47a82735cfe9e706a95d
Let me punch a new cligen-1.6.2 release for you. Give me a few minutes.
Played with it for a bit. What can I say, ImDisk is really slow. May be WinFsp's ram drive is faster, had no time to check yet. In regular circumstances I'm pretty content with the former. Adding threads brought only negligible improvement, but it would be interesting to check the real NVMe drive too.
Here's the code I used to generate the test files (deterministic):
import std/[random, tasks, strformat, streams]
import cozytaskpool
const
NLines = 3000
NFiles = 18_000
LineBytesEst = 32
var
rnd = initRand(0xDEADBEEF'i64)
pool = newTaskPool(createConsumer = false)
proc genInputLine(rnd: var Rand): string {.noinit.} =
let a = rnd.rand(1000_000)
let b = rnd.rand(10_000_00)
let c = (rnd.rand(40000) - 20000).float / 100.0
fmt("{a}\t{b}\tMX890M1E\t{c:.2f}\n")
proc work(fNum: int; seed: int64) =
var strm = newFileStream(fmt"{fNum:05}.csv", fmWrite)
var rnd = initRand(seed)
if not isNil(strm):
for _ in 1..NLines:
strm.write(genInputLine(rnd))
strm.close()
for n in 1..NFiles:
pool.sendTask(work(n, cast[int64](rnd.next())))
pool.stopPool()
First, always best to have reproducible test data! Great initiative, @Zoom!
@tcheran did not specify if the grid was fixed over samples or varying. Either could make sense (e.g. with wandering sensors that use the GPS satellite network to self-locate), but very different perf numbers & optimization ideas arise in the two situations (small hash table, but long lists vs. Zoom's giant hash table, short lists). For example, this generator program is similar, but uses a fixed grid:
import std/[os, random, strutils, strformat], cligen/osUt
const NLines = 3000
var rng = initRand(0xDEADBEEF'i64)
if paramCount() != 2: quit "Usage: start end", 1
let a = parseInt(paramStr(1))
let b = parseInt(paramStr(2))
var grid: seq[(uint64, uint64)]
for _ in 1..NLines:
let x = rng.next mod 1_000_000
let y = rng.next mod 10_000_00
grid.add (x, y)
for fNum in a..b:
var f = open(&"{fNum:05}.txt", fmWrite)
if not f.isNil:
for (x, y) in grid:
let c = rng.rand(40000) - 20000
let d = c div 100
let e = abs(c) mod 100
f.urite x, '\t', y, "\tMX890M1E\t", d, '.', &"{e:02}\n"
f.close
I ran the above with "coarse grained parallelism" (usually fine), i.e.:
zoomDat 1 4500& zoomDat 4501 9000& zoomDat 9001 13500& zoomDat 13501 18000&
My prior programs have 2 bugs - to match results, emptySeq should be declared simply as emptySeq: seq[string]. Second, there needs to be a write "\n" post-loop in the CSV output part. Oops.
I haven't compared RAM disks on Windows (someone should post more details on that), but on Linux /dev/shm on a box with i7-6700k at 4.8GHz and 65 ns latency/40 GB/s DIMMs, I get runtimes (in seconds, big enough & well separated enough to not worry about measurement error..):
Program | RanGrid | FixedGrid | TinyGrid |
---|---|---|---|
Orig | 48 | 40 | 27 |
cb1 | 36 | 30 | 19 |
cb2 | 25 | 20 | 8 |
That last TinyGrid column is using only 4 distinct grid points (by changing x & y in the output line to a & b - an early accidental bug). So, across columns we mostly see the effect of seq being faster than Table.
One can maybe get decent speed-up by going parallel & merging preliminary gridtable.csv's - that depends on which diversity of grid values mode obtains.
Unless/until such parallel scale-up, this should not be an IO bound problem on an SSD. Even a SATA SSD can probably do 750 MB/s and this problem is only 1365 MB or maybe 2 seconds, but processing takes much more. With the above generated data, for example, cat *.txt >/dev/null takes only 0.22 seconds. So, at a minimum one would need like 20/.22=90 cores without contention for IO time to == CPU time. @tcheran's "2nd run times" almost surely have data cached in DIMMs anyway.
All in all, adding threading with what's readily available turned out pretty meh. May be it's just my code.
It's a reasonable question, but when the issue is this specific it's safe to assume it's what one's been given and can hardly be changed. There's a bunch of inefficiencies in the format but when you already have the data it's faster to process it as it is, than converting it to a proper form before processing.
If you could change how you get the data, then there's so many things you can do. For example, why not bucket them on writes (pre-sorting)? Looks like there's no metadata tied to the order of the lines, so you're not obligated to store them consequentially.