nimforum mirror - fetching data from many small .txt files

tcheran (orginal) [2023-04-30T18:59:12+02:00] view original

Hi, I was wondering if there is a way to optimize this kind of data processing. I'm using Windows 10, and I need to parse many relatively small .txt files and rearrange their content in a table, where table index is a pair of 2D grid coordinates and table value is a sequence of strings. The order of magnitude is 18_000 .txt files of variable size, on average each file is around 3000 lines. Below you can find the code. I assume that accessing many small files through SSD is the main performance bottleneck, and I guess that multi-thread option, besides requiring some re-engineering (likely above my skills), wouldn't help in this case, since performances are not CPU bound. I tried as a (much simpler) excercise to understand if performing line counting across all these .TXT files by using threadpool could speed up the result... and it didn't. Is there some not-too-complex optimization to deal with cases like this...i.e. accessing many small .txt files? Thank you in advance.

import std/[tables, strutils, os]
var
    emptyseq = @[""]
    line, objMeas: string
    gridkey: tuple[x: int, y: int]
    lineSeq:  seq[string]
    gridTable = initTable[gridkey, emptyseq]()

for file in walkFiles("*.txt"):
    let f = open(file)
    # each obj file has several lines like these: grid coords (x, y) , obj_id, measure
    # 526100	5043600	MX890M1E	-110.58
    # 526150	5043600	MX890M1E	-110.3
    # 526200	5043600	MX890M1E	-110.19
    # 526250	5043600	MX890M1E	-110.13
    # (...)
    while f.readline(line):
        lineSeq = line.split('\t')
        gridkey = (x: parseint(lineSeq[0]),  y: parseint(lineSeq[1]))
        objMeas = lineSeq[2] & "@" & lineSeq[3]
        if gridTable.hasKeyOrPut(gridkey, @[objMeas]):
            gridtable[gridkey].add(objMeas)
    close(f)
let exportcsv  = open("gridtable.csv", fmWrite)
for k in gridtable.keys:
    exportcsv.writeline($k.x & ";" & $k.y & ";" & gridTable[k].join(","))
exportcsv.close()
# a gridtable entry appear like this, different obj measurements falling in the same grid tile,
# are inserted in the seq associated to its grid index:
# 526100;5043600;[email protected],[email protected],[email protected],[email protected]

cblake (orginal) [2023-04-30T19:45:12+02:00] view original

I suspect there may be system settings to optimize small file IO on Windows 10, but I am not the person to ask and that is actually not very Nim-specific. I will observe that 3,000 lines of 40-ish byte lines is like 120 KiB or 30 virtual memory pages which may not be what everyone considers "small". All together, 18_000*3_000*40 = 2.16e9 bytes which on a modern NVMe SSD should only take about 1 second of actual device IO. (I have one that can do that in about 250 milliseconds..).

You almost surely have much more than 2GB RAM on your computer. You may be able to use a RAM disk <https://github.com/nim-lang/RFCs/issues/503#issuecomment-1367542495> and just copy all the files into that (R: or T: or whatever). If you run the Nim code against files there then the time should be more CPU bound.

One way to use less CPU time within a stdlib setting would be to do less string creation/destruction in both your parsing and printing phases, as in:

import std/[tables, os, strutils]
var
  emptySeq = @[""]
  objMeas: string
  gridKey: tuple[x: int, y: int]
  gridTable = initTable[gridKey, emptySeq]()
  x, y: int

for file in walkFiles("*.txt"):
  for line in file.lines:
    var i = 0
    for field in line.split('\t'):
      if   i == 0: x = parseInt(field)
      elif i == 1: y = parseInt(field)
      elif i == 2: objMeas.setLen 0; objMeas.add field
      elif i == 3: objMeas.add  "@"; objMeas.add field
      inc i
    gridTable.mgetOrPut((x, y), emptySeq).add objMeas

let exportcsv = open("gridtable.csv", fmWrite)
for k, v in gridTable:
  exportcsv.write k.x, ";", k.y
  for i, objMeas in v:
    exportcsv.write if i == 0: ';' else: ','
    exportcsv.write objMeas
exportcsv.close()

Another way is to use std/memfiles. There is an example of that in this thread.

If you can go out to Nimble packages, then with cligen utility code you may get better performance out of something like this:

import std/[tables, os], cligen/[mfile, mslice]
var
  emptySeq = @[""]
  objMeas: string
  gridKey: tuple[x: int, y: int]
  lineSeq: seq[MSlice]
  gridTable = initTable[gridKey, emptySeq]()

for file in walkFiles("*.txt"):
  for ms in mSlices(file):
    discard ms.msplit(lineSeq, '\t', 0)
    gridKey = (x: parseInt(lineSeq[0]),
               y: parseInt(lineSeq[1]))
    objMeas.setLen 0; objMeas.add lineSeq[2]
    objMeas.add  "@"; objMeas.add lineSeq[3]
    gridTable.mgetOrPut(gridKey, emptySeq).add objMeas

let exportcsv = open("gridtable.csv", fmWrite)
for k, v in gridTable:
  exportcsv.write k.x, ";", k.y
  for i, objMeas in v:
    exportcsv.write if i == 0: ';' else: ','
    exportcsv.write objMeas
exportcsv.close()

cmc (orginal) [2023-04-30T22:45:02+02:00] view original

Seems like a good use-case for my LimDB. It's a table-like interface to a mature key-value database based on memory-mapped files . For smallish strings, this is usually a lot faster than file access because once the data is loaded the first time round, your code won't be doing calls to the kernel. It's similar to @cblake's memfiles answer above, just a bit slower and arguably harder to screw up.

import os, limdb

let db = initDatabase("some/directory/somewhere", "filescache")

# load files into database, comment out when complete
for file in for file in walkFiles("*.txt"):
  db[file] = file.readFile

for name, contents in db:
  # do parsing
  discard

alexeypetrushin (orginal) [2023-05-01T03:31:16+02:00] view original

May worth to try a) use async b) use caching, so you don't have to process those files every time.

ingo (orginal) [2023-05-01T08:19:21+02:00] view original

Depending on what you do with the data, store them in SQLite (or Spatialite if the coordinates are geographical).

tcheran (orginal) [2023-05-01T11:48:39+02:00] view original

Hi, thank you all for your suggestions. @cblake, your comment { EDIT1: but I am a bit skeptical that you timed things right as with the code you showed you are unlikely to parse & print as quickly as even a SATA SSD never mind an NVMe SSD. was most likely right. The Windows PC I'm using is a Dell Latop with Intel Core i5-10310U @ 1.70 GHz with 16 GB RAM, despite not being a monster of computing power, it has a decent SSD and I checked once again, and in most of runs CPU is close to 25% (visual inspection of Windows native task manager), so I guess it's using full power of one core (it's a 4 Cores - 8 threads Intel). Problem is that quite often on Windows, the very first run of a new Nim compiled binary is much slower than next ones... possibly the first run is under extra-scrutiny by Windows Defender and/or other anti-malware protection agents running in background (I have a quite few of them on work PC)... or maybe some other Windows system optimization applies to subsequent runs... so looking at first run CPU performance is sometimes misleading. I tried to apply your first variant (thank you), the one doing less string creation/destruction, on two different PC settings (I have a work-PC and a older home-PC), and while I didn't see any improvement on the faster work-PC settings, it ran a 5% faster than mine on the slowest home-PC. I couldn't try the cligen variant, because I got a strange compiler error... anyway I prefer not tweak too much my work-PC Nim settings, since I do not want to incur (again) in false anti-malware detection, so I stay on a bit old, but perfectly working Nim compiler version (1.4.4). As for the other suggestions, sure, in case of repeated access to data, it makes sense to go for a database solution like suggested by @cmc and @ingo, but for my current limited scope, I don't think is worth the extra effort. Thank you.

cblake (orginal) [2023-05-01T12:35:40+02:00] view original

Huh. Well, if you got only a 5% speed-up then that is consistent with your prior IO bound claims, but also consistent with said IO being very slow.. maybe from anti-malware as you propose.

The Defender stuff could be intercepting system calls, too. That might be another reason to try the std/memfiles approach since there are many fewer syscalls there -- though I only gave links to examples for that. (I could imagine that cligen utility stuff not working on old Nims.)

tcheran (orginal) [2023-05-01T15:57:56+02:00] view original

@cblake Well, I tried a bit more with your cligen powered memory map code,... Work Laptop with Windows 10 (Nim 1.4.4): initially compiler complained about a missing c header... but I was running an old version of cligen so I uninstalled it and installed again 1.6.1 with nimble. The missing header issue had gone, but I got a new compiler error. Then I decided to test cligen on Home Laptop with Windows 10 (I'm more keen to experiment on this)... here first I updated Nim to last stable (1.6.12) and then here I installed last cligen version, 1.6.1. However I get the same compiler error I get on Work Laptop. Here it is:

$ nim c -d:release --mm:orc cligen_test.nim
Hint: used config file 'C:\Users\Andrea\.choosenim\toolchains\nim-1.6.12\config\nim.cfg' [Conf]
Hint: used config file 'C:\Users\Andrea\.choosenim\toolchains\nim-1.6.12\config\config.nims' [Conf]
..........................................................................................................
C:\Users\Andrea\.nimble\pkgs\cligen-1.6.1\cligen\mfile.nim(131, 17) Error: type mismatch: got <Handle, cint, cint, int, int, bool, bool>
but expected one of:
proc mopen(fd, fh: cint; fi: FileInfo; prot = PROT_READ; flags = MAP_SHARED;
           a = 0.Off; b = Off(-1); allowRemap = false; noShrink = false;
           err = stderr): MFile
  first type mismatch at position: 1
  required type for fd: cint
  but expression 'fh' is of type: Handle
proc mopen(fh: cint; prot = PROT_READ; flags = MAP_SHARED; a = 0; b = Off(-1);
           allowRemap = false; noShrink = false; err = stderr): MFile
  first type mismatch at position: 1
  required type for fh: cint
  but expression 'fh' is of type: Handle
proc mopen(path: string; prot = PROT_READ; flags = MAP_SHARED; a = 0; b = -1;
           allowRemap = false; noShrink = false; perMask = 0o000000000666;
           err = stderr): MFile
  first type mismatch at position: 1
  required type for path: string
  but expression 'fh' is of type: Handle
expression: mopen(fh, prot, flags, a, b, allowRemap, noShrink)

I wonder if it is Windows specific, or maybe I have a not-so-clean installation, so it's specific to my configuration (likely so) .

cblake (orginal) [2023-05-01T16:01:46+02:00] view original

I believe you may be running into this bug: https://github.com/c-blake/cligen/commit/d66b7715edada5473b5f47a82735cfe9e706a95d

Let me punch a new cligen-1.6.2 release for you. Give me a few minutes.

tcheran (orginal) [2023-05-01T18:45:06+02:00] view original

Amazing! Yeah, new release it solved. I removed cligen 1.6.1 and installed 1.6.2 on my Work laptop (the one with Nim 1.4.4), and compilation was just fine. I also ran a few tests on subset of data and (I needed to throw out the first run, as always) the cligen-powered version is around 20% faster than your other version (14.32 secs vs 17.28 secs). Thank you.

Zoom (orginal) [2023-05-02T03:49:14+02:00] view original

Played with it for a bit. What can I say, ImDisk is really slow. May be WinFsp's ram drive is faster, had no time to check yet. In regular circumstances I'm pretty content with the former. Adding threads brought only negligible improvement, but it would be interesting to check the real NVMe drive too.

Here's the code I used to generate the test files (deterministic):

import std/[random, tasks, strformat, streams]
import cozytaskpool

const
  NLines = 3000
  NFiles = 18_000
  LineBytesEst = 32

var
  rnd = initRand(0xDEADBEEF'i64)
  pool = newTaskPool(createConsumer = false)

proc genInputLine(rnd: var Rand): string {.noinit.} =
  let a = rnd.rand(1000_000)
  let b = rnd.rand(10_000_00)
  let c = (rnd.rand(40000) - 20000).float / 100.0
  fmt("{a}\t{b}\tMX890M1E\t{c:.2f}\n")

proc work(fNum: int; seed: int64) =
  var strm = newFileStream(fmt"{fNum:05}.csv", fmWrite)
  var rnd = initRand(seed)
  if not isNil(strm):
    for _ in 1..NLines:
      strm.write(genInputLine(rnd))
    strm.close()

for n in 1..NFiles:
  pool.sendTask(work(n, cast[int64](rnd.next())))

pool.stopPool()

cblake (orginal) [2023-05-02T12:19:32+02:00] view original

First, always best to have reproducible test data! Great initiative, @Zoom!

@tcheran did not specify if the grid was fixed over samples or varying. Either could make sense (e.g. with wandering sensors that use the GPS satellite network to self-locate), but very different perf numbers & optimization ideas arise in the two situations (small hash table, but long lists vs. Zoom's giant hash table, short lists). For example, this generator program is similar, but uses a fixed grid:

import std/[os, random, strutils, strformat], cligen/osUt
const NLines = 3000
var rng = initRand(0xDEADBEEF'i64)

if paramCount() != 2: quit "Usage: start end", 1
let a = parseInt(paramStr(1))
let b = parseInt(paramStr(2))

var grid: seq[(uint64, uint64)]
for _ in 1..NLines:
  let x = rng.next mod 1_000_000
  let y = rng.next mod 10_000_00
  grid.add (x, y)

for fNum in a..b:
  var f = open(&"{fNum:05}.txt", fmWrite)
  if not f.isNil:
    for (x, y) in grid:
      let c = rng.rand(40000) - 20000
      let d = c div 100
      let e = abs(c) mod 100
      f.urite x, '\t', y, "\tMX890M1E\t", d, '.', &"{e:02}\n"
    f.close

I ran the above with "coarse grained parallelism" (usually fine), i.e.:


zoomDat 1 4500& zoomDat 4501 9000& zoomDat 9001 13500& zoomDat 13501 18000&

My prior programs have 2 bugs - to match results, emptySeq should be declared simply as emptySeq: seq[string]. Second, there needs to be a write "\n" post-loop in the CSV output part. Oops.

I haven't compared RAM disks on Windows (someone should post more details on that), but on Linux /dev/shm on a box with i7-6700k at 4.8GHz and 65 ns latency/40 GB/s DIMMs, I get runtimes (in seconds, big enough & well separated enough to not worry about measurement error..):

Program	RanGrid	FixedGrid	TinyGrid
Orig	48	40	27
cb1	36	30	19
cb2	25	20	8

That last TinyGrid column is using only 4 distinct grid points (by changing x & y in the output line to a & b - an early accidental bug). So, across columns we mostly see the effect of seq being faster than Table.

One can maybe get decent speed-up by going parallel & merging preliminary gridtable.csv's - that depends on which diversity of grid values mode obtains.

Unless/until such parallel scale-up, this should not be an IO bound problem on an SSD. Even a SATA SSD can probably do 750 MB/s and this problem is only 1365 MB or maybe 2 seconds, but processing takes much more. With the above generated data, for example, cat *.txt >/dev/null takes only 0.22 seconds. So, at a minimum one would need like 20/.22=90 cores without contention for IO time to == CPU time. @tcheran's "2nd run times" almost surely have data cached in DIMMs anyway.

Zoom (orginal) [2023-05-02T21:21:10+02:00] view original

Some additional observations:

Running on tmpfs on Linux is ~ 30% faster than with ImDisk on W10 for me.

WinFSP is still unreliable for any serious work. The accumulated csv was just truncated on write with no errors.

Using channels for threading gives a significant memory overhead. Some permutations even get OOM on my 16GB laptop, even though there shouldn't really be much copying. Looks like memory isn't freed fast enough, but I didn't investigate. This is for my pathological set of data (long tables, short seqs), of course.

Using a global table accessed from threads with a lock is just overhead. Interestingly, it's slower than a single threaded version on Linux, but faster on Windows. Collecting intermediate per-thread tables and then merging them could be faster than immediate access.

All in all, adding threading with what's readily available turned out pretty meh. May be it's just my code.

giaco (orginal) [2023-05-03T02:41:22+02:00] view original

fyi GeoPackage is the new(ish) Spatialite (different author, but standardized by OGC)

treeform (orginal) [2023-05-03T18:34:01+02:00] view original

Have you though of maybe doing an initial step puts all of your small files into a database or some sort of in memory store? Using a bunch of small files as normal runtime operation is not a great idea. You are at the mercy of inefficient file system no matter which way you slice it.

Zoom (orginal) [2023-05-03T21:58:02+02:00] view original

It's a reasonable question, but when the issue is this specific it's safe to assume it's what one's been given and can hardly be changed. There's a bunch of inefficiencies in the format but when you already have the data it's faster to process it as it is, than converting it to a proper form before processing.

If you could change how you get the data, then there's so many things you can do. For example, why not bucket them on writes (pre-sorting)? Looks like there's no metadata tied to the order of the lines, so you're not obligated to store them consequentially.

tcheran (orginal) [2023-05-03T23:37:13+02:00] view original

@cblake, sorry for the late follow up. Each object is described by a .TXT file, containing only grid pixels (one line is a grid pixel) having a numeric value (float). No, the grid is not the same across the .TXT files, and a pixel may contain more than one object values. The subset I used to run the mini-benchmark produces a CSV of 1.8 Millions of lines (so 1.8 Million pixels, this is the table size), and median is around 3 objects per-pixel (mean is 4.3), so I would say it's more giant hash table, short lists. You're right, your code was missing the '\n' (I didn't check the output... it was a single very looooooong line), I implemented the two small fixes and the output now is right. I confirm that the cligen powered variant is about 18-20% faster than the other one... also I would say time is more "predictable" (the other one has a larger variance... though my Windows laptop is defintely not a good / stable reference). @treeform I did not engineered the data myself that way, however the .TXT files are a raw format not meant to be consumed directly (a real DB is loaded by someone else). This was my one-time (or let's say occasional) processing with limited ambitions, it's not normal runtime operation. And it was another chance to try to learn something more on Nim (at my "level") from the smart guys. By the way... I tried to install cligen a while ago... and failed... happy to have it working now. Thank you all.

Mirror of forum.nim-lang.org

10146 :: fetching data from many small .txt files