nimforum mirror - csv column type setting

tcheran (orginal) [2022-12-26T13:47:38+01:00] view original

Hi, I was looking for something simple (no macros), relying just on Nim's standard libraries, to dynamically infer column type and assign it to the proper seq type. I know that there are pretty advanced packages like datamancer out there, but I wish to build my own small toy, just to better understand the proper way to do it (more interested on the concept rather than the performance). I know that Nim is strictly typed and that it encourages the programmer to model its own types. I devised something like that:

import strutils, parsecsv, sequtils, streams
type
  nimSeriesKind = enum
    seriesInt,
    seriesFloat,
    seriesString
  
  nimSeries = ref object
    case kind: nimSeriesKind
    of seriesInt: intValues: seq[int]
    of seriesFloat: floatValues: seq[float]
    of seriesString: stringValues: seq[string]

# CSV file parsing and matrix inversion omitted
# after that, I have a seq of string columns that
# I wish to check if they can be evaluated as int or float

proc setSeries(x: seq[string]): nimSeries =
  try:
    result = nimSeries(kind: seriesInt, intValues: x.map(parseInt))
  except:
    try:
      result = nimSeries(kind: seriesFloat, floatValues: x.map(parseFloat))
    except:
      result = nimSeries(kind: seriesString, stringValues: x)

Is the object variants the right way to do it? The try/except approach is possibly a bit inelegant (any relevant drawbacks?), maybe there is a more idiomatic / efficient way to go for? Thank you.

P.S. It's evident that if the last rows forces the type change (e.g. all integers, but last row is NaN) it will determine complete sequences re-mapping... but hopefully typical datasets do not conspire for the worst case.

Vindaar (orginal) [2022-12-26T13:58:16+01:00] view original

Given that you already mention Datamancer: At a high level, this is pretty much what I do in my CSV parser there. I first check a few N rows at the beginning before I begin parsing to deduce the most likely type (e.g. first row may be an int by chance, but second then explicit float; to avoid reallocating and copying for such common cases). And then I begin parsing and hope the deduced type holds. If parsing fails I indeed just fall back to the "next higher up type" that can encompass the seen data. It's ugly, but CSV is also an ugly data format. Ugly problems sometimes require ugly solutions... :)

Maybe someone smarter than me can tell both of us that there are better ways though. ;)

tcheran (orginal) [2022-12-26T15:19:37+01:00] view original

Thank you for your answer. I knew your package and I pay great respect for the incredible work you've done (and for the one mratsim has done with arraymancer). Since datamancer codebase is quite large and use macros and some other more advanced / performance optimized stuff (buf reading), I couldn't get if my "toy" example was completely off track or was at least catching the main points about this topic (data type inference and assignment at runtime). I agree, CSV is a poor data exchange format... but incredibly widespread, because convenience and simplicity often is more succesful than strictness (the same may be applied to dynamic typing vs strict static typing). Anyway Nim compiled code is so blazing fast that sometimes also higly un-optimized code is fast enough when compared with alternatives.

Mirror of forum.nim-lang.org

9763 :: csv column type setting