nimforum mirror - Read file at compile time

mora (orginal) [2015-05-31T16:40:15+02:00] view original

Hi,

Is it possible to read a file at compile time and create a data structure with a macro based on the file's content? I can write the macro, but I get an error that "cannot 'importc' variable at compile time".

Thanks, Peter

Jehan (orginal) [2015-05-31T17:13:01+02:00] view original

You can use staticRead(filename).

mora (orginal) [2015-05-31T20:14:28+02:00] view original

Thanks, Jehan. Is it possible to read only the first line from a large file at compile time? I can use staticExec("head filename") in linux, but it's not very portable.

Sorry for the strange questions, I'm experimenting whether Nim could be used as a very efficient data analysis framework.

Jehan (orginal) [2015-06-01T00:38:56+02:00] view original

You can write your own head implementation in Nim and use that? I use Nim to do code generation tasks (generating Nim through Nim code) in some projects.

You may have to write a simple makefile to get your dependencies in order, though (you can also do nim c -r --verbosity:0 --hints:off --warnings:off something.nim with staticExec, but then you get additional overhead each time you compile).

Araq (orginal) [2015-06-01T01:18:30+02:00] view original

import strutils

# should work at compile-time:
proc head(s: string): string = s.splitLines()[0]

mora (orginal) [2015-06-01T09:09:55+02:00] view original

Hi All,

Thanks for the answers. Let me give a little background while I have strange questions like this.

I'm working on a proof of concept, where we could work with large (several gigabytes) csv files very effectively in Nim. I usually use R for data manipulation, after loading the data, I can do two things fast in R:

working with the columns, like A+B (where A and B is a sequence/array with the same length), types are checked only once, and the calculation is done in C/C++;

calling functions like min, max, weighted.mean, which are implemented for sequences/arrays in C/C++

This works pretty fast in R, all the slow things (like searching for the memory segment which contains the data for the column "A") are calculated only once. Time series are stored in column format nowadays anyway. We have no type check in R at compile time.

There are two problems with R's approach: if you want to do something nasty which is not prewritten in C (for example your own mean function), then it is slow as hell; and if the intermediate calculation is not fit into the memory because the data is several gigabytes. Python solves these problems somehow: it is faster than R and it has iterators. With iterators you can have the next values, therefore A+B+C will not store A+B first, only the final result. This is very memory efficient, but not fast, python resolves of the variables and checks the types for each row. Of course, PyPy makes this faster. No type checks in python at compile time.

So I'm dreaming to do something in Nim. Let's imagine that we have a huge (gigabytes) file in CSV with a couple of columns and millions of rows. We would need the csv file to compile the code. At compile time we could parse the first few lines to find out the columns' name and their type, and generate the right object type. At running time we could use the columns by name:


var data = loadTable("someFile.csv")
echo data.A # this would work because someFile.csv has a column called "A"

From this point (with some iterator magic like in my library Lazy) we could have everything: compile time type checked operations, no name resolving (Nim is a compiled, it is not a script language, no need for a hashmap to find out where "A" is stored), and memory efficiency.

Any ideas?

Cheers, Peter

Varriount (orginal) [2015-06-01T09:29:35+02:00] view original

For handling large files, it would probably be a good idea to either read in chunks of the file at a time, or, barring that, memory mapping the file (Although, I'm not an expert in such practices, so take the advice with a grain of salt). F̶o̶r̶ ̶b̶e̶t̶t̶e̶r̶ ̶o̶r̶ ̶w̶o̶r̶s̶e̶,̶ ̶a̶t̶ ̶t̶h̶i̶s̶ ̶p̶o̶i̶n̶t̶ ̶i̶n̶ ̶t̶i̶m̶e̶,̶ ̶s̶o̶m̶e̶ ̶o̶f̶ ̶N̶i̶m̶'̶s̶ ̶h̶i̶g̶h̶-̶l̶e̶v̶e̶l̶ ̶I̶O̶ ̶f̶u̶n̶c̶t̶i̶o̶n̶s̶ ̶a̶r̶e̶n̶'̶t̶ ̶a̶l̶w̶a̶y̶s̶ ̶t̶h̶e̶ ̶f̶a̶s̶t̶e̶s̶t̶ ̶t̶o̶ ̶u̶s̶e̶,̶ ̶s̶o̶ ̶y̶o̶u̶ ̶m̶i̶g̶h̶t̶ ̶a̶l̶s̶o̶ ̶b̶e̶n̶e̶f̶i̶t̶ ̶f̶r̶o̶m̶ ̶u̶s̶i̶n̶g̶ ̶p̶l̶a̶t̶f̶o̶r̶m̶ ̶s̶p̶e̶c̶i̶f̶i̶c̶ ̶o̶r̶ ̶b̶u̶i̶l̶t̶-̶i̶n̶ ̶C̶ ̶f̶u̶n̶c̶t̶i̶o̶n̶s̶ ̶(̶a̶f̶t̶e̶r̶ ̶p̶r̶o̶f̶i̶l̶i̶n̶g̶,̶ ̶o̶f̶ ̶c̶o̶u̶r̶s̶e̶)̶.̶

EDIT: Ok, so apparently it's only the readLine and lines procedures that are slow, for inherent reasons

Araq (orginal) [2015-06-01T10:27:39+02:00] view original

@Varriount On the contrary, Nim's CSV module already uses streaming with very efficient buffer handling. It's only the readline, lines stuff that people blindly use that's slow.

Jehan (orginal) [2015-06-01T12:53:08+02:00] view original

Injecting multi-GB files into the compiler tool chain is almost always the wrong approach. Compilers and linkers aren't designed to handle multi-GB data blobs efficiently; performance may not be linear and on some architectures, compilers or linkers may even break.

My recommendation – if it's possible – would be to either pass the CSV file directly. If you only need to process it once, you won't save any time doing it differently. If you need to process the data multiple times, I'd recommend converting it into a binary layout first (dumping a seq to a file with writeBuffer(), then using mmap() to access the binary file as a ptr to an unchecked array. This can only become a problem if your data isn't fixed size (usually, if you have strings mixed up). If you are fine with picking a maximum length for strings, you can represent them as arrays of char, otherwise you'll need to handle them differently (e.g. by dumping them to a separate file and encoding them via their offset).

dom96 (orginal) [2015-06-01T13:24:30+02:00] view original

Perhaps you could write a quick Nim script to read the first line of your file and then use staticExec on it. For example:

getLine.nim

import os
var file: File
doAssert open(file, getCurrentDir() / "file.csv")
echo readLine(file)
close(file)

Your module:

import os
const line = staticExec("getline".addFileExt(ExeExt))

echo line.repr

mora (orginal) [2015-06-01T18:46:44+02:00] view original

Thanks for all the answers. The possibility to not read all the file to the memory is awesome, I'll check that.

@Jehan I only want to parse the first couple of lines of the csv at compile time to learn the columns' names and types. The rest of the data would be read and processed at runtime.

I'll come back once I have a working version.

Mirror of forum.nim-lang.org

1277 :: Read file at compile time