nimforum mirror - ANN: New library to parse TOML files

zio_tom78 (orginal) [2015-01-17T01:37:31+01:00] view original

Hi to everybody,

I have put on GitHub a Nim library to parse TOML files (https://github.com/toml-lang/toml). It is MIT-licensed and available at the URL https://github.com/ziotom78/parsetoml. (I submitted a PR for having it added to Nimble, if there are no problems I hope it will appear soon.)

Currently it is able to correctly parse the test files provided in the latest version of the TOML repository (including the "hard" test, https://github.com/toml-lang/toml/blob/master/tests/hard_example.toml). It has a quite complete test coverage (although not every case is covered). Things I would like to do for the next releases, in order of importance:

Create some documentation;

Add binary programs to convert to/from other formats (JSON in primis);

Use https://github.com/BurntSushi/toml-test to fully validate the parser.

Comments and suggestions are welcome!

Edit: links fixed after fadg44a3w4fe's comment

Varriount (orginal) [2015-01-17T03:35:15+01:00] view original

Really nice! The logic is very easy to follow, and the design looks well thought out. I just want to point out, in case you weren't aware, that Nim does have a style guide , with naming suggestions for enums, spacing, etc.

What concerns me is the naming of TomlValueKind's members. Since it's a public enum, the members can be used without full qualification, and using a somewhat vague naming scheme such as 'kind<X>' could cause confusion. I would personally go for something along the lines of 'tvk<X>', 'tvKind<X>', or similar.

Stylistic concerns aside, this looks to be a very useful module, and definitely something I would consider using.

fadg44a3w4fe (orginal) [2015-01-17T04:03:30+01:00] view original

Lots of your links seem broken, I'm not sure why.

I've been working on a TOML parser myself, but I got caught up in some yak shaving and haven't finished yet. If you'd like to look at it, see https://gist.github.com/7382d036b5cfc612cfb0

https://github.com/ziotom78/parsetoml/blob/master/parsetoml.nim#L235-L270 isn't really necessary, the unicode module does the same thing: http://nim-lang.org/unicode.html

https://github.com/ziotom78/parsetoml/blob/master/parsetoml.nim#L279 is actually a bug, n in nim is platform dependent. 'l' would be correct.

For datetime, I use option for optional segments so that I can keep perfect back-and-forth. See the gist.

I'd also like to point out that TOML's test suite is woefully inadequate, it doesn't test that [ foo ] => "foo". I'd use the proposed ABNF at https://github.com/toml-lang/toml/pull/236 instead.

def (orginal) [2015-01-17T04:27:30+01:00] view original

TOML looks great. Any plans to make a TOML writer as well? And should probably get the library into this list: https://github.com/toml-lang/toml#implementations

Nikki (orginal) [2015-01-17T05:59:16+01:00] view original

Superb work, my friend! Absolutely superb! Your contribution to our Nim community is greatly appreciated. Someday you will be a Nim Legend, like dom96.

zio_tom78 (orginal) [2015-01-17T17:27:14+01:00] view original

Thanks for your nice comments, Varriount, fadg44a3w4fe, def, and Nikki!

Varriount: You're right. I decided to write this library as a way to better understand Nim, and I discovered the awesomeness of pure enums while I was in the middle of coding it. In fact, you can determine if an "enum" in the library was designed early or late by its purity. Since at the moment nobody else is using the library, I got the courage to change the API and make TomlValueKind a pure enum. Currently this change is available in the devel branch.

fadg44a3w4fe: Sorry for the links, I fixed them. Many thanks for sharing your code, I have had a read at it and have found your implementation of the Datetime object very interesting, I think I'll copy it (I added a issue here: https://github.com/ziotom78/parsetoml/issues/4 ). I have also fixed my implementation for parseUnicode by relying on the unicode.toUTF8 proc (how could I have missed that? I had a look at the procedures contained in that module, but I didn't notice its presence…). I see that your code reads the whole file in memory and then uses string utilities (and the re) module. I initially thought about this approach, but then I discarded the idea because of two things:

I thought that it would have been more versatile to have a library able to deal with just one character at a time, in order e.g. to be able to read TOML data from a slow connection and have it parsed progressively. However, I realize that using your approach my library would have been much shorter and compact. What are your thoughts about this?

Initially, I tought to use a regular expression library for parsetoml, but I have little knowledge of PEGs and feared to use the re library, as it adds an external dependency (PCRE). I never used it, how widespread is it? I discovered that it is already installed on my system (Xubuntu 14.10), but does Mac OS X have it by default? And what about Windows users?

def: I didn't think about the idea of implementing procedures for writing TOML files. However, this would fit perfectly with my old idea of translating some JSON configuration files I have (for an old C++ legacy program I'm using for my job) into TOML files. It's true that some similar tool probably already exists, but it would be an interesting exercise for a novice like me to write it in Nim.

gradha (orginal) [2015-01-17T21:37:53+01:00] view original

For the task of documenting the module you can write docstrings and use gh_nimrod_doc_pages to generate you a website for GitHub pages.

fadg44a3w4fe (orginal) [2015-01-17T21:46:56+01:00] view original

I was thinking that streaming parsing is unlikely to be necessary, and so I thought it'd be better to optimize for simplicity than for functionality. It doesn't return until it's done parsing anyway, and config files tend to be small enough and pcre fast enough that I don't believe (no concrete numbers) it really matters if it does everything all at once or separated.

PCRE is installed on pretty much every linux distro, and it's in homebrew for mac. Things are harder on windows, I'm not sure how to get pcre to work there.

zio_tom78 (orginal) [2015-01-17T22:32:25+01:00] view original

fadg44a3w4fe: My idea was to use TOML for providing a summary of the calculations of the numerical code I write in my job (I am an astrophysicist). These are quite huge MPI codes that run on hundreds of processes and take hours/days to complete. Usually, such programs write a large number of log files (typically one per job, and because of the difficulty in debugging MPI programs you usually put a lot of messages in each of them). When one of such jobs run, I am always digging into the partially-written log files to check that everyting is ok so far, and to try to figure how much has already been done and how much is left to compute. I've always dreamed of patching such programs and making them write a summary of the computations they have completed so far on stderr. I would then pipe stderr to another program that shows the progress and other useful information for each process. I think that TOML would be the perfect format for the information being piped between the two programs. (It's true that so far the functions I wrote don't return until the parsing is complete, but it's easy to add a callback argument to parseStream that must be called whenever it adds a new node to the tree.)

Regarding PCRE, is this the only available option in Nim, apart from PEG, or is there some other module providing a small, standalone regexp engine? A few days ago I found on HN this link and discovered that if you don't aim for some advanced features, it's not very difficult to implement one. Perhaps at some time I might try to implement it in Nim — it would be useful for people wanting to port their codes to Nim but not willing to convert they regexps in PEG.

gradha: Thanks for the link, I have read some of your documentation in the past and wondered how you managed to keep the GitHub page synced with the nimdoc documentation. After several projects documented using doxygen and similar tools, I must confess I am no more convinced that having documentation intertwined with code is a good idea. It makes the code longer and harder to seek. Now I prefer to write the documentation from scratch, like if I were to write a novel, so that I can present the functions and the data structures in an order that is best suited for a pedagogical presentation. It requires more effort to write, but it's easier to make the reader feel the text flowing naturally. Moreover, a few important functions in my TOML library (getString, getInt, getFloat …) are defined by means of a template (https://github.com/ziotom78/parsetoml/blob/master/parsetoml.nim#L1040-L1075): how should I use docstrings in this case?

Also, from the existing documentation I have the impression that nim doc produces one HTML page per module. Is it really so? I usually use Sphinx, which allows me to split the documentation in as many pages as I decide: I think this makes the document easier to read and navigate. (An example of what I mean is the documentation of a C library I wrote a few years ago: http://hpixlib.readthedocs.org/en/latest/. I find the subdivision in sections particularly useful in this case.)

gradha (orginal) [2015-01-18T00:30:32+01:00] view original

I think you are conflating the reference pages nim generates and plain documents explaining how to use them. People don't go to Nim's system module and read it from beginning to end, they go to the tutorials or other documents, which instead link to the reference.

The only difference with regards to Spinx seems to me that you can embed the docstrings directly in the manually crafted rst. This could be done through Nim's jsondoc command which dumps the individual docstrings, then a special include directive could read the generated json files and embed them.

With regards to the templates, IIRC the doc2 command processes them, so they could in theory contain their own docstring. Of course in this case you could have a generic "This is a generic proc doing foo with bar", since I guess the user figures out the rest looking at the parameters of the signature.

Sphinx is in any case much better.

Varriount (orginal) [2015-01-18T02:44:25+01:00] view original

I do agree that a streaming parser is worth it. From what I've read regarding various parser implementations in python, someone, somewhere will invariable want to give a parser a huge file or input stream, and then complain about how the entire file is being stored in memory.

fadg44a3w4fe (orginal) [2015-01-18T03:42:20+01:00] view original

@zio_tom78 TOML doesn't really seem like a good choice for this, but it might be easier to have a document separator instead. ex, something like the following YAML-inspired example:


[some_table]
val1 = "123"
[other_table]
val2 = 123
---
[some_table]
val1 = "321"
[other_table]
val2 = 321
---

That way you can still do most the streaming stuff while also keeping the API and implementation simple.

re. PCRE, yes, I believe it's the only regex library in Nim. It isn't hard to build PCRE though, I see no reason that it can't be seamlessly used with {.compile.}.

gradha (orginal) [2015-01-18T09:42:34+01:00] view original

I take back what I said about generating comments, none of the approaches I tried generated docstrings despite creating CommentStmt nodes.

zio_tom78 (orginal) [2015-01-19T10:29:28+01:00] view original

fadg44a3w4fe: Yours is really a nice idea! But why do you think that TOML would not be an appropriate format? I would not return the whole set of results (this should be written in FITS files), but just a snapshot of how the computation is progressing. Something like this:

start-time = 2015-01-18T03:00:03Z

[input-parameters]
user = "foo"
data-directory = "/datastorage/foo/planck"
output-directory = "/datastorage/foo/my_analysis"
num-of-mpi-processes = 126
num-of-data-files = 1463
-----
[model-fitting]
start-time = 2015-01-18T03:00:05Z
end-time = 2015-01-18T05:00:03Z
norm-chi-sq = 0.86
failed-convergence = ["datafile.0005.fits", "datafile.0008.fits", "datafile.0016.fits"]
-----
[CG-inversion]
max-step-bound = 1000
steps-required = 170
final-rz/rzinit = 2.4e-13
estimated-error = 1.3e-9
start-time = 2015-01-18T05:00:03Z
end-time = 2015-01-18T06:00:03Z
-----
# The computation is still running, so more stuff is going to be appended here

fadg44a3w4fe (orginal) [2015-01-20T00:21:10+01:00] view original

@zio_tom78 What I mean is that TOML is not explicitly designed for this sort of usage; it's 1 file - 1 document. JSON and YAML have the idea of documents built in.

The data format you posted is not TOML, but an extension on TOML; you can't directly pass that data to a parser without pre-processing it. There's nothing wrong with that, but it can't strictly be called TOML.

Mirror of forum.nim-lang.org

762 :: ANN: New library to parse TOML files