First of all, I'd like to express my gratitude and appreciation for Nim — I've been following the project since Nim v0.10 (more or less), although I've never managed to find enough time to dedicate myself to study the language and experiment with it as I should have (I even bought a very early MEAP of the excellent "Nim in Action" book).
Finally, I'm now able to dedicate myself to learning Nim on an almost daily basis, and I'm starting to port to Nim some prototype projects which I had created with some quick-to-use languages, but now deserve to be reimplemented in a solid language. I'm really impressed by the philosophy behind Nim, and the clean syntax it offers. I hope I'll be able to contribute to the project in the course of time.
Right now, I'm porting a project which relies on RegExs (the original code uses PCRE). After having looked at the impure re library and the pcre wrapper, as well as the nre library, and having compared them to the strscans library, I'm tempted to use the latter for my project, for it seems to cover all the pattern needs of my code (not complex patterns, but quite a lot of them). I like strscans because it's extensible, and simpler to use than any of the current PCRE solutions, and most of all is in pure Nim and doesn't require third party dependencies.
I have a few questions though...
Performance wise, how does strscans compare to PCRE based libraries? My project has to perform a huge quantity of pattern matching, so performance might be a concern if the difference is huge. Overall, I'd like to prioritize code usability over performance right now, but being a CLI tool that is used to process many input text files, I can't afford a significant degrade in performance, but I it's OK if the difference is marginal.
Also, I wanted to ask about the nre library status. I've looked at its documentation and the GitHub issues linked therein, and learned that it's now part of the Nim Standard Library, but couldn't really work out its current status. What I don't understand is: why it's not linked in the documentation page for the StdLib? and why it resides in a subfolder of its own (lib\impure\nre\private\)?
Thanks
Well in the end you have to benchmark but strscans doesn't do anything that would be "obviously slow". It should be just fine.
nre is a controversial library and was scheduled for deprecation, but it hasn't happened yet and we're in "freeze-mode" for 0.20. Hmmm.
There is also https://github.com/nitely/nim-regex which is a pure Nim implementation without the PCRE dependency.
Thanks for the precious info @Araq.
Well in the end you have to benchmark but strscans doesn't do anything that would be "obviously slow". It should be just fine.
Then I'll go for it, for I like its syntax. My app being presently an alpha prototype, it should be fine to start with this library. Once it's stable I might try some benchmark comparison with PCRE and other regex libraries (including Oniguruma, which would be nice to create a wrapper for) and see if there is space for speed improvement. But right now I just want to switch language and keep the code simple. Also, I like the strscans approach.
nre is a controversial library and was scheduled for deprecation
I see. Although I read about some of the bugs (which I though were solved) it looked interesting, its approach seemed simple and flexible.
There is also https://github.com/nitely/nim-regex which is a pure Nim implementation without the PCRE dependency.
Ah, I didn't realize that, I thought it relied on the pcre wrapper. Interesting, I'll have a look at it thenPCRE.
@Araq, a further clarification (might be a bug in the strscans documentation).
The strscans docs state that $w "Matches an ASCII identifier: [A-Z-a-z_][A-Za-z_0-9]*.", but looking at the source code of parseutils it seems that IdentChars does not include the hyphen character:
IdentChars = {'a'..'z', 'A'..'Z', '0'..'9', '_'}
IdentStartChars = {'a'..'z', 'A'..'Z', '_'}
(same definition found in strutils too). The strscans documentation seems to imply that the hyphen is a valid identifier character, but this might be inexact — personally, I think the hyphen shouldn't be a valid identifier character, as in most syntaxes it isn't so. I actually discovered this discrepancy because I was looking for a way to reimplement it without the hyphen.
If you confirm me that it's a documentation error I can fix it and make a PR for it.