nimforum mirror - Annoucement: Another PCRE library, NRE

fadg44a3w4fe (orginal) [2015-01-19T18:02:11+01:00] view original

I've created a NRE, a regular expression library based on top of PCRE, using an heavily python-inspired API. It can be found on github at https://github.com/flaviut/nre, with the documentation in the readme. I've reproduced a section from the readme here:


== Why?

The http://nim-lang.org/re.html[re.nim] module that http://nim-lang.org/[Nim]
provides in its standard library is inadequate:
 
 - It provides only a limited number of captures, while the underling library
   (PCRE) allows an unlimited number.
 - Instead of having one proc that returns both the bounds and substring, it
   has one for the bounds and another for the substring.
 - If the splitting regex is empty (`""`), then it returns the input string
   instead of following https://ideone.com/dDMjmz[Perl],
   http://jsfiddle.net/xtcbxurg/[Javascript], and
   https://ideone.com/hYJuJ5[Java]'s precedent of returning a list of each
   character (`"123".split(re"") == @["1", "2", "3"]`).

Suggestions and bug reports are welcome!

BlaXpirit (orginal) [2015-01-19T21:35:16+01:00] view original

You showed only the smallest flaws of the library. I'll try to remember what I had found.

import re
import nre

let s1 = "Test,test; test. Test."
assert s1.split( re.re"([,;.])\ ?") == @["Test", "test", "test", "Test"]
assert s1.split(nre.re"[,;.] ?")   == @["Test", "test", "test", "Test", ""]
assert s1.split(nre.re"([,;.]) ?") == @["Test", ",", "test", ";", "test", ".", "Test", ".", ""]
# You can't get the delimiter itself, and when I actually needed this in my project,
# I couldn't even write a proper replacement for this function based only on `re` library

# 0-length matches are just ignored.
assert "word word".split( re.re"\b") == @["word word"]
assert "word word".split(nre.re"\b") == @["word", " ", "word"]

assert "a123b".split( re.re"[0-9]") == @["a", "2", "b"] # Just completely wrong
assert "a123b".split(nre.re"[0-9]") == @["a", "", "", "b"]

(this message has been edited)

Araq (orginal) [2015-01-19T21:42:13+01:00] view original

It is buggy to the core.

And yet all of your examples are about split which works as I intended it to work. Apparently this is not at all how it "obviously should" work (where "obviously" here means "as in other languages/runtimes"), but maybe even you can see how

assert "a123b".split( re.re"[0-9]") == @["a", "2", "b"] # Just completely wrong
assert "a123b".split(nre.re"[0-9]") == @["a", "", "", "b"]

is completely based on personal feelings and opinions.

BlaXpirit (orginal) [2015-01-19T21:47:43+01:00] view original

But yes, I can confirm this new library is just amazing. Thank you for your work.

The API is very well thought out: instead of a lot of strangely named functions that do the same thing but differently, it has a small, intuitive set of functions with uniform behavior, and you can already achieve more with them.

Everything is thoroughly tested and matches Perl's and JavaScript's behavior.

fadg44a3w4fe (orginal) [2015-01-19T21:55:29+01:00] view original

@Araq BlaXpirit might have been too harsh, but "a123b".split(nre.re"[0-9]") == @["a", "", "", "b"] makes logical sense:


[]             a123b # initial state
["a"]          23b   # hit across a number, substring
["a", ""]      3b    # another number, substring again
["a", "", ""]  b     # another number, substring again
["a", "", "", "b"]   # no numbers left, substring the rest

There are n matches of the delimiter, and n + 1 substrings in the result.

BlaXpirit (orginal) [2015-01-19T22:05:13+01:00] view original

completely based on personal feelings and opinions

No, it's based on the behavior of some of the most popular regex implementations. And there is probably some kind of standard on them.

I'll continue...

import re
import nre

# Can't replace using a function.
# There is nothing like "Test test TEST".replace(re.re"\b[A-Z]+\b", toLower)
# So basically, it is impossible to do anything to the captures except rearrange them.

#"abc".replace(re.re"", "x") # infinite loop!

#      "word word".replace( re.re"\b", "x") # infinite loop!
assert "word word".replace(nre.re"\b", "x") == "xwordx xwordx"

# echo "word word".findAll(re.re"b|\b") # infinite loop!

#      "abracadabra".replace( re.re"(?!a)", "|") # infinite loop!
assert "abracadabra".replace(nre.re"(?!a)", "|") == "a|b|ra|ca|da|b|ra|"

Araq (orginal) [2015-01-19T23:30:05+01:00] view original

Ok, ok, I take it back, the re module is buggy indeed. Now if only nre would work with all the code out there that uses re... ;-)

Nikki (orginal) [2015-01-19T23:42:28+01:00] view original

"Because Perl does it that way" isn't a good argument, as far as I'm concerned. There's a lot that Perl does wrong, including its regexp handling in some cases.

fadg44a3w4fe (orginal) [2015-01-20T00:09:24+01:00] view original

@Araq No, and that isn't an objective of NRE at at all. I want a clean break, and I won't even promise backwards-compatibility for a month or so.

@Nikki Can you elaborate on a case where perl does things poorly? I'd like to know just in case I need to avoid it.

BlaXpirit (orginal) [2015-01-20T02:18:32+01:00] view original

Awesomeness.

import future, strutils
import nre


assert "particles are particularly interesting".replace(re"part", toUpper) ==
  "PARTicles are PARTicularly interesting"

assert "EIFFEL Gustave, Auguste Perret, LESCOT Pierre".replace(
  re"\b([A-Z]{3,}) ([A-Z][a-z]{2,})\b",
  (m: RegexMatch) => "$2 $1" % m.captures.toSeq.map(toLower).map(capitalize)
) == "Gustave Eiffel, Auguste Perret, Pierre Lescot"

I may start a wiki of examples.

imback (orginal) [2015-02-22T03:24:51+01:00] view original

Hi,

I have a txt file with near to 4000 mails inside random html tags, and i want to extract all the emails using nre, so i have this piece of code:

import strutils, nre

let file = readFile("list.txt")

echo file

How to extract the emails using regular expressions?

imback (orginal) [2015-02-22T05:42:55+01:00] view original

Ok it's working now, i taked a regex pattern from here: http://www.regular-expressions.info/email.html

And this is my code now:

import strutils, nre

let file = readFile("list.txt")

let reEmail = re"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)\b"

for mail in findIter(file, reEmail, 0, -1):
  echo mail

#echo file

But seems slow, maybe i'm doing something wrong?

fadg44a3w4fe (orginal) [2015-02-22T22:49:20+01:00] view original

First off, you can use file.findIter(reEmail), which is exactly the same as findIter(file, reEmail, 0, -1).

For the slowness: Are you compiling with -d:release? If not, you should. Things will become lots faster.

BlaXpirit (orginal) [2015-04-10T20:45:21+02:00] view original

Join the discussion: Deprecate and replace 're' with 'nre'

Mirror of forum.nim-lang.org

771 :: Annoucement: Another PCRE library, NRE