I've created a NRE, a regular expression library based on top of PCRE, using an heavily python-inspired API. It can be found on github at https://github.com/flaviut/nre, with the documentation in the readme. I've reproduced a section from the readme here:
== Why?
The http://nim-lang.org/re.html[re.nim] module that http://nim-lang.org/[Nim]
provides in its standard library is inadequate:
- It provides only a limited number of captures, while the underling library
(PCRE) allows an unlimited number.
- Instead of having one proc that returns both the bounds and substring, it
has one for the bounds and another for the substring.
- If the splitting regex is empty (`""`), then it returns the input string
instead of following https://ideone.com/dDMjmz[Perl],
http://jsfiddle.net/xtcbxurg/[Javascript], and
https://ideone.com/hYJuJ5[Java]'s precedent of returning a list of each
character (`"123".split(re"") == @["1", "2", "3"]`).
Suggestions and bug reports are welcome!
import re
import nre
let s1 = "Test,test; test. Test."
assert s1.split( re.re"([,;.])\ ?") == @["Test", "test", "test", "Test"]
assert s1.split(nre.re"[,;.] ?") == @["Test", "test", "test", "Test", ""]
assert s1.split(nre.re"([,;.]) ?") == @["Test", ",", "test", ";", "test", ".", "Test", ".", ""]
# You can't get the delimiter itself, and when I actually needed this in my project,
# I couldn't even write a proper replacement for this function based only on `re` library
# 0-length matches are just ignored.
assert "word word".split( re.re"\b") == @["word word"]
assert "word word".split(nre.re"\b") == @["word", " ", "word"]
assert "a123b".split( re.re"[0-9]") == @["a", "2", "b"] # Just completely wrong
assert "a123b".split(nre.re"[0-9]") == @["a", "", "", "b"]
(this message has been edited)
It is buggy to the core.
And yet all of your examples are about split which works as I intended it to work. Apparently this is not at all how it "obviously should" work (where "obviously" here means "as in other languages/runtimes"), but maybe even you can see how
assert "a123b".split( re.re"[0-9]") == @["a", "2", "b"] # Just completely wrong
assert "a123b".split(nre.re"[0-9]") == @["a", "", "", "b"]
is completely based on personal feelings and opinions.
But yes, I can confirm this new library is just amazing. Thank you for your work.
The API is very well thought out: instead of a lot of strangely named functions that do the same thing but differently, it has a small, intuitive set of functions with uniform behavior, and you can already achieve more with them.
Everything is thoroughly tested and matches Perl's and JavaScript's behavior.
@Araq BlaXpirit might have been too harsh, but "a123b".split(nre.re"[0-9]") == @["a", "", "", "b"] makes logical sense:
[] a123b # initial state
["a"] 23b # hit across a number, substring
["a", ""] 3b # another number, substring again
["a", "", ""] b # another number, substring again
["a", "", "", "b"] # no numbers left, substring the rest
There are n matches of the delimiter, and n + 1 substrings in the result.
completely based on personal feelings and opinions
No, it's based on the behavior of some of the most popular regex implementations. And there is probably some kind of standard on them.
I'll continue...
import re
import nre
# Can't replace using a function.
# There is nothing like "Test test TEST".replace(re.re"\b[A-Z]+\b", toLower)
# So basically, it is impossible to do anything to the captures except rearrange them.
#"abc".replace(re.re"", "x") # infinite loop!
# "word word".replace( re.re"\b", "x") # infinite loop!
assert "word word".replace(nre.re"\b", "x") == "xwordx xwordx"
# echo "word word".findAll(re.re"b|\b") # infinite loop!
# "abracadabra".replace( re.re"(?!a)", "|") # infinite loop!
assert "abracadabra".replace(nre.re"(?!a)", "|") == "a|b|ra|ca|da|b|ra|"
@Araq No, and that isn't an objective of NRE at at all. I want a clean break, and I won't even promise backwards-compatibility for a month or so.
@Nikki Can you elaborate on a case where perl does things poorly? I'd like to know just in case I need to avoid it.
Awesomeness.
import future, strutils
import nre
assert "particles are particularly interesting".replace(re"part", toUpper) ==
"PARTicles are PARTicularly interesting"
assert "EIFFEL Gustave, Auguste Perret, LESCOT Pierre".replace(
re"\b([A-Z]{3,}) ([A-Z][a-z]{2,})\b",
(m: RegexMatch) => "$2 $1" % m.captures.toSeq.map(toLower).map(capitalize)
) == "Gustave Eiffel, Auguste Perret, Pierre Lescot"
I may start a wiki of examples.
Hi,
I have a txt file with near to 4000 mails inside random html tags, and i want to extract all the emails using nre, so i have this piece of code:
import strutils, nre
let file = readFile("list.txt")
echo file
How to extract the emails using regular expressions?
Ok it's working now, i taked a regex pattern from here: http://www.regular-expressions.info/email.html
And this is my code now:
import strutils, nre
let file = readFile("list.txt")
let reEmail = re"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)\b"
for mail in findIter(file, reEmail, 0, -1):
echo mail
#echo file
But seems slow, maybe i'm doing something wrong?
First off, you can use file.findIter(reEmail), which is exactly the same as findIter(file, reEmail, 0, -1).
For the slowness: Are you compiling with -d:release? If not, you should. Things will become lots faster.