I'm trying to work through https://github.com/kanaka/mal/blob/master/process/guide.md
But have problems with the regular expression for tokenization. First I tried it like this:
import re
var tokenRE = re"""[\s,]*(~@|[\[\]{}()'`~^@]|"(?:[\\].|[^\\"])*"|;.*|[^\s\[\]{}()'"`@,;]+)"""
echo "(123 456)".findAll(tokenRE)
This goes into an endless loop. Then I tried nre: https://github.com/flaviut/nre
import nre, optional_t
var tokenRE = re"""[\s,]*(~@|[\[\]{}()'`~^@]|"(?:[\\].|[^\\"])*"|;.*|[^\s\[\]{}()'"`@,;]+)"""
echo "(123 456)".findAll(tokenRE)
This returns the wrong @[(, 123, 456, )], still containing spaces. A simple Python version prints the correct @[(, 123, 456, )]:
import re
tre = re.compile(r"""[\s,]*(~@|[\[\]{}()'`~^@]|"(?:[\\].|[^\\"])*"|;.*|[^\s\[\]{}()'"`@,;]+)""")
print(re.findall(tre, "(123 456)"))
Any suggestions how to make this work? This regular expression should be PCRE compatible, so I'm wondering what's going wrong.
import re, nre
let r = r"""[\s,]*(~@|[\[\]{}()'`~^@]|"(?:[\\].|[^\\"])*"|;.*|[^\s\[\]{}()'"`@,;]+)"""
assert "(123 456)".findAll( re.re(r)) == @["(", "123", " 456", ")"]
assert "(123 456)".findAll(nre.re(r)) == @["(", "123", " 456", ")"]
// JavaScript
"(123 456)".match(
/[\s,]*(~@|[\[\]{}()'`~^@]|"(?:[\\].|[^\\"])*"|;.*|[^\s\[\]{}()'"`@,;]+)/g
)
Array [ "(", "123", " 456", ")" ]
http://regexr.com/3ah24 "(" "123" " 456" ")"
All 4 of these give the exact same result for me.
Python's behavior is indeed wrong.
Woops, looks like something got messed up: the regex here differs from the one at your link. Copied that one and indeed got an infinite loop with re.
import re, nre
let r = r"""[\s,]*(~@|[\[\]{}()'`~^@]|"(?:\\.|[^\\"])*"|;.*|[^\s\[\]{}('"`,;)]*)"""
# nope "(123 456)".findAll( re.re(r))
assert "(123 456)".findAll(nre.re(r)) == @["(", "123", " 456", ")"]
// JavaScript
"(123 456)".match(
/[\s,]*(~@|[\[\]{}()'`~^@]|"(?:\\.|[^\\"])*"|;.*|[^\s\[\]{}('"`,;)]*)/g
)
Array [ "(", "123", " 456", ")", "" ]
What's going on here...?
Perhaps this regexp is broken? http://regexr.com/3ah27
One thing I do see is it matches empty string: [\s,]*(...|[^\s\[\]{}('"`,;)]*) (I replaced a part of it with ... for clarity).
I understand that Python's behavior is correct, it just finds the first match/bracket group, and Nim's libraries give the whole thing.
findAll is just a small shortcut function https://github.com/flaviut/nre/blob/a1110ebb14e7ca1bd99ceb05da986ba902662bbc/src/nre.nim#L366
You can make your own version that gets the first capture group by replacing match.match with match.captures[0]