In short, I believe we should replace the current std/htmlparser with a standards compliant HTML parser according to https://html.spec.whatwg.org/multipage/
If this is at all possible, I'm more than happy to sponsor getting this implemented! Let me know what I need to do to help.
I'm open to this being a 3rd party library as well btw, I just feel like it should be in our standard library because we already have an HTML parser but it's... not that good.
The community will benefit either route, I'm all for it. I've been wanting to make this parser for a while now but I just simply don't (and won't) have the time.
Thoughts?
@nrk linked their parser which is HTML compliant https://git.sr.ht/~bptato/chawan/tree/master/item/src/html/htmlparser.nim
If it could be broken out into its own library then it would be really handy
I am indeed planning to isolate Chawan's html5 parser into a separate library. Right now I'm evaluating the best way to write an API that doesn't involve bringing in half of Chawan as a dependency; preferably it would work similarly to html5ever, so you could supply your own DOM implementation. (Eventually the library could provide a basic DOM skeleton for ease of use.)
Not sure if putting it in the stdlib is the best idea, with the tokenizer it's like 4k lines of code. That's quite the liability for maintainers, especially when they are trying to slim down the stdlib. (Not to mention it depends on Chawan's decoderstream, which is again a hell to integrate.) In short, I would rather make it a separate library.
This is great. Please do!
Has this parser been stress tested?
To some degree; but not extensively. I heavily use it for interpreting HTML content in some private projects (using DOMParser.parseFromString from cha -r), so I have probably caught the most obvious errors by now. But it has no proper testing yet.
An integration of html5lib-tests would be a good first step in this direction. Also, a fuzzer for decoderstream has been on my to-do list for a long time.