nimforum mirror - W3C Compliant HTML Parser to replace current std/htmlparser

Niminem (orginal) [2023-07-08T05:26:11+02:00] view original

In short, I believe we should replace the current std/htmlparser with a standards compliant HTML parser according to https://html.spec.whatwg.org/multipage/

If this is at all possible, I'm more than happy to sponsor getting this implemented! Let me know what I need to do to help.

I'm open to this being a 3rd party library as well btw, I just feel like it should be in our standard library because we already have an HTML parser but it's... not that good.

The community will benefit either route, I'm all for it. I've been wanting to make this parser for a while now but I just simply don't (and won't) have the time.

Thoughts?

amadan (orginal) [2023-07-08T08:08:23+02:00] view original

@nrk linked their parser which is HTML compliant https://git.sr.ht/~bptato/chawan/tree/master/item/src/html/htmlparser.nim

If it could be broken out into its own library then it would be really handy

nrk (orginal) [2023-07-08T10:02:12+02:00] view original

I am indeed planning to isolate Chawan's html5 parser into a separate library. Right now I'm evaluating the best way to write an API that doesn't involve bringing in half of Chawan as a dependency; preferably it would work similarly to html5ever, so you could supply your own DOM implementation. (Eventually the library could provide a basic DOM skeleton for ease of use.)

Not sure if putting it in the stdlib is the best idea, with the tokenizer it's like 4k lines of code. That's quite the liability for maintainers, especially when they are trying to slim down the stdlib. (Not to mention it depends on Chawan's decoderstream, which is again a hell to integrate.) In short, I would rather make it a separate library.

xTrayambak (orginal) [2023-07-08T17:12:31+02:00] view original

Chawan's HTML parser is really nice, it'd be nice if a few maintainers could be allocated to work on it in the stdlib itself, this could even be a really good selling point for Nim, but alas, HTML parsers are not all that easy to maintain on par with the standards and the quirks of the past.

Niminem (orginal) [2023-07-08T22:17:51+02:00] view original

This is great. Please do!

Has this parser been stress tested?

nrk (orginal) [2023-07-08T23:46:35+02:00] view original

To some degree; but not extensively. I heavily use it for interpreting HTML content in some private projects (using DOMParser.parseFromString from cha -r), so I have probably caught the most obvious errors by now. But it has no proper testing yet.

An integration of html5lib-tests would be a good first step in this direction. Also, a fuzzer for decoderstream has been on my to-do list for a long time.

Niminem (orginal) [2023-07-15T05:57:19+02:00] view original

That makes sense. I think this code base is the best start out of everything else I've seen in the wild. For Nim at least

Mirror of forum.nim-lang.org

10328 :: W3C Compliant HTML Parser to replace current std/htmlparser