nimforum mirror - htmlparser innerText

itanoss (orginal) [2019-05-21T03:06:14+02:00] view original

Hey guys, the result of innerText after parsing html document seems to be strange.

Source code:

from strformat import `&`
import htmlparser
import xmlparser
import xmltree

let doc = """
<html>
    <body>
        <h1>Test Title :
            <strong>Hello, world!</strong>
        </h1>
    </body>
</html>"""

let html = parseHtml(doc)
echo &">>{html.innerText}<<"

let html2 = parseXml(doc)
echo &">>>{html2.innerText}<<<"

Output:


>>
        
        Test Title :
            Hello, world!


<<
>>>Test Title :
            Hello, world!<<<

Desired output:


>>> Test Title : Hello, world! <<<

With html system, multiple spaces are ignored as you know. Is the current output valid? Or bug?

Araq (orginal) [2019-05-21T16:01:03+02:00] view original

I don't know the HTML spec well enough to tell you.

johnconway (orginal) [2019-05-22T12:26:36+02:00] view original

Is the HTML spec relevant? The nodes are being converted to text, not HTML. Whitespace is part of that text.

The inconsistency between htmlParser and xmlParser is odd though.

itanoss (orginal) [2019-05-27T04:47:23+02:00] view original

With w3c html5 document, an user agent skips whitespaces literally. I think htmlparser, as an user agent, can do the same thing. xmlparser can be enough in case that the multiple whitespaces is part of that text.

Mirror of forum.nim-lang.org

4867 :: htmlparser innerText