nimforum mirror - How to parse html wild?

Garry_Galler (orginal) [2016-10-05T16:59:17+02:00] view original

I can not parse HTML wild.**findall** doesn't return all elements from the div element with the class documents

import httpclient
import htmlparser
import tables
import strutils
import streams
import xmltree
import strtabs

var main_page = "http://old.minjust.gov.ua/19612"
var headers_dict = {
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.101 Safari/537.36 OPR/40.0.2308.62",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "DNT": "1",
    "Referer": main_page,
    "Accept-Encoding": "gzip, deflate, lzma, sdch",
    "Accept-Language": "ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4"
}.toTable

var headers = ""
for k,v in headers_dict:
    headers&= k & ":" & v & "\c\L"

var resp   = httpclient.get(main_page,headers)
var stream = newStringStream(resp.body)
var html   = htmlparser.parseHtml(stream)


var cnt=0
for elem in html.findall("div"):
  if elem.attr("class") == "document":
    for a in elem.findall("a"):
      cnt+=1
      echo cnt,"|", a.attrs["href"], "|", a.innerText
    break

It should be 809 instead of 327. Here on this element ends up:

<li><a href="/file/1495">Крівова проти України - <b>19.12.2012</b></li></a>

He was wrong. But what about me? PS: Nim version 14.

OderWat (orginal) [2016-10-05T17:17:55+02:00] view original

Probably because there is an error in the HTML structure exactly in this line. It has <li><a></li></a>

Garry_Galler (orginal) [2016-10-05T17:28:27+02:00] view original

It is just clear. It is unclear why parseHtml does not correct the error. I'm a bit later try to rewrite this code to python - perhaps there is a more advanced parser.

OderWat (orginal) [2016-10-05T17:54:23+02:00] view original

Well. It is no easy topic to tell what is right if something is just wrong. But instead of "patching" the parser I guess it would be better if Nim gets a Tidy HTML module, which would help solving such problems. Meanwhile you could use use a htmldity executable to fix the broken HTML.

Garry_Galler (orginal) [2016-10-05T18:52:15+02:00] view original

Yes, python parser took all the links. For Nim need wrapper over lxml - is one of the best parser xml html.

import requests
import lxml.html

main_page = "http://old.minjust.gov.ua/19612"
session = requests.session()
session.headers ={
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.101 Safari/537.36 OPR/40.0.2308.62",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "DNT": "1",
    "Referer": main_page,
    "Accept-Encoding": "gzip, deflate, lzma, sdch",
    "Accept-Language": "ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4"
}

page = session.get(main_page)
parser = lxml.html.fromstring(page.text)
anchors = parser.cssselect('div.document a')

cnt=0
for a in anchors:
   cnt+=1
   print("[{}|{}|{}".format(cnt,a.attrib["href"],a.text))

runvnc (orginal) [2016-10-05T23:54:11+02:00] view original

Over the years Firefox and WebKit/Chromium I believe have become more and more tolerant of fairly random stuff in HTML.

To the point that Google now recommends publicly sort of arbitrarily leaving off closing tags, no <head> or <body>, etc.

This is not good for parser developers but I'm not sure you can pretend it is XML or even regular.

If you want something as robust as Firefox or Chromium you may have to leverage or copy their code/algorithms.

timothee (orginal) [2019-09-17T04:24:10+02:00] view original

treeform (orginal) [2019-09-17T18:01:27+02:00] view original

I have written a ton of HTML scrapers in my life. The best technique I found is to strip out all HTML tags first and regex on the text only.

So you get HTML like this:

<div class="_7sjd"><i class="_7sjb img sp_BqX-Srs8YcK sx_b851c2"><u>messages</u></i></div><div class="_1d6j _7sjd fsm fwn fcg"><div class="_6a">Messages</div></div><div class="_1d6k _7sjd"><span class="_51lp _5ugf _5ugh" id="u_fetchstream_1_6">1</span></div></a></div><div class="_1d6i"><a href="/ads/growth/aymt/homepage/panel/redirect/?data=%7B%22selected_object_id%22%3A1550952275230845%2C%22is_collapsed%22%3A0%2C%22object_ids%22%3A%5B1550952275230845%5D%2C%22section%22%3A%22Header+Section%22%2C%22clicked_target%22%3A%22Selected+Page+Notifications+Count%22%2C%22event%22%3A%22click%22%7D&amp;redirect_url=%2Fistrolid%2Fmanager%2F"><div class="_7sjd"><i class="_7sjb img sp_MHk6M-gfm5c sx_d38e37"><u>globe-americas</u></i></div><div class="_1d6j _7sjd fsm fwn fcg"><div class="_6a">Notifications</div></div><div class="_1d6k _7sjd"><span class="_51lp _51lr _5ugf _5ugh" id="u_fetchstream_1_7">20+</span></div></a></div>

Its almost always best to just strip out all HTML tags and get this:


   messages     Messages    1         globe-americas     Notifications    20+

Then it becomes trivial to regex for message count and notification count.

See example code:

import re

var s = """<div class="_7sjd"><i class="_7sjb img sp_BqX-Srs8YcK sx_b851c2"><u>messages</u></i></div><div class="_1d6j _7sjd fsm fwn fcg"><div class="_6a">Messages</div></div><div class="_1d6k _7sjd"><span class="_51lp _5ugf _5ugh" id="u_fetchstream_1_6">1</span></div></a></div><div class="_1d6i"><a href="/ads/growth/aymt/homepage/panel/redirect/?data=%7B%22selected_object_id%22%3A1550952275230845%2C%22is_collapsed%22%3A0%2C%22object_ids%22%3A%5B1550952275230845%5D%2C%22section%22%3A%22Header+Section%22%2C%22clicked_target%22%3A%22Selected+Page+Notifications+Count%22%2C%22event%22%3A%22click%22%7D&amp;redirect_url=%2Fistrolid%2Fmanager%2F"><div class="_7sjd"><i class="_7sjb img sp_MHk6M-gfm5c sx_d38e37"><u>globe-americas</u></i></div><div class="_1d6j _7sjd fsm fwn fcg"><div class="_6a">Notifications</div></div><div class="_1d6k _7sjd"><span class="_51lp _51lr _5ugf _5ugh" id="u_fetchstream_1_7">20+</span></div></a></div>"""

s = re.replace(s, re"<[^>]*>", " ")

echo s

echo findAll(s, re"Messages\s*\d*")
echo findAll(s, re"Notifications\s*\d*")

bung (orginal) [2020-05-13T00:35:57+02:00] view original

https://play.nim-lang.org/#ix=2lQ8

https://play.nim-lang.org/#ix=2lQa: in fusion htmlparser this would not be a problem

snej (orginal) [2020-05-13T17:52:48+02:00] view original

Browsers have always supported “tag soup” HTML, back to Mosaic and Netscape. Unless the content type is XHTML, you cannot expect any sort of valid structure. For parsing “wild” HTML, preprocessing through some widely-used tidier is probably the best bet, since its interpretation of bad markup is hopefully similar to a browser’s.

JohnCarter (orginal) [2020-05-15T10:54:04+02:00] view original

Nim used files

Mirror of forum.nim-lang.org

2566 :: How to parse html wild?