I can not parse HTML wild.**findall** doesn't return all elements from the div element with the class documents
import httpclient
import htmlparser
import tables
import strutils
import streams
import xmltree
import strtabs
var main_page = "http://old.minjust.gov.ua/19612"
var headers_dict = {
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.101 Safari/537.36 OPR/40.0.2308.62",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"DNT": "1",
"Referer": main_page,
"Accept-Encoding": "gzip, deflate, lzma, sdch",
"Accept-Language": "ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4"
}.toTable
var headers = ""
for k,v in headers_dict:
headers&= k & ":" & v & "\c\L"
var resp = httpclient.get(main_page,headers)
var stream = newStringStream(resp.body)
var html = htmlparser.parseHtml(stream)
var cnt=0
for elem in html.findall("div"):
if elem.attr("class") == "document":
for a in elem.findall("a"):
cnt+=1
echo cnt,"|", a.attrs["href"], "|", a.innerText
break
It should be 809 instead of 327. Here on this element ends up:
<li><a href="/file/1495">Крівова проти України - <b>19.12.2012</b></li></a>
He was wrong. But what about me? PS: Nim version 14.
Yes, python parser took all the links. For Nim need wrapper over lxml - is one of the best parser xml html.
import requests
import lxml.html
main_page = "http://old.minjust.gov.ua/19612"
session = requests.session()
session.headers ={
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.101 Safari/537.36 OPR/40.0.2308.62",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"DNT": "1",
"Referer": main_page,
"Accept-Encoding": "gzip, deflate, lzma, sdch",
"Accept-Language": "ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4"
}
page = session.get(main_page)
parser = lxml.html.fromstring(page.text)
anchors = parser.cssselect('div.document a')
cnt=0
for a in anchors:
cnt+=1
print("[{}|{}|{}".format(cnt,a.attrib["href"],a.text))
Over the years Firefox and WebKit/Chromium I believe have become more and more tolerant of fairly random stuff in HTML.
To the point that Google now recommends publicly sort of arbitrarily leaving off closing tags, no <head> or <body>, etc.
This is not good for parser developers but I'm not sure you can pretend it is XML or even regular.
If you want something as robust as Firefox or Chromium you may have to leverage or copy their code/algorithms.
I have written a ton of HTML scrapers in my life. The best technique I found is to strip out all HTML tags first and regex on the text only.
So you get HTML like this:
<div class="_7sjd"><i class="_7sjb img sp_BqX-Srs8YcK sx_b851c2"><u>messages</u></i></div><div class="_1d6j _7sjd fsm fwn fcg"><div class="_6a">Messages</div></div><div class="_1d6k _7sjd"><span class="_51lp _5ugf _5ugh" id="u_fetchstream_1_6">1</span></div></a></div><div class="_1d6i"><a href="/ads/growth/aymt/homepage/panel/redirect/?data=%7B%22selected_object_id%22%3A1550952275230845%2C%22is_collapsed%22%3A0%2C%22object_ids%22%3A%5B1550952275230845%5D%2C%22section%22%3A%22Header+Section%22%2C%22clicked_target%22%3A%22Selected+Page+Notifications+Count%22%2C%22event%22%3A%22click%22%7D&redirect_url=%2Fistrolid%2Fmanager%2F"><div class="_7sjd"><i class="_7sjb img sp_MHk6M-gfm5c sx_d38e37"><u>globe-americas</u></i></div><div class="_1d6j _7sjd fsm fwn fcg"><div class="_6a">Notifications</div></div><div class="_1d6k _7sjd"><span class="_51lp _51lr _5ugf _5ugh" id="u_fetchstream_1_7">20+</span></div></a></div>
Its almost always best to just strip out all HTML tags and get this:
messages Messages 1 globe-americas Notifications 20+
Then it becomes trivial to regex for message count and notification count.
See example code:
import re
var s = """<div class="_7sjd"><i class="_7sjb img sp_BqX-Srs8YcK sx_b851c2"><u>messages</u></i></div><div class="_1d6j _7sjd fsm fwn fcg"><div class="_6a">Messages</div></div><div class="_1d6k _7sjd"><span class="_51lp _5ugf _5ugh" id="u_fetchstream_1_6">1</span></div></a></div><div class="_1d6i"><a href="/ads/growth/aymt/homepage/panel/redirect/?data=%7B%22selected_object_id%22%3A1550952275230845%2C%22is_collapsed%22%3A0%2C%22object_ids%22%3A%5B1550952275230845%5D%2C%22section%22%3A%22Header+Section%22%2C%22clicked_target%22%3A%22Selected+Page+Notifications+Count%22%2C%22event%22%3A%22click%22%7D&redirect_url=%2Fistrolid%2Fmanager%2F"><div class="_7sjd"><i class="_7sjb img sp_MHk6M-gfm5c sx_d38e37"><u>globe-americas</u></i></div><div class="_1d6j _7sjd fsm fwn fcg"><div class="_6a">Notifications</div></div><div class="_1d6k _7sjd"><span class="_51lp _51lr _5ugf _5ugh" id="u_fetchstream_1_7">20+</span></div></a></div>"""
s = re.replace(s, re"<[^>]*>", " ")
echo s
echo findAll(s, re"Messages\s*\d*")
echo findAll(s, re"Notifications\s*\d*")