nimforum mirror - webscrape

tubbs (orginal) [2022-02-08T17:51:29+01:00] view original

I would like to do some webscraping but it seems the parseHtml only deals with non dirty html. What can l use to:

Find string inside a html tag.

Find all links

Find string mathed on pattern.

What nimtools can do this? I would like something like beautifulspup for pyhon.

Thanks

enthus1ast (orginal) [2022-02-08T18:03:29+01:00] view original

I used the htmlTidy executable to clean up the html in my web scrapers lately

alexeypetrushin (orginal) [2022-02-08T18:33:35+01:00] view original

Just a side comment. For any serious web scrapping you need the real browser with JS (headless emulation). As the main task would be not so much in HTML parsing, but how to overcome all the tricks modern sites use to avoid being scraped.

enthus1ast (orginal) [2022-02-08T19:31:42+01:00] view original

Some sites might do some shenanigans, but by far not all, I've build massive databases recently just with html parsing, cookie, json parsing.

reversem3 (orginal) [2022-02-08T19:45:13+01:00] view original

Same I used to scrape multiple search engines and index my results to look for specific keywords using python and scrapy. Nothing in nim yet though.

enthus1ast (orginal) [2022-02-08T20:02:00+01:00] view original

The nim web scraping could better, but also could be far worse.

Niminem (orginal) [2022-02-08T20:17:45+01:00] view original

+1 for HTML Tidy

Using httpclient lib, you request the webpage and get the body content back. From there you run the Tidy executable to clean the raw HTML from that response- making the HTML "standard compliant". Convert that into an XML tree with htmlparser lib and get the data you need with:

https://nim-lang.org/docs/xmltree.html (helps find your patterns, printing things) https://nim-lang.org/docs/parseutils.html (helps find your patterns) https://nim-lang.org/docs/strtabs.html (helps with accessing some XML tree attributes)

The htmlparser lib has a nice example of scraping links and uses some of the libraries above.

I used to use Python because it had a "standards compliant" parser and beautiful soup but once I learned about Tidy I didn't need to use Python anymore. All scraping can be accomplished through Nim's stdlib.

enthus1ast (orginal) [2022-02-08T21:07:36+01:00] view original

Ive also build a small request library with cookie and compression support https://github.com/enthus1ast/nimNimiBrowser

Araq (orginal) [2022-02-09T09:53:39+01:00] view original

Fwiw recently I've become a fan of using tokenizers for these sort of things. You don't need to build a tree in memory (which is usually quite slow) which you then recursively traverse it to get "all" the URLs, a list of tokens is good enough and much more flexible. It also handles "wrong HTML" in a most natural manner, there are simply tokens, there is no AST to "repair".

enthus1ast (orginal) [2022-02-09T13:50:24+01:00] view original

Interesting, never thought about doing it this way. Do you have some code online that shows this kind of parsing?

tubbs (orginal) [2022-02-09T16:58:54+01:00] view original

thanks alot

Araq (orginal) [2022-02-10T08:03:34+01:00] view original

Like so:

import os, streams, parsexml, strutils

if paramCount() < 1:
  quit("Usage: htmlrefs filename[.html]")

var filename = addFileExt(paramStr(1), "html")
var s = newFileStream(filename, fmRead)
if s == nil: quit("cannot open the file " & filename)
var x: XmlParser
open(x, s, filename)
while true:
  next(x)
  if x.kind == xmlEof: break
  if x.kind == xmlAttribute and cmpIgnoreCase(x.attrKey, "href") == 0:
    echo "found a link: ", x.attrValue
x.close()

cblake (orginal) [2022-02-10T11:56:41+01:00] view original

This example iterator with this example usage may also be of interest.

planetis (orginal) [2022-02-10T18:29:26+01:00] view original

So would it make sense for a certain convert to karax tool do the same with tokens? Lol I should be answering that question.

Araq (orginal) [2022-02-10T19:34:49+01:00] view original

I cannot imagine how that would work. But if you can imagine, give it a try. :-)

planetis (orginal) [2022-02-11T18:27:35+01:00] view original

Btw there is also https://github.com/OpenSystemsLab/q.nim

Mirror of forum.nim-lang.org

8882 :: webscrape