nimforum mirror - downloading big files

morturo (orginal) [2022-07-26T04:35:39+02:00] view original

I need to download big files and I have this code:

proc pipeTo(input: Stream, output: Stream, chunkSize = 512) =
  var buff = repeat("\0", chunkSize)
  while true:
    let len = input.readDataStr(buff, 0..<chunkSize)
    if len == 0:
      break
    output.write(buff[0..<len])
    output.flush()

proc downloadFile(link: string, output: string) =
  let client = newHttpClient()
  let response = client.get(link)
  let file = newFileStream(output, fmWrite)
  response.bodyStream.pipeTo(file)
  file.close()

Screenshot

In theory, it should download a chunk and write it to disk, so if I have a big file the memory usage would still be low. But what is happening is the inverse, for a 300Mb file it's using more them 1 GB of ram. What's wrong with this code?

I have the same tool written in ts using deno and it's working fine as it should.

jrfondren (orginal) [2022-07-26T05:28:30+02:00] view original

I also noticed that pipeTo has no effect on MaxRSS.

httpclient.nim:

when client is HttpClient:
    result.bodyStream = newStringStream()
  else:
    result.bodyStream = newFutureStream[string]("parseResponse")
...
  when client is AsyncHttpClient:
    client.bodyStream.complete()
  else:
    client.bodyStream.setPosition(0)

conclusion: by the time you ask for a body stream, the entire file is sitting in a string inside StringStream, and reading from that string in chunks does nothing for you.

Lantos (orginal) [2022-07-26T08:44:24+02:00] view original

This is something related that !Patitotective and I have been talking about on discord.

I think that there could be a better way to recv data from httpclient. Welcoming anyones brain power to help with this

Currently httpClient.bodyStream is Stream | FutureStream[string] which means you can access stream of body in async but there isn't a way to access it syncronously without waiting for the body to finish (AFAICS). There is parseBody which is used that is not exported or changable in requestAux()

if the body response was changed to something that was callable with both a proc and an iterator this could make the httpClient cleaner to use when you need to access the stream from httpClient.

Something like this (pseudo code)

proc body(client: HttpClient | AsyncHttpClient): Future[string] {.multisync.} =
    while true:
        let data = await client.recvLine()
        if data == "":
            break
        result.add(data)

iterator body(client: HttpClient | AsyncHttpClient): Future[string] {.multisync.} =
    while true:
        let data = await client.socket.recv()
        if data = "":
            break
        yield data

async iterator isn't a thing but is it that would be helpful?

Related: The final part that writes the data to body stream from the network socket. The https://github.com/nim-lang/Nim/blame/685bf944aac0b2ea7ce1360b56080f3d07037fc2/lib/pure/httpclient.nim#L704

dom96 (orginal) [2022-07-26T15:44:12+02:00] view original

Yeah, unfortunately I think the current FutureStreams implementation is somehow broken. Maybe there is an easy fix though?

morturo (orginal) [2022-07-28T21:21:50+02:00] view original

If anyone have the same problem in the future (download big files), here is my solution: https://gist.github.com/carabalonepaulo/9268a44865daf6abf58bf019ddb1c109

It's a simple and single purpose script to download files (no proxy or redirect support).

xioren (orginal) [2022-08-02T23:21:13+02:00] view original

proc download(url, filepath: string): Future[HttpCode] {.async.} =
  let client = newAsyncHttpClient()
  var file = openasync(filepath, fmWrite)
  
  try:
    let resp = await client.request(url)
    await file.writeFromStream(resp.bodyStream)
    result = resp.code
  except Exception as e:
    echo e.msg
  finally:
    client.close()
    file.close()

xioren (orginal) [2022-08-03T02:01:32+02:00] view original

actually that example might still have issues with large files.

stbalbach (orginal) [2022-08-12T07:55:41+02:00] view original

Fool-proof method ;)

proc runshell(command: string): string =
  
  let (output, _) = execCmdEx(command)
  return output

proc download(url, target: string): string {.discardable} =
  
  return runshell("wget -q -O- \"" & url & "\" > " & target)

download("http://example.com/big.txt", "/tmp/big.txt")

Demos (orginal) [2022-08-12T20:08:15+02:00] view original

You probably want execProcess or startProcess instead of execCmdEx, just because it won't invoke a shell and so is safer if ther'es any user input (or spaces) in the URL.

Also if you only need to support mac, linux, and fairly modern windows then curl may be a better choice than wget. Windows ships a copy of curl by default in C:Windowssystem32, so it's always there.

I actually think esp for big files spawning off a process is better than doing it "in process". Curl can download multiple files in one invocation as well.

stbalbach (orginal) [2022-08-17T05:21:49+02:00] view original

I know curl is a standard but in my experience dealing with millions of random URLs from Wikipedia , curl will fail on websites where wget succeeded. You'd think something as basic as retrieving a web page would would work equally well but they have different internal assumptions and I find wget to be more reliable when on random sites. Wget is tailored for that purpose and curl is not. https://www.howtogeek.com/816518/curl-vs-wget/ .. as such wget has many defaults set but with curl you need to enable things and it can be complicated to get it right. Curl is a general purpose toolbox that can be webpage centrix with the right configurations. Wget is webpage centrix out of the box.

Mirror of forum.nim-lang.org

9327 :: downloading big files