I need to download big files and I have this code:
proc pipeTo(input: Stream, output: Stream, chunkSize = 512) =
var buff = repeat("\0", chunkSize)
while true:
let len = input.readDataStr(buff, 0..<chunkSize)
if len == 0:
break
output.write(buff[0..<len])
output.flush()
proc downloadFile(link: string, output: string) =
let client = newHttpClient()
let response = client.get(link)
let file = newFileStream(output, fmWrite)
response.bodyStream.pipeTo(file)
file.close()
Screenshot
In theory, it should download a chunk and write it to disk, so if I have a big file the memory usage would still be low. But what is happening is the inverse, for a 300Mb file it's using more them 1 GB of ram. What's wrong with this code?
I have the same tool written in ts using deno and it's working fine as it should.
I also noticed that pipeTo has no effect on MaxRSS.
httpclient.nim:
when client is HttpClient:
result.bodyStream = newStringStream()
else:
result.bodyStream = newFutureStream[string]("parseResponse")
...
when client is AsyncHttpClient:
client.bodyStream.complete()
else:
client.bodyStream.setPosition(0)
conclusion: by the time you ask for a body stream, the entire file is sitting in a string inside StringStream, and reading from that string in chunks does nothing for you.
This is something related that !Patitotective and I have been talking about on discord.
I think that there could be a better way to recv data from httpclient. Welcoming anyones brain power to help with this
Currently httpClient.bodyStream is Stream | FutureStream[string] which means you can access stream of body in async but there isn't a way to access it syncronously without waiting for the body to finish (AFAICS). There is parseBody which is used that is not exported or changable in requestAux()
if the body response was changed to something that was callable with both a proc and an iterator this could make the httpClient cleaner to use when you need to access the stream from httpClient.
Something like this (pseudo code)
proc body(client: HttpClient | AsyncHttpClient): Future[string] {.multisync.} =
while true:
let data = await client.recvLine()
if data == "":
break
result.add(data)
iterator body(client: HttpClient | AsyncHttpClient): Future[string] {.multisync.} =
while true:
let data = await client.socket.recv()
if data = "":
break
yield data
async iterator isn't a thing but is it that would be helpful?
Related: The final part that writes the data to body stream from the network socket. The https://github.com/nim-lang/Nim/blame/685bf944aac0b2ea7ce1360b56080f3d07037fc2/lib/pure/httpclient.nim#L704
If anyone have the same problem in the future (download big files), here is my solution: https://gist.github.com/carabalonepaulo/9268a44865daf6abf58bf019ddb1c109
It's a simple and single purpose script to download files (no proxy or redirect support).
proc download(url, filepath: string): Future[HttpCode] {.async.} =
let client = newAsyncHttpClient()
var file = openasync(filepath, fmWrite)
try:
let resp = await client.request(url)
await file.writeFromStream(resp.bodyStream)
result = resp.code
except Exception as e:
echo e.msg
finally:
client.close()
file.close()
Fool-proof method ;)
proc runshell(command: string): string =
let (output, _) = execCmdEx(command)
return output
proc download(url, target: string): string {.discardable} =
return runshell("wget -q -O- \"" & url & "\" > " & target)
download("http://example.com/big.txt", "/tmp/big.txt")
You probably want execProcess or startProcess instead of execCmdEx, just because it won't invoke a shell and so is safer if ther'es any user input (or spaces) in the URL.
Also if you only need to support mac, linux, and fairly modern windows then curl may be a better choice than wget. Windows ships a copy of curl by default in C:Windowssystem32, so it's always there.
I actually think esp for big files spawning off a process is better than doing it "in process". Curl can download multiple files in one invocation as well.