Hello there. I've been working on a tiny project for the past few months and I felt like showcasing it today: nim-url (or just url in the Nimble index)
Nim doesn't really have any WHATWG-compliant URL parsers as a library (besides Chawan's, but that is not available as a library), so I took some time out to write one. The main reason to use this library over something like urlly or std/uri would be that it handles a lot of things that those parsers do not, like opaque paths.
The API is inspired from Servo's rust-url, while the core parsing logic is heavily based off of the ada-url project.
As for security, it's been fuzzed using the drchaos library. That, in of itself, caught quite a few bugs in the parser.
The library uses CPU-agnostic SIMD-acceleration via the Overdrive for larger URLs (I might make a post on Overdrive sometime later!). This means that the same SIMD codepath, once written, allows for nim-url to use AVX2/SSE4.1/SSE3/SSE2/NEON depending on compile-time flags. TL;DR: It can perform SIMD acceleration on both AArch64 and AMD64.
It also uses its own internal StringView type to avoid unnecessary allocations and copying (and it achieves that pretty well!)
Here's a basic example on how to use the library:
import std/options
import pkg/[results, url]
# Exceptions-based API
let url = parseURL("https://0.0.0.0:8089/index.html#test")
assert(url.hostname.get() == "0.0.0.0")
assert(url.pathname == "/index.html")
assert(url.port.get() == 8089)
assert(url.fragment.get() == "test")
# Result[T, E] based API (the above function just wraps this)
let url2 = tryParseURL("https://nim-lang.org/blog.html")
assert(url2.isOk())
The parser is 100% pure, meaning it can be used in func`(s). It uses no external native dependencies and simply relies on a few Nim libraries (`shakar, results, overdrive, nimsimd, benchy).
It can be installed via Nimble or Neo to your project:
Nimble: nimble add url
Neo: neo add url
The source code can be found here. Enjoy! :^)
(besides Chawan's, but that is not available as a library)
Yeah last time I tried to use that in another project I deleted half of it and it still wouldn't compile, so in the end I settled for std/uri. Good to see there is a better solution now.
If you don't mind, some suggestions for reducing sizeof(URL):
That should save you 5 64-bit words.
Also in the getters you may want to return lent string to avoid copies.
---
Aside, I've been considering this design lately:
type URLObj = object
buf: string # serialized URL
schemeLast: uint32 # scheme is buf[0..schemeLast]
usernameLast: uint32 # username is buf[schemeLast+2+numSlashes..usernameLast]
passwordLast: uint32 # password is buf[usernameLast+2..passwordLast]
hasOpaquePath: bool
hasHost: bool
hasPort: bool
numSlashes: uint8
port: uint16
# there's still some padding left here
hostnameLast: uint32
pathnameLast: uint32
searchLast: uint32
# hash is searchLast+1..buf.high
Fits into 48 bytes on x64 (40 with refc), only needs one string buffer, and serialization is free (this seems to be its most useful property). To skip copies you can still read individual parts with toOpenArray.
Drawbacks are a) modifying the URL is slow (usually needs a memmove) b) std's support for openArray[char] is poor. I think the benefits outweigh these but I'm not sure (may depend on the use case).
--> use set[MyEnum], 1 byte. Saves memory.
I never thought of the URL type's size yet because I'd assume it doesn't account for a lot of bandwidth.
Actually my original motivation when looking into this was to remove the Option overhead, but these days memory use is also becoming topical :)
Also, the URLObj struct you created is very similar to what ada-url does for its second parsing mode where it just tells you where each component starts and ends, but I didn't implement that because it'd be too much effort.
Oh, I didn't know that. I guess it can't be such a bad idea then if others have done it before.
I might make a post on Overdrive sometime later!
Please do!