Hello. This is my first post on the forum and I am only a couple of months into my nim journey.
I am doing optimization tasks on road network data in an application that is fully written in nim and distributed to users as a command line tool for windows. The road network is from open street map and is represented as a graph with nodes containing coordinates and edges containing driving-time weights.
The full graph could be Europe or Scandinavia. I want to find local regions of the full graph and do calculations on them. Most calculations only require a very small subset of the full graph but some will require larger. I want to be able to do quick nearest neighbor lookups and to quickly get all nodes and edges within a bounding box.
I have been looking at alternative solutions and am currently thinking that my best bet is some form of custom serialization to binary files. I am however way out of my depth on this and it might be that I am missing good options that are obvious to more experienced nim programmers.
In my quest for understanding how to do this I encountered c-blake's library nio which looks very promising for fast reading and writing of large data objects. I am struggling to understand how to use it properly and also to understand what are features of nio and what is inherent functionality in nim. Another thing I am unsure about how to achieve logarithmic search when reading data so that I do not have to read the full file to access a local region.
Are there any good resources for learning how to do fast serialization in nim in general? Are there any projects I can look at that have implemented something related that I can take inspiration from? What does nio add compared to writing a serializer from scratch?
Am I barking up the wrong tree? How would you approach the problem?
Sorry, I can only give you some very general advice: Partition your data into multiple different files somehow. Ideally so that every file is roughly of equal size. During queries download the required files and keep them in a recently used cache. Write the files via Nim's write proc in bulk. Read from the files via Nim's memfiles module.
Space based partitioning works best on trees, an easy-to-adapt BTree implementation can be found here:
Thanks for your reply, all help is appreciated.
I think my current best idea is to chunk the large graph into (lat, lon) bounding boxes, save them as separate files and have a index file that points to the correct files to load for a given bounding box.
Will have some learning to do with memfiles but hopefully it will work out.
Still very open to suggestions and examples if someone else has any.
Thank you for the recommendation Clonk.
HDF5 looks promising but also a bit intimidating. It might be a good solution but @Vindaar is describing the nim implementation as slightly buggy in it's current state. Do you have experience using it?
I will do some more research into the format, it might be a good fit. However, I want to avoid introducing extra dependencies if not necessary. This is both because of the extra time needed to learn and understand new frameworks and to reduce complexity of the final codebase.
Are you aware of https://www.gaia-gis.it/fossil/libspatialite/index ?
Can be solved with a single db, a spatial index and few queries
General idea: flatten your data into an array, dump to a file, use memory mapping afterwards. For lookup, create additional indices. If your data is read-only and can be represented by a single file, works like a charm. Otherwise you need to balance pros and cons of a hand-rolled solution vs something like hdf5.
Here is a runnable toy example for you:
import std/[memfiles, strformat]
type
NodeId = int
Node = object
x: float
y: float
linked: seq[NodeId]
Graph = seq[Node]
type
LinkId = int
PNode = object
x: float
y: float
linkedFirst: LinkId
linkedLast: LinkId
PGraph = object
nodes: ptr UncheckedArray[PNode]
numNodes: int
links: ptr UncheckedArray[NodeId]
numLinks: int
proc initPGraph(mem: pointer): PGraph =
let buf = cast[ptr UncheckedArray[byte]](mem)
var cursor = 0
result.numNodes = cast[ptr int](addr buf[cursor])[]
cursor += sizeof(int)
result.numLinks = cast[ptr int](addr buf[cursor])[]
cursor += sizeof(int)
result.nodes = cast[ptr UncheckedArray[PNode]](addr buf[cursor])
cursor += result.numNodes * sizeof(PNode)
result.links = cast[ptr UncheckedArray[NodeId]](addr buf[cursor])
proc dump(gr: Graph, fname: string) =
let f = syncio.open(fname, fmWrite)
defer: f.close()
let numNodes: int = gr.len
if writeBuffer(f, addr numNodes, sizeof(int)) != sizeof(int):
raise newException(IOError, "")
var numLinks: int = 0
for node in gr:
numLinks += node.linked.len
if writeBuffer(f, addr numLinks, sizeof(int)) != sizeof(int):
raise newException(IOError, "")
var linkCounter = 0
for node in gr:
let pnode = PNode(
x: node.x,
y: node.y,
linkedFirst: linkCounter,
linkedLast: linkCounter + node.linked.len - 1)
linkCounter += node.linked.len
if writeBuffer(f, addr pnode, sizeof(PNode)) != sizeof(PNode):
raise newException(IOError, "")
for node in gr:
let len = node.linked.len*sizeof(NodeId)
if len == 0:
continue
if writeBuffer(f, addr node.linked[0], len) != len:
raise newException(IOError, "")
proc main() =
var gr = newSeq[Node](4)
gr[0] = Node(x: 1.0, y: 2.0, linked: @[1])
gr[1] = Node(x: 3.0, y: 4.0, linked: @[3, 0])
gr[2] = Node(x: 7.0, y: 8.0, linked: @[])
gr[3] = Node(x: 5.0, y: 6.0, linked: @[0])
gr.dump("test.graph")
var memfile = memfiles.open("test.graph")
defer: memfile.close()
let pgraph = initPGraph(memfile.mem)
for i in 0..<pgraph.numNodes:
let node = pgraph.nodes[i]
var linked: seq[NodeId]
for linkId in node.linkedFirst .. node.linkedLast:
let nodeId = pgraph.links[linkId]
linked.add(nodeId)
echo "(x: {node.x}, y: {node.y}, linked: {linked}".fmt
main()
Fun sounding project! I might be working on something similar in the future.
First you might look into DuckDB. It's an embedded DB similar to SQLite from what I've read but is tailored for data science and has geospatial extensions along with floating point compression. Probably your best bet.
For binary data serializations I'd recommend CBOR and of course my cborious is awesomely fast. ;)
The reasons I prefer CBOR is that it's interpretable like JSON, supports some scanning/skipping, and an extension for typed float arrays in both little/big endian and fp32/fp64 (see rfc8746). Formats like HDF5 are powerful but also fragile and prone to corruption and are complicated.
Generally the best approach in Nim is parsing directly into Nim objects. Avoid ref types and use as few seq's or other items as possible. One or two allocations per object is fine, but not a seq of seqs. That'd be an O(N) allocation cost.
HDF5 looks promising but also a bit intimidating. It might be a good solution but @Vindaar is describing the nim implementation as slightly buggy in it's current state. Do you have experience using it?
PS in my experience all HDF5 implementations are buggy. :P
Sorry to be so late in the reply chain, but I just saw this. nio is mostly a set of convenience proc & CLI commands for helping manipulate a "standard" binary serialization format. It's standard in the sense that packed C structs or packed Nim objects are. You can (and I have) just do as Araq mentioned and blast out records to a file and then if you just name the file according to nio conventions you can do a variety of manipulations (or just add your own) or use nio.FileArray . E.g., many of my in-file hash tables were debugged by just naming them foo.N... and nio print -ing. Your spatial data set sounds "static" not "dynamic". So, that really pushes toward the kind of "set everything up and then memory map it" IO approach.
You can also actually just use a file as a kind of allocation arena which is not hard for the very simple arenas and grow the files up as memory maps themselves. This has some simplifying properties in code, but can occasionally have hazards (e.g., if you write faster than the OS can flush pages to disk then all your dirty pages can hog system RAM - maybe a non-issue if you have enough RAM, but maybe a severe issue if you do not). The nio.py in the Nim nio repo has some example access code to do this sort of thing with NumPy arrays.
A pretty good way to think of nio is "HDF5, but just accepting that files & folders are a done deal". I think HDF5 bites off more than it should be chewing doing all this internal tar file/zip file kind of stuff and often that kind of thing leads to bugs & such. (It does this because HDF5 was a replacement for 1980s NetCDF which came from an era where scientists needed to use VAX VMS and MS DOS filesystems and other stuff like that.)
It sounds like you will probably have to write or adapt your own spatial index. It seems you already know about R trees and others have mentioned similar. I like to point out https://en.wikipedia.org/wiki/Grid_file as they are not often mentioned but have nice properties are are pretty easy to implement. They are more like a hash or a trie than a comparison tree. @treeform also does a lot in this vague area { if you bin together things like 2D graphics and lat-longs :-) } and is very nice & helpful.
Anyway, it's pretty easy to raise an issue on any of my repos or find an email to contact me directly. I don't check the Nim forum as often as I once used to.
I guess I don't really show it on the thes README, but it's pretty fast. On some ancient 2016 computer I just did:
$ thes -cm spatial
5.7220458984375e-06 seconds
0 syns 0 alsos 0 missing
(apologies for the meaningless digits..) That only counts the time of the query, not opening files & setting up the memory maps, though those things are typically under/around 10 microseconds. I mostly was inspired to write it from some node/javascript program working on Moby that parsed the whole input data every time and was something like literally a million times slower. :-) I included it mostly to give @emiljanQ3 a sense of scales. Online interpretable memory maps basically make things as if your program never exited in the first place, but sometimes complex data relationships with pointers can be a lot more work.
Space-wise, space-speed trade-offs are absolutely a real thing. It's doubtful you could assume your clients would have ZFS/btrfs/bcachefs transparent file decompression (which does block at a time things for you). Deployment diversity presents many real problems and how much storage space/is it NVMe/SSD/Winchester/etc. are all part of that. As with anything, assuming away problems is the easiest approach. ;-) However, it is likely that float32 is accurate enough as 40e3 km/16e6 = 2.5 meters. Anyway, compressed formats are a late stage game, and I usually like to see how big things get first. Uncompressed text of Moby is 25 MiB, but parsed into the thes dictionary it's only 11.9, for example, but then Zstd can cram that down to 4.5 MB. Maybe @emiljanQ3 can just compress for shipment/net transfer and then decompress locally and just tell users they need N GiB of local NVMe storage?
The reasons I decided to not go with SQL is a need for string parsing as opposed to reading bytes into memory as well as the database file taking to much space. I will probably go back to SQL if I don't have success with memory mapped files.
For the record, sqlite supports both memory mapping and storing binary data in the database - we use it successfully for databases up to a few hundred GB - creating lots of small files is often inefficient for other reasons, such as the file system being slow at indexing them and adding its own overhead.
https://github.com/arnetheduck/nim-sqlite3-abi is an easy way to consume sqlite as it gets compiled together with the application.
Thanks again for info @cblake. I am a bit overwhelmed by all the possibilities so I will have to work on my own for a while to wrap my head around things. I have a lot of good info here in the thread that I can go back to and dig deeper into as i progress.
It is good to know that i might get over the read speed constraints of the default nim mapping to SQLite @arnetheduck. I will give custom serialization a go first but if i go back to SQL I will keep these resources in mind.
Will come back in this thread when I have made some progress to let you know how things are going!
You can design your own binary file format, and in this binary file, define a meta data structure so that you can query "subset" of geo data easily. Dump object to bytes in file is very easy if your object is static. Not sure if splitting into multiple files is a good idea, sometimes FS could be very slow.
# section meta data
* count
* item1 meta, which contains offset to item1
* item2 meta
...
# actual data section
* item 1
* item 2 After geo data is saved this way, you can use memfile to map this file without fully loading into memory. In each research project, you load full meta data, use your own algo to get the offsets to the "subset" you need, either read the subset into memory or move pointer to do the calculations.Here is an update after some experiments with memory mapped files and doing some research on indexing data-structures.
It has beem a great help to use your example code @darkestpigeon and @cblake has been very approachable over email for some ideation.
Insights:
A few questions on memory mapping:
Feel free to critique the direction, give some pointers or just a thumbs up. I'll be back with further updates soon.
If i memory map a file, when if ever does it get read and transfered to memory? Is it when I cast a nim object from a pointer?
When you access the mapped memory and hit a page fault, the corresponding chunk of the file (typically 4KB) is read. If you have enough RAM, the page will stay cached (on linux at least) even after you exit the application, so if your next run hits the same data, it won't even trigger IO. But you should read up on memory mapping.
What if i cast a pointer to a seq? Will the full seq be transfered to memory? If not, can i access seq[i] directly from disk?
That wouldn't work, as internally seq is more complex than just a pointer. Also, even if you arrange things properly, a triggering of the =copy hook will cause an allocation with the size of your whole file and will copy the data, and the =destroy hook will corrupt the heap if your casted seq goes out of scope. Anyway, this isn't the way to go. Use ptr UncheckedArray and write convenience procs like [] and []= on top to do bounds checking and stuff.
I should also have mentioned adix/oats.nim which shows one approach to allow EITHER seq-backed (e.g. a unique word counter) OR file-backed (e.g. a weight dictionary random sampler) hash tables with client code kind of "doing more assembling (and not in the machine code sense). The differences in the two client codes may also help understanding how to work with memory mapped files in Nim. The kd-tree in Arraymancer might also be helpful. You just need to make the in-memory Tensor an on-disk NIO-/oats-like entity and it looks like that is set up to need only about 4 memory areas/4 files.
To slightly expand upon what @darkestpigeon correctly outlined, what he is describing that you seem to need to learn about is "virtual memory" that took the world by storm in the late 70s/early 80s. DEC's "VAX" was named after virtual address translation. Sometimes people call them "MMUs" (memory management units) or "on-demand paging" for reasons that may become clear shortly. The core idea is that the CPU aids code in creating an abstract memory space different from the physical RAM modules on a machine by implementing a "translation layer" at some granularity (e.g. 12 bits for 4096 byte pages, 16-bits on modern Apple M1..M5).
While CPUs may do some internal optimizations, notionally for every single memory load/store the CPU does, it first consults a cache for "hot areas" called the "TLB" (translation lookaside buffer or just virtual->physical translation cache or just translation cache) with on the order of dozens to thousands of entries (like an L1 CPU cache). If the upper bits are not in the translation cache, as per that Wikipedia page a few things can happen. The most common is that the CPU refills that part of the cache from page tables that the OS manages for your process. If the page tables themselves don't have a mapping, a page fault CPU exception is raised which is handled by OSes to either kill the process (segmentation violation) or OSes update the page mappings on-demand (hence "demand paging"). There are certainly a zillion slide shows and graphics that explain this, but I hoped to give you some keywords to search for/starting points.
Once you understand it, you will understand "page" or "VM page" as the "unit"/"quantum" of memory management by CPUs & OSes and the many (many!) options & knobs OS kernels may/may not provide. For this particular project and the trajectory @emiljanQ3 is on, it is foundational knowledge. The original interest was more tilted toward "we only have 8 KiB of physical RAM on our 1969 TOPS-20 system at BBN long before the BSD Unix mmap and want to run big programs with swap files", but it rapidly became "we want more flexibility in our programming model". Virtual memory has been taken for granted for almost 50 years with only a few exceptions (some embedded systems where lack of/providing MMUs is used to segment the buying market by manufacturers, MS-DOS during a formative time in personal computer history, ..).