nimforum mirror - Time critical reading of geographical data

emiljanQ3 (orginal) [2025-11-17T11:12:06+01:00] view original

Hello. This is my first post on the forum and I am only a couple of months into my nim journey.

I am doing optimization tasks on road network data in an application that is fully written in nim and distributed to users as a command line tool for windows. The road network is from open street map and is represented as a graph with nodes containing coordinates and edges containing driving-time weights.

The full graph could be Europe or Scandinavia. I want to find local regions of the full graph and do calculations on them. Most calculations only require a very small subset of the full graph but some will require larger. I want to be able to do quick nearest neighbor lookups and to quickly get all nodes and edges within a bounding box.

I have tried setting this up with a SQLite database using R-Trees. This gives node lookup in logarithmic time using a bounding box. There are some problems with the current SQLite solution however:

It is not fast enough for large bounding boxes. While lookup is fast, parsing a lot of floats seem to be to slow.

The file size of the db gets to large. This is critical as the data will be distributed to users with the application.

I have been looking at alternative solutions and am currently thinking that my best bet is some form of custom serialization to binary files. I am however way out of my depth on this and it might be that I am missing good options that are obvious to more experienced nim programmers.

In my quest for understanding how to do this I encountered c-blake's library nio which looks very promising for fast reading and writing of large data objects. I am struggling to understand how to use it properly and also to understand what are features of nio and what is inherent functionality in nim. Another thing I am unsure about how to achieve logarithmic search when reading data so that I do not have to read the full file to access a local region.

Are there any good resources for learning how to do fast serialization in nim in general? Are there any projects I can look at that have implemented something related that I can take inspiration from? What does nio add compared to writing a serializer from scratch?

Am I barking up the wrong tree? How would you approach the problem?

Araq (orginal) [2025-11-17T12:26:18+01:00] view original

Sorry, I can only give you some very general advice: Partition your data into multiple different files somehow. Ideally so that every file is roughly of equal size. During queries download the required files and keep them in a recently used cache. Write the files via Nim's write proc in bulk. Read from the files via Nim's memfiles module.

Space based partitioning works best on trees, an easy-to-adapt BTree implementation can be found here:

https://github.com/narimiran/sorta

emiljanQ3 (orginal) [2025-11-17T12:52:09+01:00] view original

Thanks for your reply, all help is appreciated.

I think my current best idea is to chunk the large graph into (lat, lon) bounding boxes, save them as separate files and have a index file that points to the correct files to load for a given bounding box.

Will have some learning to do with memfiles but hopefully it will work out.

Still very open to suggestions and examples if someone else has any.

Clonk (orginal) [2025-11-17T16:22:37+01:00] view original

Sharding is the way. You can try to use HDF5, there's a nimhdf5 package that exists that @Vindaar built.

emiljanQ3 (orginal) [2025-11-17T17:49:26+01:00] view original

Thank you for the recommendation Clonk.

HDF5 looks promising but also a bit intimidating. It might be a good solution but @Vindaar is describing the nim implementation as slightly buggy in it's current state. Do you have experience using it?

I will do some more research into the format, it might be a good fit. However, I want to avoid introducing extra dependencies if not necessary. This is both because of the extra time needed to learn and understand new frameworks and to reduce complexity of the final codebase.

krakengore (orginal) [2025-11-17T18:11:50+01:00] view original

Are you aware of https://www.gaia-gis.it/fossil/libspatialite/index ?

Can be solved with a single db, a spatial index and few queries

darkestpigeon (orginal) [2025-11-17T18:27:20+01:00] view original

General idea: flatten your data into an array, dump to a file, use memory mapping afterwards. For lookup, create additional indices. If your data is read-only and can be represented by a single file, works like a charm. Otherwise you need to balance pros and cons of a hand-rolled solution vs something like hdf5.

Here is a runnable toy example for you:



import std/[memfiles, strformat]


type
  
  NodeId = int
  
  Node = object
    x: float
    y: float
    linked: seq[NodeId]
  
  Graph = seq[Node]


type
  
  LinkId = int
  
  PNode = object
    x: float
    y: float
    linkedFirst: LinkId
    linkedLast: LinkId
  
  PGraph = object
    nodes: ptr UncheckedArray[PNode]
    numNodes: int
    links: ptr UncheckedArray[NodeId]
    numLinks: int

proc initPGraph(mem: pointer): PGraph =
  
  let buf = cast[ptr UncheckedArray[byte]](mem)
  var cursor = 0
  
  result.numNodes = cast[ptr int](addr buf[cursor])[]
  cursor += sizeof(int)
  
  result.numLinks = cast[ptr int](addr buf[cursor])[]
  cursor += sizeof(int)
  
  result.nodes = cast[ptr UncheckedArray[PNode]](addr buf[cursor])
  cursor += result.numNodes * sizeof(PNode)
  
  result.links = cast[ptr UncheckedArray[NodeId]](addr buf[cursor])


proc dump(gr: Graph, fname: string) =
  let f = syncio.open(fname, fmWrite)
  defer: f.close()
  
  let numNodes: int = gr.len
  if writeBuffer(f, addr numNodes, sizeof(int)) != sizeof(int):
    raise newException(IOError, "")
  
  var numLinks: int = 0
  for node in gr:
    numLinks += node.linked.len
  if writeBuffer(f, addr numLinks, sizeof(int)) != sizeof(int):
    raise newException(IOError, "")
  
  var linkCounter = 0
  for node in gr:
    let pnode = PNode(
      x: node.x,
      y: node.y,
      linkedFirst: linkCounter,
      linkedLast: linkCounter + node.linked.len - 1)
    linkCounter += node.linked.len
    
    if writeBuffer(f, addr pnode, sizeof(PNode)) != sizeof(PNode):
      raise newException(IOError, "")
  
  for node in gr:
    let len = node.linked.len*sizeof(NodeId)
    if len == 0:
      continue
    
    if writeBuffer(f, addr node.linked[0], len) != len:
      raise newException(IOError, "")


proc main() =
  var gr = newSeq[Node](4)
  gr[0] = Node(x: 1.0, y: 2.0, linked: @[1])
  gr[1] = Node(x: 3.0, y: 4.0, linked: @[3, 0])
  gr[2] = Node(x: 7.0, y: 8.0, linked: @[])
  gr[3] = Node(x: 5.0, y: 6.0, linked: @[0])
  
  gr.dump("test.graph")
  
  var memfile = memfiles.open("test.graph")
  defer: memfile.close()
  
  let pgraph = initPGraph(memfile.mem)
  
  for i in 0..<pgraph.numNodes:
    let node = pgraph.nodes[i]
    
    var linked: seq[NodeId]
    for linkId in node.linkedFirst .. node.linkedLast:
      let nodeId = pgraph.links[linkId]
      linked.add(nodeId)
    echo "(x: {node.x}, y: {node.y}, linked: {linked}".fmt

main()

elcritch (orginal) [2025-11-18T00:43:12+01:00] view original

Fun sounding project! I might be working on something similar in the future.

First you might look into DuckDB. It's an embedded DB similar to SQLite from what I've read but is tailored for data science and has geospatial extensions along with floating point compression. Probably your best bet.

For binary data serializations I'd recommend CBOR and of course my cborious is awesomely fast. ;)

The reasons I prefer CBOR is that it's interpretable like JSON, supports some scanning/skipping, and an extension for typed float arrays in both little/big endian and fp32/fp64 (see rfc8746). Formats like HDF5 are powerful but also fragile and prone to corruption and are complicated.

Generally the best approach in Nim is parsing directly into Nim objects. Avoid ref types and use as few seq's or other items as possible. One or two allocations per object is fine, but not a seq of seqs. That'd be an O(N) allocation cost.

elcritch (orginal) [2025-11-18T00:51:07+01:00] view original

HDF5 looks promising but also a bit intimidating. It might be a good solution but @Vindaar is describing the nim implementation as slightly buggy in it's current state. Do you have experience using it?

PS in my experience all HDF5 implementations are buggy. :P

cblake (orginal) [2025-11-18T01:35:13+01:00] view original

Sorry to be so late in the reply chain, but I just saw this. nio is mostly a set of convenience proc & CLI commands for helping manipulate a "standard" binary serialization format. It's standard in the sense that packed C structs or packed Nim objects are. You can (and I have) just do as Araq mentioned and blast out records to a file and then if you just name the file according to nio conventions you can do a variety of manipulations (or just add your own) or use nio.FileArray . E.g., many of my in-file hash tables were debugged by just naming them foo.N... and nio print -ing. Your spatial data set sounds "static" not "dynamic". So, that really pushes toward the kind of "set everything up and then memory map it" IO approach.

You can also actually just use a file as a kind of allocation arena which is not hard for the very simple arenas and grow the files up as memory maps themselves. This has some simplifying properties in code, but can occasionally have hazards (e.g., if you write faster than the OS can flush pages to disk then all your dirty pages can hog system RAM - maybe a non-issue if you have enough RAM, but maybe a severe issue if you do not). The nio.py in the Nim nio repo has some example access code to do this sort of thing with NumPy arrays.

A pretty good way to think of nio is "HDF5, but just accepting that files & folders are a done deal". I think HDF5 bites off more than it should be chewing doing all this internal tar file/zip file kind of stuff and often that kind of thing leads to bugs & such. (It does this because HDF5 was a replacement for 1980s NetCDF which came from an era where scientists needed to use VAX VMS and MS DOS filesystems and other stuff like that.)

It sounds like you will probably have to write or adapt your own spatial index. It seems you already know about R trees and others have mentioned similar. I like to point out https://en.wikipedia.org/wiki/Grid_file as they are not often mentioned but have nice properties are are pretty easy to implement. They are more like a hash or a trie than a comparison tree. @treeform also does a lot in this vague area { if you bin together things like 2D graphics and lat-longs :-) } and is very nice & helpful.

Anyway, it's pretty easy to raise an issue on any of my repos or find an email to contact me directly. I don't check the Nim forum as often as I once used to.

arnetheduck (orginal) [2025-11-18T08:58:06+01:00] view original

The reasons I decided to not go with SQL is a need for string parsing as opposed to reading bytes into memory as well as the database file taking to much space. I will probably go back to SQL if I don't have success with memory mapped files.

For the record, sqlite supports both memory mapping and storing binary data in the database - we use it successfully for databases up to a few hundred GB - creating lots of small files is often inefficient for other reasons, such as the file system being slow at indexing them and adding its own overhead.

https://github.com/arnetheduck/nim-sqlite3-abi is an easy way to consume sqlite as it gets compiled together with the application.

emiljanQ3 (orginal) [2025-11-18T12:19:01+01:00] view original

Thanks again for info @cblake. I am a bit overwhelmed by all the possibilities so I will have to work on my own for a while to wrap my head around things. I have a lot of good info here in the thread that I can go back to and dig deeper into as i progress.

It is good to know that i might get over the read speed constraints of the default nim mapping to SQLite @arnetheduck. I will give custom serialization a go first but if i go back to SQL I will keep these resources in mind.

Will come back in this thread when I have made some progress to let you know how things are going!

Clonk (orginal) [2025-11-18T16:46:32+01:00] view original

When I did some testing with SQLite, I started to see drastic drop in performance around ~150GB of BD size that became increasingly worse.

YesDrX (orginal) [2025-11-19T16:34:04+01:00] view original

You can design your own binary file format, and in this binary file, define a meta data structure so that you can query "subset" of geo data easily. Dump object to bytes in file is very easy if your object is static. Not sure if splitting into multiple files is a good idea, sometimes FS could be very slow.


# section meta data
* count
* item1 meta, which contains offset to item1
* item2 meta
...
# actual data section
* item 1
* item 2

After geo data is saved this way, you can use memfile to map this file without fully loading into memory. In each research project, you load full meta data, use your own algo to get the offsets to the "subset" you need, either read the subset into memory or move pointer to do the calculations.

emiljanQ3 (orginal) [2025-11-20T04:40:52+01:00] view original

Here is an update after some experiments with memory mapped files and doing some research on indexing data-structures.

It has beem a great help to use your example code @darkestpigeon and @cblake has been very approachable over email for some ideation.

Insights:

Agressive splitting into small files is costly. Both in terms of total disk space used and in read times. I was down to 1kb files at one point in order to get a fine enough grid and that was not feasible. Will probably go with 1 file per country for 1-3 gb file sizes.

For memory mapping I will need a flat data structure that allows for searching multi-dimensional data in faster than linear time. This excludes B-trees due to multiple dimensions. The standard solution for this seems to be R-trees but it also seems possible with kd-trees as long as your dataset is static. I have found rust implementations of flat R- and KD-trees that I could use as references. R-tree, KD-tree.

A few questions on memory mapping:

If i memory map a file, when if ever does it get read and transfered to memory? Is it when I cast a nim object from a pointer?

What if i cast a pointer to a seq? Will the full seq be transfered to memory? If not, can i access seq[i] directly from disk?

Feel free to critique the direction, give some pointers or just a thumbs up. I'll be back with further updates soon.

darkestpigeon (orginal) [2025-11-20T10:04:35+01:00] view original

If i memory map a file, when if ever does it get read and transfered to memory? Is it when I cast a nim object from a pointer?

When you access the mapped memory and hit a page fault, the corresponding chunk of the file (typically 4KB) is read. If you have enough RAM, the page will stay cached (on linux at least) even after you exit the application, so if your next run hits the same data, it won't even trigger IO. But you should read up on memory mapping.

What if i cast a pointer to a seq? Will the full seq be transfered to memory? If not, can i access seq[i] directly from disk?

That wouldn't work, as internally seq is more complex than just a pointer. Also, even if you arrange things properly, a triggering of the =copy hook will cause an allocation with the size of your whole file and will copy the data, and the =destroy hook will corrupt the heap if your casted seq goes out of scope. Anyway, this isn't the way to go. Use ptr UncheckedArray and write convenience procs like [] and []= on top to do bounds checking and stuff.

cblake (orginal) [2025-11-20T13:45:18+01:00] view original

I should also have mentioned adix/oats.nim which shows one approach to allow EITHER seq-backed (e.g. a unique word counter) OR file-backed (e.g. a weight dictionary random sampler) hash tables with client code kind of "doing more assembling (and not in the machine code sense). The differences in the two client codes may also help understanding how to work with memory mapped files in Nim. The kd-tree in Arraymancer might also be helpful. You just need to make the in-memory Tensor an on-disk NIO-/oats-like entity and it looks like that is set up to need only about 4 memory areas/4 files.

To slightly expand upon what @darkestpigeon correctly outlined, what he is describing that you seem to need to learn about is "virtual memory" that took the world by storm in the late 70s/early 80s. DEC's "VAX" was named after virtual address translation. Sometimes people call them "MMUs" (memory management units) or "on-demand paging" for reasons that may become clear shortly. The core idea is that the CPU aids code in creating an abstract memory space different from the physical RAM modules on a machine by implementing a "translation layer" at some granularity (e.g. 12 bits for 4096 byte pages, 16-bits on modern Apple M1..M5).

While CPUs may do some internal optimizations, notionally for every single memory load/store the CPU does, it first consults a cache for "hot areas" called the "TLB" (translation lookaside buffer or just virtual->physical translation cache or just translation cache) with on the order of dozens to thousands of entries (like an L1 CPU cache). If the upper bits are not in the translation cache, as per that Wikipedia page a few things can happen. The most common is that the CPU refills that part of the cache from page tables that the OS manages for your process. If the page tables themselves don't have a mapping, a page fault CPU exception is raised which is handled by OSes to either kill the process (segmentation violation) or OSes update the page mappings on-demand (hence "demand paging"). There are certainly a zillion slide shows and graphics that explain this, but I hoped to give you some keywords to search for/starting points.

Once you understand it, you will understand "page" or "VM page" as the "unit"/"quantum" of memory management by CPUs & OSes and the many (many!) options & knobs OS kernels may/may not provide. For this particular project and the trajectory @emiljanQ3 is on, it is foundational knowledge. The original interest was more tilted toward "we only have 8 KiB of physical RAM on our 1969 TOPS-20 system at BBN long before the BSD Unix mmap and want to run big programs with swap files", but it rapidly became "we want more flexibility in our programming model". Virtual memory has been taken for granted for almost 50 years with only a few exceptions (some embedded systems where lack of/providing MMUs is used to segment the buying market by manufacturers, MS-DOS during a formative time in personal computer history, ..).

emiljanQ3 (orginal) [2025-11-20T15:33:51+01:00] view original

This is fantastic @cblake. Saved me many hours of research. Thank you!

emiljanQ3 (orginal) [2025-12-01T15:09:31+01:00] view original

Hello again. Just wanted to give an update in case anyone is curious how things are progressing.

I ended up building a flat kd-tree that can be read from disc via memory mapping similarly to @darkestpigeon example. It is quick enough for my purposes and I have moved on to working on the algorithms that use this data.

My test file for the Norwegian road network contains around 30 million nodes and I can now query them very fast. Here is an example output:


Speed test with sub nodes
Did memory mapping in 0.0009984970092773438s
Found 100 nn in 0.0s
Found 74712 nodes at Tromsö island in 0.006012916564941406s

Speed test with main nodes
Get bounded graph
Memory mapping in 0.0s
Loaded nodes in 0.00359344482421875s
Loaded edges and built graph in 0.001016616821289062s
Loaded main node graph for tromsö in 0.00559687614440918s

I might still want to look at compression in the future. The Norway file is 1.5 GB which is reasonable, but could be improved. Speed is more important than disk space management so I will leave it as is for now.

This has been a great learning experience for me and I am very impressed by the enthusiasm and support I got in the forum. Makes me feel that nim can be a good choice, even though the ecosystem is smaller than some of the alternatives.

I might get back here again and do another update . I'll have my implementation stress tested by being used within our production systems first. I still haven't had a proper discussion with my team about if I can share the code and it's a little bit messy at the moment anyway so might be better to do that at a later point when the structure of the code is more mature.

Thanks for the help everyone!

Mirror of forum.nim-lang.org

13508 :: Time critical reading of geographical data