Hi folks
Newbie question here. I've figured out everything I need for my project except for one showstopping issue, and would very much appreciate a hand.
I need a fast way to serialize and deserialize a seq of a relatively simple object to a binary file. The typical sequence will contain a few million objects, hence the need for speed.
There are a couple of Nimble packages out there, but their docs don't cover this use-case in a way I can understand and I've failed to get anything working.
The object is a price/volume bar for trading. Ideally it would be:
type Bar = object
d: DateTime
o: float
h: float
l: float
c: float
v: float
If the DateTime field was an obstacle I could live with an int64 for a timestamp, but converting back to a timestamp is quite slow, so this would be suboptimal.
I may need to include an enum as well, though I could simply use an int key if that was a problem.
I've spent more time on this than is entirely sensible, so a working code example would be a godsend!
The main problem you have is DateTime. Because the DateTime contains functions its very hard to serialize. What are you going to do serialize the assembly machine code of the functions? Huge security risk!
If you instead use a float64 timestamp or my chrono library you would not have this problem.
type Bar = object
d: Timestamp
o: float
h: float
l: float
c: float
v: float
Next what do you want to serialize to? Json is a common format because its human readable. I have one of the fastest json serialization library: jsony
let data = bars.toJson()
echo data
echo data.fromJson(seq[Bar])
But really json is kind of slow. Even the fastest json is slow compared to binary! I have a library for that too: flatty its faster then jsony but uses its own binary format:
let data = bars.toFlatty()
echo hexPrint(data)
echo data.fromFlatty(seq[Bar])
But you can go even faster with compression. Because reading and writing even to an SSD is kind of slow, compressing your data (with fast compressor) and writing is usually both faster and takes up less space. I recommend excellent supersnappy library for this.
let data = bars.toFlatty().compress()
echo hexPrint(data)
echo data.uncompress().fromFlatty(seq[Bar])
Try and see which method if fastest/easiest for you... don't forget to compile with -d:release.
Full code:
import times, flatty, jsony, supersnappy, chrono, print, flatty/hexprint
type Bar = object
d: Timestamp
o: float
h: float
l: float
c: float
v: float
var bars: seq[Bar]
bars.add(Bar(
d: parseTs("{year}-{month}-{day}", "2000-01-01"),
o: 123.123,
h: 12.23,
l: 3.23,
c: 3.344,
v: 34.4,
))
block:
echo "using json"
let data = bars.toJson()
echo data
echo data.fromJson(seq[Bar])
block:
echo "using flatty"
let data = bars.toFlatty()
echo hexPrint(data)
echo data.fromFlatty(seq[Bar])
block:
echo "using flatty+snappy"
let data = bars.toFlatty().compress()
echo hexPrint(data)
echo data.uncompress().fromFlatty(seq[Bar])
You know what's more important than speed, right?
Well, its error checking! Try out https://github.com/planetis-m/bingo when you have the time.
Treeform - brilliant advice - I'd never have gotten to such an effective solution on my own.
For 1.5 million records to and from HDD on my HP Z600 Win 10 Pro workstation, the average benchmarks were:
File sizes: 67 megs / 48 megs / 2 megs
Standard deviation is low.
Switching to SSD made surprisingly little difference.
So - 45x speedup on naive Json serialisation! That's got to be a result. Good enough to be getting on with - any additional work would be premature optimisation, I think.
And thanks for your libs - a lot of work, I know, and much appreciated!
Planetis - I did look at bingo, but being a bear of little brain I struggled to understand the chaining in the "next" value and how that would work with a long seq. If you were kind enough to provide a more complete example I'd certainly give it a try, though Treeform's solution is looking pretty good right now.
type
Foo = ref object
value: int
next: Foo
let d = Foo(value: 1, next: Foo(value: 2, next: nil))
its simple
import std/streams
type Bar = object
d: Timestamp
o: float
h: float
l: float
c: float
v: float
var bars: seq[Bar]
bars.add(Bar(
d: parseTs("{year}-{month}-{day}", "2000-01-01"),
o: 123.123,
h: 12.23,
l: 3.23,
c: 3.344,
v: 34.4,
))
let s = newStringStream()
s.storeBin(d)
@Scotpip - while all the @treeform/@planetis advice may be relevant if you need to serialize/deserialize to strings, it sounds like you are doing so to a binary file. Given that as a hard constraint, you can save much unnecessary work - possibly dozens to hundreds of times more than you need to. (Well, formally a different big-O scaling depending upon what you are doing since files are random access...).
You can even just run right off of the binary file directly via memory mapped files as supported in Nim's std/memfiles. suggest has an example of how to do this in a complex spell check database context (maybe start at proc open(string): Suggestor - really this code is easier to "run" than "read"), but your data is so much simpler.
You could just A) make your timestamp either an int64 as you suggest or B) more generally an integer offset into a file of fixed length records. Either way your object becomes a simple, fixed size thing with very unspecial CPU-like field types. So, you can then just cast the memory address to an ptr UncheckedArray[Bar]. If you want to grow/write the file "outside the memory map" then you can either append records with writeBuffer or you could expand the map and write via array indexing. Reading is just myFile[i], etc.
Ideas along these lines are developed into a more general purpose thing over at nio (but there is much functionality to add there to be fully capable) which may be simpler example code to read than`suggest` (or not!). I just wrote that up recently and it hasn't been thoroughly tested.
For a case as simple as yours with just one datatype, though you can just do your own little Bar API on top of memfiles. This should be do-able even for little brain bear in my estimation. :-) This idea is extremely simple, but often overlooked.
Arguably, nothing can be much faster. You just use the Nim/C compiler "field access" as your "binary data interpreter", just as you would always do with an in-memory object. The file just becomes a new "persistent allocation region". If you access your records "by record index" you hardly notice any difference - except speed from not having to re-do whatever work was required to create the data. If you want just the most recent 10000 then you can just figure that out from the file size & done. You do not need to even iterate over any earlier data, page it off of persistent devices or anything.
If as @planetis mentions you care about memory safety which can indeed be critical then you probably want to implement a bounds checked dynamic array on top of the UncheckedArray with your own [] and []= operators in your little Bar APIs. While there is a std/rtarray, it was always very preliminary and almost no one uses it.
The only catch is binary format portability. Using the {.packed.} pragma on your object can usually limit that to CPUs of actually differing byte order. This is often not a problem because little endian has become so dominant with Intel/AMD/ARM. It would certainly be easy to write a little byte swapping converter program for occasional translation needs, but I've yet to ever come across that need personally in over 20 years of this kind of binary data management.
Since a lot of people get kind of stuck in their IO ways and for whatever reasons seem to not just trust my word for it and since @Scotpip seems new to Nim and might want some hand holding, here is some code to get him/others started:
import times, os, memfiles, strutils # This is barIO.nim
type
Bar* = object {.packed.}
t*: int64 # ns since 1970-01-01 gives +-292 year range!
o*, h*, l*, c*, v*: float # open,hi,lo,close,vlm
Bars* = UncheckedArray[Bar]
BarFile* = object
barf: MemFile
bars: ptr Bars
nBar: int #XXX should define `[]` bounds checking this
proc bopen*(path: string): BarFile =
result.barf = memfiles.open path
result.bars = cast[ptr Bars](result.barf.mem)
result.nBar = result.barf.size div Bar.sizeof
proc close*(bf: var BarFile) = bf.barf.close
when isMainModule:
if paramCount() != 3:
stderr.write "Use: bario make|vlmTot path.bars num|idx\n"
quit(1)
let path = paramStr(2)
let num = parseInt paramStr(3)
let t0 = epochTime()
case paramStr(1)
of "make":
let f = system.open(path, fmWrite)
var bar: Bar # real data have more than `.v` ;)
for i in 1..num:
bar.v = float(i) #XXX handle disk full below
discard f.writeBuffer(bar.addr, bar.sizeof)
of "vlmTot":
var bf = bopen(path)
var total = 0.0
for i in bf.nBar - num ..< bf.nBar:
total += bf.bars[i].v #XXX bf[i].v once `[]` exists
echo "volume total of last ", num, " records is ", total
echo epochTime() - t0, " seconds"
That's it - define, create, read, benchmark all in under 40 lines. Now for some benchmark numbers that might inspire you to care. Below are times for a Unix RAM test; 72 MB likely VERY cachable and data like this is typically write-once read-many. So, round-trip is much less relevant than teased apart write/read measurement.
$ ./barIO make /dev/shm/test.bars 1_500_000
0.039844 seconds
$ ./barIO vlmTot /dev/shm/test.bars 1_500_000
volume total of last 1500000 records is 1125000750000.0
0.005187 seconds
$ ./barIO vlmTot /dev/shm/test.bars 100_000
volume total of last 100000 records is 145000050000.0
0.000386 seconds # 337X faster than 260ms/2 (for RT)
# Demo some NIO stuff while we're at it:
$ ln -s /dev/shm/test.bars /dev/shm/tohlcv.Nlddddd
$ nio print /dev/shm/tohlcv.Nlddddd | tail -n100000 |
awk '{print sum += $6}' | tail -n1
145000050000 # Checks out; Just 2.65s=7000X slower.. ;)
# Ok, ok..
$ nio cut -d :5 /dev/shm/tohlcv.Nlddddd |
nio tails -t 100000 .Nd | nio pr .Nd |
awk '{print sum+=\$1}' | tail -n1
145000050000 # Also verifies; 122 millisec; 360X slower. ;)
Now, 72 MB/.005187s =~ 13.9 GB/s. Not terrible, but also not great. That machine can do like 45 GB/s, but getting 3 cores hot sucking data out of DIMMs is out of scope. You are on your own for the last factor of 3..12X. :-D { That will also obviously depend upon your exact calculations anyway, but VWAP or EWMA or other things should be pretty similar to volume totals. And 5 ms may be fast enough and who knows what your mem bw situation is anyway. }@cblake
Many thanks for your intriguing post. It certainly makes sense, and opens up the prospect of very low-cost and rapid caching to disk which might change my design somewhat.
I will cogitate and digest and try to understand the code you have linked.
But more broadly I'd ask you to reflect on the yawning gap between a confident systems programmer like yourself, and someone like myself who has mainly worked with high-level line-of-business development. Ask me to design you a nicely abstracted database or validated data-flow and I'm your man, but all this talk of bits and bytes leaves me a bit at sea. I'm working through a book on C to try and widen my education, but systems work is still very unfamiliar.
So if you can spare the time to offer a few more kindergarten-level pointers or even some example code it would be very helpful. Judging by the readership the thread is attracting, they are others who are benefiting too.
For my project, memory safety isn't critical - the serialisation is mainly for backtesting on my own machine, so if it blows up I'm on hand to fix it. The work is very CPU intensive, so speed is the priority.
As for endianness, for production I'll be running a conservative subset of the functionality on a remote VPS - probably Windows as that's more familiar to my partners in this little venture. I'm not very familiar with the underlying hardware they use - is there a significant risk that this might clash with the Intel Xeons I'm running locally? I've never heard of this being an issue. But then again, most people aren't using the low-end tricks you are proposing...
Been there - tried that. Pretty slow for big data sets. Though for read-only, if you set up their in-memory table you can slice and dice the data pretty effectively once you do heave it off the disk. For my use-case, querying the data is not a priority.
For what I'm doing (trading backtesting) the data sets are just too big for conventional DBs. If exotically expensive corporate solutions aren't on the menu, the consensus seems to be that serialising to file is the answer. There's also a lot to be said for keeping it simple and failsafe.
I will answer your question with a question: Does SQLite have builtin fancy time series models it can launch directly against full table scans? Doubtful. Whatever it does have will be commonplace and you are unlikely to beat the market with it { you aren't so likely in the first place..but, hey, someone has to keep it efficient :-) }. And if it's not built-in to the table scan then there will at least be some extraction pipeline mumbo-jumbo in the way - to what end?
This data is read-only or at least append-only. So, 99% of the complexity of a full DB is just not needed. I don't know SQLite well enough to know if it can memory mapped IO, but I know such do exist as well. But you wind up still wanting to just cast a record reference to a native pointer type. So, if you don't need transactions and indexing and all that, and you wind up doing this simple thing inside the DB instead of a file anyway, then you aren't getting any value out of your dependency. IMO, dead weight dependencies are usually bad.
More briefly, custom-even-trade-secret analytics or ML models are usually not so well aided by SQL engines. Meanwhile just rolling your own flat file is just as trivial as I showed above..more so with practice even for little bear brains. :-)
So, you can always do whatever you want, but as I mentioned, people get fixed in their IO ways..not quite human nature, but SW dev nature. :-) And maybe I'm even guilty of that pushing this stuff. You may find answers (or more questions/complaints) in the NIO FAQ which is really more like an "every question someone's asked me about this in 20 years".
I make no claim that it's perfect for everything, but this kind of IO is especially well suited for this kind of back testing use case where you might do millions of passes over some of the data.
No idea if Windows has this ability.
NTFS does support compression but I found Windows Server's block deduplication feature much more useful.
I am also unfamiliar with any port of Windows to a big endian CPU, but I really don't follow it much. Araq might know.
Well Windows runs on ARM and ARM can be configured to use big endian. As far as I know, I never used it.
IBM POWER cpus have been the last CPUs doing it that way for a long time, AFAIK. Apple's M1 seems ARM-related which probably means little endian, but I don't know for sure.
The M1 is little endian. (At least in the default setting.)
It's become so rare that I don't even know if anyone's run the Nim test suite on a big endian CPU at all in the past N years. Araq might know that, too.
Depends on your definition of N, I once fixed compiler bugs for PowerPC which is big endian.
@ingo - just happened to be written yesterday https://scattered-thoughts.net/writing/against-sql/ . While I don't agree 100% with everything written, this excoriating condemnation of SQL as a programming language expresses many problems in a way you may find interesting (and Nimmer's in general). Jamie Brandon has forgotten more about this stuff than I ever knew...Articles like these pop up regularly.
@Araq - thanks for the response! I was aware of ARM "bi-endianness", but didn't/don't think anyone really uses it. I guess MS "discourages" (search for "endian") anyone from using this ARM32 feature and ARM64 made it harder to do without kernel support. I guess always forcing the CPU into the endian mode expected in say interrupt/exception handling at every crossing is viewed as too costly for the value add or something. Sounds like a real landmine!
@Scotpip - I continue to think byte format concerns are a 100% "theoretical" concern for you (and a 99% theoretical concern for others..) Good luck!
Articles like these pop up regularly
Hah, yes, I know. In Dutch we'd say "Het is niet goed of het deugd niet" (it's not good nor virtue). So many have tried, even more have failed. SQL is far from perfect, yet what we have works. Use/learn it as it is instead of creating yet another failure or even write yet another orm that raises the question "How do I do this SQL in your orm...?" Another thing is they way the DB's are used. SQLite is totally different than PostgreSQL, they need a different mindset to work with. For SQLite you write "one liners", like many small procs. ... at least it's not OOP ;)
UPDATED BENCHMARK
Task: read 1.5 million price bars off disk and total the volume values
Platform: Win 10 Pro Workstation, 2x Xeon X5675 CPUs, 24 gigs memory, SSD
@treeform approach - marshalling with flatty and supersnappy libs
Average: 0.14 secs
@cblake approach - using the memfiles lib
Average: 0.06 secs
As someone with a very sketchy understanding of OS internals I'm not clear why the memfiles approach is benchmarking 10x slower on Windoze than cblake was reporting for Unix. I'd appreciate any explanation.
@ingo - this is on the same machine with SQLite:
import db_sqlite, times # inMem.nim
const n = 1_500_000
let t0 = epochTime()
let db = open(":memory:", "", "", "") # In RAM!!
db.exec sql"DROP TABLE IF EXISTS bars"
db.exec sql"""CREATE TABLE bars (t INTEGER PRIMARY KEY,
o DECIMAL(15), h DECIMAL(15), l DECIMAL(15),
c DECIMAL(15), v DECIMAL(15))"""
let q="INSERT INTO bars(t,o,h,l,c,v) VALUES (?,?,?,?,?,?)"
db.exec sql"BEGIN"
for i in 1..n: db.exec q.sql, i, 0.0,0.0,0.0,0.0, i.float
db.exec sql"COMMIT" # 4.31 s (/0.0398 =~ 108X slower)
let t1 = epochTime()
for x in db.fastRows(sql"SELECT SUM(v) FROM bars"): echo x
echo t1 - t0, " seconds to create; "
echo epochTime() - t1, " seconds to read"
db.close # 52 ms (/5.2 =~ 10X worse)
10..100x slower than my (as noted) not optimized to within 3x memfiles version. If you are satisfied with 30x slower than what is possible, well, that's ok. Maybe you don't have big data.
@Scotpip - I never use Windows and have no experience performance tuning it. Others may be able to help you. I will say what I can. I would start with a RAM filesystem if there is any available to factor out device IO. On Linux I usually get mmap IO only about 1.2x to 1.5x better (for reasons too in the weeds to explain here). Typically because of how DLLs/.so's work it is well attended by OS writers.
As a same OS experiment, I can suggest that you try readBuffer on a File (with setFilePos aka fseek if you need random access - i.e. if separating input files into 2019, 2020, etc. is not enough time window definition granularity). You will not have as much "built in" syntax support like [], but Nim lets you add that back in easily. You could even add an optional parameter to bopen saying what style of IO you would like. For this simple experiment, you would just declare a var bar: Bar and do a while loop myfile.readBuffer(bar.addr, bar.sizeof).
One thing of note is that on Linux there is an fread_unlocked in the C library which reverses the MT-safe default and can (sometimes) be much faster when you are close to hardware limits. (The default allows multiple threads to fread coherently from the same File which has a buffer they would share.)
I have no idea if this is still true, but 1990s era advice was that the raw system call reads on Windows were faster than the libc buffered IO. To use that you might have to learn how to use a special Nim importc declaration.
Another experiment is to time two passes for the totalling - a first which will demand page pages into your address space and a second that should be faster. (for cache reasons, too, but your cache is likely << 72MB).