What's up guys,
Found this today:
https://blog.logrocket.com/nim-vs-python-which-should-you-choose/
It was published yesterday, so it's fresh. Good read too!
replied this but it's pending moderation so may not appear yet:
* the first benchmark you’re referring to isn’t using the most optimized compiler flags for nim, see https://github.com/lh3/biofast/issues/21#issuecomment-707979684 where you can get easily a 1.5x speedup by using `-d:danger –gc:arc`
* worth mentioning is you can embed a python session using nim (or vice versa) via https://github.com/yglukhov/nimpy so your can integrate either way
* see also https://github.com/timotheecour/D_vs_nim for a D vs nim comparison
* not mentioned in this article is that nim compile times are far better than typical equivalent idiomatic code in C or in particular C++
can someone please also check what would the speedup with --gc:arc be in https://github.com/kostya/benchmarks ?
right now it doesn't use this flag, see: https://github.com/kostya/benchmarks/blob/7c13e9221c4e58f95151864b5db87363a38734b2/common/commands.mk#L4 ?
such benchmarks have their importance to attract people, so might as well make use of them in timely manner
The official documentation is pretty good and very complete.
Great achievement of all the contributors, years back documentation used to sux, now everything has at least 1 example or documentation fragment. :)
About kostya/benchmarks - we will only need to change to arc/orc when 1.4 hits :) And also we should use -d:lto since Rust and GCC in kostya/benchmark use it.
I did some fast testing with refc/arc with GCC - https://gist.github.com/Yardanico/8d4decc22127bb0e39b6cb0c01ca48a3 (except the last havlak with orc, it crashes for some reason, will try to minimise)
I think that we should make a website to track Nim's "performance" based on different benchmarks, be it benchmarks which test pure computations or stdlib, so we can clearly see how commits affect the performance.
how commits affect the performance.
we still don't have a real performance regression tests in stdlib (only very few tests with very loose timing constraints, some of which actually got a lot worse, eg due to tables.nim going back to what it was before pseudorandom probing, see https://github.com/timotheecour/Nim/issues/82#issuecomment-625613116 and https://github.com/nim-lang/Nim/pull/13440)
real 0m7,568s # -d:danger -d:lto
real 0m7,592s # -d:danger --verbosity:0 --opt:speed --hints:off
real 0m7,598s # -d:danger
real 0m8,285s # -d:danger --gc:orc
real 0m8,381s # -d:danger --gc:arc
You might wanna have a look at the crazy SIMD stuff Lemire is doing for parsing JSON: https://github.com/simdjson/simdjson
Here is also a video where he outlines some of the desing decisions: https://www.infoq.com/presentations/simdjson-parser/
Automatic detection of performance regressions might be nice, but has non-trivial implementation complexity. Besides naive or bad benchmarks or performance tests sometimes being worse than none at all, many CIs run on heavily shared computers. Such sharing makes real time passage a metric with very hostile noise distributions.
There are techniques to mitigate this, such as taking the average & sdev of the minimum of many runs, but they are both very expensive (e.g., 10*10 runs) and ultimately still not entirely reliable on heavily shared systems, especially within Cloud virtual machines. For example, every one of the 100 runs may be getting 5% of L3 cache of a server CPU or a full 100% at other epochs and 20x the L3 can have bigger performance impact than what is to be measured/trapped as a regression.
Beyond this any such results are at high risk of being idiosyncratic to specific hardware (CPU, DIMMs, NVMe/SSD/Winchester, etc.). To do this right, you really need a diverse suite of dedicated computers which may be beyond the resources of Nim core.
Expat has callbacks for the beginnings and ends of any XML element. This allows more general caller-side allocation & object creation (or lack thereof!) strategies. In this example, the code could simply update x,y,z totals at the end of every complete element with a name in [xyz] -- the way my hacked-parseJsonFragments does, but much less hacky.
One thing that xsv (https://github.com/BurntSushi/xsv) does to allow multithreaded csv processing is storing the byte offset of start of lines in a preprocessing pass and then a multithreaded second pass.
Having start/end of section hooks would also be beneficial for multithreading later down the line.
One sad feature of modern 3-level cache CPU architectures (at least AMD and Intel that I have dealt with, non-NUMA) is that memory bandwidth from one CPU core to DIMMs is substantially less (around 2.5..5x in my usual measurements) than aggregate memory/DIMM bandwidth. Even to copy data untranslated benefits from multi-threading by about 3x in my measurements even though one CPU core is more than capable of instruction throughput to saturate the DIMMs.
I am unaware of the circuit cost/heat dissipation/etc. trade offs CPU engineers make to have too few "lanes" from each core to the DIMMs (I would love to see slides/papers on this if anyone knows them). Usually "wider" data paths are not that much of a burden. In my view this property induces a great deal of otherwise unnecessary thread-level parallelism and its attendant software complexities (hence "sad").
One sad feature of modern 3-level cache CPU architectures (at least AMD and Intel that I have dealt with, non-NUMA) is that memory bandwidth from one CPU core to DIMMs is substantially less (around 2.5..5x in my usual measurements) than aggregate memory/DIMM bandwidth. Even to copy data untranslated benefits from multi-threading by about 3x in my measurements even though one CPU core is more than capable of instruction throughput to saturate the DIMMs.
It's even worse than that, if your process happen to run on 2 siblings cores (2 different logical core but same physical core from HyperThreading), they compete for memory bandwidth which is a plague on High-Performance Computing which is bandwidth-bound. If you try to maximize L1 and L2 cache use, they would also evict each other from cache ...
I didn't read the article.
Ryu is f2s only. It's for rendering only, no parsing. I will pick it up again when I can afford to, but I've been hoping that @Clyybber or @alaviss would finish their implementations instead. Mine is a naive port from C, because I felt it was more important to make it work first. @alaviss version is based on actually internalizing the algo and then spitting out idiomatic Nim whereas mine is a mere translation.
I'm looking foward to the days where progress in computing is based on computer science and not on unbelievable hacks that survived for way too long.
What would help is not giving people half-baked solutions in the first place. We wouldn't have to deal with supporting N versions of Nim JSON de/serializers today if we'd provided a better API in the beginning, because that API could be used by 3rd party JSON accelerators if they were even necessary.
cue @mratsim
Incrementalism is easily done by using the parsejson.nim library and its token stream.
It could be an interesting test application of CPS. I can't see making it a priority, but I will plan to add a parser to my jason hack. If there are any other strong opinions on how best to write it, please share your thoughts.
I took another look at this with a month having flown by. :)
Taking @Araq's excellent advice and just using parsejson directly, this program is 2.13x faster (gcc-10 PGO with --gc:orc) than the Kostya version, and needs no weird patches to the Nim stdlib:
when defined(cbFParse):
import parsejson, streams, mystrut # 1.63x speed up
else:
import parsejson, streams, strutils
iterator jsonTokens(path: string): JsonParser =
var p: JsonParser
p.open(newFileStream(path), path)
try:
var tok = p.getTok
while tok notin {tkEof,tkError}:
yield p
tok = p.getTok
finally:
p.close()
type Coordinate = tuple[x: float, y: float, z: float]
proc calc(path = "/tmp/1.json"): Coordinate =
var s = ""
var x = 0.0
var y = 0.0
var z = 0.0
var n = 0
for token in jsonTokens(path): # work like "on demand" C++
if token.tok == tkString: s = token.a
elif token.tok == tkFloat:
case s
of "x": x += parseFloat(token.a)
of "y": y += parseFloat(token.a)
of "z": z += parseFloat(token.a); n.inc
else: discard
result = (x: x/float(n), y: y/float(n), z: z/float(n))
echo calc()
I believe almost all the remaining time to close the ~4.5x gap with D/those fastest comes from parsejson.parseNumber, but not in the way prior discussion in this thread has suggested. If you look at that code, it largely does almost all the work to convert from ASCII to int or float. All that work gets done again later. Beyond this duplication, our parseFloat currently calls out (on Linux/glibc) to a slow, long double (IIRC) impl of ASCII to fp. (That 1.63x ratio in my code comment is just from doing a slightly smarter parseFloat where you first parse integers and later combine into a float).
So, I think the best way to speed this up is to have a fast, flexible int/float parser/validator that could be used just once in parsejson.getTok to both identify an integer/float and produce the binary value. Since @Araq mentions that NAN/INF are not part of JSON, this new in-Nim parseFloat would need to be configurable if we want to reject INF/NAN values. I have not looked at the D impl of this, but doing all that work just once is my best guess as to why D is about as fast as the SIMD/C++ on this benchmark without all the hand-rolled assembly. All other number parsing code would likely get a bit faster. So, this is motivated a few ways, but likely to claw back 3+x out of that 4.5x on this particular benchmark without running into compile-time vs. run-time differences.
I'm not sure the rest of parsejson.getTok can really be sped up much. parsejson.parseName and parsejson.skip is the only other non-trivial work done. parseName is already re-using the token string buffer. Most processing is just char at a time. There are SIMD optimized strspn things out there that might be able to speed up skip a little. That's likely the next low hanging fruit, but then you really do start to collide with run-/compile-time issues.
Sure enough I did this here (yeah, I know tests are failing at the moment..) and now (with a gcc-10.2 PGO build on Skylake) this simple program runs that benchmark over 3x faster (to within 1.4x of the SIMD optimized C++ (on demand), within 10% of the "D fast" and faster than all the Rust):
import parsejson
type Coordinate = tuple[x: float, y: float, z: float]
proc calc(path = "/tmp/1.json"): Coordinate =
var x = 0.0
var y = 0.0
var z = 0.0
var n = 0
for t in jsonTokens(path, false, false, false):
if t.tok == tkFloat: # valid float
case t.a # last string (strFloats,Ints)
of "x": x = x + t.f # a[0] is actually slower!
of "y": y = y + t.f
of "z": z = z + t.f; n.inc
else: discard
result = (x: x/float(n), y: y/float(n), z: z/float(n))
echo calc()
To be honest, all that SIMD effort is nice and all, but 1.4x isn't that huge a boost. Not duplicating work gets you almost all the way there.
Parallelization as @mratsim mentioned could get much more. Unlike xsv/csv, however, json has this fluid hierarchical structure. So, in the general case it is likely a lot of figuring out how to divide work. Maybe not worth it. Truly big data usually has a regular format anyway.