The service I migrated to Nim manages around 100k concurrent mostly long-lived WebSocket connections. These WebSocket connections are opened by client apps (Android, Windows desktop, etc) and are used to receive realtime events.
I wrote the new WebSocket server using Mummy, Ready and JSONy. I wasn't quite sure what to expect for performance but have been pleasantly surprised.
The new Nim server has now been running for a week now on a single 2 vCPU + 16GB RAM VM. This VM is behind a load balancer for HTTPS termination, the same kind of setup as described in my previous post.
With 100k open WebSocket connections, the new server's CPU and RAM usage are both under 10% (~7.5%ish). This is pretty great.
While I don't have 1 million production WebSocket connections to prove it, this provides a solid indication that Nim + Mummy can handle 1M+ WebSocket connections on a single small VM which is quite cool.
The server is also not just sitting there idle either, every minute over 200k messages are sent to clients (most of these are simple heartbeat messages).
Considering this new Nim server has been humming along without any issues for the past week I am ready to say the migration has been a success.
I have now put Nim and Mummy into production serving both a lot of HTTP request traffic and a lot of WebSocket connections. Since everything has performed really well in both cases it is pretty easy to say I am happy with the results I have achieved with Nim.
Why does this matter to you?
Thanks for giving this a read and good luck with your own Nim adventures.
Considering this new Nim server has been humming along without any issues for the past week I am ready to say the migration has been a success.
Congrats! Could you give any insights into the previous system and how it compares to the new Nim codebase? It'd be cool to know what factors led to the re-write as well, like was it performance or maintenance costs.
It sounds very interesting. Can I ask technical question - did you compare with async solution?
Because even if it is loong-running websockets - it is still 100k+ threads with context switch and, probably, most of the time websocket is just waiting.
I like Mummy's idea to have simple interface without async, but still interesting to compare with async
There is not 100k+ threads.
There is 1 Mummy socket thread and 2 worker threads for the HTTP Upgrade requests. Then there is 1 thread receiving on a Redis pubsub connection and 1 thread handling heartbeat intervals.
In total there are 5 threads and this need does not scale with number of connections.
I'll clean up and open source a minimal example to show how to set this up.
Nice!
In total there are 7 threads and this need does not scale with number of connections.
the documentation notes that "handlers" are free to take over the thread and do lengthy processing, as on of the main advantages over async - "block postgresql" is cited as an example - what happens after the "few" worker threads are up? ie if each of those blocking queries takes up a thread?
I have put together a simplified example WebSocket server that works in the same way as the production server this post is about.
The example can be seen at https://github.com/guzba/mummy/blob/master/examples/advanced_websockets.nim
This is not exactly simple but at ~200 lines it is hopefully something those interested can get through and get some value out of. I can add more comments or answer more questions from here now that we have some code I can reference.
Not that this does use threads and global memory so there are some locks etc required.
There is a lot to this and I'm sure you're aware of it all too. Happy to answer more questions if you interested. My main thesis is more "pro-thread" than "anti-async" to be clear. I just like blocking code, maybe I'm crazy.
This is super encouraging to hear. I was really tempted to see if I could get away with putting some nim into production at work recently but I’m holding off for the moment.
One thing I didn’t see last time I searched were nim libraries for open telemetry. If I can find the time I might try to work an implementation
eventually the server will oom. This is totally the same as async though imo, with caveats as always.
Unmanaged queues are usually a pattern to be avoided due to the bad failure mode (oom) but also because clients can timeout and generate unnecessary load. Often it's better to apply (incremental) backpressure e.g. return HTTP 429 early.
Unfortunately unmanaged queues are present in many async implementations.
I want to highlight that this is based on the mummy webserver that does not use Nim's asyncdispatch, but instead uses OS threads, ORC memory management, and a custom HTTP parser. It's all forward-looking to Nim 2.0. It has a really cool model where all IO is handled asynchronously in one thread, but all the real CPU work is done in a fixed thread pool. This means that there is no "what color is your function" problem. Operations that are not async, such as DNS lookup, file reading, or any OS call, just work. This also means that this server can't really succumb to many of the performance issues that can happen to a single thread-per-connection server or an async-but-single-threaded server. It's a cool combination of both.
I also want to highlight that this type of WebSocket support is pretty rare. Most web servers treat WebSockets as an escape hatch, where they just hand you a TCP socket after they do the initial HTTP handshake. But not with mummy. WebSocket support is first class. Each message gets put into the thread work pool. New messages don't run before old messages are finished. There is no async reading loop. Everything is very fast.
Upcoming Nim 2.0 and the new --threads:on and --mm:orc by default are uniquely suited for these performance gains. It's very cool to see real-world benchmarks. It's easy to have synthetic benchmarks win, but nothing can beat real-world experience. This also shows just how many connections can be served, even with a modest VM. Maybe as an industry, we should spend less time on cloud, Kubernetes, and microservices, but just more time in the profiler and optimizing things. Maybe then Twitter can actually run on one machine if it were written in Nim.