Hello !
I have been playing with a new project, back to my ML roots. This time I'm writing an LLM inference server.
The goal is to have an inference server that is standalone, ideally single binary though in the meantime it will depend on PyTorch (but it can link to system PyTorch, venv PyTorch or even download and link to the pure C++ libtorch at the moment, see how it is done: https://github.com/mratsim/tattletale/blob/c477050/workspace/libtorch/vendor/libtorch.nim
Currently I have a bidirectional Tattletale<->Pytorch interop for tensors, the implementation turned out very simple:
https://github.com/mratsim/tattletale/blob/c477050/workspace/libtorch/src/tensors_py.nim
Just 150 LOC, license and comments included.

And I have implemented almost end-to-end inference (just missing conversion of token_ids to string) that is fuzzed against huggingface/transformers showcasing Nim from Python
Next steps? Too long, I have a whole plan here, making an inference server is quite complex:
**https://github.com/mratsim/tattletale/issues/1**
I see the killer applications as:
Concurrency: llama.cpp and ik_llama.cpp don't cut it for concurrency, they have a very naive way to process parallel requests which is having pre-reserved slots and the LLM context you get is context_len / slots, so if you want to serve say 10 NPCs on a 131K context model, you only have 13.1K per request, forever even if in the specific portion of your game you only have the one NPC boss and no other.
And it's slow (ollama uses an outdated llama.cpp inside but the architecture is the reason for the multiple order of magnitude difference: https://developers.redhat.com/articles/2025/08/08/ollama-vs-vllm-deep-dive-performance-benchmarking)
Sorry for the formatting, bullet points and images are going all over the place
Just in case somebody missed it; this is my vibe-coded full native Nim project: