Hello !
I have been playing with a new project, back to my ML roots. This time I'm writing an LLM inference server.
The goal is to have an inference server that is standalone, ideally single binary though in the meantime it will depend on PyTorch (but it can link to system PyTorch, venv PyTorch or even download and link to the pure C++ libtorch at the moment, see how it is done: https://github.com/mratsim/tattletale/blob/c477050/workspace/libtorch/vendor/libtorch.nim
Currently I have a bidirectional Tattletale<->Pytorch interop for tensors, the implementation turned out very simple:
https://github.com/mratsim/tattletale/blob/c477050/workspace/libtorch/src/tensors_py.nim
Just 150 LOC, license and comments included.

And I have implemented almost end-to-end inference (just missing conversion of token_ids to string) that is fuzzed against huggingface/transformers showcasing Nim from Python
Next steps? Too long, I have a whole plan here, making an inference server is quite complex:
**https://github.com/mratsim/tattletale/issues/1**
I see the killer applications as:
Concurrency: llama.cpp and ik_llama.cpp don't cut it for concurrency, they have a very naive way to process parallel requests which is having pre-reserved slots and the LLM context you get is context_len / slots, so if you want to serve say 10 NPCs on a 131K context model, you only have 13.1K per request, forever even if in the specific portion of your game you only have the one NPC boss and no other.
And it's slow (ollama uses an outdated llama.cpp inside but the architecture is the reason for the multiple order of magnitude difference: https://developers.redhat.com/articles/2025/08/08/ollama-vs-vllm-deep-dive-performance-benchmarking)
Sorry for the formatting, bullet points and images are going all over the place
Just in case somebody missed it; this is my vibe-coded full native Nim project:
Sorry if this is obvious but is it your objective to make a modern ollama / llama.cpp alternative? And that under the hood it is based in PyTorch which you eventually plan to link statically into a single file program?
I have detailed my motivations here https://github.com/mratsim/tattletale/issues/1.
I actually don't try to replace ollama / llama.cpp as I don't use them, I use vLLM and SGLang as the performance is significantly better. In the past in particular llama.cpp was missing tensor parallelism to shard computation on multiple GPUs, which is very significant since matrix multiplication are O(n³), if you split on 2 GPUs, the computation per GPU becomes O((n/2)³) so O(n³/8).
My issues with vLLM / SGLang is that they use "ancient" quantization techniques, have no builtin tools to assess quantization and the Python dependency hell they introduce which was particularly annoying during the switch from Transformers 4 -> Transformers 5 with even models breaking, see: https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/48
So I want the concurrency, speed and parallelism of vLLM / SGlang, the SOTA quantization of Exllama v3, the embedability and portability of llama.cpp.
Ah also there are stuff that are/were really red herring on the engineering side of vLLM and SGLang. vLLM was busy looping polling for input pegging CPU to 100% unless VLLM_SLEEP_WHEN_IDLE was set (and SGLang has a --sleep-when-idle=1 CLI flag). This was reported many times. The core devs justified busy looping for "latency". vLLM finally removed it when someone complained about IBM modern supercomputer being too expensive to busy loop despite plenty or earlier bug reports (https://github.com/search?q=repo%3Avllm-project%2Fvllm+VLLM_SLEEP_WHEN_IDLE&type=issues)
Ultimately I don't want to depend on PyTorch because while it has been quite easy to ship, see my download script: https://github.com/mratsim/tattletale/blob/master/workspace/libtorch/vendor/libtorch_installer.nim, it still is a huge dependency and it might lead to the same linking friction I had with Arraymancer and BLAS: https://github.com/mratsim/Arraymancer/issues/422
@Araq yes I'm aware of tinylama. I will likely add GGUF support at one point, I'm not too sure when.
@Clonk
Sounds like a cool project. I'd love to help where I can, if you have a link to a github with some issues (either here or on discord).
https://github.com/mratsim/tattletale/issues/1 has all the sub-issues. There is a lot of moving parts in the core but some are independent that will be needed later:
I have also added 2 models that are small and SOTA, however they use tricky attention layers (Gated Delta Net and Sliding Window Attention) that will have implication on the KV cache but currently that's fine, the KV cache is super naive and having them now would help direct the refactoring:
@ingo
I have looked into SPIR-V, it's the IR behind Vulkan and OpenCL (and there is a SPIR-V OpenGL).
Unfortunately generating SPIR-V directly looked painful last I looked and needed a custom toolchain, copying my (2024, might be outdated) investigation: https://github.com/mratsim/constantine/issues/92
OpenCL
Generating OpenCL code through LLVM requires going through SPIR-V and loading the resulting kernel through clCreateProgramWithIL
SPIR-V is an experimental backend starting from LLVM 15 and likely needs to be configured through LLVM_EXPERIMENTAL_TARGETS_TO_BUILD (see https://stackoverflow.com/questions/46905464/how-to-enable-a-llvm-backend, https://reviews.llvm.org/D115009 )
Alternatively there is https://github.com/KhronosGroup/SPIRV-LLVM-Translator but it would require compiling Nim in C++ mode.
In Rust land there is this project that was very interesting: https://github.com/charles-r-earp/krnl, it uses SPIRV but depended on that: https://github.com/EmbarkStudios/rust-gpu/tree/main/crates/spirv-builder, which besides being archived needed very specific nightly build of 2022 or 2023 Rust.
And from JuliaGPU projects it seems they need 3 dependencies: https://github.com/JuliaGPU/GPUCompiler.jl/blob/v1.11.1/src/spirv.jl#L1-L5
# https://github.com/llvm/llvm-project/blob/master/clang/lib/Basic/Targets/SPIR.h
# https://github.com/KhronosGroup/LLVM-SPIRV-Backend/blob/master/llvm/docs/SPIR-V-Backend.rst
# https://github.com/KhronosGroup/SPIRV-LLVM-Translator/blob/master/docs/SPIRVRepresentationInLLVM.rst
On a similar annoyance, LLVM can in theory generate Apple Metal IR directly but it's undocumented and later LLVM IR versions are incompatible with what LLVM IR Apple uses.
And it's the same thing for Nvidia, NVVM IR is based on an old LLVM IR version and in theory you could generate NVVM IR from LLVM and optimize it with Nvidia NVVM.
See https://github.com/JuliaLLVM/llvm-downgrade and https://github.com/JuliaGPU/GPUCompiler.jl
Ultimately, given that Nvidia, Cuda, AMD HIP, Apple Metal are really C-inspired it should be possible to have Nim generate them directly, possibly with an indication {.backend: "cuda".} at the top of the file to mark special rules in place.
Some progress in the past 2 weeks.
I have added support for EXL3 quantization, a state-of-the-art quantization scheme that uses trellis, lattice codebooks and random hadamard rotations (predates turboquant polar rotations) and that currently is higher quality per bit than any other. At the cost of needing compute (but token generation is memory-bound so the cost is absorbed on GPU).
I have implemented "IntrusiveAttention" what could very well be the fastest KV cache out there that enables continuous batching and arbitrary interleaving of concurrent queries. It is based on WAVL trees (an alternative to AVL trees and Red-Black trees) to which I added the capability to do longest prefix match. It comes with Lean4 formalization and a certain number of proofs of correctness.
I added a compiler for a Nim subset (i.e. writing Nim like C but with generics) to Cuda, OpenCL, Vulkan, WebGPU. The compiler works at compile-time via macros. The result can be used either via a AOT compiler like nvcc to build an object file or a library or it can be used directly at runtime. The runtimes for all backends (NVRTC, OpenCL runtime, Vulkan runtime, wgpu) are all tested with a simple vector addition. (Vulkan descriptors are so so annoying ...)
And finally, Nvidia released in Cuda 13.1, and more officially last week TileIR, an MLIR dialect that uses tile as first-class citizen. Tiles are the best representation to date for heavy array/matrix/tensor computing: https://github.com/mratsim/laser/blob/d310294/laser/primitives/matrix_multiplication/gemm_tiling.nim#L61-L146. Unfortunately from the docs, the Cuda driver is supposed to be able to load generated TileIR directly but it doesn't seem like it work and I have to go with an external assembler that may or may not be on devs and users machine. And given that it's also Nvidia specific (unless I implement TileIR compiler) I stashed that stream of work for now.
(Vulkan descriptors are so so annoying ...)
Use Buffer Device Address