nimforum mirror - Tattletale - bidirection PyTorch<->Nim + Sketch of LLM Inference

mratsim (orginal) [2026-05-15T18:23:30+02:00] view original

Hello !

I have been playing with a new project, back to my ML roots. This time I'm writing an LLM inference server.

The goal is to have an inference server that is standalone, ideally single binary though in the meantime it will depend on PyTorch (but it can link to system PyTorch, venv PyTorch or even download and link to the pure C++ libtorch at the moment, see how it is done: https://github.com/mratsim/tattletale/blob/c477050/workspace/libtorch/vendor/libtorch.nim

Currently I have a bidirectional Tattletale<->Pytorch interop for tensors, the implementation turned out very simple:

https://github.com/mratsim/tattletale/blob/c477050/workspace/libtorch/src/tensors_py.nim

Just 150 LOC, license and comments included.

Tests from the Nim side, calling Python: https://github.com/mratsim/tattletale/blob/c477050/workspace/libtorch/tests/python_integration/test_tensor_bridge.nim

And I have implemented almost end-to-end inference (just missing conversion of token_ids to string) that is fuzzed against huggingface/transformers showcasing Nim from Python

https://github.com/mratsim/tattletale/blob/c477050/workspace/transformers/tests/test_vs_hf_transformers.py

Next steps? Too long, I have a whole plan here, making an inference server is quite complex:

**https://github.com/mratsim/tattletale/issues/1**

I see the killer applications as:

No python dependency hell, even with Docker it's a pain to manage certain dependencies (looking at you FlashAttention, causal_conv1d, xformers, transformers that set incompatible version and need alignment of planets).

Embeddability, I want the inference engine to be able to run on phones or even integrated in a game engine so you could have NPC dialogue being autogenerated (I heard Gemma4-E2B, is very smart for its sice)

Portability: Obviously I'll target Nvidia GPUs, but I want the code to run on via Vulkan, Metal, WebGPU, OpenCL as well

Concurrency: llama.cpp and ik_llama.cpp don't cut it for concurrency, they have a very naive way to process parallel requests which is having pre-reserved slots and the LLM context you get is context_len / slots, so if you want to serve say 10 NPCs on a 131K context model, you only have 13.1K per request, forever even if in the specific portion of your game you only have the one NPC boss and no other.

And it's slow (ollama uses an outdated llama.cpp inside but the architecture is the reason for the multiple order of magnitude difference: https://developers.redhat.com/articles/2025/08/08/ollama-vs-vllm-deep-dive-performance-benchmarking)

Excellent quantizations, unfortunately the heavyweights inference engine, vLLM and SGLang are still using ancient quants (GPTQ, AWQ) by AI standards and are very inflexible 4-bit, 8-bit, 16-bit, while SOTA quants use trellis quantization (ik_llama and Exllamav3), random rotations (which predates turboquant by 2 years) and lattice codebooks.

Ultimately: single binary that can be copied, for now we link to PyTorch though, maybe when I sort out my compiler stuff from Constantine as I have there:
- Nvidia, AMDGPU, x86 and ARM JIT codegen via direct LLVM IR, AMD example: https://github.com/mratsim/constantine/blob/e6bee85/tests/gpu/hello_world_amdgpu.nim#L105-L211
- Nvidia, webGPU codegen via Nim -> compile-time GPU AST -> cuda or WebGPU code -> can be checked in a repo and compiled along Nim or runtime compilation via NVRTC https://github.com/mratsim/constantine/blob/e6bee85/tests/gpu/t_nvrtc_bigint_example.nim#L16

Sorry for the formatting, bullet points and images are going all over the place

Mirror of forum.nim-lang.org

13917 :: Tattletale - bidirection PyTorch<->Nim + Sketch of LLM Inference