Since I spent so much time on it (2 full hours) I want to see it get some recognition:
https://github.com/araq/tinylama
PRs accepted, bug reports will be ignored.
Possible prompts:
@Araq there is no SIMD support in this version I think You didn't added it for simplicity.
As I want to make llama3 inference that is actually able to run much faster I made an inference independent of Your one since it's code is not made by AI except for porting logic from other examples it has some warnings and as I didn't learn completely how to use Malebolgia so there is lot of thread swaping which might be decreasing it's performance.
Overall I tested it to run Q4 and Q8 models (1,3 B for now) it is working well but still 3-4 x slower than llama.cpp on same hardware mostly due to bandwidth and some incomplete work.
For SIMD I used nimsimd lib.
The inference also supports many other useful commands like max token, system prompt and other.
Overall this is highly influenced by my earlier work I did on C# SIMD where I made similar inference.
It can be used like this : !nim c -d:danger -r --passC:"-march=native -mfma -mavx2 -ffast-math" --threads:on llama3.nim --model:Llama-3.2-1B-Instruct-Q8_0.gguf --prompt:"why is sky blue?" --max-tokens:128
I haven't compared it with Your(AI) one but I think it might be as this one is using SIMD and yes it was making much faster for 1 b model but slow for 3 b I tried to focus on support for llama 3 architecture.
And since I don't have better access to good cpu I tested it in colab which has very weak cpu(2 core) I was getting around 2-3 tps in colab cpu.
Still major drawback in mine one is threading and bandwidth issue as llama.cpp is loading model fully in ram.