Greetings!
I recently built pdfocr, a tool designed to perform highly accurate OCR on PDFs. My primary use case has been parsing lecture slides, which notoriously have complex structures packed with pictures and diagrams.
You can use it as a standalone external tool, or call it via an LLM in an agentic environment (which is exactly how I use it for my own exam preparation).
Under the Hood The tool is written entirely in Nim and is built on top of a few of my own custom libraries:
For the actual OCR model, I’m using olmOCR 2. Based on my testing, it reads complex lecture slides much better than alternatives like Chandra OCR, and it is significantly cheaper to run.
Performance & Availability I’ve done extensive comparisons with other approaches, and pdfocr consistently performs faster while using a smaller memory footprint.
I currently offer pre-compiled binary builds for three platforms. You can check it out here: https://github.com/planetis-m/pdfocr
Looking Forward I believe this has the potential to be integrated into a larger app or turned into a viable commercial product. However, I am a developer first and don't currently have the bandwidth or business expertise to take it to market. If you are interested in partnering up to turn this into a product, please reach out—I'd love to chat!
How about the comparison with glm-ocr?
| Model | Coverage | CER | WER | ReadingOrderF1 | MathF1 | TokenRecall | CharLCSRecall | Cost (USD) | RobustBalanced |
|---|---|---|---|---|---|---|---|---|---|
| allenai/olmOCR-2-7B-1025 | 1.000 | 0.4682 | 0.4893 | 0.3252 | 0.6743 | 0.8146 | 0.8496 | 0.0214 | 0.5232 |
| PaddlePaddle/PaddleOCR-VL-0.9B | 1.000 | 0.4634 | 0.4732 | 0.3751 | 0.7248 | 0.7678 | 0.7924 | 0.0420 | 0.4945 |
| deepseek-ai/DeepSeek-OCR | 1.000 | 0.5862 | 0.6512 | 0.1452 | 0.4684 | 0.6857 | 0.7235 | 0.0041 | 0.6540 |
I tried to include GLM-OCR but I was out of credits, will see if I renew my subscription.
Well, I’ll probably be the only one who ends up using this, but here it is: <https://github.com/planetis-m/study-assistant>
Based on the pdfocr utility, this is an agent skill designed for exam preparation.
Unlike NotebookLM, it offers the following:
Other conveniences:
Main features:
Have you tried minerU? https://huggingface.co/spaces/opendatalab/MinerU
I used locally and I was impressed.
If you’ve used NotebookLM, you know how terrible the “podcast” mode is. It’s basically 10 minutes of repeated AI nonsense and filler words with very little grounding in the source material.
So I built an agent skill: https://github.com/planetis-m/tts-assistant . It processes input so it’s easier to listen to, so no spelling out URLs, and math formulas are spoken naturally, and outputs an Opus audio file.
It’s written in Nim and uses a cheap TTS model from DeepInfra by default. This is part of my final-year project, so I’d appreciate any feedback if you try it.
LLM-based optical decoding doesn't look very low memory at all to me. The agent harness making calls might be, but the pipeline still requires a chunky GPU.
Thanks for the question, @icedquinn, this gives me an opportunity to expand on it.
Based on my benchmarks, it is lower memory: about 17.1% less compared to the implementation in this tmp repo (which uses async dispatch): https://github.com/planetis-m/pdfocr-tmp
Originally, I was using Tesseract (which, as far as I know, is LSTM-based) running locally on CPU
I even added instructions for the model to selectively perform OCR only on certain pages:
After extraction, identify pages with minimal educational content (only titles, captions, or page numbers). Use read_by_ocr only on those specific pages...
In practice, this wasn’t reliable. Sometimes it would OCR the entire PDF anyway, which ended up hogging system resources.
In terms of accuracy, Tesseract is quite dated. I remember cases (e.g., in AI coursework with search trees) where it performed poorly, and the downstream LLM would hallucinate incorrect information as a result.
So I’ve gone through several iterations, and my current setup is split as follows:
Agent: study-assistant
Tools: ocr-tool tts-tool rag-tool
This structure helps keep things separated, means you could instruct the agent to use other tools, like pdftotext instead of OCR. It also includes a lightweight memory component for storing entire books, which I’m currently experimenting with. That said, it could definitely benefit from a GUI.
I like how it works because it allows for better explanations of the material while incorporating multiple sources. For example, here’s a recent usage:
Study assistant lecture mode on @Introduction to Network Security.pdf using ocr-tool
(assistant provides output)
essay mode
Explain how the Sony attack was performed according to the slides, and include page references.
ELI5 mode on the CIA triad. Additionally, do not use analogies.
Use rag-tool to search 'Network Security Essentials.pdf' for additional information on the Pegasus exploit.
Not to mention, there are already a few skills that do similar things, such as Scientific Skills (MarkItDown) and Ollama DeepSeek OCR Tool.
They’re all written in Python, but I consider Nim a better alternative. You don’t need to set up Docker to manage dependencies, you get a single executable with everything bundled. It doesn’t pull in gigabytes of dependencies, it’s easy to modify, and the language itself is clear and expressive.