nimforum mirror - pdfocr – A fast, low-memory OCR tool for complex PDFs

planetis (orginal) [2026-02-28T11:06:40+01:00] view original

Greetings!

I recently built pdfocr, a tool designed to perform highly accurate OCR on PDFs. My primary use case has been parsing lecture slides, which notoriously have complex structures packed with pictures and diagrams.

You can use it as a standalone external tool, or call it via an LLM in an agentic environment (which is exactly how I use it for my own exam preparation).

Under the Hood The tool is written entirely in Nim and is built on top of a few of my own custom libraries:

relay (a remake of curly)

jsonx (a remake of eminim)

openai (an SDK built on top of the previous two)

For the actual OCR model, I’m using olmOCR 2. Based on my testing, it reads complex lecture slides much better than alternatives like Chandra OCR, and it is significantly cheaper to run.

Performance & Availability I’ve done extensive comparisons with other approaches, and pdfocr consistently performs faster while using a smaller memory footprint.

I currently offer pre-compiled binary builds for three platforms. You can check it out here: https://github.com/planetis-m/pdfocr

Looking Forward I believe this has the potential to be integrated into a larger app or turned into a viable commercial product. However, I am a developer first and don't currently have the bandwidth or business expertise to take it to market. If you are interested in partnering up to turn this into a product, please reach out—I'd love to chat!

Ghazali (orginal) [2026-02-28T11:35:39+01:00] view original

Which languages are supported by it ?

planetis (orginal) [2026-02-28T11:48:06+01:00] view original

Not sure, I have only used it with English, its base model Qwen2.5-VL-7B is multilingual, but its optimized on English. Switching to Deepseek OCR is trivial but that one tends to underperform.

wwang1990 (orginal) [2026-03-01T06:41:51+01:00] view original

How about the comparison with glm-ocr?

https://docs.bigmodel.cn/cn/guide/models/vlm/glm-ocr?utm_campaign=Platform_Ops&_channel_track_key=lYASSAF5

planetis (orginal) [2026-03-01T20:30:38+01:00] view original

I made my own benchmark:

Model	Coverage	CER	WER	ReadingOrderF1	MathF1	TokenRecall	CharLCSRecall	Cost (USD)	RobustBalanced
allenai/olmOCR-2-7B-1025	1.000	0.4682	0.4893	0.3252	0.6743	0.8146	0.8496	0.0214	0.5232
PaddlePaddle/PaddleOCR-VL-0.9B	1.000	0.4634	0.4732	0.3751	0.7248	0.7678	0.7924	0.0420	0.4945
deepseek-ai/DeepSeek-OCR	1.000	0.5862	0.6512	0.1452	0.4684	0.6857	0.7235	0.0041	0.6540

Accuracy-first (strict): PaddlePaddle/PaddleOCR-VL-0.9B

Accuracy-first (robust): allenai/olmOCR-2-7B-1025

Cost-first: deepseek-ai/DeepSeek-OCR

Balanced (strict): deepseek-ai/DeepSeek-OCR

Balanced (robust): deepseek-ai/DeepSeek-OCR

I tried to include GLM-OCR but I was out of credits, will see if I renew my subscription.

planetis (orginal) [2026-03-03T09:40:15+01:00] view original

Well, I’ll probably be the only one who ends up using this, but here it is: <https://github.com/planetis-m/study-assistant>

Based on the pdfocr utility, this is an agent skill designed for exam preparation.

Unlike NotebookLM, it offers the following:

Outputs the entire PDF text when you request a full transcription (not summaries or selected pages).

Performs full-page OCR instead of text extraction and OCR on individual images, which provides better context and visual understanding.

Does not rely on a dedicated OCR pipeline, meaning you can plug in any VLM and it will work. For example, you can change the endpoint to OpenRouter and set the model to Mistral Large 3; it will just cost more.

Other conveniences:

Caches OCR results so you don’t need to start over each session.

Automatically installs pdfocr from GitHub; you’re only required to provide your API key.

Main features:

Creates study notes based on the provided content

Generates quiz and essay questions and answers

Explains concepts in an ELI5 style without analogies or oversimplification

Generates lecture transcripts, which can be used as input for audio generation models

Exportable flashcards and mind maps

mantielero (orginal) [2026-03-03T21:12:07+01:00] view original

Have you tried minerU? https://huggingface.co/spaces/opendatalab/MinerU

I used locally and I was impressed.

planetis (orginal) [2026-03-03T21:46:25+01:00] view original

Not the exact one, but I’ve tried similar tools like Mistral OCR. They’re designed with different use cases in mind, such as full book digitization. It does more like segmenting the PDF, which I also have experimented with, but I dont need it. Also for an AI assistant, while it’s possible to feed each table as an image into the prompt, converting it directly into text is still the best option 🙂

planetis (orginal) [2026-03-06T18:53:56+01:00] view original

If you’ve used NotebookLM, you know how terrible the “podcast” mode is. It’s basically 10 minutes of repeated AI nonsense and filler words with very little grounding in the source material.

So I built an agent skill: https://github.com/planetis-m/tts-assistant . It processes input so it’s easier to listen to, so no spelling out URLs, and math formulas are spoken naturally, and outputs an Opus audio file.

It’s written in Nim and uses a cheap TTS model from DeepInfra by default. This is part of my final-year project, so I’d appreciate any feedback if you try it.

nimtastic (orginal) [2026-03-10T16:17:40+01:00] view original

I would be interested to use this for ocr on images of license plates. Do you think any tweaks can be made here to make it good?

planetis (orginal) [2026-03-10T19:22:23+01:00] view original

I doubt that. What I’ve been doing is providing the existing code and asking an LLM to generate code for another tool based on it. However, that depends heavily on your requirements. This one is built with batching in mind and produces ordered output, so the generated code might not be the right choice for your needs.

icedquinn (orginal) [2026-03-18T18:57:14+01:00] view original

LLM-based optical decoding doesn't look very low memory at all to me. The agent harness making calls might be, but the pipeline still requires a chunky GPU.

planetis (orginal) [2026-03-19T11:44:16+01:00] view original

LLM-based optical decoding doesn't look very low memory at all to me. The agent harness making calls might be, but the pipeline still requires a chunky GPU.

Thanks for the question, @icedquinn, this gives me an opportunity to expand on it.

Based on my benchmarks, it is lower memory: about 17.1% less compared to the implementation in this tmp repo (which uses async dispatch): https://github.com/planetis-m/pdfocr-tmp

Originally, I was using Tesseract (which, as far as I know, is LSTM-based) running locally on CPU

I even added instructions for the model to selectively perform OCR only on certain pages:

After extraction, identify pages with minimal educational content (only titles, captions, or page numbers). Use read_by_ocr only on those specific pages...

In practice, this wasn’t reliable. Sometimes it would OCR the entire PDF anyway, which ended up hogging system resources.

In terms of accuracy, Tesseract is quite dated. I remember cases (e.g., in AI coursework with search trees) where it performed poorly, and the downstream LLM would hallucinate incorrect information as a result.

So I’ve gone through several iterations, and my current setup is split as follows:

Agent: study-assistant

Tools: ocr-tool tts-tool rag-tool

This structure helps keep things separated, means you could instruct the agent to use other tools, like pdftotext instead of OCR. It also includes a lightweight memory component for storing entire books, which I’m currently experimenting with. That said, it could definitely benefit from a GUI.

I like how it works because it allows for better explanations of the material while incorporating multiple sources. For example, here’s a recent usage:

Study assistant lecture mode on @Introduction to Network Security.pdf using ocr-tool

(assistant provides output)

essay mode

Explain how the Sony attack was performed according to the slides, and include page references.

ELI5 mode on the CIA triad. Additionally, do not use analogies.

Use rag-tool to search 'Network Security Essentials.pdf' for additional information on the Pegasus exploit.

planetis (orginal) [2026-03-19T11:57:06+01:00] view original

Not to mention, there are already a few skills that do similar things, such as Scientific Skills (MarkItDown) and Ollama DeepSeek OCR Tool.

They’re all written in Python, but I consider Nim a better alternative. You don’t need to set up Docker to manage dependencies, you get a single executable with everything bundled. It doesn’t pull in gigabytes of dependencies, it’s easy to modify, and the language itself is clear and expressive.

Mirror of forum.nim-lang.org

13752 :: pdfocr – A fast, low-memory OCR tool for complex PDFs