nimforum mirror - tesseract in nim?

robb1e (orginal) [2021-09-23T12:09:33+02:00] view original

i am searching for a binding to use tesseract in nim, right now i am using subprocess to use it but i don't think it's the best thing to do

pietroppeter (orginal) [2021-09-23T14:10:55+02:00] view original

I assume you are referring to this https://github.com/tesseract-ocr/tesseract, looks interesting. Is this some open source state of the art for OCR or there are other competing libraries? I know very little of bindings, but "concurrent threads" document an abundance of activity in Nim in this period, so I would expect you will receive suggestions on that soon :)

mantielero (orginal) [2021-09-24T15:19:17+02:00] view original

The API doesn't look too complex. I would recommend you to do it yourself. You will be grateful afterwards.

You can look at c2nim, nimterop, futhark, cinterop. The only one that I haven't tried so far is Futhark (but I will).

robb1e (orginal) [2021-09-24T15:30:44+02:00] view original

i tried with c2nim but it didn't work

Araq (orginal) [2021-09-24T16:24:16+02:00] view original

i tried with c2nim but it didn't work

And when this happens, the last thing you should do is to read its manual, tinker with the C code and gain deep understanding. :P

mantielero (orginal) [2021-09-24T16:51:49+02:00] view original

With c2nim, my experience goes as follows. You normally run it like this:


c2nim --header capi.h

It worked! Well, not really. For this header you get everything commented. You need to know where it fails. To know that, you should use --strict:


$ c2nim --header --strict capi.h
capi.h(150, 22) Error: token expected: ; but got: *

The parser is not helping much here. The problem is that it found something it was not expecting. In this case: TESS_API. Normally it would be replaced by something else. If you simply remove it in all the file, you will see it creates the bindings for you.

It is annoying having to modify the header manually. You can avoid that by creating a file tesseract.c2nim:


#ifdef C2NIM
#def TESS_API
#endif

(explained here)

Then you simply, execute:


c2nim --header --strict tesseract.c2nim capi.h

> with the capi.h unmodified

and you get your bindings.

I hope this helps.

robb1e (orginal) [2021-09-24T17:25:15+02:00] view original

tysm for the help, as soon as i get on the computer i will try following this

robb1e (orginal) [2021-09-25T08:17:26+02:00] view original

it's working, thanks

robb1e (orginal) [2021-09-25T08:31:42+02:00] view original

update: it's not working, looks like the generated file has a lot of errors in the code

Yardanico (orginal) [2021-09-25T08:33:45+02:00] view original

Yes, you're supposed to fix those manually :)

robb1e (orginal) [2021-09-25T08:34:53+02:00] view original

ik, rn only 4 or 5 things are working properly, i'll fix what i need now

Araq (orginal) [2021-09-25T08:56:45+02:00] view original

Yes, you're supposed to fix those manually :)

Not really... You're supposed to configure c2nim properly.

robb1e (orginal) [2021-09-25T09:00:40+02:00] view original

i used


c2nim --header --strict --cpp headernames.h

where headernames.h are all the required headers

elcritch (orginal) [2021-09-25T21:56:35+02:00] view original

That's the right place to start. Unfortunately since C libraries often have oddities like random inline C macros before functions you generally need to configure c2nim to tell it how to handle certain things. It's often easier to do each header separately.

Normally I take the route of just commenting out the offending lines in the headers, but I too just learned about the header_opts.c2nim usage above. :-)

Can you post an example of the line where c2nim breaks so people here could help with what config options would resolve it?

(Also, I think I'm gonna try and patch c2nim to print what lines it's skipping or having issues with. It's annoying to have not have it print out the context (the last few lines) where it died in strict mode, or what lines it skipped in non-strict mode. I'm just too lazy to wanna switch windows to my text editor. ;) )

DavideGalilei (orginal) [2022-01-04T17:08:11+01:00] view original

I just finished my wrapper library for tesseract, feel free to try it out! 🎉 https://github.com/DavideGalilei/nimtesseract

cblake (orginal) [2022-01-04T22:26:04+01:00] view original

If someone wants a project idea beyond the usual "search for strings", I always thought it a good idea to try to improve scanned PDFs by using the OCR data to extract glyphs for new fonts as the average of all their instances (maybe removing outliers to mitigate mistakes). Then you can make a new PDF with said defined fonts that should look much higher resolution while also being fewer bytes. Call it scanned document optimization or something.

Yes, yes..It would be unsurprising if there were commercial programs like Acrobat that could sort of do this. Actually, it would be unsurprising if there was academic software, too. :-) { If someone has experience with how this turns out to work well or suck in practice, I'd love to hear it. Maybe the 10,000 'a' instances have correlated enough noise to suck. Or maybe require more than naive outlier filtering techniques or slightly underexposed scans/etc. }

Zoom (orginal) [2022-01-04T23:05:00+01:00] view original

I always thought it a good idea to try to improve scanned PDFs by using the OCR data to extract glyphs for new fonts as the average of all their instances (maybe removing outliers to mitigate mistakes). Then you can make a new PDF with said defined fonts that should look much higher resolution while also being fewer bytes. Call it scanned document optimization or something.

It already has a name: DjVu ;)

cblake (orginal) [2022-01-05T18:00:12+01:00] view original

Thanks for the link! I looked at a few examples and all I can say is that the defaults (or common settings) for whatever SW produces djvu from scans is very aggressive at "preserving uniqueness" as opposed to increasing resolution by averaging "quasi-same" glyphs together. E.g., the two 't's in a single word like "estimator" are clearly distinct glyphs but with only barely visible differences.

Maybe the challenges of glyph averaging/merging are all documented in papers leading up to djvu, though. Or maybe they use OCR like tesseract to just "snap to the first instance" instead of my averaging to improve resolution idea - it isn't clear just from the wiki. The work seems more motivated by file sizes than resolution enhancement, and there is clearly a minor trade off with higher resolution glyphs in the extracted bitmap fonts.

It's possible space/network concerns of the mid-1990s tilted them away from my idea but would no longer.. So, there's probably still a good follow-on project in there somewhere. :-) Why, one might even try to do least squares curve fits to bezier curves and extract scalable vector fonts for the main text parts, if the OCR accuracy was adequate. :-) (And, yeah - these are all pretty obvious ideas and have probably been done. I'm just continuing the discussion of how to maybe use a Nim tesseract binding.)

EyeCon (orginal) [2022-01-05T22:23:06+01:00] view original

In case you speak German (or use English subtitles), take a look at this video for the mayhem this might cause: https://www.youtube.com/watch?v=7FeqF1-Z1g0

Mirror of forum.nim-lang.org

8452 :: tesseract in nim?