The API doesn't look too complex. I would recommend you to do it yourself. You will be grateful afterwards.
You can look at c2nim, nimterop, futhark, cinterop. The only one that I haven't tried so far is Futhark (but I will).
i tried with c2nim but it didn't work
And when this happens, the last thing you should do is to read its manual, tinker with the C code and gain deep understanding. :P
With c2nim, my experience goes as follows. You normally run it like this:
c2nim --header capi.h
It worked! Well, not really. For this header you get everything commented. You need to know where it fails. To know that, you should use --strict:
$ c2nim --header --strict capi.h
capi.h(150, 22) Error: token expected: ; but got: *
The parser is not helping much here. The problem is that it found something it was not expecting. In this case: TESS_API. Normally it would be replaced by something else. If you simply remove it in all the file, you will see it creates the bindings for you.
It is annoying having to modify the header manually. You can avoid that by creating a file tesseract.c2nim:
#ifdef C2NIM
#def TESS_API
#endif
(explained here)
Then you simply, execute:
c2nim --header --strict tesseract.c2nim capi.h
> with the capi.h unmodified
and you get your bindings.
I hope this helps.
Yes, you're supposed to fix those manually :)
Not really... You're supposed to configure c2nim properly.
c2nim --header --strict --cpp headernames.h
where headernames.h are all the required headersThat's the right place to start. Unfortunately since C libraries often have oddities like random inline C macros before functions you generally need to configure c2nim to tell it how to handle certain things. It's often easier to do each header separately.
Normally I take the route of just commenting out the offending lines in the headers, but I too just learned about the header_opts.c2nim usage above. :-)
Can you post an example of the line where c2nim breaks so people here could help with what config options would resolve it?
(Also, I think I'm gonna try and patch c2nim to print what lines it's skipping or having issues with. It's annoying to have not have it print out the context (the last few lines) where it died in strict mode, or what lines it skipped in non-strict mode. I'm just too lazy to wanna switch windows to my text editor. ;) )
If someone wants a project idea beyond the usual "search for strings", I always thought it a good idea to try to improve scanned PDFs by using the OCR data to extract glyphs for new fonts as the average of all their instances (maybe removing outliers to mitigate mistakes). Then you can make a new PDF with said defined fonts that should look much higher resolution while also being fewer bytes. Call it scanned document optimization or something.
Yes, yes..It would be unsurprising if there were commercial programs like Acrobat that could sort of do this. Actually, it would be unsurprising if there was academic software, too. :-) { If someone has experience with how this turns out to work well or suck in practice, I'd love to hear it. Maybe the 10,000 'a' instances have correlated enough noise to suck. Or maybe require more than naive outlier filtering techniques or slightly underexposed scans/etc. }
I always thought it a good idea to try to improve scanned PDFs by using the OCR data to extract glyphs for new fonts as the average of all their instances (maybe removing outliers to mitigate mistakes). Then you can make a new PDF with said defined fonts that should look much higher resolution while also being fewer bytes. Call it scanned document optimization or something.
It already has a name: DjVu ;)
Thanks for the link! I looked at a few examples and all I can say is that the defaults (or common settings) for whatever SW produces djvu from scans is very aggressive at "preserving uniqueness" as opposed to increasing resolution by averaging "quasi-same" glyphs together. E.g., the two 't's in a single word like "estimator" are clearly distinct glyphs but with only barely visible differences.
Maybe the challenges of glyph averaging/merging are all documented in papers leading up to djvu, though. Or maybe they use OCR like tesseract to just "snap to the first instance" instead of my averaging to improve resolution idea - it isn't clear just from the wiki. The work seems more motivated by file sizes than resolution enhancement, and there is clearly a minor trade off with higher resolution glyphs in the extracted bitmap fonts.
It's possible space/network concerns of the mid-1990s tilted them away from my idea but would no longer.. So, there's probably still a good follow-on project in there somewhere. :-) Why, one might even try to do least squares curve fits to bezier curves and extract scalable vector fonts for the main text parts, if the OCR accuracy was adequate. :-) (And, yeah - these are all pretty obvious ideas and have probably been done. I'm just continuing the discussion of how to maybe use a Nim tesseract binding.)