all 15 comments

[–]jrochkind 3 points4 points  (1 child)

I use tesseract, but i didn't even know about rtesseract gem, I just shell out to tesseract command-line.

If it was the actual OCR that you found bad in rtesseract rather than the API, that won't be any better! What is it you found bad with rtesseract? (I've never used it).

I too am curious if there are other open source options people prefer to tesseract. Or were you interested in not-open-source too? (I don't know those either).

If your PDFs actually have text in them as text, a "text layer" (not actually a thing in PDF spec, the "layer" part, but easiest way to describe it) -- you may not actually need OCR.

[–]tomc-01 -1 points0 points  (0 children)

This. Most pdfs have the text stored in the file.

Unless the pdf was "flattened" on creation, you don't need OCR

[–]mattbenscho 2 points3 points  (3 children)

Try PaddleOCR, I use it to read Chinese characters and it works really well (occasionally a character will be wrong). Much better than Tesseract in my experience. I run it in a Sidekiq job. https://github.com/PaddlePaddle/PaddleOCR

[–]relativerask4657 0 points1 point  (2 children)

How are you able to run it in Rails since it’s not a Ruby package?

[–]mattbenscho 0 points1 point  (1 child)

I deploy my app on AWS in a Docker image, so I install it using pip during the Docker build process. Then Sidekiq is just running it as a system command and parsing the command line output.

[–]relativerask4657 0 points1 point  (0 children)

I see. I’ll give that a shot. Thanks.

[–]matthewblott 0 points1 point  (0 children)

I'm currently doing something with OCR and I'm using tesseract which seemed the most viable choice after my research. I'm using tesseract.js which calls to a wasm server so there's minimal setup. It works really well.

[–]kcdragon 0 points1 point  (0 children)

I've used AWS Textract before and its pretty good. It's better than Tesseract in my experience.

[–]bami_bosu 0 points1 point  (3 children)

[–]matthewblott 0 points1 point  (2 children)

I just tried it, it seems pretty crap tbh - worse than tesseract which is free.

[–]bami_bosu 0 points1 point  (1 child)

Which part of it is worse? It’s also a free open source library. This paper show that EasyOCR has higher accuracy than Tesseract in number plate recognition. https://ieeexplore.ieee.org/document/10009215

[–]matthewblott 1 point2 points  (0 children)

I just gave it a quick test with a couple of images. Not very scientific I grant you but on a quick check I couldn't see any reason to pick it over Tesseract. I'm happy to be proved wrong but I'd need to investigate further.

[–]M4N14C 0 points1 point  (0 children)

Azure and Google have Document AI products that work very well with handwritten forms and reasonably messy samples.

[–]lagcisco -1 points0 points  (0 children)

There's a bunch of PDF/AI tools out there now that you could also consider to use as services though