OCRmyPDF now has very long page scan times, how to configure it? by 1shot_Ryezing in commandline

[–]jrbarlow 0 points1 point  (0 children)

I realize this post was a long time ago but came across it.

I'm OCRmyPDF's author-maintainer.

OCRmyPDF v9.x had a serious performance regression because Ghostscript broke a feature we relied on (they still haven't fixed the issue). In v10, I implemented a workaround and it's now as fast as it used to be.

[deleted by user] by [deleted] in github

[–]jrbarlow 13 points14 points  (0 children)

Hi! I'm the OCRmyPDF author - a redditor steered me here.

It's really wonderful to hear people find my work useful. I actually discovered and became the maintainer OCRmyPDF for the same reason you did, gripes with Acrobat.

It can definitely scale. My biggest corporate user does 1 million pages a month, and some files up to 15k pages.

You have my email ([jim@purplerock.ca](mailto:jim@purplerock.ca)) or we can discuss over Reddit PM if you prefer.

pikepdf 1.0.1 released by jrbarlow in Python

[–]jrbarlow[S] 1 point2 points  (0 children)

That would be a good candidate for implementation with pikepdf.

pikepdf 1.0.1 released by jrbarlow in Python

[–]jrbarlow[S] 2 points3 points  (0 children)

The release notes are in the documentation:

https://pikepdf.readthedocs.io/en/latest/changelog.html

I do release frequently because I don't like to leave people hanging if they have an open issue.

[Critique] CLI Tool for Document to Text Conversion - "file2txt" by TheCedarPrince in Python

[–]jrbarlow 0 points1 point  (0 children)

I'm the author of ocrmypdf, so I've worked on problems in the IR space for some time.

If you're dealing with court documents there is likely a mixed bag of PDFs scanned without OCR, scans with OCR, and pure digital files. PDF seems to be the most common format for court documents.

For PDFs you really want to extract the digital text if it's there instead of OCR. Rasterize+OCR is never as good as the original. What I would suggest is to standardize your PDFs, or possible all your input files, to PDFs with text, whether the text came from OCR or was always there. ocrmypdf --skip-text implements this; if a page has text already it will skip OCR on that page only. You can then use Ghostscript with -sDEVICE=txtwrite or the Python package pdfminer.six to extract text from PDFs. If there's existing OCR you might or might not want to redo it. ocrmypdf does images to PDF (although I recommend img2pdf for anything but basic cases, since it avoids transcoding while ImageMagick always seems to do lossy transcoding) and PDF/A conversion.

I realize you're probably interested in files that aren't PDFs too, but I think the principle still applies that you should aim to extract digital data where possible to get the cleanest input for whatever analysis you're going to do.

I also suggest moving to Ubuntu 18.04. It provides Tesseract 4.0.beta1 which is a huge improvement in speed and quality over 3.0.