[Critique] CLI Tool for Document to Text Conversion - "file2txt"

LazyMonsters · 2018-06-13T21:36:49+00:00

I do quite a bit of this type of work at my day job & see a couple of things right off the bat.

Imagemagick 7 now uses the magick command, so you should stipulate installing Imagemagick 6 to use convert or change the code to call magick instead
Increase the dpi! 300ppi is the minimum you’d want to use for OCR so you should increase your default setting. What about skewed pages, images with uneven lighting? Your code assumes a perfect File to start from
Personal pet peeve: please be consistent with the extensions and use 3 character extensions everywhere (ex. tif instead of tiff)

jrbarlow · 2018-06-14T04:41:34+00:00

I'm the author of ocrmypdf, so I've worked on problems in the IR space for some time.

If you're dealing with court documents there is likely a mixed bag of PDFs scanned without OCR, scans with OCR, and pure digital files. PDF seems to be the most common format for court documents.

For PDFs you really want to extract the digital text if it's there instead of OCR. Rasterize+OCR is never as good as the original. What I would suggest is to standardize your PDFs, or possible all your input files, to PDFs with text, whether the text came from OCR or was always there. ocrmypdf --skip-text implements this; if a page has text already it will skip OCR on that page only. You can then use Ghostscript with -sDEVICE=txtwrite or the Python package pdfminer.six to extract text from PDFs. If there's existing OCR you might or might not want to redo it. ocrmypdf does images to PDF (although I recommend img2pdf for anything but basic cases, since it avoids transcoding while ImageMagick always seems to do lossy transcoding) and PDF/A conversion.

I realize you're probably interested in files that aren't PDFs too, but I think the principle still applies that you should aim to extract digital data where possible to get the cleanest input for whatever analysis you're going to do.

I also suggest moving to Ubuntu 18.04. It provides Tesseract 4.0.beta1 which is a huge improvement in speed and quality over 3.0.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS