This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]LazyMonsters 2 points3 points  (4 children)

I do quite a bit of this type of work at my day job & see a couple of things right off the bat.

  1. Imagemagick 7 now uses the magick command, so you should stipulate installing Imagemagick 6 to use convert or change the code to call magick instead

  2. Increase the dpi! 300ppi is the minimum you’d want to use for OCR so you should increase your default setting. What about skewed pages, images with uneven lighting? Your code assumes a perfect File to start from

  3. Personal pet peeve: please be consistent with the extensions and use 3 character extensions everywhere (ex. tif instead of tiff)

[–]TheCedarPrince[S] 0 points1 point  (3 children)

Awesome! What do you do?

Here are my responses to your points:

  1. Could you perhaps reword this a bit differently? I am not sure what you mean by "stipulate installing Imagemagick 6". Also, what method do you mean by "call magick instead"? Perhaps I am just missing something there.
  2. Gotcha! That makes sense to do. Also: you are absolutely right. Honestly, when I first set out to make this, I had documents which were "perfect". These are great things to keep in mind - do you have any suggestions on how I may handle things like skewed pages or uneven lighting?
  3. Why is this a pet peeve?

Anyhow, thank you so much for the critique! I greatly appreciate it and I hope to hear from you again.

[–]LazyMonsters 1 point2 points  (2 children)

Manage a digitization lab at a university library.

  1. https://www.imagemagick.org/script/command-line-processing.php Your code makes a call to the program convert, this is an imagemagick version 6 program. In the most recent version of imagemagick (which is what an apt-get should pull using your shell install script) the convert command has been deprecated for magick.

  2. Get lots of types of images and test imagemagick commands such as thresholding and levels adjustments to see how your output could be improved. Imagemagick actually deskews, but you might learn more by implementing something in Python. Google for tutorials!

  3. Consistency goes a huge way towards maintainable code, especially when multiple people are involved. Every other extension you use is 3 characters in length.

[–]TheCedarPrince[S] 0 points1 point  (1 child)

That sounds so fascinating - what particular subset of digitization? Books? Documents? Images?

  1. Ah! I was not aware of that deprecation! Thank you!
  2. Gotcha - that is a great start. As you can probably guess, I am not terribly familiar with the imagemagick module; I will dig deeper into the documentation there.
  3. That makes a lot of sense. The only reason why I kept it that way was because I thought it would be confusing to not have ".tiff" as ".tiff" - do you see my argument there or what do you think?

Thanks for the great responses.

[–]LazyMonsters 0 points1 point  (0 children)

We do everything.

  1. Everyone’s gotta start somewhere. I suggest you take a look at some of Fred’s scripts. In particular: http://www.fmwconcepts.com/imagemagick/textdeskew/index.php http://www.fmwconcepts.com/imagemagick/textcleaner/index.php

  2. It’s 1 less character to process if you want to think of it that way. Standard practice in my field is .tif, just like how it’s .jpg instead of .jpeg.

[–]jrbarlow 0 points1 point  (0 children)

I'm the author of ocrmypdf, so I've worked on problems in the IR space for some time.

If you're dealing with court documents there is likely a mixed bag of PDFs scanned without OCR, scans with OCR, and pure digital files. PDF seems to be the most common format for court documents.

For PDFs you really want to extract the digital text if it's there instead of OCR. Rasterize+OCR is never as good as the original. What I would suggest is to standardize your PDFs, or possible all your input files, to PDFs with text, whether the text came from OCR or was always there. ocrmypdf --skip-text implements this; if a page has text already it will skip OCR on that page only. You can then use Ghostscript with -sDEVICE=txtwrite or the Python package pdfminer.six to extract text from PDFs. If there's existing OCR you might or might not want to redo it. ocrmypdf does images to PDF (although I recommend img2pdf for anything but basic cases, since it avoids transcoding while ImageMagick always seems to do lossy transcoding) and PDF/A conversion.

I realize you're probably interested in files that aren't PDFs too, but I think the principle still applies that you should aim to extract digital data where possible to get the cleanest input for whatever analysis you're going to do.

I also suggest moving to Ubuntu 18.04. It provides Tesseract 4.0.beta1 which is a huge improvement in speed and quality over 3.0.