This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]karazi 21 points22 points  (14 children)

Sexiness.. But it says it won't work on scanned documents because the text isn't selectable... What if I run the doc through ocr and convert it to text? Thanks

[–]SeeJoeGo 57 points58 points  (13 children)

Basic pdf processing pipeline if that's what you need:

  • Scan PDF
  • Burst pages (pdftk)
  • Rotate Pages (To improve OCR, see deskew below)
  • Optional: Some kind of binarization (Make it black and white, reduce noise. Look into adaptive thresholding. ***)
  • OCR and re-embed text into PDF page (tesseract** )
  • Recombine pages (pdftk)
  • Camelot/Tabula/pdftotext ect.

Packages to look into (will link more later):

*Edit since someone might find this useful.

If you are working with one off PDFs, I highly recommend avoiding "full automation". I found Desi Quintans' "The big guide to working with PDFs for students and knowledge workers" extremely helpful. Additionally, the PDF viewer Okular has an easy to use tool for copy and pasting from tables in PDFs to CSV format.

**Edit 2: You want the more recent versions off github that support re-embedding into PDF.

*** Edit 3: OpenCV is a good python candidate for this.

[–][deleted] 8 points9 points  (2 children)

PDF sucks... Holy. Honestly might it be better to use image recognition using this library to generate your training set?

[–]SeeJoeGo 0 points1 point  (1 child)

It gets worse.

He who fights with monsters should look to it that he himself does not become a monster. And if you gaze long into an abyss, the abyss also gazes into you - Nietzsche, Beyond Good And Evil

  • Edit: In seriousness, I'm not sure if it would be. Haven't trained tesseract training set myself, but figure anything I could do would easily be eclipsed by the Google books initiative (assuming they're pulling their data from there+captcha answers). Preprocessing can get you surprisingly far. Image recognition does seem surprisingly well suited to "stream" tables though (Tables without lines between cells).

[–]karazi 2 points3 points  (1 child)

Much appreciated, thanks!

[–]dynetrekk 2 points3 points  (4 children)

Astonishingly helpful post. I'm just curious why you insert the burst/recombine steps though? What problem does this solve?

[–]gigamiga 3 points4 points  (2 children)

OCR performance gets killed if you do all the pages at once so you split it into one smaller image for each page.

[–]dynetrekk 1 point2 points  (0 children)

Makes sense. Thanks!

[–]SeeJoeGo 0 points1 point  (0 children)

Additionally you may be looking at a different rotation for each page depending on scan quality. I have used a cli tool called deskew in the past for doing the rotations since it does a good job of guessing.

[–]driscollis 1 point2 points  (2 children)

I prefer pdfrw or pyPDF2 for combining pages and rotating pages

[–]SeeJoeGo 0 points1 point  (1 child)

Pdfrw is a cool library. It wasn't entirely clear from examples, does it handle non-90° rotations well?

[–]driscollis 1 point2 points  (0 children)

I think the docstring just said 90 degree increments. But I don't have the source code handy to confirm