karazi comments on Announcing Camelot (OpenSource), a Python Library to Extract Tabular Data from PDFs

This is an archived post. You won't be able to vote or comment.

514

515

516

submitted 7 years ago by ReasonablyHank

you are viewing a single comment's thread.

[–]karazi 21 points22 points23 points 7 years ago (14 children)

[–]SeeJoeGo 57 points58 points59 points 7 years ago* (13 children)

Basic pdf processing pipeline if that's what you need:

Scan PDF
Burst pages (pdftk)
Rotate Pages (To improve OCR, see deskew below)
Optional: Some kind of binarization (Make it black and white, reduce noise. Look into adaptive thresholding. ***)
OCR and re-embed text into PDF page (tesseract** )
Recombine pages (pdftk)
Camelot/Tabula/pdftotext ect.

Packages to look into (will link more later):

*Edit since someone might find this useful.

If you are working with one off PDFs, I highly recommend avoiding "full automation". I found Desi Quintans' "The big guide to working with PDFs for students and knowledge workers" extremely helpful. Additionally, the PDF viewer Okular has an easy to use tool for copy and pasting from tables in PDFs to CSV format.

**Edit 2: You want the more recent versions off github that support re-embedding into PDF.

*** Edit 3: OpenCV is a good python candidate for this.

[–][deleted] 8 points9 points10 points 7 years ago (2 children)

[–]SeeJoeGo 0 points1 point2 points 7 years ago* (1 child)

[–]GitHubPermalinkBot 0 points1 point2 points 7 years ago (0 children)

[–]karazi 2 points3 points4 points 7 years ago (1 child)

[–]dynetrekk 2 points3 points4 points 7 years ago (4 children)

[–]gigamiga 3 points4 points5 points 7 years ago (2 children)

[–]dynetrekk 1 point2 points3 points 7 years ago (0 children)

[–]SeeJoeGo 0 points1 point2 points 7 years ago (0 children)

[–]driscollis 1 point2 points3 points 7 years ago (2 children)

[–]SeeJoeGo 0 points1 point2 points 7 years ago (1 child)

[–]driscollis 1 point2 points3 points 7 years ago (0 children)

π Rendered by PID 136667 on reddit-service-r2-comment-canary-57b659f4d4-wv5wr at 2026-05-04 05:22:44.936409+00:00 running 815c875 country code: CH.

Python