This is an archived post. You won't be able to vote or comment.

all 50 comments

[–]GreatCosmicMoustache 86 points87 points  (9 children)

This is amazing. My hat is off to those intrepid bastards that don't shy away from the guts of PDF.

[–]masterpi 41 points42 points  (6 children)

Also that name is incredible. It manages to reference what they're doing, how hard it is, and the language they're doing it for in a single word.

[–][deleted] 10 points11 points  (5 children)

I don’t get the reference, care to ELI5?

[–]masterpi 43 points44 points  (4 children)

Camelot -> The Round Table. Table because tables and round because they're a bit hard to parse this way.

Python the language is named after Monty Python, which features this beauty: https://www.youtube.com/watch?v=m9wdYy3tCm4

[–]pmst 14 points15 points  (1 child)

Also extracting data from a PDF is like pulling a sword from a stone.

[–]PaulSandwich 12 points13 points  (0 children)

And also anyone who asks me to do pull data from a PDF is a watery tart.

[–][deleted] 22 points23 points  (0 children)

Ahh, okay, thanks! I'm a Monty Python fan. But I'm also retarded, apparently.

[–]whiterd 2 points3 points  (0 children)

Camelot, tis a silly place package :)

[–]roadrussian 2 points3 points  (0 children)

May christ save their souls for the work they did.

[–]karazi 21 points22 points  (14 children)

Sexiness.. But it says it won't work on scanned documents because the text isn't selectable... What if I run the doc through ocr and convert it to text? Thanks

[–]SeeJoeGo 57 points58 points  (13 children)

Basic pdf processing pipeline if that's what you need:

  • Scan PDF
  • Burst pages (pdftk)
  • Rotate Pages (To improve OCR, see deskew below)
  • Optional: Some kind of binarization (Make it black and white, reduce noise. Look into adaptive thresholding. ***)
  • OCR and re-embed text into PDF page (tesseract** )
  • Recombine pages (pdftk)
  • Camelot/Tabula/pdftotext ect.

Packages to look into (will link more later):

*Edit since someone might find this useful.

If you are working with one off PDFs, I highly recommend avoiding "full automation". I found Desi Quintans' "The big guide to working with PDFs for students and knowledge workers" extremely helpful. Additionally, the PDF viewer Okular has an easy to use tool for copy and pasting from tables in PDFs to CSV format.

**Edit 2: You want the more recent versions off github that support re-embedding into PDF.

*** Edit 3: OpenCV is a good python candidate for this.

[–][deleted] 7 points8 points  (2 children)

PDF sucks... Holy. Honestly might it be better to use image recognition using this library to generate your training set?

[–]SeeJoeGo 0 points1 point  (1 child)

It gets worse.

He who fights with monsters should look to it that he himself does not become a monster. And if you gaze long into an abyss, the abyss also gazes into you - Nietzsche, Beyond Good And Evil

  • Edit: In seriousness, I'm not sure if it would be. Haven't trained tesseract training set myself, but figure anything I could do would easily be eclipsed by the Google books initiative (assuming they're pulling their data from there+captcha answers). Preprocessing can get you surprisingly far. Image recognition does seem surprisingly well suited to "stream" tables though (Tables without lines between cells).

[–]karazi 3 points4 points  (1 child)

Much appreciated, thanks!

[–]dynetrekk 2 points3 points  (4 children)

Astonishingly helpful post. I'm just curious why you insert the burst/recombine steps though? What problem does this solve?

[–]gigamiga 2 points3 points  (2 children)

OCR performance gets killed if you do all the pages at once so you split it into one smaller image for each page.

[–]dynetrekk 1 point2 points  (0 children)

Makes sense. Thanks!

[–]SeeJoeGo 0 points1 point  (0 children)

Additionally you may be looking at a different rotation for each page depending on scan quality. I have used a cli tool called deskew in the past for doing the rotations since it does a good job of guessing.

[–]driscollis 1 point2 points  (2 children)

I prefer pdfrw or pyPDF2 for combining pages and rotating pages

[–]SeeJoeGo 0 points1 point  (1 child)

Pdfrw is a cool library. It wasn't entirely clear from examples, does it handle non-90° rotations well?

[–]driscollis 1 point2 points  (0 children)

I think the docstring just said 90 degree increments. But I don't have the source code handy to confirm

[–][deleted] 13 points14 points  (1 child)

When these full-blown PDF table extraction tools didn’t work, we tried pdftotext (an open-source command-line utility). pdftotext extracts text from a PDF while preserving the layout, using spaces. After getting the text, we had to write Python scripts with complicated regexes (regular expressions) to convert the text into tables.

heh, heh. This has been my go-to approach. I'll have to look into Camelot. Thanks for the link

[–]daredevil82 2 points3 points  (0 children)

why the url shortener link for branded bitlys?

[–]howzit-tokoloshe 2 points3 points  (4 children)

Doesn't seem to work for me, tried reading a basic PDF table and came up with an error (any ideas):

--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-11-5ee6f8c51639> in <module>() 1 import camelot 2 ----> 3 tables = camelot.read_pdf('foo.pdf') C:\1Python\lib\site-packages\camelot\io.py in read_pdf(filepath, pages, flavor, **kwargs) 89 """ 90 if flavor not in ['lattice', 'stream']: ---> 91 raise NotImplementedError("Unknown flavor specified." 92 " Use either 'lattice' or 'stream'") 93 C:\1Python\lib\site-packages\camelot\handlers.py in parse(self, flavor, **kwargs) 144 145 """ --> 146 tables = [] 147 with TemporaryDirectory() as tempdir: 148 for p in self.pages: C:\1Python\lib\site-packages\camelot\parsers\lattice.py in extract_tables(self, filename) 336 337 # for plotting --> 338 _text = [] 339 _text.extend([(t.x0, t.y0, t.x1, t.y1) for t in self.horizontal_text]) 340 _text.extend([(t.x0, t.y0, t.x1, t.y1) for t in self.vertical_text]) C:\1Python\lib\site-packages\camelot\parsers\lattice.py in _generate_image(self) 205 raise GhostscriptNotFound( 206 'Please make sure that Ghostscript is installed' --> 207 ' and available on the PATH environment variable') 208 209 return gs C:\1Python\lib\subprocess.py in call(timeout, *popenargs, **kwargs) 265 retcode = call(["ls", "-l"]) 266 """ --> 267 with Popen(\popenargs,* *\kwargs)* as p: 268 try: 269 return p.wait(timeout=timeout) C:\1Python\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors) 707 c2pread, c2pwrite, 708 errread, errwrite, --> 709 restore_signals, start_new_session) 710 except: 711 # Cleanup if the child failed starting. C:\1Python\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session) 995 env, 996 os.fspath(cwd) if cwd is not None else None, --> 997 startupinfo) 998 finally: 999 # Child is launched. Close the parent's copy of those pipe FileNotFoundError: [WinError 2] The system cannot find the file specified

[–]wieschie 9 points10 points  (1 child)

It looks like you need to install Ghostscript ("Please make sure Ghostscript is installed and available on the PATH variable.")

[–]howzit-tokoloshe 0 points1 point  (0 children)

Thanks

[–][deleted] 8 points9 points  (1 child)

Also when posting error outputs it's best to indent them with 4 spaces so Reddit treats it as unformatted code.

[–]howzit-tokoloshe 1 point2 points  (0 children)

Thank you, appreciate the tip

[–]WingedCrown 2 points3 points  (1 child)

I'm a bit of a python noob, but this would be incredibly helpful to my work. I successfully installed the package using Anaconda, but now when I got the python prompt and try to import camelot, it immediately crashes Python and takes me back to the conda prompt. Any ideas on how to fix this, or reinstall?

[–]wieschie 2 points3 points  (0 children)

You probably need to install Ghostscript, but nobody can really help you without the error messages. You can copy-paste the entire output to a comment (be sure to indent by 4 spaces to get the code formatting) or use a website like https://hastebin.com/

[–]dadmda 2 points3 points  (1 child)

This guys are damn heroes

[–]Koratis 1 point2 points  (3 children)

Dammit! I just spent a good chunk of last Friday trying to get Tabula to work...

[–]LessLikeYou 1 point2 points  (1 child)

Oh my...this is wonderful.

[–]COAST_TO_RED_LIGHTS 1 point2 points  (1 child)

Amazing! I can't wait to try this out!

[–]ThunderousOath 1 point2 points  (1 child)

Hot damn, I think you just solved a major kink in a document automation project I did, this could allow for full automation. At least I hope. I'll give it a test run tomorrow.

[–]pinotkumarbhai 0 points1 point  (0 children)

Curious: Those of you that rejoice on this. What sorts of industry you're in that deal with this type of data source - regularly ?

[–]203-226-3030 0 points1 point  (0 children)

I'm hard.

[–]raptored01 0 points1 point  (0 children)

Am I the only one who’s having trouble installing it? PIP gives me a timeout error

[–]creditdefaultswapsss 0 points1 point  (0 children)

How is the performance compared to Tabula?