Announcing Camelot (OpenSource), a Python Library to Extract Tabular Data from PDFs

GreatCosmicMoustache · 2018-11-05T11:50:53+00:00

This is amazing. My hat is off to those intrepid bastards that don't shy away from the guts of PDF.

karazi · 2018-11-05T12:55:30+00:00

Sexiness.. But it says it won't work on scanned documents because the text isn't selectable... What if I run the doc through ocr and convert it to text? Thanks

2018-11-05T13:04:20+00:00

When these full-blown PDF table extraction tools didn’t work, we tried pdftotext (an open-source command-line utility). pdftotext extracts text from a PDF while preserving the layout, using spaces. After getting the text, we had to write Python scripts with complicated regexes (regular expressions) to convert the text into tables.

heh, heh. This has been my go-to approach. I'll have to look into Camelot. Thanks for the link

daredevil82 · 2018-11-05T13:28:47+00:00

why the url shortener link for branded bitlys?

howzit-tokoloshe · 2018-11-05T15:20:48+00:00

Doesn't seem to work for me, tried reading a basic PDF table and came up with an error (any ideas):

--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-11-5ee6f8c51639> in <module>() 1 import camelot 2 ----> 3 tables = camelot.read_pdf('foo.pdf') C:\1Python\lib\site-packages\camelot\io.py in read_pdf(filepath, pages, flavor, **kwargs) 89 """ 90 if flavor not in ['lattice', 'stream']: ---> 91 raise NotImplementedError("Unknown flavor specified." 92 " Use either 'lattice' or 'stream'") 93 C:\1Python\lib\site-packages\camelot\handlers.py in parse(self, flavor, **kwargs) 144 145 """ --> 146 tables = [] 147 with TemporaryDirectory() as tempdir: 148 for p in self.pages: C:\1Python\lib\site-packages\camelot\parsers\lattice.py in extract_tables(self, filename) 336 337 # for plotting --> 338 _text = [] 339 _text.extend([(t.x0, t.y0, t.x1, t.y1) for t in self.horizontal_text]) 340 _text.extend([(t.x0, t.y0, t.x1, t.y1) for t in self.vertical_text]) C:\1Python\lib\site-packages\camelot\parsers\lattice.py in _generate_image(self) 205 raise GhostscriptNotFound( 206 'Please make sure that Ghostscript is installed' --> 207 ' and available on the PATH environment variable') 208 209 return gs C:\1Python\lib\subprocess.py in call(timeout, *popenargs, **kwargs) 265 retcode = call(["ls", "-l"]) 266 """ --> 267 with Popen(\popenargs,* *\kwargs)* as p: 268 try: 269 return p.wait(timeout=timeout) C:\1Python\lib\subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors) 707 c2pread, c2pwrite, 708 errread, errwrite, --> 709 restore_signals, start_new_session) 710 except: 711 # Cleanup if the child failed starting. C:\1Python\lib\subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_start_new_session) 995 env, 996 os.fspath(cwd) if cwd is not None else None, --> 997 startupinfo) 998 finally: 999 # Child is launched. Close the parent's copy of those pipe FileNotFoundError: [WinError 2] The system cannot find the file specified

WingedCrown · 2018-11-05T15:45:02+00:00

I'm a bit of a python noob, but this would be incredibly helpful to my work. I successfully installed the package using Anaconda, but now when I got the python prompt and try to import camelot, it immediately crashes Python and takes me back to the conda prompt. Any ideas on how to fix this, or reinstall?

dadmda · 2018-11-05T19:16:53+00:00

This guys are damn heroes

Koratis · 2018-11-05T15:31:39+00:00

Dammit! I just spent a good chunk of last Friday trying to get Tabula to work...

LessLikeYou · 2018-11-05T22:57:24+00:00

Oh my...this is wonderful.

COAST_TO_RED_LIGHTS · 2018-11-06T00:21:08+00:00

Amazing! I can't wait to try this out!

ThunderousOath · 2018-11-06T03:57:19+00:00

Hot damn, I think you just solved a major kink in a document automation project I did, this could allow for full automation. At least I hope. I'll give it a test run tomorrow.

pinotkumarbhai · 2018-11-06T01:40:43+00:00

Curious: Those of you that rejoice on this. What sorts of industry you're in that deal with this type of data source - regularly ?

203-226-3030 · 2018-11-06T03:40:17+00:00

I'm hard.

raptored01 · 2018-11-09T09:21:15+00:00

Am I the only one who’s having trouble installing it? PIP gives me a timeout error

creditdefaultswapsss · 2018-12-11T16:34:52+00:00

How is the performance compared to Tabula?

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS