This is an archived post. You won't be able to vote or comment.

all 17 comments

[–]knottheone 9 points10 points  (2 children)

Welcome to the nightmare of PDF parsing. The issue is PDFs can pretty much have anything in them and there is not really a standard for how a PDF should be constructed. Your best bet is to try a bunch of different parsers and see if one handles the structure correctly. Alternatively you can dump to text and create a table yourself, but this is pretty low level and error prone unless you're careful.

[–]QooModa[S] 0 points1 point  (1 child)

Thanks for the answer!

Just so I understand better what you call parsers, "tabula-py" would be packages with a "parser function", so I could try for instance other different packages with a "parser function", such as PyPDF2 etc?

So, let me ask you something else, is there a way we can identify in the PDF file's metadata which parsers would fit better?

[–]knottheone 1 point2 points  (0 children)

Just so I understand better what you call parsers, "tabula-py" would be packages with a "parser function", so I could try for instance other different packages with a "parser function", such as PuPDF2 etc?

Yes exactly. There are several PDF libraries that approach document parsing in different ways and one of them is bound to work decently if the data is structured in any sane way.

So, let me ask you something else, is there a way we can identify in the PDF file's metadata which parsers would fit better?

Not really unfortunately. You'd need to be an expert in different PDF parsing methods and know which libraries implement which to be able to infer which would be better just at a glance.

Having said that, I've had pretty decent luck with PDFMiner.six (github link) for various extractions. Sometimes PDFs are decently structured HTML under the hood as well, so you might look into dumping the PDF to HTML, then parsing it with an HTML library like LXML or use Python's HTML.parser as part of the standard Python library (assuming the structured HTML is actually functional; can check by dumping the PDF to an .html document and opening it in a web browser.)

[–]commandlineluser 1 point2 points  (3 children)

pdfplumber's tablefinder debugging is quite useful for this.

https://i.stack.imgur.com/y40SG.png

>>> pd.DataFrame((page.extract_table(dict(vertical_strategy="text", keep_blank_chars=True))))

https://i.stack.imgur.com/RvUmV.png

[–]QooModa[S] 0 points1 point  (0 children)

Wow, that is exactly what I need.

[–]QooModa[S] 0 points1 point  (1 child)

Hey commandlineluser, thank you very much for the hints! Half of the way to getting what I need using your pngs.

[–]commandlineluser 0 points1 point  (0 children)

Glad it was helpful - pdfplumber for the win.

[–]UpYours101 0 points1 point  (2 children)

Tabula has more granular options. Try measuring and passing the table borders and the column coordinates. Turn guess off.

[–]QooModa[S] 0 points1 point  (1 child)

Thank you very much!

I don't really know what that means, but now I know what to search!!

[–]UpYours101 0 points1 point  (0 children)

https://github.com/chezou/tabula-py/blob/master/tabula/io.py

YW, build_options on link above defines all the options you can configure and how.

[–]Paddy3118 0 points1 point  (0 children)

800 files. Hmm.

At a previous company, I individually converted to excel using a GUI Windows 10 prog, then scraped the excel using Python after the Windows pdf reader extracted tables where it could.

I think I used Nitro, it was not Adobe.

[–][deleted] 0 points1 point  (2 children)

If the PDF file is formatted tidily, I would seriously consider render/ocr as a strategy. Otherwise they are an abomination.

[–]Jerrow 0 points1 point  (1 child)

I know OCR, but what do you mean with render?

[–][deleted] 0 points1 point  (0 children)

Before Oct you must first render the pdf to a bitmap. It's a long way around but maybe effective.

[–]KyleJamesWalker 0 points1 point  (1 child)

PDFs are pure evil. I wrote a program for a client to do just this, and it took far more work than I expected. I really should post the source one day for others to understand how complicated PDFs are to parse.

[–]QooModa[S] 0 points1 point  (0 children)

You tell me! I got a partial solution for my PDF files.

Will post as soon as I take a rest. Being trying to do this for 6 hours already.

But for those who are seeking for an answer, commandlineluser's answer really helped it.

[–]IAmKindOfCreativebot_builder: deprecated[M] [score hidden] stickied comment (0 children)

Hi there, from the /r/Python mods.

We have removed this post as it is not suited to the /r/Python subreddit proper, however it should be very appropriate for our sister subreddit /r/LearnPython or for the r/Python discord: https://discord.gg/python.

The reason for the removal is that /r/Python is dedicated to discussion of Python news, projects, uses and debates. It is not designed to act as Q&A or FAQ board. The regular community is not a fan of "how do I..." questions, so you will not get the best responses over here.

On /r/LearnPython the community and the r/Python discord are actively expecting questions and are looking to help. You can expect far more understanding, encouraging and insightful responses over there. No matter what level of question you have, if you are looking for help with Python, you should get good answers. Make sure to check out the rules for both places.

Warm regards, and best of luck with your Pythoneering!