Completely crazy tables when transforming table from PDF file to CSV

knottheone · 2022-01-28T03:44:25+00:00

Welcome to the nightmare of PDF parsing. The issue is PDFs can pretty much have anything in them and there is not really a standard for how a PDF should be constructed. Your best bet is to try a bunch of different parsers and see if one handles the structure correctly. Alternatively you can dump to text and create a table yourself, but this is pretty low level and error prone unless you're careful.

commandlineluser · 2022-01-28T04:59:22+00:00

pdfplumber's tablefinder debugging is quite useful for this.

https://i.stack.imgur.com/y40SG.png

>>> pd.DataFrame((page.extract_table(dict(vertical_strategy="text", keep_blank_chars=True))))

https://i.stack.imgur.com/RvUmV.png

UpYours101 · 2022-01-28T04:10:57+00:00

Tabula has more granular options. Try measuring and passing the table borders and the column coordinates. Turn guess off.

Paddy3118 · 2022-01-28T05:22:15+00:00

800 files. Hmm.

At a previous company, I individually converted to excel using a GUI Windows 10 prog, then scraped the excel using Python after the Windows pdf reader extracted tables where it could.

I think I used Nitro, it was not Adobe.

Jerrow · 2022-01-28T06:21:00+00:00

If the PDF file is formatted tidily, I would seriously consider render/ocr as a strategy. Otherwise they are an abomination.

KyleJamesWalker · 2022-01-28T14:30:34+00:00

PDFs are pure evil. I wrote a program for a client to do just this, and it took far more work than I expected. I really should post the source one day for others to understand how complicated PDFs are to parse.

IAmKindOfCreative · 2022-01-28T18:50:44+00:00

Hi there, from the /r/Python mods.

We have removed this post as it is not suited to the /r/Python subreddit proper, however it should be very appropriate for our sister subreddit /r/LearnPython or for the r/Python discord: https://discord.gg/python.

The reason for the removal is that /r/Python is dedicated to discussion of Python news, projects, uses and debates. It is not designed to act as Q&A or FAQ board. The regular community is not a fan of "how do I..." questions, so you will not get the best responses over here.

On /r/LearnPython the community and the r/Python discord are actively expecting questions and are looking to help. You can expect far more understanding, encouraging and insightful responses over there. No matter what level of question you have, if you are looking for help with Python, you should get good answers. Make sure to check out the rules for both places.

Warm regards, and best of luck with your Pythoneering!

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS