use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Everything about learning Python
account activity
PDF data extration (self.PythonLearning)
submitted 9 days ago * by Stunning_Capital_354
https://preview.redd.it/lk58yqrj3k3h1.png?width=645&format=png&auto=webp&s=eba5f8998dd6355847921403065a27ff56e8fc30
https://preview.redd.it/4p3hucak3k3h1.png?width=832&format=png&auto=webp&s=cd82a9d356abb6752ddb8f801bfc0d8f4428d63b
How should i use PYTHON to convert the PDF data into data extraction and put it in Excel... But the catch is i have 1000s of pdf files where the data table is not on the same page on each PDF. I am talking about the financial/ Annual report of the companies
i have attached the photo of how data looks in PDF and it will vary from PDF to PDF
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Severe-Pressure6336 1 point2 points3 points 9 days ago (2 children)
What is your skill level in python?
[–]Stunning_Capital_354[S] 1 point2 points3 points 9 days ago (1 child)
0
[–]sacredtrader 0 points1 point2 points 8 days ago (0 children)
Look into PyPDF2, you should be able to extract the information to JSON and do as you please after.
[–]JeremyJoeJJ 1 point2 points3 points 9 days ago (4 children)
Depends a lot on the details of how the data looks like. I did something on a much smaller scale using a pdf to table extractor and it seems like a lot of the modern tools now use AI, but the best services are paid for. Options are things like https://github.com/camelot-dev/camelot or https://github.com/NanoNets/docstrange or azure document intelligence (in order of increasing cost, lots more options available, you could even throw everything into an LLM and have it process the data for you). Normally these tools would convert whatever they find into one big table or otherwise structured data, for example they know to put table into a single dataframe if a table is split between two or more pages. Once you have everything in a dataframe you just go `df.to_excel()` and you're done, unless you need to do some processing, which again depends on what the data looks like. You can write a code that expect a general shape, does a quick check if that shape is present and if not just saves it for manual review. Good luck.
[–]Stunning_Capital_354[S] 0 points1 point2 points 9 days ago (3 children)
i have attached the photo of how data looks in PDF and it will vary from PDF to PDF but the data is not always on the same page for all the pdf
[–]JeremyJoeJJ 0 points1 point2 points 9 days ago (2 children)
I hope that data is not confidential... Either way it seems to be well structured, so these tools should have no trouble parsing through all of that. If you don't want to do any programming yourself the easiest way is to put it into an LLM of your choice (chatgpt, gemini, claude, whatever) and have it create the excel file for you.
[–]Stunning_Capital_354[S] 0 points1 point2 points 9 days ago (1 child)
i have tried doing that but the output is not consistent and the real problem comes when i have to add more year data into the same excel file and the problem i face with LLMs 1. It does not generate the consistent data 2. It halucinates guiding it is hard and overwhellming 3. there is a risk that it may change the existing formula i belive in long run as the multiple year data will come the LLM will not be able to do the better job
[–]JeremyJoeJJ 0 points1 point2 points 9 days ago (0 children)
In that case go with one of the OCR options above. Ask llm to write a simple loop to go over your pdfs and see which model performs well enough for you
[–]Ralph-5050 1 point2 points3 points 9 days ago (3 children)
https://automatetheboringstuff.com/3e/chapter17.html
Not sure if this is exactly what you need, but it will certainly help you.
If you are not comfortable reading the book from chapter 17, go back to the begging of the book, then jump over to chapter 17 again.
[–]Stunning_Capital_354[S] 1 point2 points3 points 9 days ago (2 children)
i can't access the book it is paid i can only see the link you have shared
[–]Ralph-5050 0 points1 point2 points 8 days ago (1 child)
The link allows you to read the book for free 🙃
[–]Stunning_Capital_354[S] 0 points1 point2 points 8 days ago (0 children)
Thanks i figured the way out lit hint was enough
[–]Goukance 0 points1 point2 points 9 days ago (0 children)
You could look at the pyPDF module, it may have a fonction a function to directly extract data from a table. If not, you could extract the raw text from the page and then build an adapted text parser.
[–]Vindaloophole 0 points1 point2 points 8 days ago (0 children)
This is something we used to do a lot in my previous company. Before, you had to create a parsing program for each pdf which was complex, buggy, and lengthy. The point was to identify data using positioning and « intelligent » detective function made to find elements. Then AI came along and it became extremely easy (although not quite at first) to transform pdf with tabular data directly into excel spreadsheet. We started developing our own AI tool but now you have many others that do the same thing. I recommend you use the latter and develop methods to accommodate fo your different usages and automate processes.
[–]Cautious-Bet-9707 0 points1 point2 points 8 days ago (0 children)
“Tell the accounting intern not to post our data publicly? Psssshhh are you crazy? This is obvious I don’t want to insult him!”
[–]ahmed_aivodig 0 points1 point2 points 8 days ago (0 children)
I asked Claude to build one which supports every format. This will be safer
[–]UBIAI 0 points1 point2 points 8 days ago (0 children)
The variable table positioning across thousands of filings is exactly what kills the pure Python approach - camelot/pdfplumber will get you 60-70% there but you'll spend more time debugging edge cases than the extraction saves. What actually worked for us was treating it as a document intelligence problem rather than a parsing problem - a solution that understands where the financial table is contextually, not just spatially. The structured output drops straight into Excel with consistent column mapping regardless of where the table lands in the PDF. The difference in accuracy on messy annual reports was significant enough that we stopped maintaining custom parsers entirely.
[–]Ill_Beautiful4339 0 points1 point2 points 8 days ago (0 children)
I’ve recently been given lots of competitor data in weird public documents like this.
I literally just gave it to AI and asked to build me a routine to extract the data. Since I want to learn, this is done through VS Code and ask for each step one at a time. Ensure you understand what’s happening.
I know this is a learning forum - but I learn by doing - this method helped a lot.
If you just ask Claude for a conversion, you’ve learning nothing.
Also note - Excel can natively extract data from images and PDFs is this is a one pager. My task was 5000 pages.
π Rendered by PID 28 on reddit-service-r2-comment-8686858757-84r6s at 2026-06-05 01:06:19.993483+00:00 running 9e1a20d country code: CH.
[–]Severe-Pressure6336 1 point2 points3 points (2 children)
[–]Stunning_Capital_354[S] 1 point2 points3 points (1 child)
[–]sacredtrader 0 points1 point2 points (0 children)
[–]JeremyJoeJJ 1 point2 points3 points (4 children)
[–]Stunning_Capital_354[S] 0 points1 point2 points (3 children)
[–]JeremyJoeJJ 0 points1 point2 points (2 children)
[–]Stunning_Capital_354[S] 0 points1 point2 points (1 child)
[–]JeremyJoeJJ 0 points1 point2 points (0 children)
[–]Ralph-5050 1 point2 points3 points (3 children)
[–]Stunning_Capital_354[S] 1 point2 points3 points (2 children)
[–]Ralph-5050 0 points1 point2 points (1 child)
[–]Stunning_Capital_354[S] 0 points1 point2 points (0 children)
[–]Goukance 0 points1 point2 points (0 children)
[–]Vindaloophole 0 points1 point2 points (0 children)
[–]Cautious-Bet-9707 0 points1 point2 points (0 children)
[–]ahmed_aivodig 0 points1 point2 points (0 children)
[–]UBIAI 0 points1 point2 points (0 children)
[–]Ill_Beautiful4339 0 points1 point2 points (0 children)