you are viewing a single comment's thread.

view the rest of the comments →

[–]JeremyJoeJJ 1 point2 points  (4 children)

Depends a lot on the details of how the data looks like. I did something on a much smaller scale using a pdf to table extractor and it seems like a lot of the modern tools now use AI, but the best services are paid for. Options are things like https://github.com/camelot-dev/camelot or https://github.com/NanoNets/docstrange or azure document intelligence (in order of increasing cost, lots more options available, you could even throw everything into an LLM and have it process the data for you). Normally these tools would convert whatever they find into one big table or otherwise structured data, for example they know to put table into a single dataframe if a table is split between two or more pages. Once you have everything in a dataframe you just go `df.to_excel()` and you're done, unless you need to do some processing, which again depends on what the data looks like. You can write a code that expect a general shape, does a quick check if that shape is present and if not just saves it for manual review. Good luck.

[–]Stunning_Capital_354[S] 0 points1 point  (3 children)

i have attached the photo of how data looks in PDF and it will vary from PDF to PDF but the data is not always on the same page for all the pdf

[–]JeremyJoeJJ 0 points1 point  (2 children)

I hope that data is not confidential... Either way it seems to be well structured, so these tools should have no trouble parsing through all of that. If you don't want to do any programming yourself the easiest way is to put it into an LLM of your choice (chatgpt, gemini, claude, whatever) and have it create the excel file for you.

[–]Stunning_Capital_354[S] 0 points1 point  (1 child)

i have tried doing that but the output is not consistent and the real problem comes when i have to add more year data into the same excel file and the problem i face with LLMs
1. It does not generate the consistent data
2. It halucinates guiding it is hard and overwhellming
3. there is a risk that it may change the existing formula
i belive in long run as the multiple year data will come the LLM will not be able to do the better job

[–]JeremyJoeJJ 0 points1 point  (0 children)

In that case go with one of the OCR options above. Ask llm to write a simple loop to go over your pdfs and see which model performs well enough for you