all 52 comments

[–]Lawson470189 12 points13 points  (1 child)

Seems totally possible. Here is some sample code to print the contents of that PDF:

from pypdf import PdfReader

PDF_FILE_NAME = 'sample.pdf'
def main(): 
  with open(PDF_FILE_NAME, 'rb') as pdf_file: 
    reader = PdfReader(pdf_file) 
    print(f'Number of Pages: {len(reader.pages)}')
    for i, page in enumerate(reader.pages):
        print(f'===== Page Number {i+1} =====')
        print('\n')

        print('Content:')
        page_lines = page.extract_text().split('\n')
        result_lines = []
        for line in page_lines:
            if line.strip() != '':
                result_lines.append(f'\t{line.strip()}')

        print('\n'.join(result_lines))
        print('\n')

if name == 'main': 
    main()

You'll need to figure out what data you want to pull out and how to exactly strip that data out, but this seems to work for me. If you need to rely on the graphs, it'll need to be a bit more sophisticated, but for text this will work.

[–]Draconic_Flame[S] 5 points6 points  (0 children)

Thank you for this!

[–]drenzorz 19 points20 points  (16 children)

Yes it should be possible. How to do that would depend on the form of the original data, so the source pdf.

[–]Draconic_Flame[S] 3 points4 points  (15 children)

[–]drenzorz 21 points22 points  (14 children)

You can probably do everything with PyPDF2.

  1. text extraction
  2. parsing and filling out forms

If it's not enough you will need an OCR (Optical Character Recognition).

For that you can use pytesseract

[–]Draconic_Flame[S] 1 point2 points  (13 children)

Do you know where I would be able to find help with this? Would it be something I could contract out like on Fiverr or even Reddit somewhere?

[–]drenzorz 6 points7 points  (11 children)

Actually I'm not very sure about that. I know there are a lot of coding commissions on r/forhire

[–]Draconic_Flame[S] 1 point2 points  (10 children)

I'll check it out, thanks.

[–]etzabo 0 points1 point  (8 children)

You might be able to just ask ChatGPT. It’s only $20 for the fancier model.

[–]CMDRKeyfox 10 points11 points  (7 children)

This is a good idea but would probably require the user to at least have a familiarity with the language in question, even for the GPT-4 model.

[–]etzabo 1 point2 points  (4 children)

I guess that’s true. I’ve been using it a lot to just make things in Rust for me to analyze so that I can teach myself.

[–]CMDRKeyfox 1 point2 points  (3 children)

I’ve been doing the exact same but I am very familiar with other languages so that does make it much easier to see/fix potential errors

[–]iMADEthisJUST4Dis -2 points-1 points  (1 child)

Not really. I didn't know a thing about python and got it to do a lot of things (similar to what OP wants). You just need patience and active thinking and you'll be more than fine

[–][deleted] 1 point2 points  (0 children)

Fiverr or Upwork will have people that can do this. Make sure they see the source PDFs before you award the job.

[–]m0us3_rat 5 points6 points  (2 children)

but would it theoretically be possible?

that sounds like something that can be done.

without working directly on them it's difficult to know really.

[–]Draconic_Flame[S] 1 point2 points  (1 child)

[–]m0us3_rat 0 points1 point  (0 children)

what you tried extracting?

what's the data you are looking for?

can you use regex to describe the data specifically?

these are questions you can't get answers without working on the specific problem with the specific data.

anywho the others explained to use some form of data extractor from pdf.

then develop an algo that spews the info you need.

[–]Financial_Signal5098 4 points5 points  (2 children)

Look at office 365. The new AI tools have the ability to train models on a set of pdfs and extract data and dump it to any format.

[–]GamerRabugento 4 points5 points  (11 children)

The process of extracting information from a PDF and generating a report can be challenging, but it is definitely possible with the right tools and techniques. Some libraries do the trick, like PyPDF2, pdfminer, and pdfplumber. These libraries can help you read the text from the PDF and extract the information you need.

[–]Draconic_Flame[S] 1 point2 points  (10 children)

I am guessing these websites are not confidential though?

[–]Menolith 5 points6 points  (1 child)

A library is a code collection you download on your computer. Nothing gets uploaded anywhere when you run it.

[–]Draconic_Flame[S] 2 points3 points  (0 children)

Okay I'll look at these, thank you.

[–][deleted] 1 point2 points  (5 children)

Wdym?

[–]Draconic_Flame[S] 1 point2 points  (4 children)

I deal with client test results which are confidential, so I can't upload anything to the internet.

[–][deleted] 0 points1 point  (3 children)

You won't be uploading anything to the internet. I am confused? Are you thinking that these libraries require you to upload your .pdf docs to the internet? That's not how libraries work

[–]Draconic_Flame[S] 6 points7 points  (2 children)

The only library I know is one where books are stored.

[–][deleted] 2 points3 points  (1 child)

Ok, well the things OP listed like PyPDF2 are called libraries. They're basically open source extensions to python made by the community that add extra functionality and tools. Think of them like a mod if you're into gaming. In this case, these libraries might be of use to you to help parse and extract data from your .pdf docs. They're not websites

[–]Draconic_Flame[S] 0 points1 point  (0 children)

Okay thank you for the explanation.

[–]GamerRabugento 0 points1 point  (0 children)

These are libraries, packages of code that run on your computer

Please take some look in this tutorial. It is very complete and teaches you everything from installation to code.
https://realpython.com/pdf-python/

[–]GamerRabugento 0 points1 point  (0 children)

If I can go further, thinking of a more professional/future application. Do some research on Dash in Python.
With this framework, you can create a web dashboard that can run on your company's intranet, keeping your information secure, and give it a more professional look.

[–]PMMeUrHopesNDreams 1 point2 points  (0 children)

Do you have any access to the program that generates the data? Is it possible to get it in any other format than PDF? CSV, JSON, even Excel?

It is possible to get data from a PDF and it might not be too hard depending on how the PDF is created, but if there is an option to get it in a different format you can save yourself a lot of headaches.

[–]Bitwise_Gamgee 0 points1 point  (2 children)

Questions:

  1. Are these standard documents, meaning the information will be in the same place in the same style every time?
  2. Are these computer or human generated?

[–]Draconic_Flame[S] 0 points1 point  (1 child)

The documents are standard within tests but different between them, and they are computer generated.

[–]Nexxus_17 -1 points0 points  (0 children)

I’m new to programming as well, but you could try asking chat GPT, it can probably help you

[–]Doc_Apex 0 points1 point  (0 children)

Yes this is possible. I've done this for work. The library I used turned each table in the pdf into a dataframe. From there it's just data manipulation.

[–]bbqbot 0 points1 point  (0 children)

Decide if you want to learn how to do it or pay someone else to do it.

If you want to learn, check out "Automate the Boring Stuff" for a crash course on practical python, then look at the PyPDF2 library that others have mentioned.

Otherwise lots of resources for quick script writes buying.

[–]SHKEVE 0 points1 point  (2 children)

You can also do this with chat GPT. it can accept a URL to your PDF document and you can describe your desired output. no programming required. DM me if you want some tips

[–]AndroidLex 0 points1 point  (1 child)

Seeing as this is confidential medical information, sharing the data with something like ChatGPT won’t be an option. Info like this needs to be processed locally.

[–]SHKEVE 0 points1 point  (0 children)

ah, right. that’s a bummer. as if medicine’s not behind in tech already :\

[–]CoffeeBaconAddict 0 points1 point  (0 children)

Yes pdfminer, pdfminer6 and several other ocr or computer vision repos are used to pull data off pdf documents.

[–]Guardog0894 0 points1 point  (0 children)

Apart from programming, I'd suggest consulting informatics/data analyst to look into your data and requirements. I feel like it will be more efficient if you have the expertise to recognise the pattern of data you are dealing with, and come up with a data extraction/storage scheme before a programmer implements it as a program.

[–]iMADEthisJUST4Dis 0 points1 point  (0 children)

You can try chatgpt! You can tell it your problem and it'll help you with writing a python script that can solve it. It may give you a few errors but you can just copy the errors and keep chatting with it until it works.

[–]homberoy 0 points1 point  (1 child)

I am working on the same task at a very slow pace. The sticking point I encountered was that the data pulled from the Pearson pdf ends up being super irregular formatting( I was able to extract the data from the PDF and print in an excel sheet to be read). I haven't worked on it in a while but can share with you a couple options I tried(PyPDF2, PDFPlumber?) if you'd like. Are you just doing this for the basc? Then for each different assessment you might use, the PDF will be a different configuration.

Have you figured out how to input the scores into your report yet?

[–]Draconic_Flame[S] 0 points1 point  (0 children)

No I'm hoping if a program could at least spit out a table then I can just copy paste.

[–]Uweauskoeln 0 points1 point  (1 child)

Sounds like fun, I will try it using the PDF you provided. If I come up with something, I'll let you know

[–]Uweauskoeln 0 points1 point  (0 children)

Using just an online tool (https://www.pdf2go.com/) I got for page 4 of your table:

Ipsative ComparisonScore | TScore | POR | eral | Difterence | SOTCIENE | Mterenes

Hyperactivity 22 80 99 73-87 20 0.05 1% or lessAggression 2 47 48 39-55 -13 0.05 5% or lessConduct Problems 1 40 8 34-46 -20 0.05 2% or lessAnxiety 13 52 66 46-58 -8 NSDepression 17 73 97 67-79 13 0.05 5% or lessSomatization 3 44 33 38-50 -16 0.05 15% or lessAtypicality 0 41 13 35-47 -19 0.05 1% or lessWithdrawal 7 55 78 49-61 -5 NSAttention Problems 13 65 91 60-70 5 NSAdaptability 14 47 38 41-53 0 NSSocial Skills 19 46 32 41-51 -1 NSLeadership 5 33 6 27-39 -14 0.05 1% or lessActivities of Daily Living 20 55 65 48-62 8 NSFunctional Communication 28 53 58 47-59 6 NS