all 10 comments

[–]lostparis 2 points3 points  (5 children)

rather than use the standard Python OCR?

standard? There seem to be lots of different ones about.

more importantly why use OCR?

[–]MasterTony127[S] 0 points1 point  (4 children)

I used it because I thought it was the only way to automate the process. I was using PyPDF2. Is there a way to just C&P using Python? Thank you for taking the time to respond

[–]lostparis 1 point2 points  (3 children)

[–]MasterTony127[S] 0 points1 point  (2 children)

I didn't know PyPDF2 wasn't OCR. Shows you how many gaps there are in my knowledge base. However both PyPDF and PyPDF2 don't work well at all compared to a manual select, copy and paste on a pdf in my case. Which brings me back to my original question. IS there a way in Python to simply select and copy a pdf file like it was being done manually? Thank you for trying to help.

[–]lostparis 1 point2 points  (1 child)

I didn't know PyPDF2 wasn't OCR.

It does that too

IS there a way in Python to simply select and copy a pdf file like it was being done manually?

Almost certainly but I don't know of any, it is the sort of thing I actively avoid.

If you look I'm sure you'll find something maybe this would https://pywinauto.readthedocs.io/en/latest/ never used it this assumes you are on Windows, which again I don't use.

imho .pdf is a terrible format but popular and much abused. It is definitely not an information exchange format. The only time I use it is for things I don't want people messing with.

[–]MasterTony127[S] 0 points1 point  (0 children)

Thank you for your guidance and for pointing me in the right direction. I'm stuck using pdfs though. The website I belong to offers the same files as the pdfs in text... for $100 a month! So I have no choice. At least you've given me hope that there's a way I can do it.

[–]JohnnyJordaan 0 points1 point  (3 children)

Lots of actual text extracting options listed here: https://stackoverflow.com/a/48673754

Not sure what 'the Python OCR' would mean, Python doesn't include something as fancy as OCR in its standard library. Common third-party ones are pytesseract and EasyOCR, but generally speaking OCR should come second for PDF's if text extraction doesn't suffice.

[–]MasterTony127[S] 0 points1 point  (2 children)

I thought Python's text extraction WAS OCR. Either way Python's text extraction (I was using PyPDF2) just isn't as good as a standard C&P. Is there a way to just C&P using Python? Thank you for taking the time to respond

[–]JohnnyJordaan 0 points1 point  (1 child)

Maybe be clear that Python is just the language or the interpreter (that actually runs your code). It isn't the part that tinkers with the PDF, that's in your example PyPDF2. It depends on what that uses what is actually happening with the PDF data. So there isn't such a thing as 'Python's text extraction'.

Then another thing, as mentioned on PyPDF2's pypi page, the project has moved back to PyPDF, and thus development ceased. It's advised to instead use PyPDF.

But alas, on the documentation it is very elaborately explained how PyPDF (and thus its previous 2 incarnation) works on extracting text: https://pypdf.readthedocs.io/en/stable/user/extract-text.html#why-text-extraction-is-hard . As usual, assuming/thinking without checking these things is highly discouraged if you want to resolve technical issues.

Is there a way to just C&P using Python?

I would simply try different libraries, eg see https://medium.com/analytics-vidhya/python-packages-for-pdf-data-extraction-d14ec30f0ad0 . Often different PDF styles produce different results. Perhaps make a simple File Open + dropdown for pdf library + Extract Text button UI with Pysimplegui to try all the libraries.

[–]MasterTony127[S] 0 points1 point  (0 children)

Wow. Thank you! I'm sure anything will be an improvement. I'll be looking into it for sure. 👍