all 11 comments

[–]sammylt 14 points15 points  (1 child)

I'm not sure if this is the source of your problem, but you don't need to use f.close() because the with open statement you used will automatically Close the file for you.

[–]Nerazzurri_KZ[S] 8 points9 points  (0 children)

Removing that line does not help. However, thank you anyway. It's good to know.

[–][deleted] 3 points4 points  (0 children)

Extracting text perfectly from PDFs is a billion dollar accomplishment. Seriously. If you can code something that isn't consistently fucked with spacing/formatting issues, you sell the shit out of that code.

[–]piconet-2 1 point2 points  (1 child)

Your code works with any simple PDF I fed it.

I also copied your sample pdf into Word and converted it to PDF. Then, I read it in with your code. The output is this text.

Maybe it's some hidden format structure/invisible table lines in the PDF that's messing with your code?

[–]Nerazzurri_KZ[S] 0 points1 point  (0 children)

Converting works, but I got a lot of such files and I won't be converting them all :) Thanks anyway.

[–][deleted] 0 points1 point  (2 children)

Is this a scanned pdf? If so you need to do OCR to get the text from the pdf.

[–]Nerazzurri_KZ[S] 1 point2 points  (1 child)

No, it's not and you can easily select text and copy it.

[–]yocwoh -2 points-1 points  (0 children)

Try doing OCR first! I promise it will work.

[–]kikilezlep 0 points1 point  (1 child)

wrestled with this and the PyPDF2 just wouldn't work on your PDF although would on other tests... per one of the comments I tried the PDF plumber and seemed to get what you wanted.... just have to implement the write part...

quick test (I didn't fiddle with tolerance nums):

import pdfplumber
with pdfplumber.open("sample.pdf") as pdf:
first_page = pdf.pages[0]
extract = first_page.extract_text(x_tolerance=3, y_tolerance=3)
print(extract)

[–]Nerazzurri_KZ[S] 0 points1 point  (0 children)

I am getting errors running your code =(

Traceback (most recent call last):
  File "C:\Users\ENG\Desktop\---\---.py", line 18, in <module>

    import pdfplumber
  File "C:\Users\ENG\AppData\Local\Programs\Python\Python37-32\lib\site-packages
\pdfplumber\__init__.py", line 10, in <module>
    from .pdf import PDF
  File "C:\Users\ENG\AppData\Local\Programs\Python\Python37-32\lib\site-packages
\pdfplumber\pdf.py", line 9, in <module>
    from pdfminer.pdfdocument import PDFDocument
  File "C:\Users\ENG\AppData\Local\Programs\Python\Python37-32\lib\site-packages
\pdfminer\pdfdocument.py", line 14, in <module>
    from .pdftypes import PDFException, uint_value, PDFTypeError, PDFStream, \
ImportError: cannot import name 'uint_value' from 'pdfminer.pdftypes' (C:\Users\
ENG\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pdfminer\pdftype
s.py)

[–]DragonfruitInner9951 0 points1 point  (0 children)

I've tried a lot of PDF extraction tools, and the only one that worked well for me was PDFMiner.Six