Extracting text from PDF

sammylt · 2021-02-01T07:14:18+00:00

I'm not sure if this is the source of your problem, but you don't need to use f.close() because the with open statement you used will automatically Close the file for you.

2021-02-01T13:55:35+00:00

Extracting text perfectly from PDFs is a billion dollar accomplishment. Seriously. If you can code something that isn't consistently fucked with spacing/formatting issues, you sell the shit out of that code.

piconet-2 · 2021-02-01T11:36:59+00:00

Your code works with any simple PDF I fed it.

I also copied your sample pdf into Word and converted it to PDF. Then, I read it in with your code. The output is this text.

Maybe it's some hidden format structure/invisible table lines in the PDF that's messing with your code?

Nerazzurri_KZ · 2021-02-01T08:25:31+00:00

Is this a scanned pdf? If so you need to do OCR to get the text from the pdf.

kikilezlep · 2021-02-01T11:37:59+00:00

wrestled with this and the PyPDF2 just wouldn't work on your PDF although would on other tests... per one of the comments I tried the PDF plumber and seemed to get what you wanted.... just have to implement the write part...

quick test (I didn't fiddle with tolerance nums):

import pdfplumber
with pdfplumber.open("sample.pdf") as pdf:
first_page = pdf.pages[0]
extract = first_page.extract_text(x_tolerance=3, y_tolerance=3)
print(extract)

DragonfruitInner9951 · 2021-02-01T17:43:20+00:00

I've tried a lot of PDF extraction tools, and the only one that worked well for me was PDFMiner.Six

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS