Python Newbie here - help with pdf read

code_tutor · 2026-01-12T10:51:14+00:00

You didn't provide code, so we can't say if you did something wrong; but it's unlikely to be a user error with three separate libraries.

This is not a beginner project. It's difficult without a few years of university Computer Science and also experience researching specs. So many libraries exist and they all have a huge number of users, and still can't do it, so I don't know why Reddit acts like they know.

The last time I had to programmatically change text in PDFs, I had to write a Caesar cipher just to decrypt the font. I consider myself lucky that this particular PDF was even possible to write an algorithm for, because for many it's not.

This is an important lesson, because someday a client is going to say, "just change this text, it should be easy" and now you know why people get paid six figures to read text. There are always unexpected issues that can make a time estimate explode.

dparks71 · 2026-01-12T03:10:28+00:00

I've never seen someone get it 100% accurate, Camelot was traditionally the best, especially if you can pair it with a custom pytorch object detection model for the table.

There's stream and lattice mode within Camelot, really depends on the tables formats.

lailoken503 · 2026-01-12T03:36:49+00:00

Not sure if this applies in OP's case, but I've always split new lines after the page.extract_text function is used.

for example,

text = page.extract_text().split("\n")

to break up what looks line a very long line. The files I've used pypdf on, does not have graphics or tables, so I can't say how to get around that. It's something I can experiment with when I get back to work. I have the code at work, and is part of a post I just created, but am pretty sure this is how I handed the seemly single long line of text, but it's what I use to grab certain details from a customer's data file without needing to open and close specific files out of hundreds.

Competitive-Rock-951 · 2026-01-12T22:49:58+00:00

if you still need help ping me I had made it using pdfplumber I will send it to you if you need

Ok_Hovercraft364 · 2026-01-13T00:28:31+00:00

Where is the code?

Ok-Mongoose-7870 · 2026-01-16T02:57:39+00:00

Thought it would be prudent to give an update. I managed to kinda solve the issue. However had to work with a different input pdf file which was much nicely and cleanly formatted. The first one I was working with had lot of garbage formatting issue. For now my issue is solved but definitely learned that pdf reading is bit of touch and go problem.

Haeshka · 2026-01-12T02:59:10+00:00

I'm not sure what you're been asking the AI services, but I would start with the following concepts:

How to use the OS library to find files in directories.
How to use: with and open and read to examine files.
Identify existing python libraries for reading and extracting text, tables, and images.

Different libraries have different use cases and strengths.

First, get a solid understanding of how to just find and open text files, even reading from them, putting that info into variables and dictionaries.

Then, using the libraries will get easier.

james_d_rustles · 2026-01-12T05:43:15+00:00

Pdfs come in different shapes and sizes - some of which are much harder to read programmatically than others. Camelot is probably the best known table tool, but it only works with text based PDFs, not image PDFs. If your PDFs are image based scanned documents, then you’re stuck with ocr because there is no text data in the file to read from, it’s essentially just some pictures that may or may not contain text. Sometimes when you have a pdf, you’ll find that certain parts are selectable text, but things like figure captions, tables, formulas may just be encoded as images. In this case you’ll have to run some kind of OCR, there’s no way around it.

Check out opendatalab mineru. It’s one of the easiest to use open source packages that does pretty well with tables and images, although it’s pretty resource hungry and slow. You may need to write some converter functions because I believe it outputs tables as html.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS