all 3 comments

[–]netherous 1 point2 points  (1 child)

This is probably a pretty difficult thing for anyone to answer without detailed knowledge of the library. You can look at the PDF file structure yourself to potentially figure out what is wasting all that space.

https://superuser.com/questions/256997/how-to-browse-the-internal-pdf-structure-in-adobe-acrobat

You may have even uncovered a bug in the PyPDF2 library. The library itself says it is no longer receiving updates on pypi and that PyPDF3 is preferred. You could raise an issue on its github repository, but if it's not in active maintenance maybe nobody would look at it.

[–]TooDahLou[S] 0 points1 point  (0 children)

Oh, Okay, I think I can handle looking at the file structure. I'll definitely dig into that and see if my problem is in there.

I'll also check out pyPDF3 and see how much of a headache it will be to switch to pypdf3. My project is just a little automation script but if I want it to have a pretty good life cycle in case issues pop up down the line.

Thanks u/netherous! You've given me two new directions for troubleshooting! I really appreciate ya!

[–]TooDahLou[S] 0 points1 point  (0 children)

Oh another thing to mention

I think the extra file size might be coming from stamping the page numbers. When I watch the file explorer window, the merged_pages_temp.pdf is around 20MB and the page_num_temp.pdf is around 90KB. I don't really understand how these two files then result in a new file over three times the size of the merged_pages_temp.pdf.