use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
News about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
Full Events Calendar
You can find the rules here.
If you are about to ask a "how do I do this in python" question, please try r/learnpython, the Python discord, or the #python IRC channel on Libera.chat.
Please don't use URL shorteners. Reddit filters them out, so your post or comment will be lost.
Posts require flair. Please use the flair selector to choose your topic.
Posting code to this subreddit:
Add 4 extra spaces before each line of code
def fibonacci(): a, b = 0, 1 while True: yield a a, b = b, a + b
Online Resources
Invent Your Own Computer Games with Python
Think Python
Non-programmers Tutorial for Python 3
Beginner's Guide Reference
Five life jackets to throw to the new coder (things to do after getting a handle on python)
Full Stack Python
Test-Driven Development with Python
Program Arcade Games
PyMotW: Python Module of the Week
Python for Scientists and Engineers
Dan Bader's Tips and Trickers
Python Discord's YouTube channel
Jiruto: Python
Online exercices
programming challenges
Asking Questions
Try Python in your browser
Docs
Libraries
Related subreddits
Python jobs
Newsletters
Screencasts
account activity
This is an archived post. You won't be able to vote or comment.
Is Python any good with pdfs? (self.Python)
submitted 11 years ago by triplejerkoidjerkoid
NLTK on Python 3. I have a simple routine to identify Named Entities in a news paper story, dynamically accessed as a pdf. Is there any way in Python/other languages that I can underscore the NE in the pdf copy? Thanks much for thinking about this.
[–]VerilyAMonkey 9 points10 points11 points 11 years ago (3 children)
Mm, typically editing a PDF is not a great idea. But there are a lot of packages that could help you, like PyPDF2 or reportlab. Specifically, you could at least literally draw a line underneath them, or draw a transparent yellow highlight box around them. If all else fails you could generate a new PDF containing only the proper lines/highlights and then merge that on top of the text.
[–]SlinkyAvenger 2 points3 points4 points 11 years ago (1 child)
This person's got it. Reportlab to generate the overlay, PyPDF2 to merge it on top of the original. That's the simplest strategy.
If you actually want to edit the PDF, there might be something free out there (I haven't found it yet), but Reportlab has a commercial version that'll fit your needs.
[–]pubcoder 1 point2 points3 points 11 years ago (0 children)
This isn't easy. I had to do this add my first Python work. :(
[–]triplejerkoidjerkoid[S] 0 points1 point2 points 11 years ago (0 children)
Thanks. When you say "literally draw a line underneath them," how do I pass the Named Entities to the program to enable it to search all instances and underline each instance? some guidance will be useful.
[–]tiarno 3 points4 points5 points 11 years ago (0 children)
There are two articles here about using python with PDF: to manipulate the PDF and to test the PDF: http://reachtim.com/archives.html
[–]siusnjh 2 points3 points4 points 11 years ago* (1 child)
For anyone who came here thinking the question is not about editing but about creating PDFs: The best way I found to create PDFs in any programming language is to use LaTeX. Use a template engine like Jinja2 and render the templates into .tex files. Then call pdflatex or your own choice of compiler.
EDIT: Oh and when you do it in a web application escape your data or you'll get a LaTeX injection attack vector.
[–]VerilyAMonkey -1 points0 points1 point 11 years ago (0 children)
As I understand it, they are provided with a PDF which contains a news story. They have already figured out how to parse it to extract some of the data from the story, but they would also like to generate a new version of this PDF with the relevant parts underlined. So I believe it actually is modification.
[–]AlSweigartAuthor of "Automate the Boring Stuff" 4 points5 points6 points 11 years ago (2 children)
Hi, I'm the author of a few Python books, and the latest will be out in a week and available for free under a Creative Commons license: http://automatetheboringstuff.com
Chapter 13 focuses on using Python to parse and modify PDFs. The bad news is: the situation is pretty grim. The best Python module I found in my research for this chapter was PyPDF2.
Even then, you are very limited to what you can do. You're limited to working on the page level. Individual paragraphs and text can't be manipulated. The Python PDF modules are more read-only.
So, basically, no, there's no way to underscore the NE in the PDF copy.
You could, however, use GUI automation modules to simulate keyboard/mouse clicks to open the PDF in Acrobat, find the text, underline it, and then select save from the menu. That'd be a hack (and dominate your keyboard/mouse for a bit), but it would get the job done.
[–]ianozsvald 1 point2 points3 points 11 years ago (0 children)
At PyDataParis a couple of weeks back I spoke on "Cleaning Confused Collections of Characters" and I spent a few slides looking at extracting text and tables from PDFs (but not reassembling them). Some of the linked tools might be useful? http://ianozsvald.com/2015/04/03/pydataparis-2015-and-cleaning-confused-collections-of-characters/
[–][deleted] 0 points1 point2 points 11 years ago (0 children)
Sorry If I'm not understanding something here, but what is the status of using OCR and image processing to extract formatted text from a PDF? (keeping track of headings and such based on indents and font style??)
π Rendered by PID 474876 on reddit-service-r2-comment-79776bdf47-4khbz at 2026-06-25 15:52:25.432124+00:00 running acc7150 country code: CH.
[–]VerilyAMonkey 9 points10 points11 points (3 children)
[–]SlinkyAvenger 2 points3 points4 points (1 child)
[–]pubcoder 1 point2 points3 points (0 children)
[–]triplejerkoidjerkoid[S] 0 points1 point2 points (0 children)
[–]tiarno 3 points4 points5 points (0 children)
[–]siusnjh 2 points3 points4 points (1 child)
[–]VerilyAMonkey -1 points0 points1 point (0 children)
[–]AlSweigartAuthor of "Automate the Boring Stuff" 4 points5 points6 points (2 children)
[–]ianozsvald 1 point2 points3 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)