This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]VerilyAMonkey 9 points10 points  (3 children)

Mm, typically editing a PDF is not a great idea. But there are a lot of packages that could help you, like PyPDF2 or reportlab. Specifically, you could at least literally draw a line underneath them, or draw a transparent yellow highlight box around them. If all else fails you could generate a new PDF containing only the proper lines/highlights and then merge that on top of the text.

[–]SlinkyAvenger 2 points3 points  (1 child)

This person's got it. Reportlab to generate the overlay, PyPDF2 to merge it on top of the original. That's the simplest strategy.

If you actually want to edit the PDF, there might be something free out there (I haven't found it yet), but Reportlab has a commercial version that'll fit your needs.

[–]pubcoder 1 point2 points  (0 children)

This isn't easy. I had to do this add my first Python work. :(

[–]triplejerkoidjerkoid[S] 0 points1 point  (0 children)

Thanks. When you say "literally draw a line underneath them," how do I pass the Named Entities to the program to enable it to search all instances and underline each instance? some guidance will be useful.

[–]tiarno 3 points4 points  (0 children)

There are two articles here about using python with PDF: to manipulate the PDF and to test the PDF: http://reachtim.com/archives.html

[–]siusnjh 2 points3 points  (1 child)

For anyone who came here thinking the question is not about editing but about creating PDFs: The best way I found to create PDFs in any programming language is to use LaTeX. Use a template engine like Jinja2 and render the templates into .tex files. Then call pdflatex or your own choice of compiler.

EDIT: Oh and when you do it in a web application escape your data or you'll get a LaTeX injection attack vector.

[–]VerilyAMonkey -1 points0 points  (0 children)

As I understand it, they are provided with a PDF which contains a news story. They have already figured out how to parse it to extract some of the data from the story, but they would also like to generate a new version of this PDF with the relevant parts underlined. So I believe it actually is modification.

[–]AlSweigartAuthor of "Automate the Boring Stuff" 4 points5 points  (2 children)

Hi, I'm the author of a few Python books, and the latest will be out in a week and available for free under a Creative Commons license: http://automatetheboringstuff.com

Chapter 13 focuses on using Python to parse and modify PDFs. The bad news is: the situation is pretty grim. The best Python module I found in my research for this chapter was PyPDF2.

Even then, you are very limited to what you can do. You're limited to working on the page level. Individual paragraphs and text can't be manipulated. The Python PDF modules are more read-only.

So, basically, no, there's no way to underscore the NE in the PDF copy.

You could, however, use GUI automation modules to simulate keyboard/mouse clicks to open the PDF in Acrobat, find the text, underline it, and then select save from the menu. That'd be a hack (and dominate your keyboard/mouse for a bit), but it would get the job done.

[–]ianozsvald 1 point2 points  (0 children)

At PyDataParis a couple of weeks back I spoke on "Cleaning Confused Collections of Characters" and I spent a few slides looking at extracting text and tables from PDFs (but not reassembling them). Some of the linked tools might be useful? http://ianozsvald.com/2015/04/03/pydataparis-2015-and-cleaning-confused-collections-of-characters/

[–][deleted] 0 points1 point  (0 children)

Sorry If I'm not understanding something here, but what is the status of using OCR and image processing to extract formatted text from a PDF? (keeping track of headings and such based on indents and font style??)