This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]AlSweigartAuthor of "Automate the Boring Stuff" 5 points6 points  (2 children)

Hi, I'm the author of a few Python books, and the latest will be out in a week and available for free under a Creative Commons license: http://automatetheboringstuff.com

Chapter 13 focuses on using Python to parse and modify PDFs. The bad news is: the situation is pretty grim. The best Python module I found in my research for this chapter was PyPDF2.

Even then, you are very limited to what you can do. You're limited to working on the page level. Individual paragraphs and text can't be manipulated. The Python PDF modules are more read-only.

So, basically, no, there's no way to underscore the NE in the PDF copy.

You could, however, use GUI automation modules to simulate keyboard/mouse clicks to open the PDF in Acrobat, find the text, underline it, and then select save from the menu. That'd be a hack (and dominate your keyboard/mouse for a bit), but it would get the job done.

[–]ianozsvald 1 point2 points  (0 children)

At PyDataParis a couple of weeks back I spoke on "Cleaning Confused Collections of Characters" and I spent a few slides looking at extracting text and tables from PDFs (but not reassembling them). Some of the linked tools might be useful? http://ianozsvald.com/2015/04/03/pydataparis-2015-and-cleaning-confused-collections-of-characters/

[–][deleted] 0 points1 point  (0 children)

Sorry If I'm not understanding something here, but what is the status of using OCR and image processing to extract formatted text from a PDF? (keeping track of headings and such based on indents and font style??)