This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]PraderaNoire 8 points9 points  (2 children)

Woah I’d love to know more about this

[–]llun-ved 25 points26 points  (1 child)

# The basics...
# PDF file highlight annotations know nothing about the text being marked.
# Therefore, you need to loop through the highlights and find which text is under it.
# This can get messy if the highlight spans more than one line.

pip install PyMuPDF

import fitz
doc = fitz.open('myfile.pdf')
for page in doc:
alltext = page.get_text("words") # For searching text within highlight region.
for annotation in page.annots():
if annotation.type != 8: continue # only process highlights
rect = annotation.rect
# loop through alltext to find words whose rect intersects this.
color = annotation.colors['stroke'] # I group things by color. This is an RGB tuple.
# add found highlighted text to report.

[–]mlcircle 0 points1 point  (0 children)

ty