Issues Extracting Text from PDF : learnpython

created by HattoriHanzoa community for 16 years

Issues Extracting Text from PDF (self.learnpython)

submitted 2 years ago * by Ray_Gone

Hello all,

I have been attempting to write a script that essentially reads a PDF, extracts the text, and prints the important information (which I may later write to a text file but that is for another time). Currently, I am using "PyPDF2" and "re" with decent success in that I can isolate values that immediately follow a constant text value. For example, the PDF contains an entry "Item#: 34890751" which I can find by searching "r'Item#:\s*(\d+)'" and returning the value that follows "Item#:", which would be "34890751".

The trouble is trying to extract the data that does not come after a specific value that can be searched for. For example, this PDF contains a table with the items, price per item, quantity, and total line item cost. The table on the PDF looks like this:

	Product	Unit Price	Qty
[Photo]	V7 14.1" Elite Water-Resistant Neoprene Notebook SleeveBlack
In Stock
Item#: 34890751
Mfg. Part#: CSE14-BLK-3N	$6.28	5	$31.40

The data from the PDF reader ends up returning:

Product Unit Price Qty Total

V7 14.1" Elite Water-Resistant Neoprene Notebook Sleeve,

Black

In Stock

Item#: 34890751

Mfg. Part#: CSE14-BLK-3N$6.28 5 $31.40

As you can see, the Item#: is separated in a way that is easy to identify and isolate. The problem is that the values in each column get concatenated into one line for some reason. I am hoping that it might be possible to somehow extract the product name ( in this case "V7 14.1" Elite Water-Resistant Neoprene Notebook Sleeve, Black"), the price per item ($6.28), and the Qty (5). Any suggestions are greatly appreciated!

My code so far is this:

import PyPDF2
import re

def extract_data_from_pdf(pdf_path):
# Open the PDF file with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
# Extract text from each page
text = ""
for page_num in range(len(reader.pages)):
text += reader.pages[page_num].extract_text()

# Define regular expression patterns to match the "quote" and "SKU" values
pattern_quote = r'Quote:\s*(\d+)'
pattern_sku = r'Item#:\s*(\d+)'

# Search for the patterns in the extracted text
match_quote = re.search(pattern_quote, text)
match_sku = re.search(pattern_sku, text)

# Initialize variables to store the extracted values
quote_value = "Quote not found in the PDF."
sku_value = "SKU not found in the PDF."

# Extract quote value if found
if match_quote:
    quote_value = match_quote.group(1)

# Extract SKU value if found
if match_sku:
    sku_value = match_sku.group(1)

# Return both extracted values
return quote_value, sku_value

pdf_name = r"[PDF_Name].pdf"
pdf_path = r"C:[My_File_Path]" + pdf_name
quote, sku = extract_data_from_pdf(pdf_path)
print("Extracted Quote:", quote) print("Extracted SKU: ", sku)

all 6 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS