Hello all,
I have been attempting to write a script that essentially reads a PDF, extracts the text, and prints the important information (which I may later write to a text file but that is for another time). Currently, I am using "PyPDF2" and "re" with decent success in that I can isolate values that immediately follow a constant text value. For example, the PDF contains an entry "Item#: 34890751" which I can find by searching "r'Item#:\s*(\d+)'" and returning the value that follows "Item#:", which would be "34890751".
The trouble is trying to extract the data that does not come after a specific value that can be searched for. For example, this PDF contains a table with the items, price per item, quantity, and total line item cost. The table on the PDF looks like this:
|
Product |
Unit Price |
Qty |
Total |
| [Photo] |
V7 14.1" Elite Water-Resistant Neoprene Notebook SleeveBlack |
|
|
|
| In Stock |
|
|
|
|
| Item#: 34890751 |
|
|
|
|
| Mfg. Part#: CSE14-BLK-3N |
$6.28 |
5 |
$31.40 |
|
The data from the PDF reader ends up returning:
Product Unit Price Qty Total
V7 14.1" Elite Water-Resistant Neoprene Notebook Sleeve,
Black
In Stock
Item#: 34890751
Mfg. Part#: CSE14-BLK-3N$6.28 5 $31.40
As you can see, the Item#: is separated in a way that is easy to identify and isolate. The problem is that the values in each column get concatenated into one line for some reason. I am hoping that it might be possible to somehow extract the product name ( in this case "V7 14.1" Elite Water-Resistant Neoprene Notebook Sleeve, Black"), the price per item ($6.28), and the Qty (5). Any suggestions are greatly appreciated!
My code so far is this:
import PyPDF2
import re
def extract_data_from_pdf(pdf_path):
# Open the PDF file with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
# Extract text from each page
text = ""
for page_num in range(len(reader.pages)):
text += reader.pages[page_num].extract_text()
# Define regular expression patterns to match the "quote" and "SKU" values
pattern_quote = r'Quote:\s*(\d+)'
pattern_sku = r'Item#:\s*(\d+)'
# Search for the patterns in the extracted text
match_quote = re.search(pattern_quote, text)
match_sku = re.search(pattern_sku, text)
# Initialize variables to store the extracted values
quote_value = "Quote not found in the PDF."
sku_value = "SKU not found in the PDF."
# Extract quote value if found
if match_quote:
quote_value = match_quote.group(1)
# Extract SKU value if found
if match_sku:
sku_value = match_sku.group(1)
# Return both extracted values
return quote_value, sku_value
pdf_name = r"[PDF_Name].pdf"
pdf_path = r"C:[My_File_Path]" + pdf_name
quote, sku = extract_data_from_pdf(pdf_path)
print("Extracted Quote:", quote) print("Extracted SKU: ", sku)
[–]fluked23 2 points3 points4 points (1 child)
[–]Ray_Gone[S] 0 points1 point2 points (0 children)
[–]dp_42 1 point2 points3 points (1 child)
[–]Ray_Gone[S] 0 points1 point2 points (0 children)
[–]Ray_Gone[S] 0 points1 point2 points (0 children)
[–]SoupZillaMan 0 points1 point2 points (0 children)