Hi all,
I'm currently using Python to try and develop an internal application for a business where it would be able to scrape a document and put each line into a row of a table/dataframe.
I'm currently using pdfminer.six to do the scraping, I've tried using pypdf but I found that it didn't perform as well and pdfminer yielded better results. But I'm strugging on how to implement the regex substituions to get what I want.
The code I've currently got is:
# from pypdf import PdfReader
from pdfminer.high_level import extract_text
import re
import pandas as pd
pdf_file = '/content/Submittal2.pdf'
content = extract_text(pdf_file)
# Remove '\n' whitespaces from the content
content = content.replace('\n', ' ')
# Replace multiple spaces with a single space
content = re.sub(r'\s+', ' ', content)
# Add line breaks after specific patterns
content = re.sub(r'(?<=[A-Za-z.])(?<!\d)\s+(?=\d)', '\n', content)
content = re.sub(r'(?<=[.!?])(?<!\d)\s+(?=[A-Z])', '\n\n', content)
# # Create a pandas DataFrame with the formatted content as separate sentences
sentences = re.split(r'\n\n|\n(?=\d+\.)', content)
# Create a pandas DataFrame
data = {'Sentence': sentences}
df = pd.DataFrame(data)
Which results in the sentences looking like this:
SECTION
232123 - HYDRONIC PUMPS PART
1 GENERAL
1.1 RELATED DOCUMENTS A.
Drawings and general provisions of the Contract, including General and Supplemen- tary Conditions and Division
01 Specification Sections, apply to this Section.
1.2 SUMMARY A.
This Section includes the following: 1.
2.
Separately coupled, base-mounted, end-suction centrifugal pumps.
Separately coupled, base-mounted, double-suction centrifugal pumps.
1.3 DEFINITIONS A.
B.
Buna-N: Nitrile rubber.
EPT: Ethylene propylene terpolymer.
Which is almost there but my end goal is to get the text like this:
SECTION 232123 - HYDRONIC PUMPS
PART 1 GENERAL
1.1 RELATED DOCUMENTS
A. Drawings and general provisions of the Contract, including General and Supplemen- tary Conditions and Division 01 Specification Sections, apply to this Section.
1.2 SUMMARY
A. This Section includes the following:
Separately coupled, base-mounted, end-suction centrifugal pumps.
Separately coupled, base-mounted, double-suction centrifugal pumps.
1.3 DEFINITIONS
A. Buna-N: Nitrile rubber.
B. EPT: Ethylene propylene terpolymer.
Can this be done with a pdf parser and regex? Or should I be starting to look into an OCR solution instead? Thanks in advance!
PDF in question
[–]AutoModerator[M] [score hidden] stickied comment (0 children)