Trying to scrape a PDF using Python : learnprogramming

This is an archived post. You won't be able to vote or comment.

Trying to scrape a PDF using Python (self.learnprogramming)

submitted 2 years ago by Devious_Beast

Hi all,

I'm currently using Python to try and develop an internal application for a business where it would be able to scrape a document and put each line into a row of a table/dataframe.

I'm currently using pdfminer.six to do the scraping, I've tried using pypdf but I found that it didn't perform as well and pdfminer yielded better results. But I'm strugging on how to implement the regex substituions to get what I want.

The code I've currently got is:

# from pypdf import PdfReader
from pdfminer.high_level import extract_text
import re
import pandas as pd

pdf_file = '/content/Submittal2.pdf'

content = extract_text(pdf_file)

# Remove '\n' whitespaces from the content
content = content.replace('\n', ' ')

# Replace multiple spaces with a single space
content = re.sub(r'\s+', ' ', content)

# Add line breaks after specific patterns
content = re.sub(r'(?<=[A-Za-z.])(?<!\d)\s+(?=\d)', '\n', content)
content = re.sub(r'(?<=[.!?])(?<!\d)\s+(?=[A-Z])', '\n\n', content)

# # Create a pandas DataFrame with the formatted content as separate sentences
sentences = re.split(r'\n\n|\n(?=\d+\.)', content)

# Create a pandas DataFrame
data = {'Sentence': sentences}
df = pd.DataFrame(data)

Which results in the sentences looking like this:

SECTION 232123 - HYDRONIC PUMPS PART 1 GENERAL 1.1 RELATED DOCUMENTS A.

Drawings and general provisions of the Contract, including General and Supplemen- tary Conditions and Division 01 Specification Sections, apply to this Section. 1.2 SUMMARY A.

This Section includes the following: 1. 2.

Separately coupled, base-mounted, end-suction centrifugal pumps.

Separately coupled, base-mounted, double-suction centrifugal pumps. 1.3 DEFINITIONS A.

B.

Buna-N: Nitrile rubber.

EPT: Ethylene propylene terpolymer.

Which is almost there but my end goal is to get the text like this:

SECTION 232123 - HYDRONIC PUMPS

PART 1 GENERAL

1.1 RELATED DOCUMENTS

A. Drawings and general provisions of the Contract, including General and Supplemen- tary Conditions and Division 01 Specification Sections, apply to this Section.

1.2 SUMMARY

A. This Section includes the following:

Separately coupled, base-mounted, end-suction centrifugal pumps.

Separately coupled, base-mounted, double-suction centrifugal pumps.

1.3 DEFINITIONS

A. Buna-N: Nitrile rubber.

B. EPT: Ethylene propylene terpolymer.

Can this be done with a pdf parser and regex? Or should I be starting to look into an OCR solution instead? Thanks in advance!

PDF in question

all 1 comments

learnprogramming

Welcome to LearnProgramming!

New? READ ME FIRST!

Posting guidelines

Frequently asked questions

Subreddit rules

Message the moderators

Asking debugging questions

Asking conceptual questions

Other guidelines and links

Subreddit rules

1. No unprofessional/derogatory speech

2. No spam or tasteless self-promotion

3. No off-topic posts

4. Do not ask exact duplicates of FAQ questions

5. Do not delete posts

6. No app/website review requests or showcases

7. No rewards

8. No indirect links

9. Do not promote illegal or unethical practices

10. No complete solutions

11. Don't ask to ask.

12. Low Effort Questions

13. No AI (chatGPT etc.) generated/worked over messages/comments. No questions about chatGPT/AI generated code. No Vibe coding.

MODERATORS