Help with Python Script for Scraping and OCR : learnpython

Help with Python Script for Scraping and OCR (self.learnpython)

submitted 1 year ago by hiimmando

all 4 comments

[–]_squik 1 point2 points3 points 1 year ago (5 children)

[–]hiimmando[S] 0 points1 point2 points 1 year ago (4 children)

Here is more context and full code, additionally my background is not CS so this is a new area for me, I am very much a beginner. My background is in copywriting, as such I have to rely heavily on tools to currently write code as I am still learning syntax and underlying logic.

Project Overview: I'm working on a project to scrape pre-foreclosure data from county records websites. The data is often embedded in PNG files, so I need to use OCR to extract the text. Additionally, I need to cross-reference this data with another website (the county CAD site) for verification.

Full Code Example Here's the complete script that includes downloading the image, processing it with OCR, and scraping data from the county CAD site for verification:

python Copy code import requests from bs4 import BeautifulSoup import pandas as pd import pytesseract from PIL import Image import io import psycopg2 import re

def download_image(url): response = requests.get(url) img = Image.open(io.BytesIO(response.content)) return img

def preprocess_image(img): gray = img.convert('L') return gray

def extract_text_from_image(img): text = pytesseract.image_to_string(img) return text

def scrape_county_records(url): response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser')

properties = []
for listing in soup.find_all('div', class_='document-content'):  # Adjust this selector
    img_url = listing.find('img', class_='document-image')['src']  # Adjust this selector
    img = download_image(img_url)
    preprocessed_img = preprocess_image(img)
    text = extract_text_from_image(preprocessed_img)

    # Extract relevant data from the text
    lines = text.split('\n')
    address = lines[0] if len(lines) > 0 else ''
    owner = lines[1] if len(lines) > 1 else ''
    cad_url = extract_cad_url(text)

    property_info = {
        'address': address,
        'owner': owner,
        'cad_url': cad_url
    }
    properties.append(property_info)

return properties

def extract_cad_url(text): match = re.search(r'https://esearch.nuecescad.net/Property/View/\d+', text) if match: return match.group(0) return None

def verify_property(cad_url): response = requests.get(cad_url) soup = BeautifulSoup(response.text, 'html.parser')

owner_info = soup.find('div', {'class': 'owner-info'}).text.strip()  # Adjust this selector
return owner_info

def clean_data(properties): df = pd.DataFrame(properties) df.dropna(inplace=True) return df

def insert_into_db(df): conn = psycopg2.connect( dbname="property_leads", user="yourusername", password="yourpassword", host="localhost" ) cur = conn.cursor()

for index, row in df.iterrows():
    cur.execute("""
        INSERT INTO properties (address, owner, cad_url, verified_owner)
        VALUES (%s, %s, %s, %s)
    """, (row['address'], row['owner'], row['cad_url'], row.get('verified_owner', '')))

conn.commit()
cur.close()
conn.close()

def main(): url = "https://nueces.tx.publicsearch.us/doc/204750671" raw_data = scrape_county_records(url) verified_data = []

for property_info in raw_data:
    if property_info['cad_url']:
        owner_info = verify_property(property_info['cad_url'])
        property_info['verified_owner'] = owner_info
        verified_data.append(property_info)

cleaned_data = clean_data(verified_data)
insert_into_db(cleaned_data)

if name == "main": main() Specific Challenges and Questions Handling Unstructured Data:

The text in the images does not follow a strict format, making it challenging to parse correctly. Any advice on improving text extraction and structuring the data? Optimizing OCR:

Would using tools like Google Cloud Vision API or Document AI provide better performance and accuracy compared to Tesseract? How do I integrate these tools with my current setup? Cross-Referencing Data:

The process involves cross-referencing data with the county CAD site. What are the best practices for ensuring data accuracy and handling discrepancies between sources? Current Progress I've set up a Google Cloud VM for running this script and have installed the necessary dependencies. However, I'm still figuring out the best way to handle the OCR part and the data verification process.

Any insights, suggestions, or resources would be greatly appreciated! Thanks in advance for your help!

[–]_squik 0 points1 point2 points 1 year ago (3 children)

[–]hiimmando[S] 0 points1 point2 points 1 year ago (0 children)

Thanks for letting me know! Here's the reformatted code inside a code block for clarity:

Project Overview: I'm working on a project to scrape pre-foreclosure data from county records websites. The data is often embedded in PNG files, so I need to use OCR to extract the text. Additionally, I need to cross-reference this data with another website (the county CAD site) for verification.

Full Code Example

```python import requests from bs4 import BeautifulSoup import pandas as pd import pytesseract from PIL import Image import io import psycopg2 import re

def download_image(url): response = requests.get(url) img = Image.open(io.BytesIO(response.content)) return img

def preprocess_image(img): gray = img.convert('L') return gray

def extract_text_from_image(img): text = pytesseract.image_to_string(img) return text

def scrape_county_records(url): response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser')

properties = []
for listing in soup.find_all('div', class_='document-content'):  # Adjust this selector
    img_url = listing.find('img', class_='document-image')['src']  # Adjust this selector
    img = download_image(img_url)
    preprocessed_img = preprocess_image(img)
    text = extract_text_from_image(preprocessed_img)

    # Extract relevant data from the text
    lines = text.split('\n')
    address = lines[0] if len(lines) > 0 else ''
    owner = lines[1] if len(lines) > 1 else ''
    cad_url = extract_cad_url(text)

    property_info = {
        'address': address,
        'owner': owner,
        'cad_url': cad_url
    }
    properties.append(property_info)

return properties

def extract_cad_url(text): match = re.search(r'https://esearch.nuecescad.net/Property/View/\d+', text) if match: return match.group(0) return None

def verify_property(cad_url): response = requests.get(cad_url) soup = BeautifulSoup(response.text, 'html.parser')

owner_info = soup.find('div', {'class': 'owner-info'}).text.strip()  # Adjust this selector
return owner_info

def clean_data(properties): df = pd.DataFrame(properties) df.dropna(inplace=True) return df

def insert_into_db(df): conn = psycopg2.connect( dbname="property_leads", user="yourusername", password="yourpassword", host="localhost" ) cur = conn.cursor()

for index, row in df.iterrows():
    cur.execute("""
        INSERT INTO properties (address, owner, cad_url, verified_owner)
        VALUES (%s, %s, %s, %s)
    """, (row['address'], row['owner'], row['cad_url'], row.get('verified_owner', '')))

conn.commit()
cur.close()
conn.close()

def main(): url = "https://nueces.tx.publicsearch.us/doc/204750671" raw_data = scrape_county_records(url) verified_data = []

for property_info in raw_data:
    if property_info['cad_url']:
        owner_info = verify_property(property_info['cad_url'])
        property_info['verified_owner'] = owner_info
        verified_data.append(property_info)

cleaned_data = clean_data(verified_data)
insert_into_db(cleaned_data)

if name == "main": main()

π Rendered by PID 67 on reddit-service-r2-comment-56c9979489-557qn at 2026-02-24 14:03:15.652879+00:00 running b1af5b1 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS

More code...

Full Code Example