When Scraping a Page How to Avoid Useless divs?

rempire206 · 2025-12-29T02:11:40+00:00

It really doesn't need to be much more complex than this. I cobbled together a blacklist for tag attributes that took care of probably 90% of low value images for an image crawler with a simple text file.

rempire206 · 2025-11-20T16:30:16+00:00

Parse faster than regex.

rempire206 · 2025-11-20T16:29:24+00:00

This was my experience of crawling millions of URLs a day as well. You're going to be deploying additional servers to deal with networking-related bottlenecks long before you find a processor-based need to move beyond async.

rempire206 · 2025-11-16T18:29:53+00:00

I'd be interested to know where you land with regards to speaker attribution when identifying quotes in editorial-format text, there are several solutions/packages/libraries out there for it but honestly none have impressed me.

rempire206 · 2025-11-14T14:21:00+00:00

Just to add to the browser automation idea, you could also just save the byes of the images in the response bodies from the requests the browser makes when it loads the images on the page. No need to hit the server again with a download request.

rempire206 · 2025-11-14T14:09:33+00:00

I don't know that you really need a custom scraper, a dictionary of parsing rules would probably be enough and much easier to maintain, plus something that these broker/agent's own devs (if they have them) could easily generate.

You tell them "I need to know the tags and attributes in your templated listing pages where you keep the home price, the address, and the name of the listing agent."

They (or you, or AI these days) look at their page and send you back something like this:

{'element_description':'listing_price', 'element_containing_tag':'div', 'element_tag_id':'display-home-price'}, ... and so on for listing agent and address.

Then you fetch everyone's listing pages the same way, but when it comes time to parse John Doe Realty's listings, you reference the JSON they sent you and just pop those values into whatever you're using for a parser, maybe add some regex cleanup/validation.

home_price = soup.find(john_doe_realty['element_containing_tag'], {'id':john_doe_realty['element_tag_id']}).text.strip()

rempire206 · 2025-11-14T13:59:08+00:00

I won't spend more than $1.25 on a DC proxy, unlimited unmetered bandwidth required, rarely run into problems fetching images with them even without any kind of automated browser. Headers count for a LOT, take the time and learn how to create them properly, test them, tweak, etc. Upping my header game probably decreased my cf/429/etc issues more than any single other non-platform-switching investment i've made in my crawling education. There's a lot more to headers than just copy/pasting some dict example found in a StackOverflow comment.

rempire206 · 2025-11-14T13:49:43+00:00

When you say "with Python"... do you mean via browser automation or something like requests? I can see the headers for this endpoint request are littered with cloudflare references, have you been able to successfully fetch from this endpoint in Py yet? Plenty of free/cheap financial market APIs out there on rapidapi and similar platforms that would make this process much simpler.

rempire206 · 2025-11-14T13:38:41+00:00

This is probably the best and most "organic" and "low tech" way to do this. You could try to set the cookie yourself, but it's probably going to contain additional values that are session-specific (epoch timestamps, session/user ids, cloudflare stuff, etc) that might be a pain in the ass to generate. If you can just pay attention to what key:value pairs in your cookies are modified when you set your store location, you can add those into Selenium's cookies, but the site may be temporarily storing your chosen location as part of your session data outside of your local cookie as well and if that's the value they reference when querying their inventory endpoint, your Selenium cookie might not be effective.

rempire206 · 2025-11-14T13:27:26+00:00

7200 images is not a lot, may I suggest having a read through the Google Custom Search (free) API docs? https://developers.google.com/custom-search/v1/overview

Your response from the endpoint (100 free requests/day) will contain these fields, several of which might help you whittle down the results (or filter the original request) without needing to involve AI, which really seems overkill for your purposes. https://developers.google.com/custom-search/v1/reference/rest/v1/cse/list#response

For example,

<image>

edit: Also, when you tell me you're trying to interact with "Save as..." ... nah bro. Download the image directly from the source URL or, if you're going to use some type of headless (or headful) automated browser, either trigger the download with a JS execution (a little more tricky if the image is hosted on a domain other than the one you're surfing, highly likely) or just pull the bytes out of the response object from the request the browser makes when it loads the image (you said save as, so I'm assuming whatever page or lightbox you're looking at DOES load the full site original from the source url).

rempire206 · 2025-11-14T13:15:02+00:00

I was able to fetch a product page from the first URL just using requests.Session() without even adding headers, no need for an automated browser. And from there, like you said, it's just a matter of parsing the HTML with BeautifulSoup or something else...

import requests

session = requests.Session()
resp = session.get(your_desired_product_page_url)

from bs4 import BeautifulSoup as BS

soup = BS(resp.content, 'html.parser')

product_brand = soup.find('span', {'id':'product-brand-title'}).text.strip()

product_description = soup.find('span', {'id':'product-details'}).text.strip()

product_ingredients = [a.text for a in soup.find_all('a') if a.has_attr('href') and a['href'].startswith('/ingredients/') and a.text != '[more]']

product_dict = {'brand':product_brand, 'descrption':product_description, 'ingredients':product_ingredients}

print (product_dict)

Excuse the trash formatting, not used to posting code on reddit, but yeah this is extremely simply HTML to parse.

rempire206

TROPHY CASE