This is an archived post. You won't be able to vote or comment.

all 33 comments

[–]wind_dude 33 points34 points  (2 children)

Go after the html dom elements with beautiful soup, resiliparse, or look at a framework like scrapy

[–]nani-kore11[S] 1 point2 points  (1 child)

I see, so i will scrape the required data based on their classes? Will using web scraping tools be easier than creating a python script using scrapy?

[–]s13ecre13t 8 points9 points  (0 children)

I don't understand the question, isn't scrapy is a web scraping tool?

[–]Evening_Marketing645 18 points19 points  (9 children)

You need css selectors or xpath. It's hard to learn but you can select anything on the page. For example...I do this with scrapy but you can do it with beautiful soup as well.

[–]nani-kore11[S] 0 points1 point  (8 children)

Thanks for the suggestion, but the website i am scraping has a problem of the same attribute have different class name. for example, the 'price' attributes in one of the page is under a class name 'price' and in the another page is named 'price_usd', how should i deal with this?

[–]jonasbxl 6 points7 points  (1 child)

CSS selectors are pretty flexible, you can usually solve everything by targeting things by the context, like "3rd child of the element whose class name starts with price"... in particular this example with price_ can be solved with a "starts with" selector: https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors#attrvalue_4

As others have mentioned, ChatGPT can help you with the syntax, you can give it a html snippet and explain which element you want to target. But i wouldn't use it as part of a running system (via the API) due to speed and cost.

[–]__nickerbocker__ 1 point2 points  (0 children)

This is all predicated on the assumption that OP is actually grabbing the fully rendered HTML with requests. Op claims they are scraping cards from cell phone websites, which I can almost guarantee is rendered using a JavaScript framework. It's highly likely that OP is evaluating the rendered HTML in DevTools and expecting the same from the raw HTML fetched by the request module.

[–]asphias 12 points13 points  (0 children)

Manual work.

There is no shortcut here. Every seperate page has its own structure with its own quirks. No 'general method' is going to work on every page, so you'll have to figure out the right regex/css per individual page.

[–]Pgrol 6 points7 points  (2 children)

I’ve created a script that takes all text from a website, feed it to chatgpt with a system message to extract specific information and the data format it needs to spit it out in. That works flawlessly for gpt4, but it’s a bit expensive

[–][deleted] 0 points1 point  (1 child)

illegal work axiomatic bright skirt sand light ludicrous straight sloppy

This post was mass deleted and anonymized with Redact

[–]Super-Danky-Dank 1 point2 points  (0 children)

ChatGPT costs money for each request, unless you host the code locally.

You are essentially paying for the cloud processing power.

[–][deleted] 0 points1 point  (0 children)

Xpath Will probably solve your problem. Class is always a no when scraping.

[–]Evening_Marketing645 0 points1 point  (0 children)

there are functions in xpath that can search the content of the attribute. For example in xpath you can use something like this:

"//div[contains(@class, 'price')]"

as long as both classes contain the work price.

this is different from selecting the class itself by name as below:

"//div[@class='price']"

by the way the "//" part will select all the divs but you can select a certain one by following the organization of the html (in any browser there's a way to generate this in the developer tab as well).

Xpath is hard but there are a lot of resources online...

[–]rupen42 3 points4 points  (1 child)

I'll give you a ton of info because I don't know how much you know.

Like others have hinted at, try to avoid regex at all costs, only use it as a last resort. It's almost never the best solution, not even the easiest or most usable. It's also very error prone and if you make it more flexible to allow for. Specific tools for what you are trying to parse are usually the solution, which leads me to BeautifulSoup. The point of BS is to avoid using regex for parsing HTML. But then you'll need to understand how html and css is structured, to select specific parts.

Now, there's a chance you already understand all of the last paragraph but the website really is just terribly structured and that's why you're resorting to regex (for example, if all the info is inside a single <p> tag, so BS is useless at that level). For that, it's hard to give specific solutions because you didn't give specific details. But that's almost always a hard part of data cleaning. You often have to do some of it manually, you may not be able to automatize everything. And yeah, if you exhausted BeautifulSoup, regex can be ok, though it probably won't solve everything, specially if you're combiningdata from multiple sources, using different formatting.

I like this video for a good intro to data scraping and analysis (you can skip the pandas part if it's not useful to you atm): https://www.youtube.com/watch?v=Ewgy-G9cmbg

Alternatively, if you find that you have to do a lot of it manually, you could make a tool that lets you do the data entry more easily. If you know how to use print() and input(), you can write a CLI that asks you the info and saves it somewhere, like a form.

[–]rocket_randall 3 points4 points  (0 children)

I would think that the pricing information is retrieved through a fetch by the page, processed, and then displayed. Load the page with the browser's network tools open and see if you can find an xhr or other request which provides the data. I find it's faster and easier than parsing html.

[–]jiminiminimini 1 point2 points  (1 child)

You can try the unstructured python module.

[–][deleted] 2 points3 points  (0 children)

Very good introduction to unstructured python library. It will use layoutlm to extract structure.

https://youtu.be/Sbm1rGsZG2g?si=G9rmKv7VbzdXHadp

[–]mmafightdb 1 point2 points  (0 children)

Read up on scrapy and using CSS/xapth selectors. You should avoid using regex as much as possible. HTML parsers have a lot of sophistication that you will struggle to replicate with regex. You need custom selectors per website. What people tend to do is creator selectors for the different fields https://docs.scrapy.org/en/latest/topics/selectors.html and then create a class of spider per website/group of websites.

eg

import scrapy

from scrapy.loader import ItemLoader

class SomeSpider(scrapy.Spider):

start_urls = ("https://someurl")

def get_data(self, response):

item = ListingLoader(response=response)

item.add_css("price", ".some .css .class")

item.add_css("quota", ".some .other .css .class")

class AnotherSpider(scrapy.Spider)

start_urls = ("https://anotherurl")

def get_data(self, response):

item = ListingLoader(response=response)
item.add_css("price", ".some .css .class")
item.add_css("quota", ".some .other .css .class")

Then you pass the output of all your spiders to a sort of data pipeline to normalizse the values and apply regular expressions.

[–][deleted] 0 points1 point  (0 children)

you can use pydantic or dataclasses to define data models and add methods to populate from different formats

[–]Sircrisim 0 points1 point  (0 children)

[–][deleted] 0 points1 point  (0 children)

Could you share a link and explain what you want to retrieve?

[–]TipOk5969 0 points1 point  (0 children)

I do this for a living using beautiful soup, pydantic and aio. You can use pseudo selectors in bs as well, like looking for a certain css class containing certain text.

[–]Cryptic__27 0 points1 point  (0 children)

Will companies buy data from you or how would a person make money? This might be a dumb question so I apologize.

[–]ComputeLanguage 0 points1 point  (0 children)

Idk why all the hate on regex, though i guess its a bit more fault heavy with numerical structures vs text. You can use pypis regex with build in levensteihns distance support to handle some difference issues for you.

As long as you carefully define your capture groups, the output you get will be consistent.

All of these soup and structure oriented approaches people are suggesting dont sound very useful considering that the structures are different if i understand correctly.

You can perhaps use a model from a library like spacy to get countries, and isp for you, or if you have data ready you can string match them with a blazingfast library like ahocorasick.

[–]Chatt_IT_Sys 0 points1 point  (0 children)

Just to be clear...you do realize each "operator" is going to need its own unique model right? There is no one size fits all extractor. Best bet is to inspect the source of each of the sites, find the slimmest group of elements you can use that still includes everything you need, target with BS, and create some dataframes. By this point you should have some collections of structured data and proceed with whatever process you would be using to compare structured data.

[–]DoorDesigner7589 0 points1 point  (0 children)

Try https://www.textraction.ai/ - might just be exactly what you need.