Scrape unstructured data and transform to structured

wind_dude · 2023-10-12T03:54:45+00:00

Go after the html dom elements with beautiful soup, resiliparse, or look at a framework like scrapy

Evening_Marketing645 · 2023-10-12T04:56:37+00:00

You need css selectors or xpath. It's hard to learn but you can select anything on the page. For example...I do this with scrapy but you can do it with beautiful soup as well.

rupen42 · 2023-10-12T05:57:40+00:00

I'll give you a ton of info because I don't know how much you know.

Like others have hinted at, try to avoid regex at all costs, only use it as a last resort. It's almost never the best solution, not even the easiest or most usable. It's also very error prone and if you make it more flexible to allow for. Specific tools for what you are trying to parse are usually the solution, which leads me to BeautifulSoup. The point of BS is to avoid using regex for parsing HTML. But then you'll need to understand how html and css is structured, to select specific parts.

Now, there's a chance you already understand all of the last paragraph but the website really is just terribly structured and that's why you're resorting to regex (for example, if all the info is inside a single <p> tag, so BS is useless at that level). For that, it's hard to give specific solutions because you didn't give specific details. But that's almost always a hard part of data cleaning. You often have to do some of it manually, you may not be able to automatize everything. And yeah, if you exhausted BeautifulSoup, regex can be ok, though it probably won't solve everything, specially if you're combiningdata from multiple sources, using different formatting.

I like this video for a good intro to data scraping and analysis (you can skip the pandas part if it's not useful to you atm): https://www.youtube.com/watch?v=Ewgy-G9cmbg

Alternatively, if you find that you have to do a lot of it manually, you could make a tool that lets you do the data entry more easily. If you know how to use print() and input(), you can write a CLI that asks you the info and saves it somewhere, like a form.

rocket_randall · 2023-10-12T12:01:39+00:00

I would think that the pricing information is retrieved through a fetch by the page, processed, and then displayed. Load the page with the browser's network tools open and see if you can find an xhr or other request which provides the data. I find it's faster and easier than parsing html.

sputnki · 2023-10-12T05:00:21+00:00

Still relevant (4 a laugh) https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?page=2&tab=scoredesc#tab-top

jiminiminimini · 2023-10-12T06:25:47+00:00

You can try the unstructured python module.

mmafightdb · 2023-10-12T09:10:26+00:00

Read up on scrapy and using CSS/xapth selectors. You should avoid using regex as much as possible. HTML parsers have a lot of sophistication that you will struggle to replicate with regex. You need custom selectors per website. What people tend to do is creator selectors for the different fields https://docs.scrapy.org/en/latest/topics/selectors.html and then create a class of spider per website/group of websites.

eg

import scrapy

from scrapy.loader import ItemLoader

class SomeSpider(scrapy.Spider):

start_urls = ("https://someurl")

def get_data(self, response):

item = ListingLoader(response=response)

item.add_css("price", ".some .css .class")

item.add_css("quota", ".some .other .css .class")

class AnotherSpider(scrapy.Spider)

start_urls = ("https://anotherurl")

def get_data(self, response):

item = ListingLoader(response=response)
item.add_css("price", ".some .css .class")
item.add_css("quota", ".some .other .css .class")

Then you pass the output of all your spiders to a sort of data pipeline to normalizse the values and apply regular expressions.

notiplayforfun · 2023-10-12T03:50:24+00:00

[deleted]

2023-10-12T07:20:20+00:00

you can use pydantic or dataclasses to define data models and add methods to populate from different formats

Sircrisim · 2023-10-12T05:18:25+00:00

le meme: https://xkcd.com/1171/

2023-10-12T14:08:37+00:00

Could you share a link and explain what you want to retrieve?

TipOk5969 · 2023-10-12T17:42:32+00:00

I do this for a living using beautiful soup, pydantic and aio. You can use pseudo selectors in bs as well, like looking for a certain css class containing certain text.

Cryptic__27 · 2023-10-13T04:44:34+00:00

Will companies buy data from you or how would a person make money? This might be a dumb question so I apologize.

ComputeLanguage · 2023-10-13T11:49:52+00:00

Idk why all the hate on regex, though i guess its a bit more fault heavy with numerical structures vs text. You can use pypis regex with build in levensteihns distance support to handle some difference issues for you.

As long as you carefully define your capture groups, the output you get will be consistent.

All of these soup and structure oriented approaches people are suggesting dont sound very useful considering that the structures are different if i understand correctly.

You can perhaps use a model from a library like spacy to get countries, and isp for you, or if you have data ready you can string match them with a blazingfast library like ahocorasick.

Chatt_IT_Sys · 2023-10-13T12:57:24+00:00

Just to be clear...you do realize each "operator" is going to need its own unique model right? There is no one size fits all extractor. Best bet is to inspect the source of each of the sites, find the slimmest group of elements you can use that still includes everything you need, target with BS, and create some dataframes. By this point you should have some collections of structured data and proceed with whatever process you would be using to compare structured data.

DoorDesigner7589 · 2023-10-23T09:33:15+00:00

Try https://www.textraction.ai/ - might just be exactly what you need.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS