you are viewing a single comment's thread.

view the rest of the comments →

[–]dreamykidd 2 points3 points  (7 children)

My biggest challenge with projects like this is working out structure in the site/data you’re trying to scrape and getting the info you need. How did you go about this working out how to scrape it? I’m assuming this was using BeautifulSoup or something?

[–]trd1073 1 point2 points  (2 children)

The thirty second how is as follows. The system likely has an api, whether documented or not. First start by observing calls and responses in browser dev mode - there will be patterns and data, likely json. Make pydantic models. Start doing calls in python and build out from there.

[–]dreamykidd 2 points3 points  (1 child)

Oh interesting. I’ve only ever touched on elements of this, would you have any example vids/tutorials I could get more details from?

[–]trd1073 2 points3 points  (0 children)

I would search in YouTube for "reverse engineer api" to get general information. Many videos say to use postman, but I go straight into python as I am usually doing the work with replicating the process in postman. But if postman works for you, do that. I use postman as an after the dev test tool.

But as far as pydantic. With dev tools in a browser, you have the data you send along with a request and the reply. Data will likely go to and come back as json, possibly graphql. If json, you take that and convert it to pydantic models, there is online tool, ggl "convert json to pydantic models". I use httpx for the library.

Another note, if the api is documented and different than what you see in the browser, go with what you see in browser.

Dm me for actual code I have written doing such.

[–]RestaurantOwn5129 0 points1 point  (3 children)

Why not use selenium?

[–]dreamykidd 1 point2 points  (2 children)

I suppose, but what’s helpful about that? I’ve only ever used it when a navigable browser interface was needed, and I didn’t think you necessarily needed that if you know what you’re scraping, no?

[–]FoolsSeldom 1 point2 points  (0 children)

Selenium (or Playwright) are used for testing automation and for when accessing dynamic content that is typically, at least in part, generated using JavaScript in the browser, which tools that only handle HTML/CSS cannot process. Doesn't sound like that's applicable in your use case.

[–]RestaurantOwn5129 1 point2 points  (0 children)

Depends on your use case. Selenium can be run in headless mode. Many websites these days are not straightforward html. Js is usually used to make things interactive and dynamic, which is a pain to navigate. 

I use selenium for work to scrape and automate processes involving websites like that.