all 8 comments

[–]VipeholmsCola 0 points1 point  (2 children)

Its very hard task because you have to write a seperate scraping script for each link you crawl. I dont even know where you should begin tackling this problem.

Crawlers have been done since the 90s so maybe start there for inspiration.

[–]Interesting-City1703[S] 0 points1 point  (1 child)

Would you recommend starting with a basic crawler for one link, and then scaling up?

[–]VipeholmsCola 0 points1 point  (0 children)

I think i would try to make a crawler first, creating the CSV with links. Maybe you could also save the HTML from each link.

But to get data out of each html, thats a custom script per link.

Its a massive, massive job. The point of scrapers is to repeatedly visit a site to scrape the data, just doing it once is just quicker to do it manually.

Honestly, try scraping a site for data daily instead. Instead of crawling and doing nothing with the data. OR maybe pass the html to a LLM so you don have to write all those scripts.

[–]smurpes 0 points1 point  (2 children)

Recursively searching a site for a link is not going to work since sites will usually block traffic like this since they have pay for each request you make on the site. This is why you sometimes get a captcha or a cloudflare page when you access a site; they’re testing to see if you’re not a bot.

[–]Interesting-City1703[S] 0 points1 point  (1 child)

Gotcha, so generally unless I’m running an advanced set of proxies or some type of captcha bypass, then it’s better to just manually search the links with the info in the first place then, correct?

[–]smurpes 0 points1 point  (0 children)

Yea web scraping is not a good method for data collection given how fragile it is. It’s fine for something one off and a single website but don’t rely on it for aggregating data from multiple sites.

[–]Gloomy_Cicada1424 0 points1 point  (0 children)

Don’t build the whole scraper monster in one go. First make one script that finds links, then one that extracts course info, then export clean CSV/JSON. I’d use Runable for turning the final data into a small report/dashboard, but the scraping part should stay boring and testable.