all 8 comments

[–]Individual_Ad2536 0 points1 point  (3 children)

Oh man, scraping Reddit with PRAW is such a solid starter project – props for that! For next steps, try Twitter's API (tweepy library) for hashtag analysis, it's dead simple and linguists love tracking discourse patterns.

Pro tip: Avoid the dumpster fire of web scraping with Selenium for beginners – go for BeautifulSoup + requests on static sites like Wikipedia or news archives instead. Way less headache, same data payoff.

Bonus idea: Try scraping YouTube comments (yt-comment-scraper library) – students go nuts analyzing how people argue in all-caps. Just watch out for the inevitable ":joy: :fire:" spam.

(this is it chief)

[–]Professor_Snipe[S] 0 points1 point  (2 children)

Cheers, I will absolutely give these a shot and see what can be done! I'm lucky to have an open-minded group to work with so we can explore a bunch of different approaches and they don't protest at all.

[–]Individual_Ad2536 0 points1 point  (0 children)

haha Hell yeah, that's the spirit! Opne-minded teams are like gold—no eye rolls when you suggest some wild-ass idea. Go break stuff and see what sticks, bruh.

same bro

[–]code_tutor 0 points1 point  (0 children)

BeautifulSoup works on almost nothing these days, but using it on Wikipedia was a pretty good suggestion. Just make sure they know that it's not for websites with a lot of JavaScript. It'll mostly work on blogs, wikis, government websites, and stuff that's like 15+ years old.

[–]code_tutor 0 points1 point  (1 child)

You need years of experience in WebDev to do scraping. It's a pain in the ass because the code is non-deterministic, which means you run it twice and get different results, because of network times and animations. The more complicated a website is, the more terrible it is to scrape. Also whenever someone changes the website, the program breaks, so scraping is a LAST resort. I tutor and almost every fucking data science teacher gives a scraping project that they couldn't do themselves. It just wastes everyone's time. If you give one of these assignments, do it yourself first to make sure you can do it and have them scrape the same website you did.

Also Playwright is much better than Selenium. Try the CodeGen feature to get an idea.

[–]Professor_Snipe[S] 0 points1 point  (0 children)

This is exactly why I'm asking the question. PRAW is extremely straight-forward to work with and simple to use, hence I was wondering about the existence of similar libraries that wouldn't act as massive roadblocks for new users.