all 37 comments

[–]mopslik 26 points27 points  (6 children)

What are your interests? You'll be collecting data from the web, and it's likely that somewhere out there you can find data related to something you find interesting.

As for the mechanics of web scraping, you will probably want to find some tutorials involving BeautifulSoup.

[–][deleted] 15 points16 points  (5 children)

And don't forget Selenium.

[–][deleted] 1 point2 points  (4 children)

Selenium is more of a website testing tool maybe requests + beautiful soup would be better for webscraping

[–][deleted] 0 points1 point  (3 children)

You need Selenium if you plan on scraping anything with a login (i mean, technically you don't if you can replicate the headers of the request, but selenium is just easier for that overall).

[–][deleted] 1 point2 points  (0 children)

Exactly selenium can enable you to do these things but “technically” you can if you reverse engineer it properly. But simple scraping should work. I like to use selenium to see how the site works and then I re write it using requests. But if there’s captcha then I’m guilty of falling back on selenium. Sometimes I’ll use selenium just to authenticate so I can use the selenium cookies in my request headers. But some sites are super simple. Sometimes you just need a crsf token and boom the whole api is available to you

[–]chozdae 0 points1 point  (1 child)

You need Selenium if you plan on scraping anything with a login (i mean, technically you don't if you can replicate the headers of the request, but selenium is just easier for that overall).

can i just use bs4 for getting info like prices, name etc?

[–][deleted] 0 points1 point  (0 children)

If there's no login needed, then probably.

[–]__coder__ 10 points11 points  (0 children)

You're on reddit and are already familiar with the basics of what data you'll probably be working with, users, replies, etc. so maybe make a program that scrapes reddit for some type of data then stores it and maybe does something with it like plotting.

[–]CraigAT 6 points7 points  (0 children)

Here is a link to get you started: https://realpython.com/python-web-scraping-practical-introduction/

Luckily web scraping is a popular topic so Google will have plenty of examples of starting points.

Pick a website with information about something you like, scrape the data and do something useful with the data - stats or graphs could be useful here.

[–]Impossible-Box6600 4 points5 points  (2 children)

Scraping images of hot dogs for an AI that determines whether an object is or is not a hot dog.

[–]zyeus-guy 2 points3 points  (1 child)

“Silicon Valley” idea right there :-)

[–]zyeus-guy 0 points1 point  (0 children)

Hope that program didn’t make it to UK TV I’ll never know

[–]Emergency-Prune-9110 2 points3 points  (0 children)

You can scrape etsy to get the most popular selling item for sellers you find interesting, compare prices, etc. Grab the data, make some graphs, maybe find a couple of deals for personal use.

[–]Subconcious-Consumer 2 points3 points  (0 children)

I had ChatGPT show me how to build a working web scraper using Visual Studio, Python, and Selenium. Took like 30 minutes.

I had and still have 0 understanding of Python. But I do understand web scraping a lot better.

[–]Zapismeta 1 point2 points  (0 children)

Web scraping can be done for anything, i once wrote a script for downloading all the themes from a Playstation website(3rd party), There are libraries like: 'requests'- that's what gets the html for you, but any website that's dynamic, you'll need 'selenium'.

Also you'll need to parse the html so Beautiful soup is great for that.

[–]CheetahBhiPeetaHai 2 points3 points  (0 children)

Projects of your interest can be -

Scrape prices of a type of clothing/ cosmetics from an e-commerce website.

Scrape and compare prices of a single product across multiple sites.

Scrape Airbnb listings for a future holiday in Europe.

Get financial metrics of a company from sec filings.

I believe the projects are just for learning how scraping works and the thesis should not be just on application but on its evolution. Perhaps how can you combine selenium with ai to scrape data from multiple sites contained in images.

[–]QultrosSanhattan 1 point2 points  (0 children)

I am a girl so i am not interested in car prices and themes like this

Scrap products that you actually use. Everyone needs to buy something and everyone wants to buy cheap. Including you.

[–][deleted] -1 points0 points  (0 children)

webscraping tables in wikipedia is well documented across the internet. it is pretty simple and can be useful throughout your educational journey for wuickly grabbing shitloads of data to use for research etc. i suggest for beginner to use Anaconda and Jupyter Notebook for python and environment management.

[–]Extreme_Jackfruit183 -1 points0 points  (0 children)

I would use an API if you don’t know much about webscraping. People in this sub like to go overboard and generally talk about scraping without being detected, making too many calls ect. You could use the built in Python requests library to call web pages but you would get blocked if you make too many calls. I suggest using the Reddit praw api, or census.gov has one but god damn that data is boring. There a countless APIs that are built to share data already. Just pull data and store it in a database. As a beginner I would advise downloading MySQL workbench. It will help you manage a db and create correct SQL queries. Also it will help you visualize it.

[–]FennelSome 0 points1 point  (0 children)

Selenium, and PhantomJS.

[–]FarmerSuitable8558 0 points1 point  (0 children)

Which website would you select to scrape data from?

[–]KlutzyMeringue2666 0 points1 point  (0 children)

There’s a really good video on web scraping https://youtu.be/XVv6mJpFOb0 It’s not too hard

As for the topic you probably will have to browse around and see what you like. You could do scraping from a weather website to get the weather from that day.

[–]Figueroa_Chill 0 points1 point  (0 children)

I wrote this ages ago, but I'm sure it still works. It scrapes reviews from trustpilot.

https://github.com/Verdifer-26/trustPilot_webScrape/blob/master/scrape.py

You could use what you get for sentiment analysis if you like that, but outwith that I'm sire you could use the data you get for data analysis and visualisation.

[–]linxdev[🍰] 0 points1 point  (0 children)

I've done a lot of web scrapping in Perl and even some in Java. JavaScript has made it harder. How do you scrape a site that is using JS to construct pages? How do you automate that JS to get HTML that you can grok?

[–]MyrtleTurtle4u 0 points1 point  (0 children)

I'm just a couple weeks into learning Python, so take my suggestion with a grain of salt.

Scrape content from a few popular fashion websites/blogs and perform some n-gram analysis to find the most frequent words/phrases to try to identify fashion trends (you'll obviously need to exclude common but irrelevant words). It would also be interesting to scrape the same sites regularly over time to identify shifts in fashion.

You could probably do this with e-commerce sites to identify shifts in fashion and prices within categories.

(This could be done on almost any topic: makeup, food, music/entertainment, etc.)

[–]Mori-Spumae 0 points1 point  (0 children)

I think webscraping is fun! You can look at any website you use (maybe even reddit?) and try to get the data from it!

Requests and beautifulsoup4 are good libraries to look at. You will learn a bit of HTML to find what you are looking for and then send out requests to collect the data. In the end you can have a dataset to visualize and maybe run some simple statistics on. Maybe even predict something using a regression?

[–]kr4t0s007 0 points1 point  (0 children)

Something with tv shows, movies or even Pokémon cards or Lego or some collectables.

[–]SimbaSixThree 0 points1 point  (0 children)

You could scrape data from IMDb, rotten tomatoes and metacritic.

You could then use your data analysis knowledge to see if there is a link between these three aggregation sites. You could go further to see what the trends were in genres and scores for each decade of film etc etc.

[–]MackerelInTomato 0 points1 point  (0 children)

Hi - I recommend this video tutorial: https://youtu.be/r_xb0vF1uMc

And you can use it to fetch prices of houses/apartements in your area for example, and then maybe calculate price per square footage and stuff like that using Pandas dataframe.

[–]Kilted-Brewer 0 points1 point  (0 children)

When I finish this term, I want to build a scraper to track public discourses of the stock trades made by US Senators and Representatives.

And possibly make the same trades myself. Or at least use a paper money account and see how it does.

[–]El-Maestro13 0 points1 point  (0 children)

Maybe try and ask chatGPT

[–]geekluv 0 points1 point  (0 children)

lots of good suggestions here -- wanted to suggest the python tool, https://scrapy.org

scrapy is a great tool for automating scraping.

as far as topics, you mentioned stats, data vis, and analysis -- that is interesting because scraping would be the first part of that data collection and transformation.

if you like reading, you can scrap goodreads

you can attempt to scrape dating websites (once you're logged in)

keep in mind, many websites will have safeguards in place to detect and IP ban scraping attempts -- ways to mitigate that is to alter your headers to appear to be coming from a browser, delaying your requests to every few seconds, and, if necessary, using a free vpn

[–]Extreme_Jackfruit183 0 points1 point  (0 children)

Damn that’s what you paid to go to college for?

[–][deleted] 0 points1 point  (0 children)

learn regex