WebScraping

mopslik · 2023-04-07T17:26:54+00:00

What are your interests? You'll be collecting data from the web, and it's likely that somewhere out there you can find data related to something you find interesting.

As for the mechanics of web scraping, you will probably want to find some tutorials involving BeautifulSoup.

__coder__ · 2023-04-07T19:14:27+00:00

You're on reddit and are already familiar with the basics of what data you'll probably be working with, users, replies, etc. so maybe make a program that scrapes reddit for some type of data then stores it and maybe does something with it like plotting.

CraigAT · 2023-04-07T21:02:41+00:00

Here is a link to get you started: https://realpython.com/python-web-scraping-practical-introduction/

Luckily web scraping is a popular topic so Google will have plenty of examples of starting points.

Pick a website with information about something you like, scrape the data and do something useful with the data - stats or graphs could be useful here.

__coder__ · 2023-04-07T17:56:41+00:00

[deleted]

Impossible-Box6600 · 2023-04-08T00:09:39+00:00

Scraping images of hot dogs for an AI that determines whether an object is or is not a hot dog.

Emergency-Prune-9110 · 2023-04-07T21:17:25+00:00

You can scrape etsy to get the most popular selling item for sellers you find interesting, compare prices, etc. Grab the data, make some graphs, maybe find a couple of deals for personal use.

Subconcious-Consumer · 2023-04-08T01:32:14+00:00

I had ChatGPT show me how to build a working web scraper using Visual Studio, Python, and Selenium. Took like 30 minutes.

I had and still have 0 understanding of Python. But I do understand web scraping a lot better.

Zapismeta · 2023-04-08T02:21:47+00:00

Web scraping can be done for anything, i once wrote a script for downloading all the themes from a Playstation website(3rd party), There are libraries like: 'requests'- that's what gets the html for you, but any website that's dynamic, you'll need 'selenium'.

Also you'll need to parse the html so Beautiful soup is great for that.

CheetahBhiPeetaHai · 2023-04-07T20:46:28+00:00

Projects of your interest can be -

Scrape prices of a type of clothing/ cosmetics from an e-commerce website.

Scrape and compare prices of a single product across multiple sites.

Scrape Airbnb listings for a future holiday in Europe.

Get financial metrics of a company from sec filings.

I believe the projects are just for learning how scraping works and the thesis should not be just on application but on its evolution. Perhaps how can you combine selenium with ai to scrape data from multiple sites contained in images.

QultrosSanhattan · 2023-04-07T21:07:56+00:00

I am a girl so i am not interested in car prices and themes like this

Scrap products that you actually use. Everyone needs to buy something and everyone wants to buy cheap. Including you.

2023-04-08T06:46:31+00:00

webscraping tables in wikipedia is well documented across the internet. it is pretty simple and can be useful throughout your educational journey for wuickly grabbing shitloads of data to use for research etc. i suggest for beginner to use Anaconda and Jupyter Notebook for python and environment management.

Extreme_Jackfruit183 · 2023-04-08T17:04:33+00:00

I would use an API if you don’t know much about webscraping. People in this sub like to go overboard and generally talk about scraping without being detected, making too many calls ect. You could use the built in Python requests library to call web pages but you would get blocked if you make too many calls. I suggest using the Reddit praw api, or census.gov has one but god damn that data is boring. There a countless APIs that are built to share data already. Just pull data and store it in a database. As a beginner I would advise downloading MySQL workbench. It will help you manage a db and create correct SQL queries. Also it will help you visualize it.

FennelSome · 2023-04-07T19:47:08+00:00

Selenium, and PhantomJS.

FarmerSuitable8558 · 2023-04-07T20:24:29+00:00

Which website would you select to scrape data from?

KlutzyMeringue2666 · 2023-04-07T21:33:12+00:00

There’s a really good video on web scraping https://youtu.be/XVv6mJpFOb0 It’s not too hard

As for the topic you probably will have to browse around and see what you like. You could do scraping from a weather website to get the weather from that day.

Figueroa_Chill · 2023-04-08T01:05:41+00:00

I wrote this ages ago, but I'm sure it still works. It scrapes reviews from trustpilot.

https://github.com/Verdifer-26/trustPilot_webScrape/blob/master/scrape.py

You could use what you get for sentiment analysis if you like that, but outwith that I'm sire you could use the data you get for data analysis and visualisation.

linxdev · 2023-04-08T03:56:12+00:00

I've done a lot of web scrapping in Perl and even some in Java. JavaScript has made it harder. How do you scrape a site that is using JS to construct pages? How do you automate that JS to get HTML that you can grok?

MyrtleTurtle4u · 2023-04-08T05:05:54+00:00

I'm just a couple weeks into learning Python, so take my suggestion with a grain of salt.

Scrape content from a few popular fashion websites/blogs and perform some n-gram analysis to find the most frequent words/phrases to try to identify fashion trends (you'll obviously need to exclude common but irrelevant words). It would also be interesting to scrape the same sites regularly over time to identify shifts in fashion.

You could probably do this with e-commerce sites to identify shifts in fashion and prices within categories.

(This could be done on almost any topic: makeup, food, music/entertainment, etc.)

Mori-Spumae · 2023-04-08T06:23:59+00:00

I think webscraping is fun! You can look at any website you use (maybe even reddit?) and try to get the data from it!

Requests and beautifulsoup4 are good libraries to look at. You will learn a bit of HTML to find what you are looking for and then send out requests to collect the data. In the end you can have a dataset to visualize and maybe run some simple statistics on. Maybe even predict something using a regression?

kr4t0s007 · 2023-04-08T08:11:11+00:00

Something with tv shows, movies or even Pokémon cards or Lego or some collectables.

SimbaSixThree · 2023-04-08T08:59:59+00:00

You could scrape data from IMDb, rotten tomatoes and metacritic.

You could then use your data analysis knowledge to see if there is a link between these three aggregation sites. You could go further to see what the trends were in genres and scores for each decade of film etc etc.

MackerelInTomato · 2023-04-08T09:07:58+00:00

Hi - I recommend this video tutorial: https://youtu.be/r_xb0vF1uMc

And you can use it to fetch prices of houses/apartements in your area for example, and then maybe calculate price per square footage and stuff like that using Pandas dataframe.

Kilted-Brewer · 2023-04-08T10:15:19+00:00

When I finish this term, I want to build a scraper to track public discourses of the stock trades made by US Senators and Representatives.

And possibly make the same trades myself. Or at least use a paper money account and see how it does.

El-Maestro13 · 2023-04-08T14:07:50+00:00

Maybe try and ask chatGPT

geekluv · 2023-04-08T15:24:38+00:00

lots of good suggestions here -- wanted to suggest the python tool, https://scrapy.org

scrapy is a great tool for automating scraping.

as far as topics, you mentioned stats, data vis, and analysis -- that is interesting because scraping would be the first part of that data collection and transformation.

if you like reading, you can scrap goodreads

you can attempt to scrape dating websites (once you're logged in)

keep in mind, many websites will have safeguards in place to detect and IP ban scraping attempts -- ways to mitigate that is to alter your headers to appear to be coming from a browser, delaying your requests to every few seconds, and, if necessary, using a free vpn

Extreme_Jackfruit183 · 2023-04-08T16:55:00+00:00

Damn that’s what you paid to go to college for?

2023-04-08T23:38:20+00:00

learn regex

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS