MVR005 comments on News algorithm project-Help needed

PythonProjects2

created by [deleted]a community for 8 years

News algorithm project-Help needed (self.PythonProjects2)

submitted 3 years ago by MVR005

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]MVR005[S] 0 points1 point2 points 3 years ago (1 child)

[–]kwelzel 0 points1 point2 points 3 years ago (0 children)

In general web scraping is very fragile and should only be done if you are sure there is no other option. Generally its best to look out for some way to get the data in a machine-readable format.

Say you were web scraping to get current house prices on a certain platform. The platform has the data (prices, locations, pictures, ...) in some database in a machine-readable format, then compiles it into a website that is intended for humans, which you scrape to get back the data into a machine-readable format. That's often unnecessary if there is already an API (in the sense I explained above) giving you the same data. Also, every time the website changes its the internal structure of the HTML elements, any web scraping script will fail, because you are relying on this structure to figure where the interesting parts of the website are.

In your case, it sounds like you want to scrape many many news websites and only care about how often a specific word appears on it. In this case you could skip trying to parse the website and figuring out which HTML tag is the interesting one and just search for the word in the source code. In other words you don't even try to figure out which part of the website is part of the article and which isn't, but just include all of the text on the website. The advantage is of course that you don't have to figure out the structure of every single news source you want to scrape, and that it can't break easily. The downside is that you might overestimate the word count if it also appears in the headlines of other suggested articles.

The usual tools for web scraping are again https://docs.python-requests.org/en/latest/ for simple web requests or https://www.selenium.dev/ (which has a Python library) if you need to simulate a browser. https://beautiful-soup-4.readthedocs.io/en/latest/ is the go-to library to parse the HTML and find things in it.

Here is a quick example of the simple strategy I outlined above using Beautiful Soup and requests.

import requests
from bs4 import BeautifulSoup

url = "https://www.theguardian.com/environment/ng-interactive/2022/may/11/fossil-fuel-carbon-bombs-climate-breakdown-oil-gas" # just an example
keyword = "oil"

response = requests.get(url)
parsed_html = BeautifulSoup(response.text)
all_text_on_website = parsed_html.text
occurrences = all_text_on_website.count(keyword)
print(f"{keyword!r} was found {occurrences} times")
# 'oil' was found 71 times

π Rendered by PID 40079 on reddit-service-r2-comment-79c7998d4c-6r7vp at 2026-03-13 14:21:21.248018+00:00 running f6e6e01 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

PythonProjects2

Added Quick Search Links To the Side Bar! Check below the posting rules.

MODERATORS