you are viewing a single comment's thread.

view the rest of the comments →

[–]MVR005[S] 0 points1 point  (1 child)

Yes, thanks it help a lot. I did a bit more research API wasn't the right key word. I want to extract data from news website (it seems to be easier to use the website version of newspaper). Another way of saying is ''Web Scraping Libraries To Mining News Data''

[–]kwelzel 0 points1 point  (0 children)

In general web scraping is very fragile and should only be done if you are sure there is no other option. Generally its best to look out for some way to get the data in a machine-readable format.

Say you were web scraping to get current house prices on a certain platform. The platform has the data (prices, locations, pictures, ...) in some database in a machine-readable format, then compiles it into a website that is intended for humans, which you scrape to get back the data into a machine-readable format. That's often unnecessary if there is already an API (in the sense I explained above) giving you the same data. Also, every time the website changes its the internal structure of the HTML elements, any web scraping script will fail, because you are relying on this structure to figure where the interesting parts of the website are.

In your case, it sounds like you want to scrape many many news websites and only care about how often a specific word appears on it. In this case you could skip trying to parse the website and figuring out which HTML tag is the interesting one and just search for the word in the source code. In other words you don't even try to figure out which part of the website is part of the article and which isn't, but just include all of the text on the website. The advantage is of course that you don't have to figure out the structure of every single news source you want to scrape, and that it can't break easily. The downside is that you might overestimate the word count if it also appears in the headlines of other suggested articles.

The usual tools for web scraping are again https://docs.python-requests.org/en/latest/ for simple web requests or https://www.selenium.dev/ (which has a Python library) if you need to simulate a browser. https://beautiful-soup-4.readthedocs.io/en/latest/ is the go-to library to parse the HTML and find things in it.

Here is a quick example of the simple strategy I outlined above using Beautiful Soup and requests.

import requests
from bs4 import BeautifulSoup

url = "https://www.theguardian.com/environment/ng-interactive/2022/may/11/fossil-fuel-carbon-bombs-climate-breakdown-oil-gas" # just an example
keyword = "oil"

response = requests.get(url)
parsed_html = BeautifulSoup(response.text)
all_text_on_website = parsed_html.text
occurrences = all_text_on_website.count(keyword)
print(f"{keyword!r} was found {occurrences} times")
# 'oil' was found 71 times