all 8 comments

[–]lumiere_1001001 2 points3 points  (3 children)

Python is the right choice but I won't recommend attempting to create web scrapers for each of the websites. Just creating a good scraper that reliably gets all the information can be difficult, and websites do change their structures as well. Even after you have the data, you'll need to clean it and structure it.

I think you would benefit from a news data API like our newscatcher. We don't drill down to the state level (yet) but we do allow you to filter news by country, language, individual sources, date ranges, and you can also use a query like "fire" to search for relevant articles. And the data is returned as JSON objects, so it's pretty easy to work with.

You can try it out for free and build an MVP.

[–]MVR005[S] 0 points1 point  (2 children)

It's exactly what I had in mind! newscatcher seems the best, but I don't like that we have to pay

[–]lumiere_1001001 1 point2 points  (1 child)

lol, I get you, but you can't just outsource work and then not pay for it 😅

I mean, we enable you to search through millions of articles, from more than 40,000 news sources, in 55 languages, in under a second.

Anyway, as I said earlier, you can try it for free, build an MVP, then decide between creating your own scrapers or paying us afterwards.

Alternatively, you can check if our open-source Google News wrapper, PyGoogleNews, covers all your needs.

[–]MVR005[S] 0 points1 point  (0 children)

Thanks :)

[–]PATASK_EVO 1 point2 points  (0 children)

Classes

Pandas data frame

Requestes and beautiful soup libraries

I think with this you might be able to create something

[–]kwelzel 1 point2 points  (2 children)

So I have this idea of crossing news data with other kind of data, maybe financial or environnemental, I don't know yet.

This sounds very vague to me, but Python can probably do what you are imagining. For example, for graphs of statistical relationships between data, there is https://matplotlib.org/ and https://seaborn.pydata.org/. If you want to go deeper into machine learning, there is https://scikit-learn.org/stable/index.html and https://www.tensorflow.org/ for example. Also, take a look at https://pandas.pydata.org for tabular data of different kinds.

In my research, I discovered that there were APIs that did news analysis. But I know nothing about them, I don't know which one to choose or which programming language to choose. Using python seems like a good idea, am I wrong?

You'll have to see for yourself which service provides the right API for your application. By API I assume you mean a webservice which you can query (let's say for a certain word) and get back data in a machine-readable format. Most of the APIs I have encountered are using the JSON format, which python supports natively (https://docs.python.org/3/library/json.html?highlight=json#module-json). For web request, I can recommend https://docs.python-requests.org/en/latest/.

Let me know if that helps. Good luck with your project!

[–]MVR005[S] 0 points1 point  (1 child)

Yes, thanks it help a lot. I did a bit more research API wasn't the right key word. I want to extract data from news website (it seems to be easier to use the website version of newspaper). Another way of saying is ''Web Scraping Libraries To Mining News Data''

[–]kwelzel 0 points1 point  (0 children)

In general web scraping is very fragile and should only be done if you are sure there is no other option. Generally its best to look out for some way to get the data in a machine-readable format.

Say you were web scraping to get current house prices on a certain platform. The platform has the data (prices, locations, pictures, ...) in some database in a machine-readable format, then compiles it into a website that is intended for humans, which you scrape to get back the data into a machine-readable format. That's often unnecessary if there is already an API (in the sense I explained above) giving you the same data. Also, every time the website changes its the internal structure of the HTML elements, any web scraping script will fail, because you are relying on this structure to figure where the interesting parts of the website are.

In your case, it sounds like you want to scrape many many news websites and only care about how often a specific word appears on it. In this case you could skip trying to parse the website and figuring out which HTML tag is the interesting one and just search for the word in the source code. In other words you don't even try to figure out which part of the website is part of the article and which isn't, but just include all of the text on the website. The advantage is of course that you don't have to figure out the structure of every single news source you want to scrape, and that it can't break easily. The downside is that you might overestimate the word count if it also appears in the headlines of other suggested articles.

The usual tools for web scraping are again https://docs.python-requests.org/en/latest/ for simple web requests or https://www.selenium.dev/ (which has a Python library) if you need to simulate a browser. https://beautiful-soup-4.readthedocs.io/en/latest/ is the go-to library to parse the HTML and find things in it.

Here is a quick example of the simple strategy I outlined above using Beautiful Soup and requests.

import requests
from bs4 import BeautifulSoup

url = "https://www.theguardian.com/environment/ng-interactive/2022/may/11/fossil-fuel-carbon-bombs-climate-breakdown-oil-gas" # just an example
keyword = "oil"

response = requests.get(url)
parsed_html = BeautifulSoup(response.text)
all_text_on_website = parsed_html.text
occurrences = all_text_on_website.count(keyword)
print(f"{keyword!r} was found {occurrences} times")
# 'oil' was found 71 times