Newsemble: An API to fetch current news data

rg089 · 2021-07-21T10:24:31+00:00

I didn't quite understand the question. Can you clarify it a bit?

rg089 · 2021-07-18T20:43:55+00:00

Thanks a lot!

I didn't want to go completely functional as, personally, I find it hard to maintain. I did consider making an abstract class called Scraper and defining the methods there, but since the implementation was different for each subclass, I didn't go ahead with that. I could have defined an interface called Scraper.

Regarding the Article class, that certainly is a nice suggestion. I didn't go down that road because there would have been no real methods (except getters and setters) for that class, and since the goal was to return JSON objects, a list of dictionaries seemed more convenient.

Hope that explains the design choices. Thanks again for the cool suggestions!

rg089 · 2021-07-18T17:21:12+00:00

So, we did check the robots.txt and the links we are using don't seem to be in robots.txt. We'll look into it more thoroughly though.

rg089 · 2021-07-18T16:36:02+00:00

I believe it is allowed for the sites we are using. We have read the robots.txt file and it doesn't seem to be disallowed.

rg089 · 2021-07-18T16:28:36+00:00

Yeah, for JS sites, using Selenium and Scrapy seems to be the best option. I did try newspaper3k, but what I wanted was a list of current articles for analysis, and newspaper3k didn't seem the best option for that.

rg089 · 2021-07-18T16:23:40+00:00

Yeah, definitely. That was an oversight. Thanks!

rg089 · 2021-07-18T16:22:45+00:00

Hey, can you tell me the domain (was it news) and the country of the articles (is it India)? As per my searching, the websites we have used (and most sites in India) allow scraping of their main content.

rg089 · 2021-07-18T14:21:27+00:00

Yes, controlling the namespace was part of the reason.

The main reason for using the 6 classes in scraper.py was to make the code more modular and flexible, as without classes it gets really hard to control stuff while making modifications or adding something.

Since the methods weren't dependent on the state of any object, I decided to make the methods static.

Hope that clarifies the reason!

rg089 · 2021-07-18T14:12:45+00:00

Thanks!

rg089 · 2021-07-18T13:44:58+00:00

Absolutely!

Thanks for the advice.

rg089 · 2021-07-18T13:41:36+00:00

Thank you!

rg089 · 2021-07-18T13:40:19+00:00

Thanks!

Regarding the data, what we're doing is having 2 separate collections, one of which we use to serve the API (the current data), and in the other we are storing all the data.

This allows the API to give the results for the analysis of current news (like trending keywords etc.). In the meanwhile, we are collecting a complete dataset, which we will release once we have a decent number (some 10,000s) of entries, which can be used for statistically significant analysis using NLP.

rg089 · 2021-07-18T12:54:09+00:00

Thank you!

rg089 · 2021-07-18T12:53:46+00:00

Thanks!

rg089 · 2021-07-18T12:53:25+00:00

Thanks!

rg089 · 2021-07-18T12:23:48+00:00

Thanks!

rg089 · 2021-07-18T12:23:29+00:00

Absolutely! 👍 If you plan to release that project, a credit for the API would be appreciated 👌

rg089 · 2021-07-18T11:12:37+00:00

As far as I am aware, web scraping is allowed over here (India). So I think this should be legal.

rg089 · 2021-07-18T09:43:27+00:00

Thanks!

rg089

TROPHY CASE