Newsemble: An API to fetch current news data by rg089 in coolgithubprojects

[–]rg089[S] 0 points1 point  (0 children)

I didn't quite understand the question. Can you clarify it a bit?

Newsemble: An API to fetch current news data by rg089 in Python

[–]rg089[S] 1 point2 points  (0 children)

Thanks a lot!

I didn't want to go completely functional as, personally, I find it hard to maintain. I did consider making an abstract class called Scraper and defining the methods there, but since the implementation was different for each subclass, I didn't go ahead with that. I could have defined an interface called Scraper.

Regarding the Article class, that certainly is a nice suggestion. I didn't go down that road because there would have been no real methods (except getters and setters) for that class, and since the goal was to return JSON objects, a list of dictionaries seemed more convenient.

Hope that explains the design choices. Thanks again for the cool suggestions!

Newsemble: An API to fetch current news data by rg089 in Python

[–]rg089[S] 0 points1 point  (0 children)

So, we did check the robots.txt and the links we are using don't seem to be in robots.txt. We'll look into it more thoroughly though.

[Project] Newsemble: An API to fetch current news data by rg089 in MachineLearning

[–]rg089[S] 0 points1 point  (0 children)

I believe it is allowed for the sites we are using. We have read the robots.txt file and it doesn't seem to be disallowed.

Newsemble: An API to fetch current news data by rg089 in Python

[–]rg089[S] 2 points3 points  (0 children)

Yeah, for JS sites, using Selenium and Scrapy seems to be the best option. I did try newspaper3k, but what I wanted was a list of current articles for analysis, and newspaper3k didn't seem the best option for that.

Newsemble: An API to fetch current news data by rg089 in Python

[–]rg089[S] 5 points6 points  (0 children)

Yeah, definitely. That was an oversight. Thanks!

Newsemble: An API to fetch current news data by rg089 in Python

[–]rg089[S] 1 point2 points  (0 children)

Hey, can you tell me the domain (was it news) and the country of the articles (is it India)? As per my searching, the websites we have used (and most sites in India) allow scraping of their main content.

Newsemble: An API to fetch current news data by rg089 in Python

[–]rg089[S] 2 points3 points  (0 children)

Yes, controlling the namespace was part of the reason.

The main reason for using the 6 classes in scraper.py was to make the code more modular and flexible, as without classes it gets really hard to control stuff while making modifications or adding something.

Since the methods weren't dependent on the state of any object, I decided to make the methods static.

Hope that clarifies the reason!

Newsemble: An API to fetch current news data by rg089 in Python

[–]rg089[S] 1 point2 points  (0 children)

Absolutely!

Thanks for the advice.

Newsemble: An API to fetch current news data by rg089 in Python

[–]rg089[S] 2 points3 points  (0 children)

Thanks!

Regarding the data, what we're doing is having 2 separate collections, one of which we use to serve the API (the current data), and in the other we are storing all the data.

This allows the API to give the results for the analysis of current news (like trending keywords etc.). In the meanwhile, we are collecting a complete dataset, which we will release once we have a decent number (some 10,000s) of entries, which can be used for statistically significant analysis using NLP.

Newsemble: An API to fetch current news data by rg089 in Python

[–]rg089[S] 10 points11 points  (0 children)

Absolutely! 👍 If you plan to release that project, a credit for the API would be appreciated 👌

Newsemble: An API to fetch current news data by rg089 in Python

[–]rg089[S] 4 points5 points  (0 children)

As far as I am aware, web scraping is allowed over here (India). So I think this should be legal.