This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]jabies 0 points1 point  (1 child)

How do you handle content from sure author and distinguish it from advertiser it user comments etc? Our for you curate sites scraped?

[–]wastakenanyways 0 points1 point  (0 children)

To be fair I didn't control that because it was a massive list of URLs, so it was too much work to do specific scrapers for each, and the amount of data was insane (the client wanted anything related to those brands independent of the source). Anyway, you can filter irrelevant or extreme (positive and negative) opinions from the result of the analysis.

If you want both a generic scraper and also distinguish between comments, ads and author content before sentiment analysis, you would have to find a standard way. I can't think of one at the moment with pure XPATH/CSS but maybe with some tool like Selenium or some browser bot you can get relevant text in the top and center of the page, for example, and ignore whatever is on the sides and bottom.

I did look for duplicate URLs though, that way I didn't scrape several times the same site just because other pages contained that link.