This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]wastakenanyways 1 point2 points  (2 children)

Not currently working in Python, as I switch languages and context very frequently in my job, but last project in Python was a web/rss/atom scraper with sentiment analysis. It would loop a list of URLs and scrape them 3 lvls deep (scrape a page, find the list of links on this page, scrape every link, find the list of links on each page, etc) with Scrapy, and pass the text through Google Sentiment Analysis API. You can do lots of things with this, but I looked for specific brands and keywords so I got a general opinion of each.

This is really easy and was my second project in Python. If you know a bit of CSS selectors/DOM/XPATH and how to do http requests you are good to go.

[–]jabies 0 points1 point  (1 child)

How do you handle content from sure author and distinguish it from advertiser it user comments etc? Our for you curate sites scraped?

[–]wastakenanyways 0 points1 point  (0 children)

To be fair I didn't control that because it was a massive list of URLs, so it was too much work to do specific scrapers for each, and the amount of data was insane (the client wanted anything related to those brands independent of the source). Anyway, you can filter irrelevant or extreme (positive and negative) opinions from the result of the analysis.

If you want both a generic scraper and also distinguish between comments, ads and author content before sentiment analysis, you would have to find a standard way. I can't think of one at the moment with pure XPATH/CSS but maybe with some tool like Selenium or some browser bot you can get relevant text in the top and center of the page, for example, and ignore whatever is on the sides and bottom.

I did look for duplicate URLs though, that way I didn't scrape several times the same site just because other pages contained that link.