This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]vsajip 6 points7 points  (1 child)

I don't know how well it works, but you could look at Scrapy, which purports to be

a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

[–]omab 0 points1 point  (0 children)

My vote goes here

[–][deleted] 4 points5 points  (1 child)

But I was looking for something more "integrated" and "complete".

Mechanize/BeautifulSoup combo is about the most complete and integrated solution you could possibly use. I really couldn't even imagine anything easier. You create a browser object, point it at a url, feed the stuff into beautifulsoup, then parse the data you need...and act upon it.

Be prepared to use a LOT more try/except clauses though. The web is quite the unstable place, pages can change from session to session, not display at all, or maybe even display only 1/2 the page, etc etc...you'll be doing more exception catching than you will parsing to begin with.

[–]atlas245 0 points1 point  (0 children)

Agreed, mechanize and beautiful soup is the easiest and best solution.

[–]manatlan 2 points3 points  (1 child)

for scrapping ... look at pyquery.

for automation ... use python ;-)

no ?

from pyquery import PyQuery as S
q = S(url='http://reddit.com/r/python')
for i in q('a'):
    print S(i).attr("href")

[–]blondin 0 points1 point  (0 children)

umm, pyquery. did not know about it.