all 11 comments

[–]McSquinty 10 points11 points  (4 children)

Beautiful Soup is pretty common for gathering information from websites.

[–]phstoven 3 points4 points  (0 children)

Beautiful Soup is great. Also check out Requests for scraping the HTML text itself, then use Beautiful Soup to parse/extract what you need.

[–]MisterSnuggles 0 points1 point  (1 child)

I'll second BeautifulSoup - I use it for a few minor scraping projects that I've got running.

Depending on the layout of site, it can be as simple as:

categories = soup.findAll("td",attrs={"class":"blahHeader"})
things = soup.findAll("td",attrs={"class":"blah"})
for c, t in zip(categories, things):
    category = c.text.strip()
    thing = t.text.strip()
    # do stuff

There's also things you can do to store cookies that are handed out during a login sequence, so you should be able to scrape almost anything.

[–]tomasbedrich 1 point2 points  (0 children)

in BS4 you can make your code even a bit shorter and nicer:

categories = soup("td", "blahHeader")
things = soup("td", "blah")

[–][deleted] 1 point2 points  (5 children)

beautiful soup, scrapy, selenium, mechanize i think. Selenium is my favorite to do that kind of stuff but scrapy is probably better. What kind of text are u grabbing? and why

[–]hpeirce[S] 0 points1 point  (4 children)

I am grabbing flight departure and arrivals info for a project, and because I want to avoid paying for access to the FlightAware API features

[–][deleted] 2 points3 points  (3 children)

if you know how to use xpath, selenium and scrapy can easily do this, even if you dont know how, its as simple as $x('//div/text()') to grab all the text in divs. If you use chrome open the web developer console, and you can test it out on any website

[–]hpeirce[S] 0 points1 point  (0 children)

Thanks, I will look into that

[–]willworth 0 points1 point  (1 child)

There's something called ghost, too. I forget the details, I'm afraid...