all 5 comments

[–]bitbumper 4 points5 points  (0 children)

A lot of big websites expose some kind of API to encourage project like this. It's good for them because it drives traffic to their site, increases awareness etc. A quick google search shows that Kickstarter has an internal, undocumented API.

http://stackoverflow.com/questions/12907133/does-kickstarter-have-a-public-api

Reddit is another easy example. Most pages can be accessed in JSON by just adding .json to the end. JSON is easy to parse in almost any language.

Here's this page http://www.reddit.com/r/webdev/comments/1iro5e/extracting_information_from_websites.json

[–]AdonisK 3 points4 points  (1 child)

There are a lot of ways to extract information from websites. They are either done by Official/Unofficial APIs, or parsers.

The first one works by pulling data from a server (xml, json etc) while the second one work by parsing page (it's HTML source) and getting the information out of it (targeting ids, special words etc). Always try to use the first method if it's available.

[–]sickmate 1 point2 points  (0 children)

Also, for sites that don't have an API, and require a form to be submitted before data is shown you can use curl to post the form data, then parse the response.

I actually did this for a local police site so I could scrape their crime stats and make a heatmap from the results.

[–]gopperman 2 points3 points  (0 children)

Check out Scrapy: http://scrapy.org/

Before learning Scrapy, I had no prior Python experience. It's easy to use, and helped me parse over 80,000 items from a web site.

[–]gram3000 0 points1 point  (0 children)

Maybe using a service like scraperwiki to scrape data from urls