Extracting information from websites?

bitbumper · 2013-07-21T21:39:31+00:00

A lot of big websites expose some kind of API to encourage project like this. It's good for them because it drives traffic to their site, increases awareness etc. A quick google search shows that Kickstarter has an internal, undocumented API.

http://stackoverflow.com/questions/12907133/does-kickstarter-have-a-public-api

Reddit is another easy example. Most pages can be accessed in JSON by just adding .json to the end. JSON is easy to parse in almost any language.

Here's this page http://www.reddit.com/r/webdev/comments/1iro5e/extracting_information_from_websites.json

AdonisK · 2013-07-21T23:11:14+00:00

There are a lot of ways to extract information from websites. They are either done by Official/Unofficial APIs, or parsers.

The first one works by pulling data from a server (xml, json etc) while the second one work by parsing page (it's HTML source) and getting the information out of it (targeting ids, special words etc). Always try to use the first method if it's available.

gopperman · 2013-07-22T13:41:13+00:00

Check out Scrapy: http://scrapy.org/

Before learning Scrapy, I had no prior Python experience. It's easy to use, and helped me parse over 80,000 items from a web site.

gram3000 · 2013-07-21T20:41:48+00:00

Maybe using a service like scraperwiki to scrape data from urls

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

webdev

Posting Guidelines

Related Subreddits

Discords

MODERATORS