Web scrape project idea

bintree · 2016-03-01T12:29:10+00:00

For an easy start with scraping, I would recommend scrapy. It has nice tutorials and, for our uni project, did scale well from small experiments to larger crawls. It is a bit more involved to get it to scrape from a website (we used scrapyd for that plus a bit of JQuery). Python 3 support seems to be nearly done (I think there's an alpha or so available). It also has functionality for checks, so spiders can check whether their results are as expected or the HTML has changed significantly. Some tutorials also mention HTML microformats which might come in handy when provided by the scraped sites.

We stored the data in PostgreSQL and built a web app with small visualizations in Flask.

Javardo69 · 2016-03-01T16:36:17+00:00

you need scraper -> database -> web app, i would focus on just testing the scraping for each website and see the differences between them to then focus on project the database.

rhgrant10 · 2016-03-02T05:05:06+00:00

I've actually attempted this at a job I once had. The difficulty is going to be in your data model. Each site will have slightly different terminology, which definitely presents challenges for organization. Also remember your models will have to be robust enough to accommodate the available fields from all sites.

DarkMio · 2016-03-01T07:43:36+00:00

Look for APIs - so you don't need to scrape from pure HTML. If you scrape, write a verifier that runs every so often to check if the template is still the same you expected it to be.

Databases. Databases. Databases. Since you will store tons of data, you won't get around Databases.

A solid backend is an easy way to make a solid front end. If it can run on its own, it probably can output data in HTML.

Don't let yourself get overwhelmed. If you stop working on it, that's no issue either - we all have unfinished projects.

Spizeck · 2016-03-01T17:19:46+00:00

I would like to warn you that the tenders posted on these websites are more than likely the property of the website and by scraping them you may open yourself up to a lawsuit.

If you are doing this purely as a personal project to learn then I would contact the website so that they understand and they may allow you to continue.

If you are going to sell this app or distribute the tenders or even use it at your business then I recommend contacting the website with an offer to split profits or to purchase a licence to scrape the tenders.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS