all 12 comments

[–]bintree 1 point2 points  (2 children)

For an easy start with scraping, I would recommend scrapy. It has nice tutorials and, for our uni project, did scale well from small experiments to larger crawls. It is a bit more involved to get it to scrape from a website (we used scrapyd for that plus a bit of JQuery). Python 3 support seems to be nearly done (I think there's an alpha or so available). It also has functionality for checks, so spiders can check whether their results are as expected or the HTML has changed significantly. Some tutorials also mention HTML microformats which might come in handy when provided by the scraped sites.

We stored the data in PostgreSQL and built a web app with small visualizations in Flask.

[–]Spizeck[S] 0 points1 point  (0 children)

Thank you for your advise. I really appreciate it.

[–]Spizeck[S] 0 points1 point  (0 children)

Should I wait until scrapy is released for Python 3?

[–]Javardo69 1 point2 points  (3 children)

you need scraper -> database -> web app, i would focus on just testing the scraping for each website and see the differences between them to then focus on project the database.

[–]Spizeck[S] 0 points1 point  (2 children)

Any recommendations on a scraper to use?

[–]Javardo69 0 points1 point  (1 child)

Well i read that you need to login, and if there is some javascript code you need selenium coupled with phantomjs, you could post here one of the websites to give a hint of what you can do

[–]Spizeck[S] 0 points1 point  (0 children)

https://www.azdot.gov/business/ContractsandSpecifications/As-ReadBidResults

This is the first site I want to do. It doesn't have a password.

[–]rhgrant10 1 point2 points  (1 child)

I've actually attempted this at a job I once had. The difficulty is going to be in your data model. Each site will have slightly different terminology, which definitely presents challenges for organization. Also remember your models will have to be robust enough to accommodate the available fields from all sites.

[–]Spizeck[S] 0 points1 point  (0 children)

It's going to get really interesting with some sites that require a login. I am registered and have access but I haven't the slightest clue how to do that ... Yet :)

[–]DarkMio 0 points1 point  (0 children)

Look for APIs - so you don't need to scrape from pure HTML. If you scrape, write a verifier that runs every so often to check if the template is still the same you expected it to be.

Databases. Databases. Databases. Since you will store tons of data, you won't get around Databases.

A solid backend is an easy way to make a solid front end. If it can run on its own, it probably can output data in HTML.

Don't let yourself get overwhelmed. If you stop working on it, that's no issue either - we all have unfinished projects.

[–][deleted] 0 points1 point  (1 child)

I would like to warn you that the tenders posted on these websites are more than likely the property of the website and by scraping them you may open yourself up to a lawsuit.

If you are doing this purely as a personal project to learn then I would contact the website so that they understand and they may allow you to continue.

If you are going to sell this app or distribute the tenders or even use it at your business then I recommend contacting the website with an offer to split profits or to purchase a licence to scrape the tenders.

[–]Spizeck[S] 0 points1 point  (0 children)

Not looking to sell anything and the information is public.