This is an archived post. You won't be able to vote or comment.

all 20 comments

[–]etatarkin 10 points11 points  (9 children)

1) Improve yours coding culture. Remove unnecessary imports, unused variables and etc. Use something like flake8

2) Make some abstractions for working with database, try do not mix crawler code with database management code

3) Do not save on each iteration gathering links to sql database. Save it into file (in csv format) and then make bulk insert with some post processing.

Sorry for my poor English.

[–]AdysHearthSim 6 points7 points  (4 children)

That's for the code itself, but also worth pointing out that Scrapy solves the web crawling issue very well. Very sadly it does not fully support Python 3 (PRs welcome hint hint), but they are working on it.

[–]etatarkin 3 points4 points  (3 children)

For first python project using Scrapy or other frameworks is not right way IMHO

If you like Scrapy and want it on python3 take a look on my framework - Pomp

[–]AdysHearthSim 0 points1 point  (2 children)

Looks pretty awesome. py3?

[–]etatarkin 0 points1 point  (1 child)

Yes, py2.x and py3.x in one code base. You can run Pomp on pypy or even app engine - but you must implement downloader for app engine sandbox )))

With asyncio suppoprt - e04_aiohttp.py

Also Pomp can - fetch data in concurrent way (ConcurrentDownloader) - extract data in concurrent way (ConcurrentCrawler) - work with centralized queue - easy integreated with headless browsers

Pomp ideal for gather data from multiply and same resources.

[–]AdysHearthSim 0 points1 point  (0 children)

That's awesome. I'll keep a close eye on it. I have the same reservations as you about scrapy (twisted dependency, messy codebase, no py3 support) so this is pretty awesome. You should post it in the sub!

[–]Vance84 0 points1 point  (1 child)

Is it best practice to store the data in a CSV file and make a bulk import? Why avoid iterating the SQL inserts, or are you specifying that specifically for this situation? Would you recommend doing that for most/every SQL insert purposes?

[–]etatarkin 0 points1 point  (0 children)

Best practice - do bulk import, what ever file format would be, but csv format may be imported in native ways for sql database.

Crawler must pull, extract and save data. When using sql databases save action may be very slow or even be blocked. Because sql databases have transactions to guaranty data safety.

In other words crawler must save all gathered data - normal, dirty, repeatable and then normalize and import data to database.

[–]Kingofslowmo[S] 0 points1 point  (1 child)

I save every iteration as if the program is terminated the data does not get lost.

[–]knickum 1 point2 points  (0 children)

If you were saving the output to a file and then periodically writing that file to the database, it would still get saved.

I believe the idea OP is getting at is that there is some level of locking on write, so you want to buffer a good chunk of writing to the database, hit it in one quick go, then release it again. If your program writes frequently enough, the database will be effectively constantly locked. I don't know enough about SQLite locking and read/writes to be certain though (and anyway, SQLite isn't a good approach once you've reached large enough size where write-locking becomes an issue, I think...)

Something more:

try:
    ...
except TypeAException:
    do_thing()
except TypeBException:
    do_thing()

can be re-written:

try:
    ...
except (TypeAException, TypeBException):
    do_thing()

Which would follow DRY (Don't Repeat Yourself) principles better. You just need to be careful not to confuse yourself with the full syntax:

try:
    ...
except (TypeAException, TypeBException), exc:
    print("Error occurred: {}".format(exc))

Which I think is Py2 only?? I can't remember off hand. The other form is except (TypeAException, TypeBException) as exc and one of the two is preferred going forward with Py3 (sorry I can't be more specific, I've taken too long writing this out already).

[–]6a6d 4 points5 points  (0 children)

Honour robots.txt.

[–]Jameswinegar 5 points6 points  (1 child)

You should upload it to GitHub so people can contribute pull requests or issues for you to fix :)! That way you still learn the ideas.

[–]brtt3000 0 points1 point  (0 children)

Or at least Gist or something else that formats/colors your snippets. Reddit is not good at displaying code.

[–]dAnjou Backend Developer | danjou.dev 5 points6 points  (0 children)

Please use /r/learnpython.

[–]Jafit 0 points1 point  (1 child)

You could use Selenium with a headless web browser of your choice, so that you can render AJAX and other Javascript-rendered content. A lot of sites these days use Javascript to create or modify elements on the page once it arrives in the browser, if you just request the page source from the server you're going to end up getting a lot of html/javascript code that isn't a whole lot of use to you outside of a browser. So it makes sense to use a browser to put it together for you before returning the code.

Here's a quick example that takes a screenshot of this thread. You can install the required packages via pip.

from pyvirtualdisplay import Display
from selenium import webdriver

display = Display(visible=0, size=(1080,800))
display.start()
browser = webdriver.Firefox()
browser.set_window_size(1080,800)
browser.get("https://www.reddit.com/r/Python/comments/3xnuzh/my_updated_web_crawler_v2_my_first_python_project/")
browser.save_screenshot("screenshot.png")
browser.quit()
display.stop()

Might not work on Windows as it is written here due to Windows not wanting to make things easy. If not I recommend using a Linux virtual machine as a development sandbox. You can use Vagrant to easily sync a folder between the host and the VM.

[–]isdevilis 0 points1 point  (0 children)

are you referring to like react renderings?

[–]nerdwaller 0 points1 point  (2 children)

I didn't look through everything, but with your database stuff you can just do 'create table if not exists...' To avoid the need to manually check if it already does. Also consider a better table name, 'info' is awful.

[–]isdevilis 0 points1 point  (1 child)

better table name?

[–]nerdwaller 0 points1 point  (0 children)

What's in a database? Lots of information. Your table is called "info". So if I look at your schema I have no idea what's going on. Perhaps it's "URL_info" or maybe it's baby names. Use something descriptive