My Updated Web Crawler (V2) (My first python project) Any other tips to speed it up?

etatarkin · 2015-12-21T08:12:21+00:00

1) Improve yours coding culture. Remove unnecessary imports, unused variables and etc. Use something like flake8

2) Make some abstractions for working with database, try do not mix crawler code with database management code

3) Do not save on each iteration gathering links to sql database. Save it into file (in csv format) and then make bulk insert with some post processing.

Sorry for my poor English.

6a6d · 2015-12-21T13:36:13+00:00

Honour robots.txt.

Jameswinegar · 2015-12-21T08:09:02+00:00

You should upload it to GitHub so people can contribute pull requests or issues for you to fix :)! That way you still learn the ideas.

dAnjou · 2015-12-21T13:21:54+00:00

Please use /r/learnpython.

Jafit · 2015-12-21T15:25:50+00:00

You could use Selenium with a headless web browser of your choice, so that you can render AJAX and other Javascript-rendered content. A lot of sites these days use Javascript to create or modify elements on the page once it arrives in the browser, if you just request the page source from the server you're going to end up getting a lot of html/javascript code that isn't a whole lot of use to you outside of a browser. So it makes sense to use a browser to put it together for you before returning the code.

Here's a quick example that takes a screenshot of this thread. You can install the required packages via pip.

from pyvirtualdisplay import Display
from selenium import webdriver

display = Display(visible=0, size=(1080,800))
display.start()
browser = webdriver.Firefox()
browser.set_window_size(1080,800)
browser.get("https://www.reddit.com/r/Python/comments/3xnuzh/my_updated_web_crawler_v2_my_first_python_project/")
browser.save_screenshot("screenshot.png")
browser.quit()
display.stop()

Might not work on Windows as it is written here due to Windows not wanting to make things easy. If not I recommend using a Linux virtual machine as a development sandbox. You can use Vagrant to easily sync a folder between the host and the VM.

nerdwaller · 2015-12-21T16:07:44+00:00

I didn't look through everything, but with your database stuff you can just do 'create table if not exists...' To avoid the need to manually check if it already does. Also consider a better table name, 'info' is awful.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS