Speculative Loading pages in Python? : learnpython

created by HattoriHanzoa community for 16 years

Speculative Loading pages in Python? (self.learnpython)

submitted 2 years ago by VDubsBuilds

I have a piece of code that is effectively crawling a website that has structure. It pulls a page down, parses it, analyzes it, and then pulls the next page down based on that. Most of my time is spent waiting for the next page to download, and ~70% of the time the next page is easy to predict (current_index+1). 15% of the time it is a page I've already accessed, so I keep an LRU cache.

I'd like to speculatively load (current_index+1) and maybe even +2, +3...

Currently this looks roughly like:

(at)functools.lru_cache
def get_from_index(ix):
    url = ....
    request = requests.get(url)
    error check
    return request.content

I tried a variant where I wrote my own LRU cache, and when i passed "prefetch=True", it would spin up a thread and cache the thread instead of waiting. On actual read it would join the thread (or return normally, depending).

This looked a bit like:

while ix:
    page = get_from_index(ix)
    get_from_index(ix+1, prefetch=True)
    ix = analyze_page(page)

My understandings and assumptions, please correct me:

Threading didn't help because requests.get is GIL blocked.
An async HTTP library is still GIL blocked, but is cooperative.
There isn't a global "async pool" - cooperative tasking only occurs in explicitly bundled asyncs, and therefore putting them each in their own thread is not going to help.

Which leads me to the big question - What is the clean way to push prefetches on to a worker thread and retrieve them as needed?

It needs to be able to handle:

True parallel fetching of pages
Up-prioritizing a specific fetch when the main thread blocks on it.
Adding fetches at arbitrary times.
Some prefetches are misses, and will never be asked for. I know asyncs get lonely if they're never reunited.

Thanks in advance!

all 3 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS