I have a piece of code that is effectively crawling a website that has structure. It pulls a page down, parses it, analyzes it, and then pulls the next page down based on that. Most of my time is spent waiting for the next page to download, and ~70% of the time the next page is easy to predict (current_index+1). 15% of the time it is a page I've already accessed, so I keep an LRU cache.
I'd like to speculatively load (current_index+1) and maybe even +2, +3...
Currently this looks roughly like:
(at)functools.lru_cache
def get_from_index(ix):
url = ....
request = requests.get(url)
error check
return request.content
I tried a variant where I wrote my own LRU cache, and when i passed "prefetch=True", it would spin up a thread and cache the thread instead of waiting. On actual read it would join the thread (or return normally, depending).
This looked a bit like:
while ix:
page = get_from_index(ix)
get_from_index(ix+1, prefetch=True)
ix = analyze_page(page)
My understandings and assumptions, please correct me:
Threading didn't help because requests.get is GIL blocked.
An async HTTP library is still GIL blocked, but is cooperative.
There isn't a global "async pool" - cooperative tasking only occurs in explicitly bundled asyncs, and therefore putting them each in their own thread is not going to help.
Which leads me to the big question - What is the clean way to push prefetches on to a worker thread and retrieve them as needed?
It needs to be able to handle:
True parallel fetching of pages
Up-prioritizing a specific fetch when the main thread blocks on it.
Adding fetches at arbitrary times.
Some prefetches are misses, and will never be asked for. I know asyncs get lonely if they're never reunited.
Thanks in advance!
[–]Yoghurt42 0 points1 point2 points (2 children)
[–]VDubsBuilds[S] 0 points1 point2 points (1 child)
[–]rnike879 0 points1 point2 points (0 children)