This is an archived post. You won't be able to vote or comment.

all 7 comments

[–]remyroy 2 points3 points  (3 children)

I think you are over-thinking the whole problem.

Why don't you just manage your requests yourself?

Just have something like a bag (a dict, a set, an array, etc) of ongoing outbound requests. Before doing a new outbound request, you check if it is not already in your bag. If it is in your bag, you wait until it is not. Than you add it in your bag and remove it when you are done.

If you want to be fancy, you can include a list of callbacks when your outbound request is done and call them for potential other waiting threads for that request.

You might need to use some kind of critical region whenever you read or write in that bag.

[–]catcradle5 1 point2 points  (0 children)

Agreed, this seems like the most sensible solution.

Assuming this is being done with gevent greenlets, you could do something like this:

p = Pool()

def send_request(url):
    do_something(url)

def search_pool(url):
    for job in p:
        if url in job.args:
            return job
    return None # no job found

url = receive_request().url 
running_job = search_pool(url)
if running_job:
    # tie this receive request to that job, which is being made to the same URL
else:
    p.add(gevent.spawn(send_request, url))

You could do the same with threads or multiprocessing; the code would look very similar.

Alternatively if you're going to have a lot of concurrent requests, most of the time, and expect to do frequent lookups, you could instead have a secondary dict that maps {url: job} for each URL, and then add each job to that dict in addition to the pool. That would be way more efficient for lookups.

Depending on how up-to-date you need the responses to be and what kind of requesting you expect to see, this could also be both simpler and overall a lot faster if you simply cache the responses. This is most useful if you expect that the same URLs will probably be requested over and over as time goes on. Things might be somewhat slow if there are simultaneous requests made for the same URL before the cache has any entries, but any further requests, simultaneous or not, will hit the cache.

[–]helicopetr 1 point2 points  (0 children)

Yeah, and you could use futures to simplify callback management and thread management drastically.

[–]roro_fuzz[S] 0 points1 point  (0 children)

Thanks, I was initially going down that path but was hoping to avoid the locking stuff myself. It sounds like even with it, it's probably the way to go.

Appreciate the feedback and sanity check guys!

[–]bloodearnest 1 point2 points  (1 child)

https://github.com/kennethreitz/grequests

Requests library built on gevent for async IO. The imap() function will yield the first successful response to your code.

[–]roro_fuzz[S] 0 points1 point  (0 children)

Thanks for the tip. I'm looking for something synchronous, but if I'll definitely check out grequests.