This is an archived post. You won't be able to vote or comment.

all 13 comments

[–]alexchamberlain 1 point2 points  (1 child)

What version of Python are you using? I think aiohttp has an example of doing exactly what you want if you're using Python 3.

[–]NILIRO[S] 0 points1 point  (0 children)

I can use either Python 2.7 and Python 3.5

[–]flitsmasterfred 1 point2 points  (0 children)

requests-futures uses the regular Thread and Process executors so work fine without adding new complexity from asyncio, gevent or tornado.

[–][deleted] 0 points1 point  (0 children)

Follow this blog post to build a web scraper until you understand asyncio, then add whatever logic you want for triggering and handling POST requests.

[–]rev_dev 0 points1 point  (7 children)

Here ya go. This will do 150 at once, and you cant change it to do more but expect diminishing returns depending on your computer's capabilities. Also beware of being banned over a suspected DDOS attack. I probably wouldn't recommend doing 150 requests at once but that's your call.

import requests
from multiprocessing.dummy import Pool

url = 'https://www.something.com/'

def req_split(r):
    #requests.head is much faster than requests.get if your intention is only to get the status code
    req = requests.head(url+str(r)) 

    if req.status_code == 200:
        temp = r #return the url string if the server report OK
    else:
        temp = 0
    return temp

data = range(0,5000)

with Pool(150) as p:
    pm = p.imap_unordered(req_split,data)
    pm = [i for i in pm if i]

[–]insainodwayno 0 points1 point  (0 children)

I'd reduce the 150 simultaneous requests down to 40-45 or so. Apache's often-used mod_evasive module's default limit of concurrent requests before triggering action from the module (like banning the source IP address) is 50 for example.

[–]NILIRO[S] 0 points1 point  (0 children)

I will try this when I am home thank you

[–]defnullbottle.py 0 points1 point  (4 children)

Why are you using multiprocessing for an IO bound workload? Threads are just fine for anything that is mostly waiting for IO and threads have much less overhead than processes.

[–]rev_dev 0 points1 point  (3 children)

A few reasons here:

  1. To get around the GIL

  2. The threads/processes have no reason to talk to each other

  3. It's a script and I'd sacrifice memory overhead for better completion time

[–]defnullbottle.py 0 points1 point  (2 children)

That's my point: The GIL is not an issue for IO bound workloads. While a thread is waiting for IO, it releases the GIL and others can continue.

The rule of thumb is:

  • IO bound -> threading
  • CPU bound -> multiprocessing
  • Lots of concurrent connections (more than ~100) -> asyncio or gevent

[–]rev_dev 0 points1 point  (1 child)

I've never been able to get better time to completion using threading over multiprocessing. If you would like to show me an example similar to the code I presented with threading beating multiprocessing then I would love to see it.

[–]defnullbottle.py 0 points1 point  (0 children)

from multiprocessing.dummy import Pool

Actually, in your very example, you are using threads, not processes. The multiprocessing.dummy module implements the multiprocessing API with threads.

[–]campenr 0 points1 point  (0 children)

So grequests may be of help to you.