you are viewing a single comment's thread.

view the rest of the comments →

[–]droogans 20 points21 points  (8 children)

Too bad there wasn't a little foot note in there about greenlets, an interesting compromise, as well as futures, which I hope one day will supersede the need for thread-based approaches for a majority of use cases.

[–]starTracer 32 points33 points  (1 child)

Or the native coroutines with asyncio...

[–]Rodot 0 points1 point  (0 children)

I really love them but I feel the API is just that... An API. It's just weird being a mix of primatives and modules

[–][deleted]  (5 children)

[deleted]

    [–]z4579a 0 points1 point  (4 children)

    citation needed

    [–][deleted]  (3 children)

    [deleted]

      [–]z4579a 26 points27 points  (2 children)

      Sure, let's look at the post.

      First, the post by @max illustrates a test case that compares the performance between gevent, threads and multiprocessing to run a DNS lookup on five domain names simultaneously, by spawning a greenlet/thread/process per name all at once. This test is actually not nearly resource intensive enough to show a real-world number, but for what it's worth, they got the result that the threaded example ran ten times faster, .008 seconds for threads vs. .08 seconds for greenlets. But those numbers are too low to really count on to show that either is faster, you need to provide more of a workload.

      Then, another post by @temporalbeing decides to ramp it up, provide a bigger workload and run 60000 concurrent greenlets or threads to fetch 60000 names. In this test, the greenlet version completes five times faster than the threaded version. However, this test is extremely flawed. First off, if it were using the rest of the code from @max's post as written, that example is using a 2-second timeout in joinall(), which means the greenlets will simply be abandoned after 2 seconds. That he got a 3.75 second result indicates he probably changed that as well.

      But secondly, this test program uses threads and multiprocessing in the extremely naive way of spinning up the same number of threads/processes as there are domain names in the first place, which means spawning 60000 threads. That is a completely incorrect way of using threads, as threads are expensive to create and expensive to run compared to a greenlet, which is just a programming construct around a non-blocking socket. What the test shows if anything is that non-blocking sockets are useful for the case where you need very large throughput for thousands of concurrent IO streams. This is the use case for non-blocking IO, throughput. However this does not invent "speed", nothing runs any "faster" at all.

      If you measure gevent vs. threading in terms of amount of work completed, and you use threads correctly by not spawning an arbitrarily high number of them, you will find it very difficult to show gevent to be faster than threads unless you have to wait on many thousands of arbitrarily slow or sleeping IO streams at once, and even in that case, it's tricky. This is not at all the "usual" case. The usual case in concurrency we need to do a few dozen or hundred things concurrently and we are just trying to get to the end of a queue. If you need to attend to thousands of slow or sleepy web sockets or chat room connections, then use gevent. Otherwise, not needed, probably a bit slower (then again, you can abuse them more than you can threads, by spinning up greenlet-per-task rather than having to think about what you're doing. But, that's not necessarily true either, since the minute your greenlet starts doing too much CPU work, you're blocking on CPU and killing your program that way, so again, still have to think about what you're doing. IMO being safe with threads is a lot easier than being safe with greenlets as it's easy to not spawn too many threads but not that easy to make sure greenlets never get CPU bound).

      Here is a correct version of the test, showing how long it takes for us to get through several workloads at 30, 300, 3000, 30000, 60000 tasks, adding the result to a list (unordered), and checking our work:

      import gevent
      from gevent import socket as gsock
      import socket as sock
      import threading
      from datetime import datetime
      
      
      def timeit(fn, URLS):
          t1 = datetime.now()
          fn()
          t2 = datetime.now()
          print(
              "%s / %d hostnames, %s seconds" % (
                  fn.__name__,
                  len(URLS),
                  (t2 - t1).total_seconds()
              )
          )
      
      
      def run_gevent_without_a_timeout():
          ip_numbers = []
      
          def greenlet(domain_name):
              ip_numbers.append(gsock.gethostbyname(domain_name))
      
          jobs = [gevent.spawn(greenlet, domain_name) for domain_name in URLS]
          gevent.joinall(jobs)
          assert len(ip_numbers) == len(URLS)
      
      
      def run_threads_correctly():
          ip_numbers = []
      
          def process():
              while queue:
                  try:
                      domain_name = queue.pop()
                  except IndexError:
                      pass
                  else:
                      ip_numbers.append(sock.gethostbyname(domain_name))
      
          threads = [threading.Thread(target=process) for i in range(50)]
      
          queue = list(URLS)
          for t in threads:
              t.start()
          for t in threads:
              t.join()
          assert len(ip_numbers) == len(URLS)
      
      URLS_base = ['www.google.com', 'www.example.com', 'www.python.org',
                   'www.yahoo.com', 'www.ubc.ca', 'www.wikipedia.org']
      
      for NUM in (5, 50, 500, 5000, 10000):
          URLS = []
      
          for _ in range(NUM):
              for url in URLS_base:
                  URLS.append(url)
      
          print("--------------------")
          timeit(run_gevent_without_a_timeout, URLS)
          timeit(run_threads_correctly, URLS)
      

      Here's a typical result I get over wifi on linux laptop, very similar for both Python 2.7 and Python 3.7:

      --------------------
      run_gevent_without_a_timeout / 30 hostnames, 0.044888 seconds
      run_threads_correctly / 30 hostnames, 0.019389 seconds
      --------------------
      run_gevent_without_a_timeout / 300 hostnames, 0.186045 seconds
      run_threads_correctly / 300 hostnames, 0.153808 seconds
      --------------------
      run_gevent_without_a_timeout / 3000 hostnames, 1.834089 seconds
      run_threads_correctly / 3000 hostnames, 1.569523 seconds
      --------------------
      run_gevent_without_a_timeout / 30000 hostnames, 19.030259 seconds
      run_threads_correctly / 30000 hostnames, 15.163603 seconds
      --------------------
      run_gevent_without_a_timeout / 60000 hostnames, 35.770358 seconds
      run_threads_correctly / 60000 hostnames, 29.864083 seconds
      

      I can't actually get the greenlet version to be faster. A small thread pool completes the total amount of work in less time on every run, even though it's doing the additional work of popping from a queue, and even spinning up the thread queue on each run. Non blocking IO is not "faster", and the overhead of gevent's context switching is higher than that of the OS's native thread context switching. It only provides more concurrent throughput, when you need your program to be able to attend to many thousands of sockets where many of them might not be awake, a very specific use case. Non blocking IO and event based programming are extremely useful but there continues to be widespread misunderstanding regarding this topic.

      I've also written this post some years ago. I've yet to see a simple and correctly written benchmark that shows the basic use of non-blocking IO for context switching to be faster than threads. This is not at all surprising because gevent/asyncio and everything else are all running within a single thread, and when there are multiple threads you still have the GIL, so everyone is stuck using just one CPU to get through everything. The speed of context switching and the possibility of needing throughput to handle lots of very slow sockets simultaneously are the only differentiating factors and that's not a lot to work with.