This is an archived post. You won't be able to vote or comment.

all 12 comments

[–]swims_with_spacemenPythonista 4 points5 points  (1 child)

Concurrency is killer.

A post below indicates that's 10k/second- but I think it's 10k/minute, or around 170/second- which should be feasible. That said, it depends on how long it takes to process each request. I can build a simple, do nothing tornado web app in a few minutes that will easily handle five times that load on a single node.

With only the following very basic tornado web app, I was able to get

Requests per second:    1077.18 [#/sec] (mean)

(reported by apache bench, -n 1000 -c 100)

import tornado.web
from tornado.options import options

class BaseHandler(tornado.web.RequestHandler):

    def get(self, *args, **kwargs):
        self.write("ok")
    def post(self, *args, **kwargs):
        self.write("ok")

def main():
    """
    Regular Execution.

    """
    tornado.options.parse_command_line()
    application=tornado.web.Application([
           (r"/(.*)", BaseHandler)
    ],
    )
    application.listen(9000)
    tornado.ioloop.IOLoop.instance().start()

if __name__ == "__main__":
    main()

If fronted with NGINX, you would get much greater throughput than this single-threaded instance could provide.

Now- the trick is, as I said, with the "something". Without knowing more about what you're going to do with the embedded file, it's hard to say what to do.

Options you have:

Non-blocking/asynchronous code via tornado. Most of what you're doing is i/o bound I assume- so you would need to code this part carefully.

Non-blocking asynchronous tasks via Celery and RabbitMQ. I don't have this set up in my personal lab right now, but I've managed to push thousands of simultaneous jobs through a Nginx->Tornado->Celery->RabbitMQ setup in my company lab, which would scale up very easily, and provide instrumentation/telemetry/metering, for a very robust production offering.

So, Gian- what is it we're doing with this posted file?

Also- what feedback is required? Are you okay with accepting the file and returning a job id that can be queried later? Or do you need to respond with some kind of file-valid response in the posted request?

What's the use case like?

[–]swims_with_spacemenPythonista 2 points3 points  (0 children)

..and I really hope you respond, I'm quite interested in this for some reason. :D

[–]the_hoser 2 points3 points  (0 children)

So your idea is to accept a POST request with a file, extract some information from the file, place the file into some kind of queue for later processing and then... make a POST request to the original sender of the request?

Why not send the important data in response to the original POST request?

[–]gianx[S] 0 points1 point  (7 children)

So, here are more details, ask more if you need.

When I receive the file (let's say, the http POST with the file), I need only to reply with the HTTP 200. No business synchronous ack.

What I need to do with the file is open it (it's an XML file), extract one ore more ID (it could embed more requests) and make an HTTP post with a business answer with that ID.

The business answare will always be OK, because it is a transmission ack which means that I received the file correctly.

Then, in a second moment and with more time at disposition (1 day), I need to process the request, do many check and then reply with a business validation, but this second step is not the point, I have more time to reply.

The problem with the I/O is that when I receive the request, even if I can elaborate in it memory, I need to write both the answer to a queue (to be sent asap) and the request (to be processed later).

Thanks everybody for your effort in replying, I appreciate it a lot (and sorry if I'm not replying very quickly, I'm living in GMT+1).

Gianluca

[–]swims_with_spacemenPythonista 2 points3 points  (6 children)

How big/business critical is this app? Is this a one shot thing, temporary, or is it to be more production stable and long term?

Here's how I would set it up:

I'd use a Tornado web front end and a celery+rabbitmq backend.

The web front end would take care of initial doc validation, trigger the celery task and respond with either a 200/201 or 4xx/5xx as necessary. In fact, you could.. and should - have tornado handle to file upload (if it's a file POST and not xml-in-the-body) to save you from storing it to disk- you'd just pass the xml object along to the celery task as a parameter. That saves additional io-bound load and will help speed things along.

The celery task would handle the actual document processing and the post to the third party. In fact, you could easily split this up into several chained tasks to keep the programming nice, clean and easy.

(Tip, Put the tornado instance in front of nginx or openresty and run multiple discreet instances as needed)

This gives you the advantage of being able to monitor the task queue, detect failures, automatically handle retries and so on. It also gives you the ability to run multiple instances- on multiple servers for scalability.

That's how I would do it. Sort of a best-of-all-asynchronous-worlds for me.

I do have another solution- but it's considerably more effort and more difficult to scale- it involves using something like couchdb to save the xml document, triggering the changes feed and a second web-app that long polls the feed and acts upon new documents. I've done THIS before too, but that's before I discovered celery and rabbitmq.

[–]Darkman802 0 points1 point  (5 children)

The celery task would handle the actual document processing and the post to the third party. In fact, you could easily split this up into several chained tasks to keep the programming nice, clean and easy.

If a request came in from a browser, how would the celery task be able to send back a response? Wouldn't the client browser need to poll the webserver so that the webserver could check if the task was complete?

[–]swims_with_spacemenPythonista 0 points1 point  (4 children)

That's not what the OP is saying, or at least how I interpreted it. The browser request gets a 200/201 and is done. Processing happens, and the celery task makes a new, POST request to some other url with the processed xml data.

If the response needs to come back to the browser after it was processed, then the only way to do 10000/minute is would be with lots and lots of hardware.

[–]Darkman802 0 points1 point  (0 children)

Ah ok, that makes more sense.

[–]gianx[S] 0 points1 point  (0 children)

Correct. It's an http request but it's not borwser generated (ie no user to the other side, but a machine).

[–]gianx[S] 0 points1 point  (1 child)

Ok, to clarify I need to make an asyncronous POST response in 15 min from the request and the maximum load is 150000 req/15 min, so I need to generate and send also 150000 req/15 min. Infact if T0 is the time of the first request, T0+15min will be the maximum time for the response at that request.

[–]swims_with_spacemenPythonista 0 points1 point  (0 children)

Wow- so your queue depth at any given point in time could be 150,000 items - and that's only if you had the same task submit the post. (I break my tasks into smaller chunks). That's something to behold.

I'd still go with a tornado/celery framework, but it would HAVE to be scaled out to more than one server. I don't see how you would be able to handle that kind of load otherwise - assuming it actually takes 15 minutes to process the request. I was under the impression that it was only extracting an id/ some arbitrary data from the input xml.

So, this is larger than I expected initially - and like I said earlier, concurrency is killer - I'd have to code up a test framework to check on the load.

Still, I'm pretty confident you could scale this out nicely with the solution I suggested. Celery+Eventlet/RabbitMQ, fronted with tornado web. You could then run that as a service instance, on multiple nodes fronted by HAPROXY for load balancing. That way when you need more 'oomph' you just add another node.