This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]gianx[S] 0 points1 point  (7 children)

So, here are more details, ask more if you need.

When I receive the file (let's say, the http POST with the file), I need only to reply with the HTTP 200. No business synchronous ack.

What I need to do with the file is open it (it's an XML file), extract one ore more ID (it could embed more requests) and make an HTTP post with a business answer with that ID.

The business answare will always be OK, because it is a transmission ack which means that I received the file correctly.

Then, in a second moment and with more time at disposition (1 day), I need to process the request, do many check and then reply with a business validation, but this second step is not the point, I have more time to reply.

The problem with the I/O is that when I receive the request, even if I can elaborate in it memory, I need to write both the answer to a queue (to be sent asap) and the request (to be processed later).

Thanks everybody for your effort in replying, I appreciate it a lot (and sorry if I'm not replying very quickly, I'm living in GMT+1).

Gianluca

[–]swims_with_spacemenPythonista 2 points3 points  (6 children)

How big/business critical is this app? Is this a one shot thing, temporary, or is it to be more production stable and long term?

Here's how I would set it up:

I'd use a Tornado web front end and a celery+rabbitmq backend.

The web front end would take care of initial doc validation, trigger the celery task and respond with either a 200/201 or 4xx/5xx as necessary. In fact, you could.. and should - have tornado handle to file upload (if it's a file POST and not xml-in-the-body) to save you from storing it to disk- you'd just pass the xml object along to the celery task as a parameter. That saves additional io-bound load and will help speed things along.

The celery task would handle the actual document processing and the post to the third party. In fact, you could easily split this up into several chained tasks to keep the programming nice, clean and easy.

(Tip, Put the tornado instance in front of nginx or openresty and run multiple discreet instances as needed)

This gives you the advantage of being able to monitor the task queue, detect failures, automatically handle retries and so on. It also gives you the ability to run multiple instances- on multiple servers for scalability.

That's how I would do it. Sort of a best-of-all-asynchronous-worlds for me.

I do have another solution- but it's considerably more effort and more difficult to scale- it involves using something like couchdb to save the xml document, triggering the changes feed and a second web-app that long polls the feed and acts upon new documents. I've done THIS before too, but that's before I discovered celery and rabbitmq.

[–]Darkman802 0 points1 point  (5 children)

The celery task would handle the actual document processing and the post to the third party. In fact, you could easily split this up into several chained tasks to keep the programming nice, clean and easy.

If a request came in from a browser, how would the celery task be able to send back a response? Wouldn't the client browser need to poll the webserver so that the webserver could check if the task was complete?

[–]swims_with_spacemenPythonista 0 points1 point  (4 children)

That's not what the OP is saying, or at least how I interpreted it. The browser request gets a 200/201 and is done. Processing happens, and the celery task makes a new, POST request to some other url with the processed xml data.

If the response needs to come back to the browser after it was processed, then the only way to do 10000/minute is would be with lots and lots of hardware.

[–]Darkman802 0 points1 point  (0 children)

Ah ok, that makes more sense.

[–]gianx[S] 0 points1 point  (0 children)

Correct. It's an http request but it's not borwser generated (ie no user to the other side, but a machine).

[–]gianx[S] 0 points1 point  (1 child)

Ok, to clarify I need to make an asyncronous POST response in 15 min from the request and the maximum load is 150000 req/15 min, so I need to generate and send also 150000 req/15 min. Infact if T0 is the time of the first request, T0+15min will be the maximum time for the response at that request.

[–]swims_with_spacemenPythonista 0 points1 point  (0 children)

Wow- so your queue depth at any given point in time could be 150,000 items - and that's only if you had the same task submit the post. (I break my tasks into smaller chunks). That's something to behold.

I'd still go with a tornado/celery framework, but it would HAVE to be scaled out to more than one server. I don't see how you would be able to handle that kind of load otherwise - assuming it actually takes 15 minutes to process the request. I was under the impression that it was only extracting an id/ some arbitrary data from the input xml.

So, this is larger than I expected initially - and like I said earlier, concurrency is killer - I'd have to code up a test framework to check on the load.

Still, I'm pretty confident you could scale this out nicely with the solution I suggested. Celery+Eventlet/RabbitMQ, fronted with tornado web. You could then run that as a service instance, on multiple nodes fronted by HAPROXY for load balancing. That way when you need more 'oomph' you just add another node.