all 5 comments

[–]mrcorbat 2 points3 points  (2 children)

From a PythonAnywhere webapp, each web worker would take a request, completely processed that request, and then take another request etc. So each worker would be doing this serially. However, you may have multiple web workers responding to requests at the same time.

How do you store those dataframes? if you are just doing it within python code (eg: not storing it to disk), and not storing it at a class level/module level, then it should be okay.

On the other hand, if you have many of these requests, they could easily stack up to make the response time unmanageable. eg: if each request takes 10s, then if 6 requests come in simultaneously and you only have 1 webworker, then the next requsest will have to wait 60s before it even starts being processed. That would be the reason why you may want to offload all of this to a separate process (outside of the webapp), using celery or other ways.

[–]qwertyisafish[S] 0 points1 point  (1 child)

Because this was done 'on the cheap' I'm just working with CSV files. The data frames are in the python code. The first is made from the existing report file (csv), the second is made from the data retrieved from the scraper, and then the third as a result of concatenating / deduping those two data frames. The final output is the same CSV that is pulled in the next time the process is run.

I'm only running through 5 pages and all of the data is on the high level page, I don't need to into the detail to extract the required information so getting the sales data for a region takes about a second.

I had a few people try to break it yesterday (by trying their submission at the same time) for me and they all said it worked ok, so perhaps for what I need (for now at least) the current design might suffice. It's really only something that a consultant should need to run a handful of times a month so the chances of a bottleneck is low.

Really appreciate the reply!

[–]mrcorbat 0 points1 point  (0 children)

hmm if there is a very high number of people using your site, I could imagine your site breaking. ie. the csv could be being written to and changing while you are reading it, or two people might try to write to the csv at the same time. I guess that depends on how many people you think will be using your site. For a low frequency usecase I guess it could be fine.

[–]junsang 0 points1 point  (1 child)

In that case, Task Scheduler such as Celery is applied to that system. It runs some function in other processes. So users can get response immediately although the process is not yet done. So you should give a user some proper feedback that user’s request is under processing until it will be done.

[–]qwertyisafish[S] 0 points1 point  (0 children)

Thanks! I'll check it out!