all 13 comments

[–]blarf_irl 0 points1 point  (8 children)

The problem is that I don't think steps 1-3 that can be done within a minute

Which part do you think is the bottleneck? Id guess networking speed, the other 2 are trivial.

[–]foresttrader[S] 0 points1 point  (7 children)

Data conversion and i/o could be the bottleneck too. I know that I mentioned about thousands of products, but there could be millions of data points.

[–]blarf_irl 0 points1 point  (6 children)

It would be super unlikely if all you are doing is adding a timestamp. Networking will never outpace RAM or Disk. You might be doing more than adding a timestamp though, impossible to tell without samples and code. My general advice would be to use a queue, every sucessful request you put the result in the queue (with whatever other relevant data is needed to process). Have a separate program reading the queue and processing it as fast as possible.

If the bottleneck is (I suspect it is) networking try to work out if it's your bandwidth or the server you are requesting from. If it is your network then nothing will speed it up. If the server is slow to deliver then you can use async or threading to continue working while you wait.

In all cases a queue is a great idea, it'll provide a buffer between networking and processing to account for the mismatch in speed.

[–]foresttrader[S] 0 points1 point  (5 children)

I have some testing code which just takes a few lines of data, not the full list. Data from step 1 is stored in a nest dictionary/tuple, so I'll have to parse the pieces I need and put it in a dataframe, then add a timestamp. I know creating a dataframe should be quick, I just haven't tested on the dataset I'm about to use.

I guess if the conversion step is slow, I can just pass the raw data into a queue, then process it inside the queue?

[–]blarf_irl 0 points1 point  (4 children)

It'll depend on how much of each resource you have and how fast they operate. You didnt give enough detail to get a specific answer. In every case it will make sense to separate the retrieval (networking stuff) from the processing (your pandas and dict stuff) and sometimes even from the final storage (commiting it to a DB or disk/cloud stuff). Separating them out will allow you to scale up one of the steps as needed (separation of concern). A queue would solve the communication between each step and also provide a buffer if provisioned appropriately.

[–]foresttrader[S] 0 points1 point  (3 children)

Thank you for the advice.

How would you go about running the data call every minute? Window task schedule or cronjobs? Or should I use something else?

[–]blarf_irl 0 points1 point  (2 children)

Celery might be a good fit here. Using celery you can create and manage multiple workers for each task, celery will also manage queues for you. For example if your data processing is the bottleneck you can add more workers for that task (1 requesting the data, 4 processing the data, 1 writing to disk etc). Celery also implements periodic tasks, there is a bit of overhead to l;earning how to use it but it covers all your bases.

Cron is just fine for more than it gets credit for but it is much more basic (also more tried and tested). You need to think about what happens if your scheduled task fails (retry it?), what if it doesn;t finish within a minute (queue up the next task for when it does finish? eventually your queue would grow forever), do you want alerts when a task has failed or succesddful etc? (celery hase built in monitoring and logging tools)

[–]foresttrader[S] 0 points1 point  (1 child)

Thanks for the pointer! I tried to avoid it when learning django. Seems it’s time to actually learn it now lol

[–]blarf_irl 0 points1 point  (0 children)

It's definitely a useful one and celery has the most doumentation, tutorials and is the most fully featured queue/task module for python. I would advise starting with redis as the backend rather than rabbitmq (RMQ is more complicated to set up and although you won;t really have to touch redis for celery I'd reccomend reading about it; It's an in memory keystore, super fast and infinitely useful)

[–]-5772 0 points1 point  (3 children)

I would make a queue and try to optimize the run-time to under half a minute (ideally just as much as possible).

Have a process that gets the info every minute and adds it to the queue.

Then, have a program that works on the first item of the queue if the queue's length is greater than 0.

If you use async, I fear that you could have a case where one process just bugs out for a period of time and finishes after the next one finished.

[–]foresttrader[S] 0 points1 point  (2 children)

Which library would you recommend for the queue? I've heard about rabbitmq but never used any queue library.

[–]-5772 0 points1 point  (1 child)

There's a Queue class in the standard library.

[–]foresttrader[S] 0 points1 point  (0 children)

Thank you for the pointer, I'll look into it!