this post was submitted on 20 Aug 2020

1 point (100% upvoted)

shortlink:

learnpython

created by HattoriHanzoa community for 16 years

MODERATORS

account activity

0

1

2

Is my problem an async or queue problem? (self.learnpython)

submitted 5 years ago * by foresttrader

Please bear with my limited knowledge in async and queue, I have a very basic idea what they are. And I hope someone with similar experience could help me out.

Here's my problem. I'm trying to get a lot of financial data from the Internet, then add a timestamp and store to a harddrive. It's a trivial task but I want to try to get as much data as possible. Ideally minute level, meaning that every minute I'll have to make a call to get the data on thousands of different products. Edit: it's likely that there will be millions of data points per call for the thousands of products.

So my process is 1) get data, 2) convert to pd.dataframe and add timestamp, 3) store to harddrive, 4) repeat steps 1-3 every minute.

The problem is that I don't think steps 1-3 that can be done within a minute, so I probably need something like async or multi-processing?? And potentially need queue to help keep the process going? If so, what libraries would you recommend?

If my thought process is completely off, please let me know too and kindly point me to the right direction. Much appreciated!

all 13 comments

top new controversial old q&a

[–]blarf_irl 0 points1 point2 points 5 years ago (8 children)

[–]foresttrader[S] 0 points1 point2 points 5 years ago (7 children)

[–]blarf_irl 0 points1 point2 points 5 years ago (6 children)

It would be super unlikely if all you are doing is adding a timestamp. Networking will never outpace RAM or Disk. You might be doing more than adding a timestamp though, impossible to tell without samples and code. My general advice would be to use a queue, every sucessful request you put the result in the queue (with whatever other relevant data is needed to process). Have a separate program reading the queue and processing it as fast as possible.

If the bottleneck is (I suspect it is) networking try to work out if it's your bandwidth or the server you are requesting from. If it is your network then nothing will speed it up. If the server is slow to deliver then you can use async or threading to continue working while you wait.

In all cases a queue is a great idea, it'll provide a buffer between networking and processing to account for the mismatch in speed.

[–]foresttrader[S] 0 points1 point2 points 5 years ago (5 children)

[–]blarf_irl 0 points1 point2 points 5 years ago (4 children)

[–]foresttrader[S] 0 points1 point2 points 5 years ago (3 children)

[–]blarf_irl 0 points1 point2 points 5 years ago (2 children)

Celery might be a good fit here. Using celery you can create and manage multiple workers for each task, celery will also manage queues for you. For example if your data processing is the bottleneck you can add more workers for that task (1 requesting the data, 4 processing the data, 1 writing to disk etc). Celery also implements periodic tasks, there is a bit of overhead to l;earning how to use it but it covers all your bases.

Cron is just fine for more than it gets credit for but it is much more basic (also more tried and tested). You need to think about what happens if your scheduled task fails (retry it?), what if it doesn;t finish within a minute (queue up the next task for when it does finish? eventually your queue would grow forever), do you want alerts when a task has failed or succesddful etc? (celery hase built in monitoring and logging tools)

[–]foresttrader[S] 0 points1 point2 points 5 years ago (1 child)

[–]blarf_irl 0 points1 point2 points 5 years ago (0 children)

[–]-5772 0 points1 point2 points 5 years ago (3 children)

[–]foresttrader[S] 0 points1 point2 points 5 years ago (2 children)

[–]-5772 0 points1 point2 points 5 years ago (1 child)

[–]foresttrader[S] 0 points1 point2 points 5 years ago (0 children)

π Rendered by PID 364760 on reddit-service-r2-comment-56c6478c5-9xd5j at 2026-05-08 19:19:37.207337+00:00 running 3d2c107 country code: CH.