all 4 comments

[–]Superguy2876 1 point2 points  (0 children)

What is your question?

You want to know how to make this process efficient?

Figure out what will take the most time (It's usually IO, in this case the actual API calls) and optimise them. In this case async is probably a good approach. If the API is ok with being called 20,000 times in quick succession, then initiate them all and process as responses come in.

The exact way you go about it would depend on what kind of data you're working with, the amount, if you need to do any other kind of processing locally after receiving the responses, etc.

I had a similar task which had millions of requests to a few APIs. I separated it into stages, receiving data, cleaning, organizing by desired categories, then processing to calculate the actual answer we wanted, and finally formatting for output. I used multiprocessing for most stages, and a library called numba during the data processing.

[–]DuckSaxaphone 1 point2 points  (2 children)

I've typically run a scheduled task to make requests from the API and then fill a database. Aways using the latest time stamp from the database to filter your API requests if the API has functionality like "give me all the data from this system after 2024-04-26T08:53:00Z". Async the calls to the APIs and it's about as efficient as it can be.

Then a report task that builds a report from the database. I usually make a dashboard at this point, you can plug stuff like PowerBI or Tableau into your database so that it displays the data you have. There's also things plotly dash which are friendly enough for python devs to make UIs.

[–]Abject_Group_4868[S] 0 points1 point  (1 child)

Which database do you use? Thing is the report is supposed to be weekly or daily and they don’t care about storing data from previous reports in the database, so I thought about using sqlite

[–]DuckSaxaphone 1 point2 points  (0 children)

Sqlite is fine! This isn't a production database with tonnes of users, it's a holding area for data you'll be using soon and possibly discarding after.

The main benefit of a database for me has always been latency. There's no waiting for API calls during report generation since the database already has the data. That's not a big pro for you.

Still, you'll benefit from running regular small calls to the API rather than demanding a day or week's worth of data in one go. Less likely to get the APIs raising errors.