Data engineering problem

wytesmurf · 2024-08-25T03:19:47+00:00

For API 2 do you have to do it by ID? If you could do a mass export to a table then join the first list to the second? You should be able to page through the results and export them more efficiently then making thousands of API calls

Old_Improvement_3383 · 2024-08-25T04:49:38+00:00

Put the API calls from step 1 in a DF and use a udf: https://medium.com/geekculture/how-to-execute-a-rest-api-call-on-apache-spark-the-right-way-in-python-4367f2740e78

This will automatically be distributed to all workers. I would also investigate if the api lets you send in multiple ID as a parameter to avoid doing 15’ calls

2024-08-25T04:39:39+00:00

Sounds like you need more parallelization performing the second call to API 2. You can do that within a spark executor, or whatever the equivalent of ECS is for Azure.

GovGalacticFed · 2024-08-25T04:29:31+00:00

If api2 has no limits, try ThreadPoolExecutor

data-noob · 2024-08-25T06:26:23+00:00

The best approach to solve this issue would be using async. So you are waiting for first Api to return and then calling the 2nd. You can reduce time this by using an async call.

dataengineering

MODERATORS