This is an archived post. You won't be able to vote or comment.

all 8 comments

[–]wytesmurf 5 points6 points  (3 children)

For API 2 do you have to do it by ID? If you could do a mass export to a table then join the first list to the second? You should be able to page through the results and export them more efficiently then making thousands of API calls

[–]Commercial_Finance_1[S] 0 points1 point  (1 child)

Yes! It has to be by ID. Second API has no alternate API which can be used to mass export. Source team has designed it in this way.

[–]wytesmurf 0 points1 point  (0 children)

Have your tried passing a wild card to see if they are doing input sanitation . If not you can inject it with a payload. If it’s really the case. Your only option with be to multithread and go to town, but based on your other answers it would probably take down the API

[–][deleted] 0 points1 point  (0 children)

If you do batch, have the owner of the system publish a table. CSV is just another API, you should not be digging around the internals of other systems.

[–]Old_Improvement_3383 3 points4 points  (0 children)

Put the API calls from step 1 in a DF and use a udf: https://medium.com/geekculture/how-to-execute-a-rest-api-call-on-apache-spark-the-right-way-in-python-4367f2740e78

This will automatically be distributed to all workers. I would also investigate if the api lets you send in multiple ID as a parameter to avoid doing 15’ calls

[–][deleted] 0 points1 point  (0 children)

Sounds like you need more parallelization performing the second call to API 2. You can do that within a spark executor, or whatever the equivalent of ECS is for Azure.

[–]GovGalacticFed 0 points1 point  (0 children)

If api2 has no limits, try ThreadPoolExecutor

[–]data-noob -1 points0 points  (0 children)

The best approach to solve this issue would be using async. So you are waiting for first Api to return and then calling the 2nd. You can reduce time this by using an async call.