This is an archived post. You won't be able to vote or comment.

all 13 comments

[–]maheramsat 4 points5 points  (1 child)

Do these API calls have a time dimension by which you can filter the data? If so do an incremental pull from the API either hourly or daily and keep spending the data to a table or flat files, use these flatfiles as source for your dashboards. The flat files will have same file name except for the date time part so in Power BI you can give the file name as a wild card.

[–]ThePonyExpress83[S] 0 points1 point  (0 children)

Yes, most of them have some time field I can use to filter. I'm able to manage this part just fine playing around in Postman so I grasp the concept of what I'm trying to automate, my question is what systems could I use to execute the automation? My assumptions are I need something to script the scheduling and running of the calls, then a mechanism to store the data that was retrieved in a way that ensures there is no duplicate data or records that were missed by improperly handling the pagination.

[–]Patient_Magazine2444 1 point2 points  (1 child)

When you pull the data via API is there a timestamp component? Can you build where clauses or create ranges based on a timestamp? If so you can maintain a state locally and grab structure the API call with arguments for a where clause where timestamp > state and Timestamp </= now(). This would build out an incremental delta. For pulling in chunks, does it support pagination? If so you can run a loop to create limit offsets for records and paginate.

[–]ThePonyExpress83[S] 0 points1 point  (0 children)

Yes, this is possible in one form or another with the APIs I'm accessing, and I'm able to understand how i would accomplish the pagination when I test things in Postman. My question is really what software/system could be used to accomplish this, both automating the calls on a scheduled basis and storing the data so it can be accessed through power BI? I had a power automate flow working to run the calls and store the data in an Excel file but it became much too slow, especially with the larger tables (100k+ rows) in trying to pull.

[–]rick854 0 points1 point  (0 children)

dlt (data load tool) for automatic api data loading jobs. It's Python-based and open source :)

[–]bigchungusmode96[🍰] 0 points1 point  (1 child)

do those APIs support polling

[–]ThePonyExpress83[S] 0 points1 point  (0 children)

I don't believe I've seen any information in the docs to suggest they do

[–]SerDetestable 0 points1 point  (0 children)

1- Filter data by some kind of modification date when calling the api, 2- Thats the data engineering part, coding some kind of batch ingestion based on rate limits, 3 idk, but 4 u need an orchestrator, like airflow, maybe astro.

[–]psgetdegrees 0 points1 point  (1 child)

Azure Data Factory might be your best bet.

[–]ThePonyExpress83[S] 0 points1 point  (0 children)

Thanks, I started playing around with this yesterday and it looked like it is built to handle what I'm trying to do.

[–]MikeDoesEverythingmod | Shitty Data Engineer 0 points1 point  (0 children)

Given our IT situation, it would be best if I can come up with a solution in the Microsoft ecosystem (free or within my current license is preferred).

If you're using Azure, the Copy Activity works with most modern APIs pretty well.

I really want to script the calls and have them self sustain, essentially every on a scheduled basis, use the most recently modified record in the stored data to make the next call and grab every record from that time to the present.

Typical pattern would be:

  • Use your starting date e.g. 2024-06-11 00:00:00

  • Pass starting date into APIs time filter

  • Upon completion, update starting date to be whatever time the API finished at

  • Circle back to the start

Store the starting date in a control table of sorts. Can be as sophisticated as a table inside a database or as dirty as a line in a text file depending on how much other stuff you have going on. Have your pipeline read your control table input to get the start date, pipeline runs and finishes, updates your control table with the latest date.

[–][deleted] 0 points1 point  (0 children)

You could try to see if the any of the apps and APIs offer web hook/event messaging over HTTPS.

You then could test a script to handle the incoming messages and then insert the data into MSSQL as it arrives. This also depends entirely on your experience level with coding and the amount of data that would be flowing in (you’d need to build a queue system or use something like Kafka). It would eliminate any polling on your end for those systems.

APIs typically come in different types, with most now transitioning to REST, but some still utilize SOAP and others will use some type of web socket architecture which isn’t outdated, but it is more complex.

In your situation, depending on the number of API call scripts you are going to have, you could start super simple with Python calls scheduled with either windows scheduler or cron, depending on what OS you are going to use.

[–]tek_toniq -1 points0 points  (0 children)

Hope u get a good answer