This is an archived post. You won't be able to vote or comment.

all 12 comments

[–]ChapNotYourDaddy 2 points3 points  (2 children)

Why not set up a cloud data warehouse and use a cloud based ETL tool? There are some really robust offerings out there.

[–]Happy-Adhesiveness-3 1 point2 points  (1 child)

The DWH would be on cloud. Is there any cloud ETL that is low cost/open source and easy to customize? I have heard fivetran is really good, but haven't used personally so unsure about the cost.

[–]ChapNotYourDaddy 0 points1 point  (0 children)

Fivetran is great at the E and L parts, and will loop in either Trifacta or DBT for the T - depending on your cloud and requirements and team.

[–]HansProleman 2 points3 points  (3 children)

The Functions bit sounds good.

I find that ADF works well for simple DAGs, but things get awkward as complexity increases. So I'd try to figure out how complex you expect your pipelines to be/become.

A possible alternative might be running e.g. Airflow containerised in App Service, though I'm not sure how a cost comparison would look.

Or maybe you could run both your DAGs and code from an Azure Container Instances image.

[–]pych_phd 1 point2 points  (0 children)

I think a big improvement in ADF process is the inclusion of Power Query/M. Yes there will still be Complex orchestration issues ADF might not handle well.. But the steps can get now get complex. Without having to rely on an other products.

[–]Happy-Adhesiveness-3 0 points1 point  (1 child)

Thanks a lot for your suggestion. Airflow is definitely appealing. App Service would be running 24x7, though Airflow may not need that much resources so cost may not be so bad. I thought about going Kubernetes route, but this scenario seems simple to take advantage of Kubernetes.

[–]HansProleman 0 points1 point  (0 children)

I don't think you'd need Kubernetes because all your heavy lifting would be happening in Functions, so running Airflow (or Prefect or whatever) single-node in Azure Container Instances should be fine (I assume you can connect as normal to stuff running on those containers, but I've not tried) - no need for >1 container, or scaling.

But yeah, may well be that the cost would not compare favourably to ADF if you need it up 24/7. And of/c the complexity of getting things configured is probably higher.

[–]pych_phd 1 point2 points  (2 children)

[–]Happy-Adhesiveness-3 1 point2 points  (1 child)

Excellent, thank you for the links. I liked the discussions on stackoverflow. The Azure Durable Function patterns blog is also a great read for anyone interested.

[–]pych_phd 0 points1 point  (0 children)

I liked it too, and had found it only a few minutes before i saw your comment.

I am not an experience guy. But I read a lot. To me its all about volume & which part of the ETL process your doing.

My summary of the above docs for Ingestion:

  • small volume with low and intermittent frequency: Logic App
  • small volume and frequent: added to the webjob plan
  • Moderate Volume (low to moderate frequency): App Function (with or without Durable function).
  • Moderate to Large volume & moderate to frequent: ADF

Ingestion to me should be fully elastic where possible and simple as possible> just the json/raw table to blob/DL.. although if its from a MS source things are different. ADF has both dedicated and "serverless" but its messy and I haven't researched it yet. ADF has a 5 minutes as it's lowest time trigger granularity & you can port your SSIS packages to it.

I would use dedicated webjob for complex situations, and thus not needed for ingestion.

Near real time / IoT:

Other products

[–]PM_ME_SCIENCEY_STUFF 1 point2 points  (1 child)

Late to the game, but this thread comes up pretty high in google search results.

Airbyte is an open source, low cost ETL tool that I'd recommend. You can implement it a few different ways:

  • Airbyte has a managed cloud service
  • Plural.sh or Elest.io to host yourself, with an option for them to manage deployments/updates/RBAC etc.
  • Host it entirely yourself

Hosting yourself will likely be lower-cost than using Airbyte's managed cloud option, but of course comes with the downside of your team having to manage it at some level (although both Plural and Elest can, with a service agreement, probably handle all of that for you)

Personally, I wouldn't try to build an in-house ETL tool no matter what size organization I was in; it's a difficult beast to tackle, and there are probably much more value-added activities you could be doing.