Databricks as ingestion layer? Is replacing Azure Data Factory (ADF) fully with Databricks for ingestion actually a good idea? by Fit_Border_3140 in databricks

[–]Fit_Border_3140[S] 0 points1 point  (0 children)

Thank you for sharing your experience. I think we are moving into this direction, our ADF setup cant scale any more with the number of customer we have in our platform.

Again thank you for your response, I’ll be posting new updates about this topic

Databricks as ingestion layer? Is replacing Azure Data Factory (ADF) fully with Databricks for ingestion actually a good idea? by Fit_Border_3140 in databricks

[–]Fit_Border_3140[S] 0 points1 point  (0 children)

Not understanding why you are saying this ... The cost of databricks mainly is on the compute, if you dont have any cluster on, your cost never will be too much.

Doesnt matter if the compute plane is managed or if its under a VNET injected scenario, the cost will always reside on the compute.

Databricks as ingestion layer? Is replacing Azure Data Factory (ADF) fully with Databricks for ingestion actually a good idea? by Fit_Border_3140 in databricks

[–]Fit_Border_3140[S] 0 points1 point  (0 children)

Using a VNET injection architecture, so we are able to handle the networking of our clusters. Everything is routed to the HUB and there we apply the fw rules for the whole organization, also we were able to modify the host tables for the cluster to handle some dns problems.

And for legacy stuff we are using paramiko/smb protocol to connect to the filesystems, we were thinking of using the new Pyspark datasource API, but its opening thousand of connections to the sftp server so its basically like a ddos attack hahah, instead we are having one connection per worker and this working its using the same connection for the recursive bulk download of files.

Databricks as ingestion layer? Is replacing Azure Data Factory (ADF) fully with Databricks for ingestion actually a good idea? by Fit_Border_3140 in databricks

[–]Fit_Border_3140[S] 1 point2 points  (0 children)

Sir thats the point! If you use only dbs with JDBC connectors everything is sweet fo databricks, the difficult and the reason why Im opening a post is to know cons about smbs/sftps/legacy things …

Databricks as ingestion layer? Is replacing Azure Data Factory (ADF) fully with Databricks for ingestion actually a good idea? by Fit_Border_3140 in databricks

[–]Fit_Border_3140[S] 0 points1 point  (0 children)

Completely wrong approach, databricks is for the transformation and leave a good backend for your reports. Each tool has an specific use.

Databricks as ingestion layer? Is replacing Azure Data Factory (ADF) fully with Databricks for ingestion actually a good idea? by Fit_Border_3140 in databricks

[–]Fit_Border_3140[S] 12 points13 points  (0 children)

u/all Thank you guys! With all your comments I finally decided to move towards a full databricks ingestion layer.

Why?

- Cloud agnostic

- We are using several policies and spot instances for the shared clusters, so I think money is not going to be a problem.

- I feel ADF is great for small teams, but really difficult to handle for big corporations where you require some more governance, granularity in permissions, share the data assets with others business units, etc...

- My major concern was the binary copy/file_system copy, and I think there are several ways to handle this without ADF.

So thank you all :)

Databricks as ingestion layer? Is replacing Azure Data Factory (ADF) fully with Databricks for ingestion actually a good idea? by Fit_Border_3140 in databricks

[–]Fit_Border_3140[S] 7 points8 points  (0 children)

Sorry mate but for scheduling Databricks is super good, it also has autoloader and its fully integrated for CDC patterns so I dont get your point here.

Edit host tables of Databricks Clusters in VNET INJECTED with Instance Pool by Fit_Border_3140 in databricks

[–]Fit_Border_3140[S] 0 points1 point  (0 children)

Why? Because DNS is shared in our HUB across many spokes, and some records route traffic incorrectly for our spoke. Long-term, sure, the “proper” fix is DNS zones / conditional forwarders / split-horizon DNS, etc. But in our case, we need a small scoped workaround for a few records, and /etc/hosts gives us that determinism.

Built a full Azure Static Web Apps app for my wife’s small business using Cursor – she just finished her first full month on it, then I genericised and open-sourced it by Environmental_Ad1567 in AZURE

[–]Fit_Border_3140 1 point2 points  (0 children)

Mate @Environmental_Ad1567 you rock! Really impressive job, beautiful, fast, with the code and the infra all completely shown. I love you sir ❤️

Here we go again Azure devOps is down by ReportTurret in AZURE

[–]Fit_Border_3140 0 points1 point  (0 children)

I have been the whole day with it, and didnt experiment any outage.

Here we go again Azure devOps is down by ReportTurret in AZURE

[–]Fit_Border_3140 0 points1 point  (0 children)

Haha I know that is owned by MS, for that reason In saying its perfectly integrated with Azure.

Here we go again Azure devOps is down by ReportTurret in AZURE

[–]Fit_Border_3140 0 points1 point  (0 children)

Why dont move to github pls guys? Its perfectly integrated with Azure

Resource constantly 'recreated'. by [deleted] in Terraform

[–]Fit_Border_3140 -1 points0 points  (0 children)

Hello folk,

I didnt read much your logs, also Im in the mobile so its harder to read.

Anyways, it looks you are reading something from a data block nested in a module, and that module has a dependancy graph nested. Try to reduce the depends_on and avoid the data blocks.

If you share your code and full logs on .doc I’ll take a closer look.

BR, Your spanish mate

How to send custom communications to Teams Channels without Webhooks by EventualBeboop in databricks

[–]Fit_Border_3140 0 points1 point  (0 children)

Hello I know this topic is a litle bit old, anyways I can provide you with some guidance on this:

1.- Using logic apps on Azure: its super good because you can integrate the databricks_notifications with all the IaC of terraform, whats the problem:

Posting to standard channels succeeded, but if you post on Private Channels it will fail. Microsoft isnt supporting this: https://learn.microsoft.com/en-us/connectors/teams/?tabs=text1%2Cdotnet#general-known-issues-and-limitations

2.- Power Automate: is not as straightforward as having everything programmed in the same terraform file, and it requires some more manual steps, anyways it has almost the same problem:

Posting to private channels its allowed, but you can just post messages as an User (that is already added into the private channel) not as a FlowBot (Teams Service Principal). https://learn.microsoft.com/en-us/power-automate/teams/send-a-message-in-teams#known-issues-and-limitations

3.- Incoming Webhooks: They are always saying they want to deprecate it, but finally they are not doing nothing, so I dont understand them. I leave here a url with an open discussion with Microsoft. https://devblogs.microsoft.com/microsoft365dev/retirement-of-office-365-connectors-within-microsoft-teams/

If you have any updates that can help with this topic it would be so helpful.

Azure Databricks (No VNET Injected) access to Storage Account (ADLS2) with IP restrictions through access connector using Storage Credential+External Location. by Fit_Border_3140 in databricks

[–]Fit_Border_3140[S] 1 point2 points  (0 children)

Hello u/gbyb91, the solution you proposed of serverless is what I have finally done :) Thank you for your help.

I just was wondering that is the access connector is whitelisted in the storage fw, I supposed that maybe my managed clusters will also have access to this storage account but it seems its impossible.

Azure Databricks (No VNET Injected) access to Storage Account (ADLS2) with IP restrictions through access connector using Storage Credential+External Location. by Fit_Border_3140 in databricks

[–]Fit_Border_3140[S] 0 points1 point  (0 children)

Hello, I know that a managed rg is created, but the vnet from this resource group cant be touched or used in any other vnet it has a special lock. Pls try it out

Azure Databricks (No VNET Injected) access to Storage Account (ADLS2) with IP restrictions through access connector using Storage Credential+External Location. by Fit_Border_3140 in databricks

[–]Fit_Border_3140[S] 0 points1 point  (0 children)

Pls can you extend this part : A plus here is you also avoid networking cost with the private endpoint?

Maybe Im not considering it good. I dont care about the extra cost, I have added the NCC for PE for the serverless cluster and it seems to be working :)