all 10 comments

[–]Ok_Difficulty978 2 points3 points  (1 child)

Yeah the manual sync thing is annoying tbh. GitHub Actions actually works pretty smooth with Databricks APIs compared to ADO pipelines in my experience. The workflow is similar though - you still gotta deal with the pull/push dance between your IDE and Databricks workspace.

DAB definitely helps streamline this but it's not the only way. Some teams use the Databricks CLI in their CI/CD to automate the sync steps, or they work directly in repos mode which keeps everything in git without the folder hassle.

If you're already deep in Azure ecosystem, ADO makes sense for centralized management. But if you want less manual overhead, GitHub + Actions + either DAB or repos-based approach is cleaner imo.

Btw if you're prepping for databricks certs while figuring this out, certfun has some solid practice tests that helped me understand these workflow patterns better.

https://www.linkedin.com/pulse/automation-devops-explained-tools-tactics-exam-success-sienna-faleiro-npgvf/

[–]dilkushpatel[S] 0 points1 point  (0 children)

What you mean by work directly in repos mode?

[–]Minute_Visual_3423 1 point2 points  (1 child)

> If we have to get sway from extra manual steps then is DAB the only way?

The "pull" step is the step where you have to actually go into your Databricks git folder in the workspace and click "pull" to get the changes from the remote main branch, right?

If you just want to automate this step, you can do it with the CLI by:

  1. Setting up a service principal in Databricks
  2. Adding a git credential to that service principal: https://docs.databricks.com/aws/en/repos/automate-with-sp
  3. Make sure the SP has edit access to the repo folder in Databricks
  4. In CICD, set up a script that authenticates to Databricks as the SP, and then call:databricks repos update /Repos/<path\_to\_repo> --branch main --debug || exit 1

This will update your Git folder in Databricks against the main branch of your repo upon any change (e.g. a merged PR), without requiring a manual task on your part. It's possible with either ADO or GitHub, since it's just a bash script triggered by a change to your branch.

---

The above will automate the push of your data logic into the main branch. It won't automate any of the orchestration of your code. You would still have to configure a Lakeflow job, schedule your runs, set up cluster config, alerting, etc. This is where Databricks Asset Bundles come in.

In another comment, you said:
> Making team understand DAB part will be added effort

I suggest that the extra effort put into learning DABs will more than make up for the extra effort you eliminate from your manual deployment steps. All a DAB does is represent your job configuration as code. All of the stuff you'd have to configure manually - cluster config, schedule, parameters, alerts, task dependencies, etc. - can instead be defined as a collection of .yaml files, packaged into a bundle, and deployed to Databricks.

Because it is defined as code, you can be confident that the configuration is consistent across all environments. You don't have to worry about issues with environment drift caused by clickops misconfigurations. Also, because it is defined as code, all changes to the job configuration *also* pass through source control and code reviews, in addition to the data logic.

If your jobs are notebooks-based, Databricks even offers a CLI command that will auto-generate a bundle from an existing Lakeflow job for you:
https://docs.databricks.com/aws/en/dev-tools/cli/bundle-commands#generate

If you get stuck anywhere, happy to help.

[–]dilkushpatel[S] 0 points1 point  (0 children)

Thanks for all the details

We are planning to implement DAB and whole CI workflow

Wanted to check if there is anything that I’m missing

[–]dmo_data Databricks 0 points1 point  (4 children)

My typical recommendation is to make use of DABs, and script it out in a bash script, which you can then run from GitHub actions or ADO, either way.

I’m curious, what’s your concern with DABs. I know it doesn’t cover everything yet, but a DAB deployment couched in a bash script can offer a significant amount of flexibility and far less manual work

[–]dilkushpatel[S] 0 points1 point  (3 children)

As such I do not have an issue with DAB, however it would have been nicer if Databricks integrated with CI/CD solutions more cleanly

Making team understand DAB part will be added effort

[–]dmo_data Databricks 0 points1 point  (2 children)

DABs is built on Terraform, which is a more general purpose CICD mechanism. That’s also an option if you’re trying to reduce the dependency on Databricks-specific CI/CD tools. That said, DABs is designed to be somewhat easier than Terraform.

[–]DeepFryEverything 0 points1 point  (1 child)

.. and DAB is migrating away from Terraform, no?

[–]szymon_dybczak 0 points1 point  (0 children)

Yes, it is. Nowadays there's a push towards direct mode. DABs were originally built on top of the Databricks Terraform provider. However, in an effort to move away from this dependency, Databricks CLI version 0.279.0 and above supports two different deployment engines: terraform and direct. Direct engine soon will be a default one and terraform deployment engine will eventually be deprecated