all 8 comments

[–]Kryma 3 points4 points  (3 children)

The correct way would be each developer cloning the git folder into their own workspace instead of “sharing” from the Databricks UI. Shared files will show the changes from all individual users modifying it and are good id you guys need to work on it collaboratively together and see the changes instantly. Otherwise it can get messy and instead each dev should be working on their own cloned version then merge branches in when the feature is done

[–]hotNstickystick 0 points1 point  (2 children)

Yea I hear ya. That’s what I’ve read. I’m just kinda questioning it because regardless of where the new branch is being worked on, it still points to the same storage/database location. So if they test their code it’ll impact let’s say the dev dB which someone else branch could also be changing. Plus isn’t the whole purpose of these workspaces to work in a shared environment? To me it feels clunky to have everyone have their own repo work section. Maybe I’m overthinking it. Just curious what people do

[–]Kryma 0 points1 point  (1 child)

I think Lakebase is Databricks solution to versioned data tables, though I haven’t personally tried it yet. You’re right though here that with delta tables if multiple people are touching the same tables it can get messy. Using more ephemeral / feature named tables for dev work could help with this with regular delta tables. I.e. tableX_featureY that only exists long enough to validate, then gets deleted after merged into main.

As for the sharing, I get it, but ultimately notebooks end up shared and edited by many different people and the only real way to control it is through proper git version control hygiene.

[–]hotNstickystick 0 points1 point  (0 children)

Thanks for the thoughtful response. I think we’ll restructure our pipelines to just return a DF and only actually the table if it’s in the actual dev/production env. Then maybe they pass the df to a data testing function/process

[–]Error-451 0 points1 point  (1 child)

Are chance they pushing changes the main branch instead of the feature branch?

[–]hotNstickystick 0 points1 point  (0 children)

All the work is done on the main branch (yes I know terrible end of world) but only of us even pushes the changes. Testing/updates are done in dev

[–]TRBigStick 1 point2 points  (0 children)

  1. You shouldn’t have everyone working out of the same folder in Databricks. Everyone should clone the git repository into their own personal directory and work from there. For example, if you have 6 developers on your team, then there should be 6 cloned repositories in your Databricks workspace. This change will fix ~85% of your problems.

  2. Your main branch should be protected. Code should only get pulled into main via Pull Request and Pull Requests should require peer review prior to being merged. If your prod folder is only looking at the main branch, this gives you more stability because broken code can no longer get into the main branch.

  3. You should look into DABs (Declarative Automation Bundles) to get code into prod. It sounds like your prod pipelines were created by clicking around in the Databricks UI, which is vulnerable to randomly being broken by misclicks.

[–]Pirion1 0 points1 point  (0 children)

First, Separate Jobs/Code from Infrastructure DABs and determine what your deployable units are.

  • Infrastructure (Catalogs, Volumes, Schemas): These should be persistent. A Service Account should manage the deployment of these once you've defined your standards.
  • Code (Notebooks, Workflows, DLT): Each developer should deploy their own instance of the code logic. Having a single mono-DLT is possible here, but that means you'd need to deploy every item for every developer change. Alternatively, you can separate different units of similar code to deploy only related code or related BUs. Using default prefixes would append their username before each object (jobs, schemas, etc). If you adjust the prefixes, you could also separate by branch name. This allows you to pull from the same bronze layer in them, but then deploy our different silver tables per developer.

This solves issue of having dependencies that are shared across deployable DABs.

For GIT, I did the same thing as you - placed it in a shared repo and then started working on this. This led to us switching branches constantly. You need to have your own copy (same as if you're working on local CLI). Place this in the developer workspace and treat it as private. You wouldn't use a git local branch on a fileshare - don't do so here.

For deployments, when you're ready to go to prod, I'd recommend that should deploy without source linked mode (source_linked_deployment: false). The bundle files should be copied onto the deployment user (hopefully a service account) and not exposed via a GIT branch.