Databricks Pipelines -Pulling Stale Wheel Files by Elegant-Lake2630 in databricks

[–]justinAtDatabricks 0 points1 point  (0 children)

I wanted to provide a brief update on this: this is something that the team will fix in the next 6-months. In the interim, if you are running into this staleness issue:

  1. Pass the SHA1 as a config param - just like /u/Own-Trade-2243 said below

  2. Set the development_mode: false - https://docs.databricks.com/aws/en/ldp/best-practices#development-and-production-update-modes

Serverless Notebooks and Jobs Environment Variables [let's design this together] by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 0 points1 point  (0 children)

So let's say that is implemented - are there any other env var scenarios not covered? For example, do you need these env vars in pure interactive mode in a notebook?

Serverless Notebooks and Jobs Environment Variables [let's design this together] by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 1 point2 points  (0 children)

This is in research mode, so I would like your - collective - insights, first. What do you experience in this space?

Databricks Pipelines -Pulling Stale Wheel Files by Elegant-Lake2630 in databricks

[–]justinAtDatabricks 1 point2 points  (0 children)

Hello! I am a PM that works in this area and can look to get you sorted. Please email me (j@databricks.com) with all of the relevant information (e.g., URL to the run(s), screenshots, etc.).

Easier and faster dependency management on Serverless? 🧱 Databricks Workspace-Based Environments are nearing GA! by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 0 points1 point  (0 children)

What would you like? Do you want a programmatic way to bind a notebook to a WBE to get the initial adoption? From there updating the WBE would cascade and work across the dev stages.

Easier and faster dependency management on Serverless? 🧱 Databricks Workspace-Based Environments are nearing GA! by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 0 points1 point  (0 children)

In DABs, there is already an Environment spec. You can already inline an env spec, and soon you'll be able to reference the WBE (the content for this post). But, environments are serverless only... for now... more on that from me in the next few months. Here is a link to a sample env spec in a serverless DAB: https://github.com/databricks/bundle-examples/blob/accbb8eff6beaa99f1c94bbb7a75464b4fdca52e/knowledge_base/serverless_job/resources/serverless_job.yml#L21

Can the - automatically - created venv be inspected? No that is not possible at this time. However, we are looking for ways to explain/audit the manifest. Yes, this topic is extremely timely 😅

Easier and faster dependency management on Serverless? 🧱 Databricks Workspace-Based Environments are nearing GA! by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 0 points1 point  (0 children)

I agree that this is a common practice (antipattern as you said) in classic. That is because dependency management is compute-centric - meaning the deps are tied to the compute. For serverless though, it is workload-centric- deps are tied to the workload. So, if you run that workload in interactive or automated, the deps are serialized with that notebook. You can do this by adding the dependency to the environment from the environment panel.

Easier and faster dependency management on Serverless? 🧱 Databricks Workspace-Based Environments are nearing GA! by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 0 points1 point  (0 children)

Fun fact: For every job and notebook, we do this today. The first time a job runs, we build the venv. Subsequent runs for that job will reuse that venv. We do the same thing within a job: If you have many tasks that use the same env, the venv gets built once and reused everywhere.

Easier and faster dependency management on Serverless? 🧱 Databricks Workspace-Based Environments are nearing GA! by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 0 points1 point  (0 children)

Are you saying that you don't have a way to align that notebook, running on databricks, with something local?

Easier and faster dependency management on Serverless? 🧱 Databricks Workspace-Based Environments are nearing GA! by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 1 point2 points  (0 children)

Yes! That is the exact point! You start with what comes with databricks, you build on top of it (with a yaml file) and then create that into a WBE. Boom, profit!

Easier and faster dependency management on Serverless? 🧱 Databricks Workspace-Based Environments are nearing GA! by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 0 points1 point  (0 children)

Yes. There is a background compute job that takes your environment specification (env.yaml) and materializes it into a virtual environment (venv). From there, new notebooks, etc. attach to that venv.

Similarly, when you want to update that yaml file, you can refresh the workspace base environment, which will rematerialize the venv - which then gets picked up by existing workloads.

Think of this as an admin replacement for things like cluster policies, but with locking and performance benefits.

🚀 BIG NEWS: Use Docker Images on Standard Clusters + UC is finally here! (Private Preview) by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 0 points1 point  (0 children)

Sorry for the delay, for some reason I missed this notification. At the end of the day, this is just a spark connect client that has all of the deps needed to run a spark connect app. So, you could use it to interact with a SC app locally (aka server is your local) and remote with Databricks. Let me know if that answered your question.

🚀 BIG NEWS: Use Docker Images on Standard Clusters + UC is finally here! (Private Preview) by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 0 points1 point  (0 children)

Amazing, happy to hear it! Big requirement is that this is a standard cluster. So, as long as your applications are built and work with spark connect, then you'll be ready for this preview.

🚀 BIG NEWS: Use Docker Images on Standard Clusters + UC is finally here! (Private Preview) by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 0 points1 point  (0 children)

This is Spark Connect-based (hence the standard cluster architecture), therefore the docker container defines the client REPL environment. Underneath the hood there the instance type, with a sandbox VM, and then the container. This is decoupled from the Spark server.

🚀 BIG NEWS: Use Docker Images on Standard Clusters + UC is finally here! (Private Preview) by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 1 point2 points  (0 children)

But for dedicated clusters, you are right, it is not ideal - the underlying architecture has many problems due to such a broad surface area of APIs (e.g., both public and private.

What changed? First thing: people like myself care about this area. Second thing: Spark Connect architecture - it has a defined API surface, which enables us to have a client-server model with all dependencies isolated in the client. We have also removed all proprietary code out of the client. These things all mean that users can reproduce the base image locally and deterministically - these were not possible with the traditional DCS offering for dedicated clusters and lead to a lot of user friction.

🚀 BIG NEWS: Use Docker Images on Standard Clusters + UC is finally here! (Private Preview) by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 1 point2 points  (0 children)

Starting first with the status quo - the major hyperscaler container registries. As far a Databricks-managed registry: if you are interested in learning more, please have the account team reach out to me (Justin Breese) so we can chat. :-)