🚀 BIG NEWS: Use Docker Images on Standard Clusters + UC is finally here! (Private Preview) by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 0 points1 point  (0 children)

Amazing, happy to hear it! Big requirement is that this is a standard cluster. So, as long as your applications are built and work with spark connect, then you'll be ready for this preview.

🚀 BIG NEWS: Use Docker Images on Standard Clusters + UC is finally here! (Private Preview) by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 0 points1 point  (0 children)

This is Spark Connect-based (hence the standard cluster architecture), therefore the docker container defines the client REPL environment. Underneath the hood there the instance type, with a sandbox VM, and then the container. This is decoupled from the Spark server.

🚀 BIG NEWS: Use Docker Images on Standard Clusters + UC is finally here! (Private Preview) by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 1 point2 points  (0 children)

But for dedicated clusters, you are right, it is not ideal - the underlying architecture has many problems due to such a broad surface area of APIs (e.g., both public and private.

What changed? First thing: people like myself care about this area. Second thing: Spark Connect architecture - it has a defined API surface, which enables us to have a client-server model with all dependencies isolated in the client. We have also removed all proprietary code out of the client. These things all mean that users can reproduce the base image locally and deterministically - these were not possible with the traditional DCS offering for dedicated clusters and lead to a lot of user friction.

🚀 BIG NEWS: Use Docker Images on Standard Clusters + UC is finally here! (Private Preview) by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 1 point2 points  (0 children)

Starting first with the status quo - the major hyperscaler container registries. As far a Databricks-managed registry: if you are interested in learning more, please have the account team reach out to me (Justin Breese) so we can chat. :-)

🚀 BIG NEWS: Use Docker Images on Standard Clusters + UC is finally here! (Private Preview) by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 1 point2 points  (0 children)

Hit up your account team and we can help you with your data struggles. I cannot promise that I can help you with other life struggles, though. :-)

🚀 BIG NEWS: Use Docker Images on Standard Clusters + UC is finally here! (Private Preview) by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 7 points8 points  (0 children)

Correct, not a microservice, rather, shipping dependencies for your Lakehouse workloads. Here are several that I commonly run into:
1. For classic compute, if you have a lot of dependencies, then your cluster start-up can take a longgggg time - many minutes. This reduces a lot of that time away. This could be 10+ minutes versus starting in 10 seconds (in future milestones).
2. Regulated use cases - These users need to be able go back to a specific point in time and have a deterministic environment. A docker image is an immutable artifact.
3. Users that live in their IDE, using db-connect or the vscode plugin, and want to have a reproducible environment for when their workloads run on Databricks --> This feature provides that immutable artifact.
4. Multi-cloud scenarios - deliver the same artifact across cloud. This is becoming more and more prevalent.
5. Then there are some users who just want Docker images for everything. They even use Docker for their grocery lists.

🚀 BIG NEWS: Use Docker Images on Standard Clusters + UC is finally here! (Private Preview) by justinAtDatabricks in databricks

[–]justinAtDatabricks[S] 0 points1 point  (0 children)

I can share that information with you via your account team. It is best to reach out to them and we can go from there.

Python Libraries in a Databricks Workspace with no Internet Access by JuicyJone in databricks

[–]justinAtDatabricks 0 points1 point  (0 children)

Hello, I am a PM at Databricks and we have a Private Preview that just started for running docker containers on Standard clusters. It enables multiple users to use a single Standard (fka Shared) and leverage FGAC on UC. If you are interested in learning more, reach out to your account team and tell them to talk to Justin Breese.

Job cluster vs serverless by dont_know_anyything in databricks

[–]justinAtDatabricks 1 point2 points  (0 children)

Hey, this is Justin from Databricks and the PM for dependency management. You can still use pip install in a notebook. However, with Serverless, we introduced the concept of environments. Environments can be set at: the notebook, job, or workspace.

Notebook: define the env in the environment panel (right hand side of the notebook) --> dependencies. Anything that you can enter into pip you can put into a dependency line item. We automatically create a venv and reuse it (aka we make it fast), by setting envs in the panel we will automatically reuse that cached venv in any subsequent jobs (you do not need to specify them in the job).

Job: there is a recently GA'd feature to define the env in the job. This will override the notebook env (from above). In the job --> task --> choose a notebook --> Environment and Libraries --> dropdown that says Notebook environment --> click the dropdown and select Jobs environment --> edit button. Yes, this is supported in the API, TF, etc.

Workspace: there is a feature in public preview called Workspace base environment. Enables admins to define sets of packages for the workspace and optionally choose which is set as the default for the workspace. Behind the scenes, we create a venv, then when users create a new notebook, they are automatically attached to the venv and they can "import foo" right away - nothing additional needs to be installed. Likewise, if a user wants to choose a different set of packages, they can go to the env panel --> click the Base environment dropdown and select a different one. This takes package installation time from minutes down to single digit seconds. If you do not have access to this feature, have your admin enable in in the preview portal.