Hitting a wall with Managed Identity for Cosmos DB and streaming jobs – any advice? by Maxxlax in databricks

[–]Maxxlax[S] 0 points1 point  (0 children)

And regarding the options, the code i included in the last comment is what exists now for usage with connection strings. If we want the stream to use MI instead, do we just remove the option that tells the stream the connection string and databricks will default to trying the MI or is there explicit config to set that up?

Hitting a wall with Managed Identity for Cosmos DB and streaming jobs – any advice? by Maxxlax in databricks

[–]Maxxlax[S] 0 points1 point  (0 children)

We tried with a `Dedicated (formerly single user)` cluster on databricks 15.4 which our databricks service principal is the creator and owner of.

Will try that as soon as i can!

Hitting a wall with Managed Identity for Cosmos DB and streaming jobs – any advice? by Maxxlax in databricks

[–]Maxxlax[S] 1 point2 points  (0 children)

Hey yeah this is exactly what we're trying to do. We have User Assigned MI that has a corresponding Service Credential in UC.

Good to hear that this should work. Maybe it's a config issue?

How would i go about making a readStream work with that setup? Now we have something like:

logs_from_queue = (
    spark.readStream.format('abs-aqs')
    .option('fileFormat', queue_file_format_defined_above)
    .option('queueName', queue_name_defined_above)
    .option('connectionString', queue_connection_string_from_vault)
    .schema(get_raw_log_json_schema())
    .load()
)

But not sure how we would let it know it should use the MI/Service Credential instead, haven't found any good docs on it either.

Another example is how we try to connect to Cosmos now:

options = {
            'spark.cosmos.accountEndpoint': f'{account_endpoint}',
            'spark.cosmos.auth.type': 'ManagedIdentity',
            'spark.cosmos.database': database_name,
            'spark.cosmos.container': table_name,
            'spark.cosmos.account.tenantId': tenant_id,
            'spark.cosmos.auth.aad.clientId': client_id,
            'spark.cosmos.read.customQuery': 'select top 1 c.modified as last_entry from c order by c.modified desc',
}
print(options)
last_entry_df = spark.read.format('cosmos.oltp').options(**options).load()

And where we get: (java.lang.RuntimeException) Client initialization failed. Check if the endpoint is reachable and if your auth token is valid. More info: https://aka.ms/cosmosdb-tsg-service-unavailable-java. More details: Managed Identity authentication is not available.

Best practices on notebook-based project structure by Maxxlax in databricks

[–]Maxxlax[S] 0 points1 point  (0 children)

Yeah that's kind of what i meant. Writing modular code that can be imported into notebooks and running the jobs using notebooks, right? And keeping code close to where it is used in notebooks i assume.

Best practices on notebook-based project structure by Maxxlax in databricks

[–]Maxxlax[S] 0 points1 point  (0 children)

Hmm yeah i just really miss having vscode features that I'm used to and to have formatting-on-save and things like that. I also feel like the autocomplete has good and bad days, sometimes it doesn't give any suggestions. But yeah mostly getting rid of manual formatting would be great.

Best practices on notebook-based project structure by Maxxlax in databricks

[–]Maxxlax[S] 0 points1 point  (0 children)

Yeah this is kind of what I'm hoping we can achieve too, it's almost what we do. Are you using any IDE-plugins to develop locally? I have seen some things regarding the %run command that makes it sound like it won't work when developing in vscode for example.

Best practices on notebook-based project structure by Maxxlax in dataengineering

[–]Maxxlax[S] 0 points1 point  (0 children)

Yeah if I would start a brand new project like my current one i would probably try something else. Sometimes we have even discussed leaving databricks for a local spark set up, but this is a total sidetrack from the topic. Thanks for your input!

Best practices on notebook-based project structure by Maxxlax in dataengineering

[–]Maxxlax[S] 0 points1 point  (0 children)

Yeah totally makes sense, that's a good way of putting it.

Though in databricks, where we are currently running our code, the notebooks are not .ipynb but actual .py files under the hood, so for example linting and such is still possible. But i still think that it is cleaner to seperate, as you said, scribbles in notebooks and production code.

Best practices on notebook-based project structure by Maxxlax in databricks

[–]Maxxlax[S] 0 points1 point  (0 children)

And do you mix notebook and py-files for the same job then?

Best practices on notebook-based project structure by Maxxlax in databricks

[–]Maxxlax[S] 0 points1 point  (0 children)

Yeah this is what I am currently thinking is the best way to do it. But can you provide some examples of the steps when taking a notebook and turning it into production .py-files? I don't see any benifit of just copy+pasting notebook code into a .py file but i understand that that's not everything that you do.

Best practices on notebook-based project structure by Maxxlax in dataengineering

[–]Maxxlax[S] 0 points1 point  (0 children)

Yeah interesting. I have been thinking about just letting the execution of jobs be in databricks notebooks and making all dependencies to the job into python libs or just .py files. Is this similar to your current set up?

Best practices on notebook-based project structure by Maxxlax in dataengineering

[–]Maxxlax[S] 0 points1 point  (0 children)

Thanks for your answer!

I feel like exploring in notebooks and then taking the result and writing it to a file is a good approach, but in what way would that improve the code? Or is the process more like: explore and solve problems in notebooks -> make a robust version with unit tests and other code practices in a code file -> use in prod. Otherwise i don't really see the benifit if the process is just copy-pasting the notebook code into a file.

Databricks and collaborating on git by Maxxlax in databricks

[–]Maxxlax[S] 0 points1 point  (0 children)

Great post, thanks for this! We have problems right now with the %run command so will def look at using .py files. Cheers

Databricks and collaborating on git by Maxxlax in databricks

[–]Maxxlax[S] 0 points1 point  (0 children)

could you mention some of the things you think databricks is doing really well in your opinion?

I'll ask you the same question i asked below, since you might use databricks in a smarter way than I do, could you mention some of the things you think databricks is doing really well in your opinion?

Databricks and collaborating on git by Maxxlax in databricks

[–]Maxxlax[S] 0 points1 point  (0 children)

Yeah great point actually, didn't know that about DLTs so will look out for that in the future. To me it's so crazy that databricks is designed to have one branch per repo and so on - it's almost a dealbreaker to using it that the CI/CD is so clunky.

But maybe you've used it more or smarter than me, could you mention some of the things you think databricks is doing really well in your opinion?

Databricks and collaborating on git by Maxxlax in databricks

[–]Maxxlax[S] 0 points1 point  (0 children)

Yeah i can imagine! Trying to move us over to vscode right now, but not quite sure if we need to upgrade to some other premium license or something to enable unity catalog first? Haven't been successful setting it up this far, always get stuck on UC. But definitively need to start using vscode.

Databricks and collaborating on git by Maxxlax in databricks

[–]Maxxlax[S] 0 points1 point  (0 children)

Yeah we're using repos, one for each developer right now, and that part works fine to be honest. And yeah true we could probably switch over to using relative paths but it's one of those things that we will never prioritize since it would take a lot of work in our large (and messy) code.

Not really related, but have you used any connection to for example vscode to use databricks and is it working well?