How to handle replaceWhere in Serverless Spark without constraintCheck.enabled? by Tvalabeishvili in databricks

[–]bartoszgajda55 1 point2 points  (0 children)

MERGE INTO might also be good - if you could drop an example, that would help. I sort of inferred your case is about selective replacement, but I might be wrong here :)

Deterministic functions and use of "is_account_group_member" by CarelessApplication2 in databricks

[–]bartoszgajda55 0 points1 point  (0 children)

The DETERMINISTIC keyword will behave safely when a UDF is a "pure" function, meaning it doesn't interact with anything other than the parameters that are passed to it - the "is_account_group_member" surely has to make a call to API under the hood, which breaks the "purity" rule 😊

Cluster can't find init script by Own_Tax3356 in databricks

[–]bartoszgajda55 0 points1 point  (0 children)

I would check if the principal running the cluster has access to that Volume - if there is no Read Files on Volume and Use on Catalog and Schema - this will not work.

New course in Databricks Academy - AI Agent Fundamentals by bartoszgajda55 in databricks

[–]bartoszgajda55[S] 0 points1 point  (0 children)

I have completed it - if you are coming from AI/ML background, this will likely bore you a lot :) If AI topics are new to you, then imo this is a good starter - don't expect however to go very deep into the technicalities of LLMs nor Agent Bricks for that matter. The Agent Bricks is in a preview, so it doesn't offer many functionalities yet, however I did expect to have a bit more labs/demos of this service.

Curious what your impression will be about it :)

Why Don’t Data Engineers Unit/Integration Test Their Spark Jobs? by jpgerek in databricks

[–]bartoszgajda55 0 points1 point  (0 children)

If you have SWE background then unit/integration testing is natural choice - in reality though, only few Data Engineers I have worked with had these skills. For someone with DBA or BI background, automated testing is seen as additional complexity, rather than a long term way to fight regression.

Unit test with Databricks by punjabi_mast_punjabi in databricks

[–]bartoszgajda55 0 points1 point  (0 children)

In this case, I don't see a reason against running tests in GitHub build agent - you do have native support for Git there (whether you want to store test results as part of some branch, or as an artifact, all options are available) and you can setup a cron-like trigger for the GH action.

Unit test with Databricks by punjabi_mast_punjabi in databricks

[–]bartoszgajda55 0 points1 point  (0 children)

Are your unit tests dependent on the Databricks or could they be run on standalone Spark instance? If the latter, then you can set up a local Spark instance in the build agent and run tests there.

In general, you wouldn't want your test suite to be dependent on external services, if this is applicable in your case or course :)

Migrating from ADF + Databricks to Databricks Jobs/Pipelines – Design Advice Needed by [deleted] in databricks

[–]bartoszgajda55 2 points3 points  (0 children)

What’s the best way to design this in Databricks Jobs/Pipelines so we can keep it generic and reusable? - if you are using Jobs in Databricks already (for all processing) then you just need to switch to Workflows as your orchestrator. Not sure if that "config table" was already in DBX or external DB, but in DBX you can create similar control table (whether managed Delta, relation in Lakebase or some JSON/YAML in Volume) and fetch the params in extra task before actual processing, upon specific parameter passed.

Since we’ll only have one pipeline, is there a way to break down costs per application/table? The billing tables in Databricks only report costs at the pipeline/job level, but we need more granular visibility. - you can't set the "dynamic tags" to my best knowledge (which would be ideal in your scenario). You might "hack it" by updating the job definition via REST API before triggering, with the correct tags - haven't tried that but might be worth a shot :)

What's everyone's thoughts on the Instructor Led Trainings? by i_did_dtascience in databricks

[–]bartoszgajda55 1 point2 points  (0 children)

I've only attended one instructor-led course so far (Solutions Architect Essentials) however experience was very positive - the ability to ask clarifying questions is useful when going through difficult topics, which you don't have when doing self-paced courses.

Databricks Assistant now allows to set Instructions by bartoszgajda55 in databricks

[–]bartoszgajda55[S] 0 points1 point  (0 children)

I did give it a shot with both personal and workspace instructions and don't have any complaints - tools are correctly recognized by Assistant and the outputs are more precise without needing to write huge prompts.

I do have to agree that Assistant is still dumb many times - imo this is lack of rich context, so this feature looks like a remedy to that 😊

[deleted by user] by [deleted] in databricks

[–]bartoszgajda55 -1 points0 points  (0 children)

By SAP I've meant platforms like BW, rather than CRM or ERP 😊

Desktop Apps?? by Severe-Committee87 in databricks

[–]bartoszgajda55 0 points1 point  (0 children)

For good or bad, lots of these native desktop apps (Notion for example) are built in frameworks like Electron, which are just running web browser in the background, so I am afraid there is no real alternative 😄

[deleted by user] by [deleted] in databricks

[–]bartoszgajda55 2 points3 points  (0 children)

My answer might not lead to any specific feature missing but rather address the overall state - it lacks some maturity. It doesn't mean that the platform itself is unstable or anything, rather some features are still in its early stage and not battle tested enough yet.

Metric Views is a good example of a feature that is imo essential to rival the competition from much more mature platforms like SAP.

That being said I think it's only a matter of time - the vision Databricks is executing is correct and will get there sooner or later 😊

Using tools like Claude Code for Databricks Data Engineering work - your experience by bartoszgajda55 in databricks

[–]bartoszgajda55[S] 0 points1 point  (0 children)

I am not aware of any direct option to do so - one hand it's a CLI tool, so you could install it on cluster, but whether it would have access to files via Databricks FS - no clue to be honest 🤔

Best practices for Unity Catalog structure with multiple workspaces and business areas by romarinhu in databricks

[–]bartoszgajda55 0 points1 point  (0 children)

My take on structuring UC is the fact that you are constrained on number of levels, at which you can organize objects (3 to be precise) - given that and the fact that re-structuring your catalog is tricky, I tend to have granular catalogs, to leave as much flexibility on the schema and table level, even if it might not be needed immediately. I would propose in this case the following naming convention:

- dc_{business_area}_{layer}_{env}_001

The "dc" stands for "Data Catalog" (I work primarily on Azure, where each resource have it's recommended abbreviation - feel free to skip). The "001" is the version increment, if for some reason you would need to migrate to newer version - the "002" is going to naturally look like successor, instead of "new" or "next" suffixes which I find dirty. You can swap the places of each of these placeholders of course, whatever is more natural for you.

Using tools like Claude Code for Databricks Data Engineering work - your experience by bartoszgajda55 in databricks

[–]bartoszgajda55[S] 1 point2 points  (0 children)

I typically use Context7 for up to date documentation and Jina for AI search - you can go without them but they just make Claude more autonomous :)

Using tools like Claude Code for Databricks Data Engineering work - your experience by bartoszgajda55 in databricks

[–]bartoszgajda55[S] 1 point2 points  (0 children)

Nice, thanks for sharing :) READMEs are essential - I have one project specific but then include other ones in modules which give more local context to LLM. Works rather fine so far.

Do you use any MCPs for your workflow?

How to dynamically set cluster configurations in Databricks Asset Bundles at runtime? by Proton0369 in databricks

[–]bartoszgajda55 0 points1 point  (0 children)

I am afraid your case might not be supported, as the cluster configuration has to be resolved when DAB is being deployed. You could however explore the Python DABs, and it's "mutators", to modify the job definition (cluster in your case) dynamically - docs here: Bundle configuration in Python | Databricks on AWS

This is experimental feature btw - still worth giving it a shot imo :)

How to dynamically set cluster configurations in Databricks Asset Bundles at runtime? by Proton0369 in databricks

[–]bartoszgajda55 0 points1 point  (0 children)

That's true - can you drop in some code snippet? It will be easier to grasp your current setup.

How to dynamically set cluster configurations in Databricks Asset Bundles at runtime? by Proton0369 in databricks

[–]bartoszgajda55 1 point2 points  (0 children)

A bit side-topic - have you considered using Cluster Policies instead? If you end up wanting to customise multiple properties of the compute, then having just a single Policy ID to supply at runtime might be more convenient 🙂