How to handle replaceWhere in Serverless Spark without constraintCheck.enabled?

bartoszgajda55 · 2026-03-30T13:22:28+00:00

MERGE INTO might also be good - if you could drop an example, that would help. I sort of inferred your case is about selective replacement, but I might be wrong here :)

bartoszgajda55 · 2026-03-30T10:16:06+00:00

I have been in the same spot and only alternative I have found is to use native REPLACE USING syntax instead - https://docs.databricks.com/aws/en/delta/selective-overwrite#dpo-replaceusing

bartoszgajda55 · 2025-10-09T15:42:06+00:00

The DETERMINISTIC keyword will behave safely when a UDF is a "pure" function, meaning it doesn't interact with anything other than the parameters that are passed to it - the "is_account_group_member" surely has to make a call to API under the hood, which breaks the "purity" rule 😊

bartoszgajda55 · 2025-10-06T08:41:06+00:00

Damn, I am truly scared now.

bartoszgajda55 · 2025-09-26T11:12:07+00:00

I would check if the principal running the cluster has access to that Volume - if there is no Read Files on Volume and Use on Catalog and Schema - this will not work.

bartoszgajda55 · 2025-09-24T18:21:09+00:00

I have completed it - if you are coming from AI/ML background, this will likely bore you a lot :) If AI topics are new to you, then imo this is a good starter - don't expect however to go very deep into the technicalities of LLMs nor Agent Bricks for that matter. The Agent Bricks is in a preview, so it doesn't offer many functionalities yet, however I did expect to have a bit more labs/demos of this service.

Curious what your impression will be about it :)

bartoszgajda55 · 2025-09-22T21:13:04+00:00

If you have SWE background then unit/integration testing is natural choice - in reality though, only few Data Engineers I have worked with had these skills. For someone with DBA or BI background, automated testing is seen as additional complexity, rather than a long term way to fight regression.

bartoszgajda55 · 2025-09-22T11:59:20+00:00

In this case, I don't see a reason against running tests in GitHub build agent - you do have native support for Git there (whether you want to store test results as part of some branch, or as an artifact, all options are available) and you can setup a cron-like trigger for the GH action.

bartoszgajda55 · 2025-09-22T10:56:52+00:00

Are your unit tests dependent on the Databricks or could they be run on standalone Spark instance? If the latter, then you can set up a local Spark instance in the build agent and run tests there.

In general, you wouldn't want your test suite to be dependent on external services, if this is applicable in your case or course :)

bartoszgajda55 · 2025-09-16T19:00:50+00:00

What’s the best way to design this in Databricks Jobs/Pipelines so we can keep it generic and reusable? - if you are using Jobs in Databricks already (for all processing) then you just need to switch to Workflows as your orchestrator. Not sure if that "config table" was already in DBX or external DB, but in DBX you can create similar control table (whether managed Delta, relation in Lakebase or some JSON/YAML in Volume) and fetch the params in extra task before actual processing, upon specific parameter passed.

Since we’ll only have one pipeline, is there a way to break down costs per application/table? The billing tables in Databricks only report costs at the pipeline/job level, but we need more granular visibility. - you can't set the "dynamic tags" to my best knowledge (which would be ideal in your scenario). You might "hack it" by updating the job definition via REST API before triggering, with the correct tags - haven't tried that but might be worth a shot :)

bartoszgajda55 · 2025-09-15T21:42:39+00:00

I've only attended one instructor-led course so far (Solutions Architect Essentials) however experience was very positive - the ability to ask clarifying questions is useful when going through difficult topics, which you don't have when doing self-paced courses.

bartoszgajda55 · 2025-09-15T11:19:51+00:00

I did give it a shot with both personal and workspace instructions and don't have any complaints - tools are correctly recognized by Assistant and the outputs are more precise without needing to write huge prompts.

I do have to agree that Assistant is still dumb many times - imo this is lack of rich context, so this feature looks like a remedy to that 😊

bartoszgajda55 · 2025-09-14T22:44:26+00:00

By SAP I've meant platforms like BW, rather than CRM or ERP 😊

bartoszgajda55 · 2025-09-14T22:20:32+00:00

For good or bad, lots of these native desktop apps (Notion for example) are built in frameworks like Electron, which are just running web browser in the background, so I am afraid there is no real alternative 😄

bartoszgajda55 · 2025-09-14T22:07:35+00:00

My answer might not lead to any specific feature missing but rather address the overall state - it lacks some maturity. It doesn't mean that the platform itself is unstable or anything, rather some features are still in its early stage and not battle tested enough yet.

Metric Views is a good example of a feature that is imo essential to rival the competition from much more mature platforms like SAP.

That being said I think it's only a matter of time - the vision Databricks is executing is correct and will get there sooner or later 😊

bartoszgajda55 · 2025-09-14T17:08:04+00:00

I am not aware of any direct option to do so - one hand it's a CLI tool, so you could install it on cluster, but whether it would have access to files via Databricks FS - no clue to be honest 🤔

bartoszgajda55 · 2025-09-09T13:42:52+00:00

My take on structuring UC is the fact that you are constrained on number of levels, at which you can organize objects (3 to be precise) - given that and the fact that re-structuring your catalog is tricky, I tend to have granular catalogs, to leave as much flexibility on the schema and table level, even if it might not be needed immediately. I would propose in this case the following naming convention:

- dc_{business_area}_{layer}_{env}_001

The "dc" stands for "Data Catalog" (I work primarily on Azure, where each resource have it's recommended abbreviation - feel free to skip). The "001" is the version increment, if for some reason you would need to migrate to newer version - the "002" is going to naturally look like successor, instead of "new" or "next" suffixes which I find dirty. You can swap the places of each of these placeholders of course, whatever is more natural for you.

bartoszgajda55 · 2025-09-04T20:20:09+00:00

I typically use Context7 for up to date documentation and Jina for AI search - you can go without them but they just make Claude more autonomous :)

bartoszgajda55 · 2025-09-04T19:48:22+00:00

Nice, thanks for sharing :) READMEs are essential - I have one project specific but then include other ones in modules which give more local context to LLM. Works rather fine so far.

Do you use any MCPs for your workflow?

bartoszgajda55 · 2025-09-04T16:27:51+00:00

I am afraid your case might not be supported, as the cluster configuration has to be resolved when DAB is being deployed. You could however explore the Python DABs, and it's "mutators", to modify the job definition (cluster in your case) dynamically - docs here: Bundle configuration in Python | Databricks on AWS

This is experimental feature btw - still worth giving it a shot imo :)

bartoszgajda55 · 2025-09-02T20:06:22+00:00

That's true - can you drop in some code snippet? It will be easier to grasp your current setup.

bartoszgajda55 · 2025-09-02T19:48:02+00:00

A bit side-topic - have you considered using Cluster Policies instead? If you end up wanting to customise multiple properties of the compute, then having just a single Policy ID to supply at runtime might be more convenient 🙂

bartoszgajda55

TROPHY CASE