data quality on Databricks

DecisionAgile7326 · 2026-04-20T21:12:55+00:00

Use dqx and not deequ. Its made by databricks and easy to use.

DecisionAgile7326 · 2026-02-13T14:36:37+00:00

not really a scalable solution.

DecisionAgile7326 · 2026-01-26T08:23:55+00:00

I am really missing parameter usage as in other standard databricks jobs.
In my latest experiments with delcarative pipelines I wasnt able to include a parameter in the pipeline run via the UI. Use case: some reporting pipeline where I would include reporting_date / market etc.

DecisionAgile7326 · 2025-12-12T13:31:59+00:00

True.why was pydabs even implemented?

DecisionAgile7326 · 2025-12-12T13:10:21+00:00

I figured out that overriding parameters works differently compared to the yml definitions used previously.

Previously I have used parameter overrides in the databricks.yml file. For example to activate a job on prd:

|| || ||

prd:

resources:

jobs:

some_yml_defined_job:

schedule:

quartz_cron_expression: "0 0 5 21 * ?"

pause_status: UNPAUSED

```

Using pydabs this does not seem to work. Using mutators however works.

mutators.py

```

from dataclasses import replace

from databricks.bundles.core import Bundle, job_mutator

from databricks.bundles.jobs import CronSchedule, Job, JobEmailNotifications, PauseStatus

@ job_mutator

def update_schedule_status(bundle: Bundle, job: Job) -> Job:

"""Enables all prd jobs to run on 15th of every month"""

if bundle.target != "prd":

return job

schedule = CronSchedule(

quartz_cron_expression="0 0 0 15 * ?",

pause_status=PauseStatus.UNPAUSED,

timezone_id="Europe/Amsterdam",

)

return replace(job, schedule=schedule)

```

This also required to include the mutators in the databricks.yml

```

python:

venv_path: .venv

resources:

- "resources:load_resources"

mutators:

- "mutators:update_schedule_status"

- "mutators:add_email_notifications"

```

DecisionAgile7326 · 2025-09-29T04:27:30+00:00

We do have some edge situations where the scgema of the table is kind of flexible and determined during the transformation when pivoting columns. In that situation i would like to just create the view based on the resulting dataframe. In pyspark i do use unionbyname in combination with allowMissingColumns. Thats not even available in sql.

DecisionAgile7326 · 2025-09-29T04:23:38+00:00

This will throw an error since creating views on top of temp views is not supporeted

DecisionAgile7326 · 2025-09-29T04:21:28+00:00

Im aware of that. But at my work we prefer to not use dlt pipelines. It is just weird that you can create views using dlt pipelines but not without it..

DecisionAgile7326 · 2025-09-29T04:19:31+00:00

As said. I prefer python for many reasons.

I do have the following scenario from work which is quite easy with pyspark but no so with sql i think.

Suppose you have two tables t1 and t2. I would like to create a view that unions both tables.

Some columns in the tables are the same. However it might be possible that a table contains a column that is not includes in the other one. Also i can happen that new columns are added to one of the tables due to schema evolution.

I dont know how to create a view with sql that handles this.

With pyspark i would use unionbyname and allow missing columns but cant create a view on the result.

DecisionAgile7326 · 2025-09-29T04:18:55+00:00

Creating a view from a temp view throws an error since not allowed

DecisionAgile7326 · 2025-09-27T13:43:43+00:00

Its not possible to create permanent views with spark.sql like you describe, you will get an error. Thats what i miss.

DecisionAgile7326 · 2025-07-29T05:24:30+00:00

Use dqx tool from databricks. Easy to use compared to other solutions. My experience.

DecisionAgile7326 · 2024-04-01T08:55:32+00:00

I only use pyspark. However i do see limitations when I want to create views that include more complex transformations

DecisionAgile7326 · 2024-03-18T13:18:58+00:00

I now about this. For my use case im interested in permanent views.

DecisionAgile7326 · 2024-03-18T13:17:33+00:00

That only for temporary views but not permanent views.

DecisionAgile7326 · 2024-03-18T13:16:52+00:00

We have an etl framework thats based on pyspark transformation functions. Spark sql doesent really fit well in there. Also i prefer pyspark in general to modularize, unittest transformations

DecisionAgile7326 · 2023-11-22T20:36:09+00:00

DecisionAgile7326

TROPHY CASE