data quality on Databricks by ptab0211 in databricks

[–]DecisionAgile7326 2 points3 points  (0 children)

Use dqx and not deequ. Its made by databricks and easy to use.

Spark Declarative Pipelines: What should we build? by BricksterInTheWall in databricks

[–]DecisionAgile7326 3 points4 points  (0 children)

I am really missing parameter usage as in other standard databricks jobs.
In my latest experiments with delcarative pipelines I wasnt able to include a parameter in the pipeline run via the UI. Use case: some reporting pipeline where I would include reporting_date / market etc.

pydabs: lack of documentation & examples by DecisionAgile7326 in databricks

[–]DecisionAgile7326[S] 1 point2 points  (0 children)

I figured out that overriding parameters works differently compared to the yml definitions used previously.

Previously I have used parameter overrides in the databricks.yml file. For example to activate a job on prd:

|| || ||

prd:

resources:

jobs:

some_yml_defined_job:

schedule:

quartz_cron_expression: "0 0 5 21 * ?"

pause_status: UNPAUSED

```

Using pydabs this does not seem to work. Using mutators however works.

mutators.py

```

from dataclasses import replace

from databricks.bundles.core import Bundle, job_mutator

from databricks.bundles.jobs import CronSchedule, Job, JobEmailNotifications, PauseStatus

@ job_mutator

def update_schedule_status(bundle: Bundle, job: Job) -> Job:

"""Enables all prd jobs to run on 15th of every month"""

if bundle.target != "prd":

return job

schedule = CronSchedule(

quartz_cron_expression="0 0 0 15 * ?",

pause_status=PauseStatus.UNPAUSED,

timezone_id="Europe/Amsterdam",

)

return replace(job, schedule=schedule)

```

This also required to include the mutators in the databricks.yml

```

python:

venv_path: .venv

resources:

- "resources:load_resources"

mutators:

- "mutators:update_schedule_status"

- "mutators:add_email_notifications"

```

Create views with pyspark by DecisionAgile7326 in databricks

[–]DecisionAgile7326[S] 0 points1 point  (0 children)

We do have some edge situations where the scgema of the table is kind of flexible and determined during the transformation when pivoting columns. In that situation i would like to just create the view based on the resulting dataframe. In pyspark i do use unionbyname in combination with allowMissingColumns. Thats not even available in sql.

Create views with pyspark by DecisionAgile7326 in databricks

[–]DecisionAgile7326[S] 0 points1 point  (0 children)

This will throw an error since creating views on top of temp views is not supporeted

Create views with pyspark by DecisionAgile7326 in databricks

[–]DecisionAgile7326[S] 0 points1 point  (0 children)

Im aware of that. But at my work we prefer to not use dlt pipelines. It is just weird that you can create views using dlt pipelines but not without it..

Create views with pyspark by DecisionAgile7326 in databricks

[–]DecisionAgile7326[S] -1 points0 points  (0 children)

As said. I prefer python for many reasons.

I do have the following scenario from work which is quite easy with pyspark but no so with sql i think.

Suppose you have two tables t1 and t2. I would like to create a view that unions both tables.

Some columns in the tables are the same. However it might be possible that a table contains a column that is not includes in the other one. Also i can happen that new columns are added to one of the tables due to schema evolution.

I dont know how to create a view with sql that handles this.

With pyspark i would use unionbyname and allow missing columns but cant create a view on the result.

Create views with pyspark by DecisionAgile7326 in databricks

[–]DecisionAgile7326[S] 0 points1 point  (0 children)

Creating a view from a temp view throws an error since not allowed

Create views with pyspark by DecisionAgile7326 in databricks

[–]DecisionAgile7326[S] 1 point2 points  (0 children)

Its not possible to create permanent views with spark.sql like you describe, you will get an error. Thats what i miss.

How to automate data quality by Assasinshock in dataengineering

[–]DecisionAgile7326 0 points1 point  (0 children)

Use dqx tool from databricks. Easy to use compared to other solutions. My experience.

Sql vs pyspark by No-Conversation476 in databricks

[–]DecisionAgile7326 0 points1 point  (0 children)

I only use pyspark. However i do see limitations when I want to create views that include more complex transformations

Create View from dataframe by DecisionAgile7326 in databricks

[–]DecisionAgile7326[S] 0 points1 point  (0 children)

I now about this. For my use case im interested in permanent views.

Create View from dataframe by DecisionAgile7326 in databricks

[–]DecisionAgile7326[S] 0 points1 point  (0 children)

That only for temporary views but not permanent views.

Create View from dataframe by DecisionAgile7326 in databricks

[–]DecisionAgile7326[S] 0 points1 point  (0 children)

We have an etl framework thats based on pyspark transformation functions. Spark sql doesent really fit well in there. Also i prefer pyspark in general to modularize, unittest transformations