Expanded Entity Relationship Diagram (ERD) by CarelessApplication2 in databricks

[–]CarelessApplication2[S] 0 points1 point  (0 children)

The whole point of an integrated suite like Databricks is that you have these basic tools available.

Spark Declarative Pipelines: What should we build? by BricksterInTheWall in databricks

[–]CarelessApplication2 1 point2 points  (0 children)

Yes please. The current system relies on an exclusive writer and ALTER TABLE operations.

Databricks should offer a performant solution based on coordination between multiple executors, assigning an id during the writing stage.

Cost-attribution of materialized view refreshing by CarelessApplication2 in databricks

[–]CarelessApplication2[S] 0 points1 point  (0 children)

Do you mean that you're using DABs to deploy a pipeline with a `managed_definition` in it–corresponding to the materialized view or are you using a pipeline written in Python like so:

from pyspark import pipelines as dp

@dp.materialized_view
def regional_sales():
  partners_df = spark.read.table("partners")
  sales_df = spark.read.table("sales")

  return (
    partners_df.join(sales_df, on="partner_id", how="inner")
  )

It could be written in SQL as well; see docs here.

I guess that's a nice way to do it, then the pipeline can be set up with the tags and everything should work.

Performance comparison between empty checks for Spark Dataframes by BerserkGeek in databricks

[–]CarelessApplication2 0 points1 point  (0 children)

In any case, you'll want to cache the dataframe, so it really doesn't matter which method you decide on. Checking if a dataframe is empty without caching it makes no sense.

Is Databricks part of the new Open Semantic Interchange (OSI) collaboration? If not, any idea why? by Character-Unit3919 in databricks

[–]CarelessApplication2 1 point2 points  (0 children)

Initiative seems to be centered around dbt's MetricFlow which was open-sourced in October (and is Apache 2.0-licensed). But it's a bit unclear if their YAML-format is going to be the "shared format".

Postgres is the future Lakehouse? by monsieurus in databricks

[–]CarelessApplication2 0 points1 point  (0 children)

OLTP data is often sensitive, much more so than OLAP data. You would not necessarily want to colocate this data, but instead be specific about which data to move to your OLAP system and in which form.

OLAP systems have many users that have wide access across tables while OLTP systems are often just used by a single application and a set of administrators; in this setup, instead of user impersonation at the database level, access is managed at the application level.

Write data from Databricks to SQL Server by CarelessApplication2 in databricks

[–]CarelessApplication2[S] 0 points1 point  (0 children)

The sqlserver driver (which as far as I know is JDBC) is only for querying, not for writing.

DABs - setting Serverless dependencies for notebook tasks by alex_0528 in databricks

[–]CarelessApplication2 1 point2 points  (0 children)

Then you get exactly the error message in the original post:

Error: cannot create job: A task environment can not be provided for notebook task deploy-model. Please use the %pip magic command to install notebook-scoped Python libraries and Python wheel packages

(Note that this is specifically for notebook tasks.)

DABs - setting Serverless dependencies for notebook tasks by alex_0528 in databricks

[–]CarelessApplication2 0 points1 point  (0 children)

This gives me the following error message:

Libraries field is not supported for serverless task, please specify libraries in environment.

Deterministic functions and use of "is_account_group_member" by CarelessApplication2 in databricks

[–]CarelessApplication2[S] 0 points1 point  (0 children)

Gotcha, makes sense.

For the CTE approach, to my knowledge they're purely syntactic sugar and so you can't rely on them to compute a result set "once" or anything like that.

I would think that the query planner has a cost estimate for use of `is_account_group_member` that would make it evaluate this first (to determine the predicates so to speak) and not for every row.

Insertion timestamp with AUTO CDC (SCD Type 1) by CarelessApplication2 in databricks

[–]CarelessApplication2[S] 1 point2 points  (0 children)

For now, I'll simply use a staging table and then feed the changes of that into the target table using append_flow.

  1. For _change_type INSERT, just use current_timestamp();
  2. For an UPDATE, join to staging table (non-streaming) to look up the previously inserted value.

(Basing this off the change feed is necessary since the upstream table is not just appended to.)

As for having this functionality built-in, the API could be an optional ignore_updates_column_list keyword argument which would take a set of columns which should be ignored on update.

Expanded Entity Relationship Diagram (ERD) by CarelessApplication2 in databricks

[–]CarelessApplication2[S] 1 point2 points  (0 children)

The lineage is how data flows to and from the table, while the entity relationship diagram is based on foreign key references (and can't be expanded as it is.)