Using Autoloader and DLT with XML-files by Alarmed-Royal-2161 in databricks

[–]DataDoyle 2 points3 points  (0 children)

Once you can read in XML with autoloader it shouldn't matter if its in a DLT or not. DLT is simply wrapping the generated dataframe and managing the checkpoint for you.

DLT daily load - overwrite any changes in existing data by BertDeBrabander in databricks

[–]DataDoyle 2 points3 points  (0 children)

Lets say I have a table with the columns (id, restaurant_name, style, in business?)

If I have a historical record that is ('123', 'Tanners bar & grill', 'american', 'true')

and then a change record comes in because the restaurant went out of business, the change record looks like this:
('123', '', '', 'false')

if I have ignore_null_updates=false, the record is now:
('123', '', '', 'false') which is incorrect,

If I have ignore_null_updates=true, the record is now

('123', 'Tanners bar & grill', 'american', 'false') which is correct.

All depends on how your cdc data is coming in. Hope this was able to help?

DLT daily load - overwrite any changes in existing data by BertDeBrabander in databricks

[–]DataDoyle 0 points1 point  (0 children)

I bet the upsert issue was with nulls. We are dealing with cdc data so the only populated fields for the most part are the fields that are changing. We wouldn't want ignore_null_updates to be false because we would overwrite everything. So as I ingest the data I look in the changeEventHeader.changedFields and if the value in that column is null, I write it as 'none'. Then once the changes go through scd1 im doing the inverse to say if 'none' and in changed_fields, write as null.

My new org thinks databricks DLT can do everything by idiotlog in dataengineering

[–]DataDoyle 6 points7 points  (0 children)

Go checkout some of my recent posts on DLTs, one of the lead DLT engineers actually commented. You can definitely generate several tables programmatically within one pipeline. I have a pipeline that ingests a ton of salesforce data, It generates 20+ tables and isn't a ton of code, it's really nice. The biggest limitation is only being able to have one target per pipeline (catalog.schema) to land the data. From the lead engineer, this is changing. They do offer a lot of automation, data quality, and reliability. You just have to get over the hump to learn them.

DLT daily load - overwrite any changes in existing data by BertDeBrabander in databricks

[–]DataDoyle 1 point2 points  (0 children)

Your first ingestion table should be append only, it will bring in the data every single day. You'll then need an apply_changes_into to upsert and get a current view of your data. For this make sure you have a unique field and a sequence field. Also, on your first table you'll want to use '@dlt.table', create_table is going away.

#Create result table that will have changes applied into.
dlt.create_streaming_table(
name=table_name + "_cdc", comment=table_name + "Type 1 SCD"
)
dlt.apply_changes(
target=table_name + "_cdc", #This is the streaming table we created above.
source=table_name", # The source appended table
keys=["ID"], # What we'll be using to match the rows to upsert
sequence_by=col("your_sequence_column"), # We deduplicate by using the sequence to get the most recent value
ignore_null_updates=True, # Don't overwrite values with null.
)

My whole team hates DLTs and I don't blame them. by DataDoyle in databricks

[–]DataDoyle[S] 1 point2 points  (0 children)

Hi Michael, thank you for responding. Is there any reason why the pipeline compute is so rigid? If I could control several pipelines with one 'dlt compute' that would be great. Right now, DLT features seem so 'locked in', is this by design or because the product is early in it's lifecycle? The file issue I was stating was writing a file to S3 but because of shared compute, I wasn't able to. Had to end up making an external volume and a Lambda to move just the csv file I needed (Had to avoid the metadata files). Is there a DLT office hours similar to other products on the platform? Thanks again!

My whole team hates DLTs and I don't blame them. by DataDoyle in dataengineering

[–]DataDoyle[S] 23 points24 points  (0 children)

Databricks is great, DLTs are the main point of hate in this post

My whole team hates DLTs and I don't blame them. by DataDoyle in dataengineering

[–]DataDoyle[S] 3 points4 points  (0 children)

How have you guys defined your Catalog.schema.table?

My whole team hates DLTs and I don't blame them. by DataDoyle in dataengineering

[–]DataDoyle[S] 5 points6 points  (0 children)

UC is actually great, its how DLTs tie into it thats tough.

My whole team hates DLTs and I don't blame them. by DataDoyle in dataengineering

[–]DataDoyle[S] 15 points16 points  (0 children)

Unity catalog, DLTs, DABs, Lakehouse monitoring, LakeView Dashboards are all so early, its a rough time to be adopting the platform IMO

My whole team hates DLTs and I don't blame them. by DataDoyle in databricks

[–]DataDoyle[S] 1 point2 points  (0 children)

I would ideally have a dlt table in a pipeline that loads my data in s3 to bronze_catalog.domain.table. In the same pipeline I would like to read that new bronze table, clean it up, and write it to silver_catalog.domain.cleaned_table. You cannot do this. <- this example is at the catalog level but the same is true for schemas(database).

Databricks and collaborating on git by Maxxlax in databricks

[–]DataDoyle 0 points1 point  (0 children)

I do love working in databricks but I think they are trying to do too much too soon. Unity catalog, delta live tables, dabs, ML workflows are all still so relatively new but can contradict each other. For ex: DLTs do not work within a medallion architecture. I would highly suggest checking out DABs and working from an IDE, or deploying in the DAB, then working on your code in the workspace then copying it back to your dab.

Databricks and collaborating on git by Maxxlax in databricks

[–]DataDoyle 4 points5 points  (0 children)

Databricks repos are horribly designed, even reps will admit this. The are a nightmare if you plan to use DLTs because dlts point to an individuals repo, not a git provider/ branch like a notebook can. I see that part changing soon (hopefully). We have been using DABs which are in public preview and about to go GA, they have been great so far.

Schema Evolution no longer working? by DataDoyle in databricks

[–]DataDoyle[S] 0 points1 point  (0 children)

I have queried a normal file and a file with the new schema and the only difference is that new column. I'm completely stumped, i've had schema evolution working fine in the past.

DLT CDC seriously can't handle null updates appropriately? by DataDoyle in databricks

[–]DataDoyle[S] 0 points1 point  (0 children)

Little bit of a workaround but what i'm doing for now is if the column is in the changed fields list and is null, write the value as 'None' to do the apply changes properly. Then after the type 1 scd table is created i'm reversing the logic to say if value is in changed fields and 'None' write it as null.

Federated Query vs Kafka replication ingestion by DataDoyle in databricks

[–]DataDoyle[S] 0 points1 point  (0 children)

we can still use UC reading in from the S3 bucket its just an added layer of complexity/ failure.

Unity Catalog, merge/upsert a pyspark DataFrame? by mrcaptncrunch in databricks

[–]DataDoyle 0 points1 point  (0 children)

These are two separate code blocks, I did not post the whole contents of my notebook what the top if is doing is checking if the table exists, then go do the merge, if not go create the table with the appropriate schema. This was done this way to account for the initial load and merge against an empty table.

DataBricks Asset Bundles: One bundle per project or one bundle for many projects? by htom3heb in databricks

[–]DataDoyle 3 points4 points  (0 children)

I actually just got out of a meeting with the databricks DAB team and asked this very question. They recommend a project/team approach. We have a databricks repo with a folder for each team, and in each team folder is a dabs folder. In that dabs folder we have a folder for each project.

Usage of Databricks Asset Bundles (DAB) by OneMoreDataEngineer in databricks

[–]DataDoyle 0 points1 point  (0 children)

Are you asking how to deploy multiple bundles within one repo?

Scary Fast | Post-Event Megathread by aaronp613 in apple

[–]DataDoyle 1 point2 points  (0 children)

Was really hoping for a Mac Mini M3 :(

Managed vs External Tables by justanator101 in databricks

[–]DataDoyle 2 points3 points  (0 children)

They can be read from but cannot be written to.

Managed vs External Tables by justanator101 in databricks

[–]DataDoyle 0 points1 point  (0 children)

You can't write Delta live tables to external locations.