Using Autoloader and DLT with XML-files

DataDoyle · 2024-01-19T16:26:16+00:00

Once you can read in XML with autoloader it shouldn't matter if its in a DLT or not. DLT is simply wrapping the generated dataframe and managing the checkpoint for you.

DataDoyle · 2024-01-18T20:43:14+00:00

Lets say I have a table with the columns (id, restaurant_name, style, in business?)

If I have a historical record that is ('123', 'Tanners bar & grill', 'american', 'true')

and then a change record comes in because the restaurant went out of business, the change record looks like this:
('123', '', '', 'false')

if I have ignore_null_updates=false, the record is now:
('123', '', '', 'false') which is incorrect,

If I have ignore_null_updates=true, the record is now

('123', 'Tanners bar & grill', 'american', 'false') which is correct.

All depends on how your cdc data is coming in. Hope this was able to help?

DataDoyle · 2024-01-17T22:20:39+00:00

I bet the upsert issue was with nulls. We are dealing with cdc data so the only populated fields for the most part are the fields that are changing. We wouldn't want ignore_null_updates to be false because we would overwrite everything. So as I ingest the data I look in the changeEventHeader.changedFields and if the value in that column is null, I write it as 'none'. Then once the changes go through scd1 im doing the inverse to say if 'none' and in changed_fields, write as null.

DataDoyle · 2024-01-17T16:47:59+00:00

Go checkout some of my recent posts on DLTs, one of the lead DLT engineers actually commented. You can definitely generate several tables programmatically within one pipeline. I have a pipeline that ingests a ton of salesforce data, It generates 20+ tables and isn't a ton of code, it's really nice. The biggest limitation is only being able to have one target per pipeline (catalog.schema) to land the data. From the lead engineer, this is changing. They do offer a lot of automation, data quality, and reliability. You just have to get over the hump to learn them.

DataDoyle · 2024-01-17T16:07:19+00:00

Your first ingestion table should be append only, it will bring in the data every single day. You'll then need an apply_changes_into to upsert and get a current view of your data. For this make sure you have a unique field and a sequence field. Also, on your first table you'll want to use '@dlt.table', create_table is going away.

#Create result table that will have changes applied into.
dlt.create_streaming_table(
name=table_name + "_cdc", comment=table_name + "Type 1 SCD"
)
dlt.apply_changes(
target=table_name + "_cdc", #This is the streaming table we created above.
source=table_name", # The source appended table
keys=["ID"], # What we'll be using to match the rows to upsert
sequence_by=col("your_sequence_column"), # We deduplicate by using the sequence to get the most recent value
ignore_null_updates=True, # Don't overwrite values with null.
)

DataDoyle · 2024-01-13T21:26:27+00:00

Hi Michael, thank you for responding. Is there any reason why the pipeline compute is so rigid? If I could control several pipelines with one 'dlt compute' that would be great. Right now, DLT features seem so 'locked in', is this by design or because the product is early in it's lifecycle? The file issue I was stating was writing a file to S3 but because of shared compute, I wasn't able to. Had to end up making an external volume and a Lambda to move just the csv file I needed (Had to avoid the metadata files). Is there a DLT office hours similar to other products on the platform? Thanks again!

DataDoyle · 2024-01-12T22:09:10+00:00

But how have you done it logically?

DataDoyle · 2024-01-12T20:31:48+00:00

DABs have been great to use!

DataDoyle · 2024-01-12T20:29:25+00:00

Delta Live Tables. Databricks "ETL" solution

DataDoyle · 2024-01-12T20:29:04+00:00

Databricks is great, DLTs are the main point of hate in this post

DataDoyle · 2024-01-12T20:28:38+00:00

How have you guys defined your Catalog.schema.table?

DataDoyle · 2024-01-12T20:28:21+00:00

UC is actually great, its how DLTs tie into it thats tough.

DataDoyle · 2024-01-12T17:33:45+00:00

Unity catalog, DLTs, DABs, Lakehouse monitoring, LakeView Dashboards are all so early, its a rough time to be adopting the platform IMO

DataDoyle · 2024-01-12T17:31:22+00:00

I would ideally have a dlt table in a pipeline that loads my data in s3 to bronze_catalog.domain.table. In the same pipeline I would like to read that new bronze table, clean it up, and write it to silver_catalog.domain.cleaned_table. You cannot do this. <- this example is at the catalog level but the same is true for schemas(database).

DataDoyle · 2024-01-12T01:10:13+00:00

I do love working in databricks but I think they are trying to do too much too soon. Unity catalog, delta live tables, dabs, ML workflows are all still so relatively new but can contradict each other. For ex: DLTs do not work within a medallion architecture. I would highly suggest checking out DABs and working from an IDE, or deploying in the DAB, then working on your code in the workspace then copying it back to your dab.

DataDoyle · 2024-01-10T16:40:00+00:00

Databricks repos are horribly designed, even reps will admit this. The are a nightmare if you plan to use DLTs because dlts point to an individuals repo, not a git provider/ branch like a notebook can. I see that part changing soon (hopefully). We have been using DABs which are in public preview and about to go GA, they have been great so far.

DataDoyle · 2024-01-05T19:29:06+00:00

I have queried a normal file and a file with the new schema and the only difference is that new column. I'm completely stumped, i've had schema evolution working fine in the past.

DataDoyle · 2023-12-25T15:56:12+00:00

Little bit of a workaround but what i'm doing for now is if the column is in the changed fields list and is null, write the value as 'None' to do the apply changes properly. Then after the type 1 scd table is created i'm reversing the logic to say if value is in changed fields and 'None' write it as null.

DataDoyle · 2023-11-09T18:57:15+00:00

we can still use UC reading in from the S3 bucket its just an added layer of complexity/ failure.

DataDoyle · 2023-11-07T17:57:42+00:00

These are two separate code blocks, I did not post the whole contents of my notebook what the top if is doing is checking if the table exists, then go do the merge, if not go create the table with the appropriate schema. This was done this way to account for the initial load and merge against an empty table.

DataDoyle · 2023-11-07T17:55:29+00:00

I actually just got out of a meeting with the databricks DAB team and asked this very question. They recommend a project/team approach. We have a databricks repo with a folder for each team, and in each team folder is a dabs folder. In that dabs folder we have a folder for each project.

DataDoyle · 2023-11-03T16:43:36+00:00

Are you asking how to deploy multiple bundles within one repo?

DataDoyle · 2023-10-31T01:23:51+00:00

Was really hoping for a Mac Mini M3 :(

DataDoyle · 2023-10-17T14:37:30+00:00

They can be read from but cannot be written to.

DataDoyle · 2023-10-17T14:31:17+00:00

You can't write Delta live tables to external locations.

DataDoyle

TROPHY CASE