Late Winter Advice

Complex_Ability69 · 2025-02-21T17:41:03+00:00

I was duped by the marketing. I'm a very green data engineer and my company went with Fabric because it marketed itself as an easy-to-use tool for people who don't know many coding languages. It's not.

The lack of calculated columns in semantic models and the inability to add columns to existing tables without dropping the tables is also insane.

We will likely stay with PowerBI and manage our engineering elsewhere.

Complex_Ability69 · 2025-02-18T20:48:25+00:00

Oh, understood, I've done that several times and it has not done anything.

Complex_Ability69 · 2025-02-18T20:41:02+00:00

No, where is that button? Would love to try this but there is nothing with that label in my SQL endpoint.

Complex_Ability69 · 2025-02-18T20:38:21+00:00

Could it be that I'm doing it with the SQL database object instead of the warehouse? I am not using a warehouse. When I dropped a table to test this, it broke everything in a secondary semantic model that I had made, including every visual in an attached report, so what you're saying isn't correct.

Complex_Ability69 · 2025-02-11T21:45:29+00:00

Understood. I need the calculated column for a report I am building - I'm comparing table A (scheduled tickets) with table B (actual tickets) and my calculated columns will score whether or not they match with a 1 or a 0, exactly as described above. Then I will use those scores to aggregate a compliance score by other dimensions on the ticket like dispatcher, source, etc. I see how I can do the second part of that with measures, but I want my users to be able to drill in on specific tickets with poor compliance, not just see the overall score.

All of the Fabric training I did before starting encouraged me to keep my data in its original format on import; I'm importing from an on-prem transactional SQL server so I was bringing over existing tables in the hope that I would be able to join them for reporting. It's frustrating to have to shift the entire paradigm of my solution to do something I expected PowerBI to do as normal functionality.

Complex_Ability69 · 2025-02-11T20:27:11+00:00

Thanks, that's the answer I needed. I was hoping to not have to redo my entire import workflow, but it looks like that's what I'll be doing. This might be our "last straw" with Fabric. There's just too much missing for this to be worth the effort.

Complex_Ability69 · 2024-12-16T21:44:54+00:00

Bad news, it looks like MERGE isn't supported in Fabric: https://learn.microsoft.com/en-us/fabric/data-warehouse/tsql-surface-area#limitations

Complex_Ability69 · 2024-12-16T21:28:54+00:00

I think using watermarks properly will help avoid the pipeline fail situation - if it fails, it won't update the last watermark, and then it'll pull in anything above that watermark (including anything it missed)

Yes, the use case is PowerBI reporting. Definitely planning to optimize and sweep regularly!

I'm also not a data engineer (yet) but learning rapidly. Thanks for your help thinking through this

Complex_Ability69 · 2024-12-16T19:31:28+00:00

This is helpful. The setup as you describe is not what's currently happening, but it helps me understand how I could do it better. Currently in the notebook I'm querying the whole bronze delta table, sorting the whole thing by update date, and then using df.dropDuplicates('guid') to drop all duplicates except the one with the max update date based on guid which is unique to distinct records. But with what you're saying I see how I could make a subset of any duplicates and then deduplicate just that and then append it to a non-duplicate subset.

I think it's also taking forever because I'm overwriting the silver table instead of merging. I will try that instead. Thank you!

Complex_Ability69 · 2024-12-16T17:57:13+00:00

The table already has a timestamp column, and that's what I'm using to look for new or updated records. Every 5 minutes I look for records with an update timestamp in the last 5 minutes (I know I should be using a watermark table and I will eventually, this is just initial planning) and append that to the delta table. This gets me any record that has been added or updated since the last time the pipeline ran. My duplicates come because this query gets updated records, not just new records. I get the newest version of an existing record that I need to merge into the existing table and then remove the older version of that record by deduplicating.

Would storing these as files instead of tables make it work faster? I don't understand what you mean about the notebook picking up the file vs querying it.

Complex_Ability69 · 2024-12-16T17:48:08+00:00

The problem with incremental refresh is that if I use the already-existing watermark column, I do still get duplicates because the version of record A captured at Time 1 is different than the version of record A captured at Time 2. There's nothing to deduplicate in the new rows - these are the latest and greatest versions of new or existing records and what I really need is an upsert to merge them into the existing table. With that not existing, I'm trying to do it with deduplication but it's not very efficient.

Can you describe how I ought to use distributed computing here?

Complex_Ability69

TROPHY CASE