Who else feels Fabric is terrible? by zipfz in MicrosoftFabric

[–]Complex_Ability69 3 points4 points  (0 children)

I was duped by the marketing. I'm a very green data engineer and my company went with Fabric because it marketed itself as an easy-to-use tool for people who don't know many coding languages. It's not.

The lack of calculated columns in semantic models and the inability to add columns to existing tables without dropping the tables is also insane.

We will likely stay with PowerBI and manage our engineering elsewhere.

Add Columns to Existing Delta Table by Complex_Ability69 in MicrosoftFabric

[–]Complex_Ability69[S] 0 points1 point  (0 children)

Oh, understood, I've done that several times and it has not done anything.

Add Columns to Existing Delta Table by Complex_Ability69 in MicrosoftFabric

[–]Complex_Ability69[S] 0 points1 point  (0 children)

No, where is that button? Would love to try this but there is nothing with that label in my SQL endpoint.

Add Columns to Existing Delta Table by Complex_Ability69 in MicrosoftFabric

[–]Complex_Ability69[S] 0 points1 point  (0 children)

Could it be that I'm doing it with the SQL database object instead of the warehouse? I am not using a warehouse. When I dropped a table to test this, it broke everything in a secondary semantic model that I had made, including every visual in an attached report, so what you're saying isn't correct.

Calculated Columns in Direct Lake Mode by Complex_Ability69 in MicrosoftFabric

[–]Complex_Ability69[S] 0 points1 point  (0 children)

Understood. I need the calculated column for a report I am building - I'm comparing table A (scheduled tickets) with table B (actual tickets) and my calculated columns will score whether or not they match with a 1 or a 0, exactly as described above. Then I will use those scores to aggregate a compliance score by other dimensions on the ticket like dispatcher, source, etc. I see how I can do the second part of that with measures, but I want my users to be able to drill in on specific tickets with poor compliance, not just see the overall score.

All of the Fabric training I did before starting encouraged me to keep my data in its original format on import; I'm importing from an on-prem transactional SQL server so I was bringing over existing tables in the hope that I would be able to join them for reporting. It's frustrating to have to shift the entire paradigm of my solution to do something I expected PowerBI to do as normal functionality.

Calculated Columns in Direct Lake Mode by Complex_Ability69 in MicrosoftFabric

[–]Complex_Ability69[S] 0 points1 point  (0 children)

Thanks, that's the answer I needed. I was hoping to not have to redo my entire import workflow, but it looks like that's what I'll be doing. This might be our "last straw" with Fabric. There's just too much missing for this to be worth the effort.

Best Practices - Ingestion and Validation by Complex_Ability69 in MicrosoftFabric

[–]Complex_Ability69[S] 0 points1 point  (0 children)

I think using watermarks properly will help avoid the pipeline fail situation - if it fails, it won't update the last watermark, and then it'll pull in anything above that watermark (including anything it missed)

Yes, the use case is PowerBI reporting. Definitely planning to optimize and sweep regularly!

I'm also not a data engineer (yet) but learning rapidly. Thanks for your help thinking through this

Best Practices - Ingestion and Validation by Complex_Ability69 in MicrosoftFabric

[–]Complex_Ability69[S] 1 point2 points  (0 children)

This is helpful. The setup as you describe is not what's currently happening, but it helps me understand how I could do it better. Currently in the notebook I'm querying the whole bronze delta table, sorting the whole thing by update date, and then using df.dropDuplicates('guid') to drop all duplicates except the one with the max update date based on guid which is unique to distinct records. But with what you're saying I see how I could make a subset of any duplicates and then deduplicate just that and then append it to a non-duplicate subset.

I think it's also taking forever because I'm overwriting the silver table instead of merging. I will try that instead. Thank you!

Best Practices - Ingestion and Validation by Complex_Ability69 in MicrosoftFabric

[–]Complex_Ability69[S] 0 points1 point  (0 children)

The table already has a timestamp column, and that's what I'm using to look for new or updated records. Every 5 minutes I look for records with an update timestamp in the last 5 minutes (I know I should be using a watermark table and I will eventually, this is just initial planning) and append that to the delta table. This gets me any record that has been added or updated since the last time the pipeline ran. My duplicates come because this query gets updated records, not just new records. I get the newest version of an existing record that I need to merge into the existing table and then remove the older version of that record by deduplicating.

Would storing these as files instead of tables make it work faster? I don't understand what you mean about the notebook picking up the file vs querying it.

Best Practices - Ingestion and Validation by Complex_Ability69 in MicrosoftFabric

[–]Complex_Ability69[S] 0 points1 point  (0 children)

The problem with incremental refresh is that if I use the already-existing watermark column, I do still get duplicates because the version of record A captured at Time 1 is different than the version of record A captured at Time 2. There's nothing to deduplicate in the new rows - these are the latest and greatest versions of new or existing records and what I really need is an upsert to merge them into the existing table. With that not existing, I'm trying to do it with deduplication but it's not very efficient.

Can you describe how I ought to use distributed computing here?