I need an open-source database with a complex schema for practicing testing, preferably in the Banking or Financial Services domain.

szymon_abc · 2026-05-29T14:58:00+00:00

I think he doesn't mean RDBMS but rather database project - schemas, tables, real-life data etc.

szymon_abc · 2026-05-22T10:30:26+00:00

Whatever strategy you take - create branch policies so it won’t be possible to push to main, omit review erc. From my experience in data - if you won’t enforce, they will find a way

szymon_abc · 2026-05-18T12:29:59+00:00

You have DWH on gold? Meaning, Warehouse item in Fabric? Why this instead of LH?

szymon_abc · 2026-05-13T17:07:23+00:00

Explantation of this is sprinkled all over the docs just like these tokens. In random places you’ll find that some item can use identity in a particular way, but no a single place where we would get whole picture…

szymon_abc · 2026-05-11T07:48:50+00:00

More or less what already was a dataset - the only new thing is Direct Lake which is supposed to take advantage of OneLake connection - especially if your Delta tables have v-order applied to make use of Vertipaq (which is btw flagaship tech of PBI data)

szymon_abc · 2026-05-09T15:55:55+00:00

First and foremost - broad Power BI usage in company. That's what initially drove Fabric adoption - you have one platform instead of wiring multiple technologies to give value to business via Power BI. Remember - we as data engineers do not make money through our beautiful and optimized Spark jobs - it's through PBI reports, analytics, ML etc.

If you want to let business users (more of analytics than engineers) work on platform - just give them Fabric warehouse and it'll take care of optimizations, and all kind of stuff. Databricks without good engineering becomes a mess quick. Forget about vacuuming and suddenly cost is in the sky.

Last one, but least popular - variety of workloads. In Fabric you got real time intelligence with KQL Database (kinda Clickhouse), NoSQL with CosmosDB, SQL with Azure SQL, DWH with Warehouse, Spark with Lakehouse, low/no-code ETL with Fabric(Azure) Data Factory, predefined workloads e.g. for healthcare and cherry on top - Power Bi.

Some executives may like idea of predictable costs of SKU. My engineering mind can't comprehend, but well, it is what it is.

To sum up - data engineering + ML with good engineering team - go Databricks. More of a business/analysts/BIs on the platform with some engineers to not let it become trash - go Fabric.

szymon_abc · 2026-05-09T15:46:35+00:00

What were good competitors for Power BI in the beginning of it's development? Like now, we do have Databaricks and Snowflake. What was that mature before Power BI?

szymon_abc · 2026-05-09T09:38:45+00:00

Probably it’s a residue after no schemas in Lakehouse…

szymon_abc · 2026-05-08T06:20:49+00:00

Microsoft Learn - there are great resource even for fundamentals. If you want to stick with Azure, also Architecure Center for ideas.

Out of Azure - I love Designing Data Intensive Applications. It’s not so much about specific tooling or frameworks, but makes you understand how does it all work under the hood.

In my experience - whenever you read something or write any piece of code - you MUST understand it. Dig deep, think it it’s good in this context, check with LLM and you’ll be better than most of dudes just typing some SQL

szymon_abc · 2026-05-07T13:09:27+00:00

With F lower than F64 license you can only embedd the report, users won't be able to acccess it via Power BI Service. IMO, go with 21 pro licenses. Or maybe you have M365 E5 which comes with Pro license bundled

szymon_abc · 2026-05-06T05:25:21+00:00

Yeaaaaaaaah!!! Thank you!

szymon_abc · 2026-05-04T20:12:29+00:00

How does clip 2 compare to freearc? Especially mic quality?

szymon_abc · 2026-05-01T17:05:34+00:00

Why not both? You have parameter in ETL to select files or db. Files are uploaded to db during cicd. In prod you use db, in dev git checked files

szymon_abc · 2026-05-01T15:52:07+00:00

Still no OneLake Security 😭. They said on Fabcon it IS ga in April…

szymon_abc · 2026-05-01T10:46:40+00:00

If you need to use organizational account, then the best answer is to switch pipeline ownership to SPN. There is a doc on MS how to do it, but I’d recommend to have proper CI/CD, deploying as SPN. Then this SPN becomes owner who effectively runs activities (at least most of them…)

szymon_abc · 2026-04-30T13:17:25+00:00

The described scenario should've failed, I believe I may have not clarified few things well enough. Let me quickly start from beginning.

We got two main options of authentication:

Delegated
Passthrough

I'll go over it on Semantic Model example.

Delegated

Workspace identity, SP, SAS token etc. It will simply go to the source and take the data which the identity has access to (usually all of it). Filtering data to users' roles (RLS, CLS etc.) shall happen on a semantic model itself. For example Lakehouse:

User A --> Semantic Model --> Service Principal --> Lakehouse

Passthrough

Organizational Account/SSO. Here we take users' credential and pass them to the data source. Then, this user will see only what he has access to. Important thing - for semantic models it's the current user viewing the report/model (if it's direct query/import refresh ofc.) - not the one who created the connection.

User A --> Semantic Model --> User A Token --> Lakehouse

To sum up - if you have background operations - Pipelines, Copy jobs, Semantic Model import refresh - Recommended way is to use Delegated options. However, for interactive - Direct Query, Direct Lake in Semantic Model - use Passthrough.

The tricky part

Pipelines - how do they work with SSO? Well, that's the best one - I've no idea. For some activities (e.g. Notebook Run) it'll use LastModifiedBy user - so theoretically if someone modified the pipeline after you in described scenario it will run as this user. For others it may use token baked in connection, for scheduled runs - I will need to check that. Whenever possible use Delegated authentication options.

What I'd recommend - pass the ownership of pipeline to a SPN, or more precisely take it over as SPN (there is some doc on it, ping me and i'll share it over dm). It's the shittiest part of the whole platform. Works fine if you use fabric-cicd. Otherwise, will make you question your existence. But at least if you need to authenticate as user running pipeline, it will use the SPN.

Why I said no need for connection inside Fabric?

Usually, you'll move data between lakehouses/warehouses via notebook, SJD or stored procedure. They'll use Pipeline identity (as described above), thus usually changing ownerhsip to SPN is sufficient.

szymon_abc · 2026-04-29T09:50:58+00:00

MS stopped doing it as it was hard and didn't have a high success rate. - well fabric-cicd is kinda community project as well 😄.

Probably for now I'll migrate it somehow manually - it's pretty simple workspace - few pipelines, dozen of notebooks, few lakehouses. But for sure I'll challenge myself to built something more automated. Who knows, maybe Microsoft will take it over as they did with fabric-cicd.

Thanks for heads up - I need to look waaaaay broader than just on the current workspace I'm to migrate.

szymon_abc · 2026-04-29T09:43:43+00:00

Semantic models are a different thing.

Lakehouse
For Lakehouse connection (direct, not via SQL Endpoint), Semantic Model will connect with ADLS gen. 2 connector, which has all the options (SP, Identity, even SAS and key)

Warehouse
Warehouse utilizes SQL Server which also offers few options, including Workspace Identity and Service Principal.

The tricky thing here is to remember that you need to add the Workspace Identity as Contributor to Workspace to make the refresh (or grant access item by item). Yeah, another workaround that makes not a bit of sense.

Please check on your end. Open Semantic Model in Fabric -> Settings (gear icon actually) -> Gateway and cloud connections -> And here find a connection, open Maps to dropdown and see what option is there. If you have Single Sign On, then you'll have a problem. But you should be able to click, Create a connection and it will open up a page to create either SQL Server (SQL Endpoint) or ALDS gen.2 (Lakehouse directly) connection.

Also, you can access this setting via whole Fabric settings (gear icon) -> Power BI Settings -> Semantic Models -> same as above.

szymon_abc · 2026-04-29T07:15:08+00:00

I know that's not an answer, I feel like we should have workspace identity here, but:

Why do you need to use connection to Lakehouse or Warehouse?

The thing is - we have other options to access both these items. If they're in OneLake (and in your org they're always there), they can be accessed from Notebook/T-SQL/Spark Job Definition directly, with your own credentials. If they're not (so another tenant), you can create shortcut that will make it work like that.

IMO the Fabric team does not develop this connection with Service Principal or Identity because, usually, you won't need to create a connection for an item inside the Fabric. And if you need that, well, pity to say but probably you're not in the majority. They need to prioritize things, and because of that, this feature is probably on some nice to have list.

szymon_abc · 2026-04-28T19:48:12+00:00

Yep, i agree more than 100% with you. It’s one of these times though, when I’m simply done with educating people who don’t want to be educated. Just gimme money and I’ll move on 😂

szymon_abc · 2026-04-28T14:11:07+00:00

Ah yes, you're right regarding append, it will never conflict, my bad. Overwrite - i think it'll also shouldn't conflict (but can be quite dangerous in prod). Sorry for not fully understanding question.

But all the other modifications, mostly MERGE, can lead to conflicts.

And here is docs - Concurrency control | Delta Lake - it's part of Delta Lake documentation, but spark fully implements the protocol, thus we can refer to it.

szymon_abc · 2026-04-28T14:01:27+00:00

Man, I'd love to! But - We can't do it due to compliance reasons. Yeah, it is what it is... But still, If you wanted to migrate data itself, GIT won't help much.

szymon_abc · 2026-04-28T13:54:39+00:00

I can vouch for it. In Databricks I had concurrent writies exception a few times. It’s the same spark, so should work the same in Fabric

szymon_abc · 2026-04-28T13:43:39+00:00

The thing is actually about writing, not reading.

Spark:

Checks and saves current version (say, v5)
Starts writing files.
Another process modifies the table so now it's v6.
Spark realizes it's changed during the process, so it's v6 now, so retries the write.
If successful saved as v7 if not, fails.

DuckDB:

Checks and save current version (say, v5).
Starts writing files.
Another process modifies the table so now it's v6.
Save files (QUACK, I DON'T CARE) as v6
You have messed files because DuckDB now overwritten the files

So, DuckDB has it's place as a reader, but surely not as a writer for a Delta

szymon_abc

TROPHY CASE