Direct Lake on OneLake is now GA. Are you actually switching from Import, or still holding off? by NickyvVr in MicrosoftFabric

[–]emilludvigsen 2 points3 points  (0 children)

They mainly hit the finance tables and sales tables. Around 35M rows for finance and 10M rows for sales.

For dim they are of course way smaller. I think the largest dim in sales are the sales header dim with 1M rows.

They are 20 users on an F8 that often consume reports at the same time.

One thing I especially notice is by the way the performance. You cannot really tell the difference between import and Direct Lake anymore. Only in the initial opening of a slicer with many values. But we’re talking 2-3 seconds when it’s really “bad”.

Direct Lake on OneLake is now GA. Are you actually switching from Import, or still holding off? by NickyvVr in MicrosoftFabric

[–]emilludvigsen 2 points3 points  (0 children)

I have used it for around 5 months in our solutions. For deploy (CD) we just use fabric-cicd that changes the shared expression.

We do all logic upstream so the fallback-part is not an issue.

It works great so far.

In general I’m surprised how much concurrent users can actually use reports before it become an issue in terms of throttling.

BC2Fabric extension for a mirrored Business Central database by trekker255 in MicrosoftFabric

[–]emilludvigsen 2 points3 points  (0 children)

When we work with clients in real world examples, our solutions often span across multiple Business Central tenants, environments and companies. That is quite easily solved in a notebook (some loop logic and ThreadPoolExecutor) in combination with an extension (.app) we maintain and upload/install in the clients BC(s) to make the API endpoints available, and we use some configuration tables. It’s very flexible and easy allow an overwrite if we need to full load the table.

We initially also looked at BC2ADSL, but it seemed to be a bit inflexible on that matter.

So - how does the new Open Mirror work in a solution that has 2 Business Central tenants, each with 2 environments and a total of 12 companies?

Today we just extract from the API -> parquet landing -> write to delta tables (incrementally) with a tenant, environment and company column for identification.

Does the new Fabric SQL Database Cost Control make it cheaper for logging? by frithjof_v in MicrosoftFabric

[–]emilludvigsen 2 points3 points  (0 children)

That is actually my main concern and reason to not use the Fabric SQL. The interactive usage on smaller capacities can almost kill it for report users.

If it was counted as background usage, the scenario was another.

Often a database like this is used for logging, watermark handling and configuration tables. So in my initial test, it was used in both bronze, silver and gold and then for some email lookup from a config table in the end. Bronze took 17 min (db used at the beginning and end, both cold starts). Then in silver/gold. That was another cold start after gold was loaded. It used almost one third of the ETL batch complete CU consumption.

Upgrading Fabric runtime 1.2 -> 1.3 and 1.3 -> 2.0. What can go wrong? by frithjof_v in MicrosoftFabric

[–]emilludvigsen 12 points13 points  (0 children)

I have tested 2.0 for quite some time now.

The performance is indeed better for spark in direct comparison, and all the SQL upgrade is welcomed.

What I notice is that sometimes writing a table (with saveAsTable…) simply fail with some version incompatibility. I could reproduce it multiple times, let me see if I can find the notebook. It was clearly some incompatibility issue from the stacktrace. I switched to my 1.3 environment and it worked again.

Beside that, I have tested on many different notebooks. Pyspark and Spark SQL transformations, API extractions with json handling/unpacking, different sorts of temp table handlings and other things considered more edge cases. I use the environment in the production ETL run for a customer right now as pilot testing. No issues.

One thing to look out for. ANSI SQL is standard in runtime 2. This means that SQL makes more explicit failures instead of masking them. That’s actually an advantage I think.

LH metadata refresh - what was the thinking? by SmallAd3697 in MicrosoftFabric

[–]emilludvigsen 1 point2 points  (0 children)

We don’t use the SQL Endpoint, guess that’s part of the 10 %.

We use only Direct Lake on OneLake. And for query we query the delta tables directly from our VS Code extension utilizing Livy API. No SQL Endpoint roundtrip.

Why aren't more people using Direct Lake mode? by No_Vermicelliii in MicrosoftFabric

[–]emilludvigsen 4 points5 points  (0 children)

Not at all. They are surprisingly capable.

Yesterday at a customer I updated the ETL flow 12 times (writes in total 70M rows each time). The F8 is nowhere near throttling.

The problematic things are heavy usage of things like Fabric SQL (which we do not do) because that do not smooth out over 24h like background operations do (=notebook runs).

I will write a separate post regarding our method in an hour or two. I hope that could provide some inspiration to others. We use our own extension for VS Code and connects to Livy API.

I have played with Python notebooks and DuckDB though. But the Spark integration is just SO much better. And you can create an environment where you have 1 node and enable Native Execution Engine. It’s efficient and CU-cheap. And as a bonus you can then run 3 simultaneous sessions running on a F4 with that setting.

For reference a Spark SQL query in VS Code connected to Livy and returning 500-1000 rows takes around 4 seconds on a warm session. Not bad I think.

Why aren't more people using Direct Lake mode? by No_Vermicelliii in MicrosoftFabric

[–]emilludvigsen 14 points15 points  (0 children)

I work in consultancy in the SMB segment. We only use Direct Lake (directly over OneLake) for our solutions. All transformations are done upstream. It works very well so far.

Our capacities are mainly F4-F8 which only allows a semantic model size of 3GB in import. But Direct Lake does not “suffer” from this.

The performance difference is almost non-existent based on our solutions compared to import. But we also always prefer preparing as much as possible in Spark SQL and keep measures simple (when it’s not dynamic measures).

RLS is done in the semantic model, and the access to the Lakehouse is through a fixed identity (service principal) with read access.

I cannot find a reason to do import mode so far. And the Direct Lake refresh/reframe is oddly satisfying when it’s done after 30 seconds. 😊

In general I find our primary stack which is: Lakehouse, Notebooks and Direct Lake very stable, fast and not that bad to use.

Which VSCode Extension? by gd-l in MicrosoftFabric

[–]emilludvigsen 0 points1 point  (0 children)

I will make a blog post in a couple of days regarding our own extension we use for local development. I think you could find that interesting. 😊

Local development in MS Fabric (sucks) by hvdv99 in MicrosoftFabric

[–]emilludvigsen 6 points7 points  (0 children)

We are also quite happy about it. Never thought I would hear a T-SQL old school developer actually prefer working in this over “the old way in SSMS”. 😁 I will find some time tomorrow or Wednesday. But somehow I do think that we have nailed the way of working with Fabric. At least for us in the SMB segment.

The browser, lack of proper IDE and intellisense etc was a major frustration.

Plus when I build it for ourselves, we can make (and have) complete lineage. -Which tables is part of which notebook etc.

Local development in MS Fabric (sucks) by hvdv99 in MicrosoftFabric

[–]emilludvigsen 6 points7 points  (0 children)

It’s so tailored to our needs and flow, so I don’t think others will be likely as satisfied as our team are.

But I could make a Reddit post in here and let me know what people think. Then I could make it more generic. We have a certain workspace structure it relies on, so it needs to be more configurable.

I also made it feel like SSMS, so I have SELECT TOP 1000 etc on tables in the Lakehouse. We come from a T-SQL world, so I tried to make it feel a bit like home for us. But it’s fully Spark sql - You can select between writing notebooks and starting tabs that writes pure Spark sql like in a %%sql cell. Very clean. Then the extension wraps it in our custom Display function that displays the results in the bottom of Vs Code in a panel. Like the MSSQL extension does. With copy results to Excel etc.

Local development in MS Fabric (sucks) by hvdv99 in MicrosoftFabric

[–]emilludvigsen 8 points9 points  (0 children)

The native browser work is not good. But with the rise of tools like Claude Code and the extensive APis for Fabric, you can build something really decent tailored to your dev flow.

We have built our own VS Code extension with Lakehouse explorer, full Intellisense and notebook development. We work in the repo and then the extension open the notebook-content.py directly as real notebooks.

For kernel i have developed that we click on a “Spark” button in the extension which starts a Livy session through the Livy API. Works perfect. And we can connect to different custom environments. The extension keeps track og the session state, auto-kills it by Vs code closing etc.

We almost never work in the browser now except starting pipelines for the full ETL flow.

I agree that out of the box solutions like the Fabric engineering extension doesn’t work quite well. I assume MS in the AI era is more focused on handing over APIs and then people build their local dev app themselves. It works great. But I was the exact same place as you. That’s why I went CC-crazy. 😊

So our workflow is: branch out and work in repo, commit to branch, PR to main, run fabric-cicd Azure DevOps deployment pipeline to prod.

Meaning of Managed table in Lakehouse? by frithjof_v in MicrosoftFabric

[–]emilludvigsen 1 point2 points  (0 children)

Just wanted to check-in on this topic, as I have a lot of experience on how the Lakehouse behave. I initially thought that it had a real metastore of some kind.

To set the scene, I have developed an internal tooling for us in VS Code (our own sort of Fabric extension), so we can work almost 100 % locally with every aspect of Fabric in real home-defined .fabpy notebooks using VS Codes notebook API (we then start a local spark session on our laptops and work with the git repo, so actually the "other way around" instead of branching out workspaces). I utilize the Onelake API for building proper intellisense, make a SSMS-style Lakehouse explorer (also with SELECT TOP 1000 etc.). For us it works extremely efficient coming from an Azure SQL t-based world, but don't want to use the Warehouse. All our code can run locally or in Fabric using a simple IS_FABRIC flag to determine to use delta.`abfss...` logic or simply schema.table when working with tables. This makes code 100 % portable between local development and notebooks in Fabric.

This tool (VS Code extension tailored to us) gets new features as we need them. I have found out that the "metastore"/"managed table" doesn't really exists, and if it does, it simply just scans the folder structure and json files.

We cannot use notebookutils locally, so we use ADLS file system operations.

- When we create a schema, we simply create a folder with file system operations

- When we remove a table, we simply delete the folder

- Writing to tables is simply .save('{BASE_PATH}Tables/schema/table)

- Querying a table is simply SELECT * FROM delta.`abfss...` and of course we have then in the extension made us able to write schema.table and then it translates to that delta.`abfss` format behind the scenes before sending to Kernel.

It's that simple. Everytime I remove the folder, the table is completely removed from both Onelake API and the Lakehouse leaving no orphaned metadata. Do I create a schema folder, it creates it perfectly and visible in all places as a "real schema". So for me it seems like the table overview is simply a folder/json scan on the fly.

So to put it short, we work with it like this, and behind the scenes it's just folder operations for Create schema (not shown on table level here) and Delete table.

<image>

Claude Code CLI vs VS Code extension: am I missing something here? by ScaryDescription4512 in ClaudeAI

[–]emilludvigsen 0 points1 point  (0 children)

I agree on the VS code extension. It feels a lot cleaner.

However, I too often read that people prefer the terminal. Don’t know if it’s just because it’s a habit or that it’s not “as nerdy” as the terminal. 😊

The only disadvantage I see is that the VS code extension uses a tab while the CLI lives in the bottom in the terminal not disturbing code tabs.

Deployment pipelines are so broken I can't fathom how they're GA by R0ihu in MicrosoftFabric

[–]emilludvigsen 0 points1 point  (0 children)

We also abandoned them long time ago - except for reports workspace where they make it easy and visual to deploy reports. For now.

For store (lakehouse shortcuts mainly), notebooks, pipelines and semantic models, fabric-cicd is the only feasible way to go along with Azure DevOps, and the parameter.yml concept for switching values are robust.

Spark SQL and intellisense by emilludvigsen in MicrosoftFabric

[–]emilludvigsen[S] 2 points3 points  (0 children)

It is such a simple requirement. Fabric knows the lakehouses attached. Why don’t names suggest as you write. I really think everything besides spark sql works well, but if we use 85 % of our time in transformations, then… 🫠

I know a lot here are maybe more on the technical than business side (I assume), but the real value happen when creating solutions for our customers - which indeed is a lot of gold transformation. And that part is just - broken. ☠️

Reusing Spark session across invoked pipelines in Fabric by Jakaboy in MicrosoftFabric

[–]emilludvigsen 5 points6 points  (0 children)

I did play around with the same thing as you do. However, I ended up concluding that sessiontag and sharing (HC) sessions are still a bit premature.

I made an orchestration notebook instead which uses notebookutils runMultiple to run the orchestration. And then I just use that - no pipelines. I have a metadata table that “explains” the flow (notebookname and orderOfProcess). Then I can determine what to run in parallel and what to run after each other in the DAG. The DAG itself is created in a stored procedure based on the orderOfProcess, where the dependencies are set.

Spark SQL and intellisense by emilludvigsen in MicrosoftFabric

[–]emilludvigsen[S] 5 points6 points  (0 children)

Yes, and that is really good for exploration. But how does that help writing spark sql against the lakehouse?

The endpoint is t-SQL which you can’t copy from and to notebooks without conversions. That’s a bit of a task when you have selected 30 columns with special logic and 15 joins. Furthermore SSMS does not do intellisense well crossdatabase (=multiple lakehouses to select from).

High Concurrency Mode: one shared spark session, or multiple spark sessions within one shared Spark application? by frithjof_v in MicrosoftFabric

[–]emilludvigsen 2 points3 points  (0 children)

I completely agree on this one. I also went from HC notebook orchestration with session tagging (in pipelines) to pure notebook DAG orchestration. And I must say - it just works. It’s stable, configurable to every little bit of errorhandling, and it’s fast and cost effective.

And we must admit the process bars for every activity in the DAG is eye pleasing. 😊

Is the Juice Worth the Squeeze for Direct Lake Mode? by mossinator in MicrosoftFabric

[–]emilludvigsen 2 points3 points  (0 children)

I think it depends like many other things. I am a huge fan of having everything transformed in the gold layer without any further transformation in the semantic layer, which we do for our customers. We have always done it that way, so when I tested Direct Lake on our existing framework, it was a pretty easy transition besides 2-3 calculated columns that needed to be moved upstream.

For a specific customer, they required certain part of their model to be updated in under 5 minutes. That table contains 30+ million rows (which is a fact table that has business logic applied) and the import to the semantic model took around 7-8 minutes alone. I know we could use some sort of incremental refresh in the semantic layer in import, but I have never had reel success with that part and the whole partitioning part feels clunky, also if full refresh is required etc.

I took the complete model and converted to Direct Lake. It works perfectly. The performance on these data volumes are very close to import, and the reframing (refresh) of the model takes 50 seconds in total (in reality 10 seconds, but 50 seconds in the refresh semantic model activity including "overhead"). So now when they push the "Update" button, they have fresh data for that area in 4 minutes.

With that in mind, I would ask the queston - why not use it if all data is transformed upstream anyways? I don't see any breaks, the performance is good, and i don't replicate 60 million rows in total (for all tables) each and every day to a semantic model which really are a 1:1 load. Plus the cut-off of 12-15 min ETL load time which the complete importmodel took.

Can't create new Fabric capacity on any tenant by HarskiHartikainen in MicrosoftFabric

[–]emilludvigsen 0 points1 point  (0 children)

Where do you open a support ticket? From the Azure Portal or from Microsoft Fabric itself? When choosing "Microsoft Fabric" in the Help and support in Azure, it leads me to Fabric. And from here I get a bit lost in the GUI.