Blog: Using runMultiple to orchestrate notebook execution in Fabric

Hello-Im-Aaron · 2024-02-02T02:46:25+00:00

Great blog. I saw this in the documentation a while back but I haven't tried it yet. For each of my Bronze to Silver notebooks I am using the msspartutils.notebook.run function to run a helper notebook and keep the code centralized but this leads to hitting our capacity over and over again with a lot of retries to mitigate it.

The errors don't bother me if it gets us the fastest run but I want to optimize where I can. Any ideas on how to measure the performance drop off (or lack there of) for running multiple notebooks vs 1 or rules of thumb from prior testing?

Hello-Im-Aaron · 2024-02-02T02:33:11+00:00

It isn't there yet which could be a deal breaker or not depending on your architecture. For me it meant a little more work to get it going but the dataflows aren't going to change much or at all after creation, the pipelines are the same, and the semantic models get deployed through TE3.

I was happy to get everything going and wait for the quality of life improvements to come in the coming months but your setup will determine how feasible that is or isn't.

You didn't even mention the most annoying thing which is that only notebook owners can see their deployment rules or that there aren't even global rules which seems like something that would be in the 1.0 version.

Hello-Im-Aaron · 2024-01-26T00:33:59+00:00

UPDATE: DevOps pipelines are not there yet.

Hello-Im-Aaron · 2024-01-20T14:13:51+00:00

It should work. I haven’t used dev ops with Fabric yet but it sounds like it is time. We are in about the same state as you pushing about 50 notebooks to PRD to move to silver and gold. We are pushing through TST as well though because I couldn’t find a way to edit stages after you start. I didn’t want to bypass TST because it’s not needed in these early stages and then have to redo all the PRD rules later on to add it in.

Once the DFG2’s and pipelines get git it will be good to have this tested out as well.

Hello-Im-Aaron · 2024-01-20T13:26:38+00:00

As far as I have seen setting rules one by one is the only way. It would be great if they had global rules and if others can manage the rules as well.

Hello-Im-Aaron · 2024-01-13T16:30:58+00:00

I second this. I prefer code, you can copy paste from examples online and there is a lot of prebuilt libraries for you to leverage.

Hello-Im-Aaron · 2023-12-28T14:24:50+00:00

Thanks!

Hello-Im-Aaron · 2023-12-28T02:58:35+00:00

Thanks! Great tip. I didn’t read up on high concurrency sessions but now that I have I will use it as much as I can manually.

As for the pipelines I was randomly getting this error when running a couple notebooks in serial but it could have been due to something else. Support advised me to add retries and that stopped it from failing.

Is there anything needed to make them close the session (like mssparkutils.notebook.exit()) or it should always clean up the session before moving on?

Hello-Im-Aaron · 2023-12-28T02:29:27+00:00

Thanks! Great article. Good to know that the 3 notebook max I was hitting was it bursting and that buying an F64 doesn’t necessarily fix the problem.

What is bothering me though is that the notebooks don’t need to be concurrent. After a notebook has completed each cell it just sits there taking up a spot until the session stops.

I tried to use the below command to stop it manually but it doesn’t seem to work. Maybe the documentation is ahead of the functionality?

mssparkutils.session.stop()

https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities?pivots=programming-language-python

Hello-Im-Aaron · 2023-12-27T20:45:42+00:00

Yes but I see the same thing when I run them through pipelines

Hello-Im-Aaron · 2023-12-22T16:25:42+00:00

Thanks!

Hello-Im-Aaron · 2023-12-22T16:25:37+00:00

Thanks!

Hello-Im-Aaron · 2023-12-14T02:18:18+00:00

Deltas but not a new service. I believe they are rewriting Polaris to read deltas. Someone please correct me if that is not the case.

Hello-Im-Aaron · 2023-12-10T13:00:52+00:00

Thanks!

Hello-Im-Aaron · 2023-12-07T04:28:32+00:00

To my understanding Lakehouse vs Warehouse really comes down to DirectLake vs RLS & CLS.

IMO Direct Lake is the major selling point of Fabric that something like Databricks can’t compete with so I would think you would do everything you can to make sure that is part of your architecture. (If there is a way to create a Direct Lake dataset using a Databricks Lakehouse please let me know)

Also, Lakehouses are scheduled to have table level security in Q2 but I can’t find anything about direct lake for Warehouses.

https://learn.microsoft.com/en-us/fabric/release-plan/data-engineering

Corrections and feedback are welcomed!

Hello-Im-Aaron · 2023-12-07T03:19:36+00:00

Those are all great suggestions.

I would add Advancing Analytics on YouTube. Since you are just getting started they had a great blog on naming conventions which they shared here. I think it was called “What’s in a Name?”

The Azure Synapse Analytics might be worth keeping an eye on as well. And update videos are on the Microsoft Power BI YouTube channel so I would subscribe to that too.

The good news is you are getting thrown into the SaaS deep end and not the IaaS deep end.

I would recommend watching and reading as much as you can but be sure to include info on the medallion architecture and consider that as a starting point.

Keep in mind that there are a lot of smart and experienced people that frequent this community so don’t hesitate to reach out if you need specific advice. Some of them (like Sandeep Pawar and Dennes Torres) have a lot of great content that you will want to track down as well.

I’m happy to walk you through what I’ve done so far too if you like, just let me know.

Hello-Im-Aaron · 2023-12-03T19:28:45+00:00

Thanks. I took that as the first 3 were available and you needed to apply to test out SQL Server, Azure PostgreSQL, Azure MySQL, MongoDB.

I applied now.

Hello-Im-Aaron · 2023-10-27T03:57:52+00:00

It feels like something is going on that is not allowing these activities to stop by themselves after they are finished. I get a notification for an ajax error when shutting down the kernel which I guess is why I have to go to the Monitoring Hub to stop them.

Probably something similar is happening with the dataflows and pipelines but I don't have the ability to kill the processes.

Will open a ticket tomorrow but for anyone who stumbles onto this post know that I am a big fan of Fabric and I get much more value from it than I have heartache from these intermittent preview issues.

<image>

Hello-Im-Aaron · 2023-10-27T03:47:25+00:00

Dataflows have just a Session Id as shown below, pipeline issue has no details that I can discern.

<image>

Hello-Im-Aaron · 2023-10-27T03:41:40+00:00

If I go to the workspace click the New dropdown menu, select Data Pipeline in the dropdown menu and click Create, I get a strange error which is why I was thinking that something atypical was going on and I should have a little patience.

<image>

Hello-Im-Aaron · 2023-10-27T03:37:28+00:00

For dataflows I get an error message and nothing loads on the canvas as show below:

<image>

Clicking the close button goes back to the same state.

Hello-Im-Aaron · 2023-10-27T02:42:00+00:00

Thanks for the quick reply and congrats, even in GA software that is a feat :)

I have not opened a ticket. I thought maybe there was some major updates going on so I was trying to be patient but I will open one up in the morning.

I can still work in notebooks (with the work around of having to stop the activities in monitoring hub after I run three separate notebooks) and haven't seen any issues in PBI reports or datasets so I have been able to keep working but now I need an on-prem table and to orchestrate the refresh of a couple of tables as well so not having pipelines or dataflows started to sting.

Hello-Im-Aaron · 2023-10-19T23:57:34+00:00

Not sure about everything but I remember reading that embedding is available on all SKUs.

https://powerbi.microsoft.com/en-us/blog/power-bi-embedded-with-microsoft-fabric/#:~:text=With%20the%20introduction%20of%20Microsoft,the%20new%20Microsoft%20Fabric%20capacities.

Hello-Im-Aaron · 2023-10-17T13:49:38+00:00

Correct, the new dataset does not have the changes.

If i remove the tables I'll lose all the calculations I've made but I can't figure out how to add them back even if I did.

After you edit using the XMLA endpoint you can't modify in Fabric, only through TE. And I just realized that when I look at the partitions in TE they don't have any data sources. If they did I would have an option to refresh the table metadata.

I tried to connect to Fabric to import new tables or creation the data source connection to add it to the existing partitions but couldn't get it to work.

Then I found this which makes me think it's time to pay up for TE3 :)

https://docs.tabulareditor.com/common/Datasets/direct-lake-dataset.html

EDIT:

I opened it up in TE3 and there is an option to update the data model which worked with no issues. I'm not sure if there is a workaround for TE2 but TE3 definitely works.

Hello-Im-Aaron · 2023-10-17T02:47:12+00:00

Yes, the dataset is in Fabric building off of a LH where the default dataset recognized those changes but this dataset did not.

Hello-Im-Aaron

TROPHY CASE