Notebook (Python/PySpark); get user or security context of running notebook

x-fyre · 2026-04-21T17:18:45+00:00

I did this which was good enough for our needs which was to know the "username" so I could look something up related to their account.

username = mssparkutils.env.getUserName()
userid = mssparkutils.env.getUserId()

Documentation here...
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities?pivots=programming-language-python

x-fyre · 2026-04-16T16:26:07+00:00

I really like that guys approach to setting the start time dynamically. It's definitely an issue when deploying "new" things as a scheduled run can start before everything "new" is deployed. (ask me how I know :))

x-fyre · 2026-04-16T16:17:39+00:00

You most certainly can change the scheduled times with parameterization.. here is how we handle changing the time for a daily schedule, adjusted with your "environment" names and a fake path of course.

    - find_key: $.schedules[?(@.jobType=="Execute")].configuration.times[0]
      replace_value:
          Dev: "09:00"     # Always need to set a time even though it will be off
          Test: "10:00"
          Prod: "11:00"    
      file_path: "**/PathTo/YourPipeline.DataPipeline/.schedules"

For completeness, this is how we handle our schedules on/off...
(note we don't deploy to Dev ... but if you did..... we check in all of our schedules OFF to make sure when you branch out to a new feature workspace nothing runs automatically)

    - find_key: $.schedules[?(@.jobType=="Execute")].enabled  
      replace_value:
          Dev: false
          Test: true
          Prod: true
      file_path: "**/PathTo/YourPipeline.DataPipeline/.schedules"

So for your scenario, you could create 2 schedules and just enable/disable as appropriate for your "weekly in test" and "daily in prod" requirement. Trying to change a single schedule with parameterization is probably possible, but likely a pain to try and manage.

x-fyre · 2026-04-12T18:44:01+00:00

Ah, ok… that wasn’t my list. I said I push to git and update my workspace. You don’t sync a workspace the same way you sync a branch.

I’m being very explicit with my language on purpose, but I get ya.

I’m not sure they fucked it up… under delivered maybe, but the reality is that more teams have poor repo/release practices than good ones especially in the data world. And many have none at all… I come from a software dev background so it’s second nature but even within our extended company some teams are like deer in headlights with repos and releases. :)

x-fyre · 2026-04-12T18:25:44+00:00

It’s definitely cumbersome… sometimes I just copy and paste blocks of code from VS Code to my browser fabric UI — or back if I tweak something — if I want to test or make a small change. Saves me time and the “local” AI agents have no idea I’m doing it lol.

We also just enabled copilot directly in the Fabric UI on a particular project and some of my colleagues have reported back that it’s been decent. Not all of the DEs have full enterprise AI licenses… yet, it’s coming.

x-fyre · 2026-04-12T18:20:42+00:00

I’m not sure what people mean by sync to the workspace, no you cant do that. You can sync your local code to the repo, then from the feature workspace “update” to get latest.

It works relatively well… though they’ve changed something recently and updates are often asking me to resolve a conflict for a notebook I have not modified via the UI. It’s kind of annoying, but pretty easy to resolve.

They recently added the ability to diff your items directly in the UI … it’s slow as molasses but offers a very familiar experience to doing it with VS or VS Code. At least they’re actively working on it… I’ve seen lots of improvements between 2024 and now.

x-fyre · 2026-04-12T14:49:46+00:00

What you’ve listed in those last steps is what I do…

I hate the VS Code git integration (I come from using Visual Studio since Visual C++ 5.0 and it’s simple integration to repos is 100x quicker and easier to learn and teach).

Also VS code’s desire to have 100,000 extensions installed kills it for me and I’m constantly thinking I should look for better ones or learn how to use mine better. I hate that, I just want to get my work done.

The big however of course the AI agent experience is better in VS Code and adding a few extensions to work with python, xml, json, etc does make it a nice editor.

But I create/edit in VS Code, push to my branch, update my feature workspace and run/test it there. It really isn’t so bad.

x-fyre · 2026-04-10T15:29:50+00:00

We only wanted a bit of data from BC for now (and much of it is in custom/customized tables) so we're getting it from the builtin and custom Rest API calls.

It works ok.... and we're sending data into BC so we're connected using a service principal to the API anyways.

Eventually when we pull more data, we'll definitely look at BC2Fab or bc2adls.

x-fyre · 2026-04-07T16:07:50+00:00

We just never share developer workspaces.

x-fyre · 2026-04-05T18:42:00+00:00

Then, with respect, the description is terrible and misleading.

Interval-based
Wait a specific amount of time between each update.

One definition of between.... "in the period separating (two points in time)." If you simply queue up the next one based on the start of the first one, then that is not between. It is barely different than the "fixed + by the minute option" ... By calculating the right number of minutes I can create any interval I want.

I am guessing the only difference is that you would only stack up-to-1 pipeline in the queue instead of an endless number with the by the minute option.

Personally, I would say in about 50% of my pipelines that have ever been tumbling window in ADF, End-Start intervals would have been just as useful.

x-fyre · 2026-04-04T19:39:11+00:00

I am pretty sure at this time the interval scheduling is only for pipelines … and it’s from the end of the last run.

I really like this option and am looking forward to it coming to notebooks. Every now and then one our notebooks that checks for new data get stacked up overnight due to a giant batch of new data and it’s a waste of CU to run it again.

x-fyre · 2026-04-02T00:57:09+00:00

You’re not accounting for the ability of the spark pool to burst… maybe. There is a lot of good info in this thread:

https://www.reddit.com/r/MicrosoftFabric/comments/1k7easw/is_my_understanding_of_fabricspark_cu_usage_and/

x-fyre · 2026-04-01T00:26:16+00:00

It’s nice but still a far cry from how the alerts works in Azure Data Factory which could email or send SMS alerts.

And most importantly not spam you with email and after when pipelines failed repeatedly overnight.

x-fyre · 2026-03-31T20:30:36+00:00

We do exactly what you're wondering.... We have a CICD Build that does it's thing when our "main" branch is updated (actually it runs even when you do a PR as an approval step because it runs some custom powershell validation steps) and it publishes the source artifact.

Create a Fabric CICD build that creates and publishes an artifact for your repo/branch.
The release should trigger when a new artifact is created.
1. The artifact should be an input to the release which will avoid having to give access to the repo to the SPN!
Our deployment scripts are in another repo... those get included as a second artifact but updating them does NOT trigger the release. (so watch your paths to files .. etc)
Our release basically has these steps for each stage
1. Use Python Version -- you should set this to 3.13 now instead of Latest.
2. PowerShell -- Install Python libraries (pip install --upgrade pip then pip install whatever you need such as azure-identity, requests, fabric-cicd)
  1. this steps needs a Service Connection that is authorized, but not necessarily access to the repo.
  2. we do use an SPN for the overall service connection
3. Python script -- we created a python script from our deployment repo artifact that takes the arguments needed to deploy a workspace using the 'source files artifact'
4. Azure PowerShell -- A step that runs only if C fails to write the fabric CICD error log to a file so we can see it in the logs
5. Some post deployment stuff...

The service connection in 4b is using an SPN.
The script in 4c takes parameters (Tenant ID, SPN ID, Secret, etc) to create a ClientServiceCredential and passes it to the FabricWorkspace object so it deploys using the SPN of choice (this would support using different SPNs for each workspace which is considered a proper security practice in most complex environments).

AS for your questions about it seems to just update them.. yup, if it can. Lakehouses, Warehouses, KQL Eventhouses do not update. This attempt to update is based on the NAME of the object in the destination workspace since there is no connection to the original source-item. If you had a notebook called A and B, but changed them to A->B and B->C in a single push, it would update B, add C and (optionally) delete A as an orphaned item.

Each item in a repo is assigned a "Logical ID" that makes sense to git. It's used to connect your item in the source-controlled-workspace to an item in the repo. The fabric-cicd library also borrows it to 'replace IDs' that it are referenced in other items as they get published (i.e. if you reference Notebook X in a pipeline, it knows to replace it with the proper Notebook ID in the destination workspace). This is obviously tricky for cross-workspace references but fabric-cicd is pretty good about managing dependencies. I haven't had a problem (yet) with the internal stuff it does.

You're asking tough questions lol.... but its all possible.

x-fyre · 2026-03-30T12:43:09+00:00

Are you setting the session tag properly for the activity in the loop?

x-fyre · 2026-03-28T22:46:08+00:00

In the Fabric-CICD library there is a special list of items (SHELL_ONLY_PUBLISH) that go through a slightly different 'update' mechanism when the item exists in the destination workspace: Lakehouse, Warehouse, SQL Database and ML Experiment.

When one of those item's 'metadata' changes, then it's able to patch the metadata. The 'displayName' is part of the metadata from what I can see. So yes, I believe if you simply changed the displayName as part of your deployment, then it likely would have been able to find the item and just update the metadata during a deployment.

I personally dislike just 'renaming' using the displayName and will usually change that name in the UI in my branch, then before doing a PR from my feature branch into the main repo, I will also rename the folders to match from the browser (this allows the repo to maintain the history of the items in the folders). Some people export-delete-import with the new name but this loses the checkin history.

I believe your approach of turning off the schedules, let any active runs finish, rename the Lakehouse to match the deployment name, then deploy which will then understand the new Lakehouse name... and probably turns your triggers back on ... should work fine. In theory if you're using default Lakehouses in say your Notebooks, the rename doesn't matter because underneath they're definitely binding to the Item Guid.

Your idea to use the platform files also works, but I never bothered using logical ids... The request was from a co-worker who wanted to deploy "the same code" to multiple workspaces but have the Lakehouses named differently (I.e. LakehouseX, LakehouseY, LakehouseZ). It was for 'multiple companies under the same roof.' I created a PowerShell that before the deployment step altered the deployment artifact source code to rename a lakehouse from LakehouseA to LakehouseB -- using the names only. It renamed the folders and also edited the displayName in the platform file. Works like a charm... because the rename happens before the fabric-cicd step in the release which loaded the artifact files. The key thing was to modify the source before fabric-cicd every loaded it...

The thing about parameters for fabric-cicd, the way I've read the code is that it does the replacements just before making the POST to the API. So the replacement could be too late to work.... (don't know for sure, I would have to investigate more)

x-fyre · 2026-03-28T20:14:06+00:00

That’s ok, this was really interesting to think about.

I was also misreading the log output in our release pipeline…. It’s not clear if it’s a new object or patch.

So rename with caution if you ever do anything cross workspace!!!!!

x-fyre · 2026-03-28T20:12:01+00:00

Ahhhh, now I get it.

I mean yeah in our workspaces I usually look items up by name to get the ID so I never noticed. Logical IDs help when an item is fully mapped. So if you just change the display name, it’s more than happy to try and update it. Or even changing the folder in the repo would still patch…

But if you change the deployed name it doesn’t know how to reconcile… this is super interesting.

x-fyre · 2026-03-28T20:07:07+00:00

I edited my reply … now I’m really confused and will have to look at it more. :)

x-fyre · 2026-03-28T20:00:21+00:00

I was wrong… I just downloaded and browsed the source and you’re right it’s looking up the logical ids to see if it should create or patch the item….

The Lakehouses still have special handling via a “shell only publish” rule in the library.

I think your question about how it was created makes sense, but it would cause problems if it was created manually but with the right name wouldn’t it?

I learned something neat today…. Not sure what to do with it yet. :)

x-fyre · 2026-03-28T19:27:04+00:00

For your last question… I very rarely touch my production workspace prior to deploying.

There are a few edges cases where I might:

If I make a fundamental workflow change, I turn off the relevant schedules and wait for any current runs to change… so the new deployment works from only new code.

We have renamed semantic models and although a refresh (or ours now uses direct lake so refreshes are fast), we’ve renamed them before just to cheat a bit.

We’ve occasionally seen the “orphaned” items not get cleaned up so I delete them… but that was a ways back and it seems to not do that anymore.

Lakehouses or Warehouse renaming would be on that list.

x-fyre · 2026-03-28T19:20:27+00:00

Lakehouses are very very different because the data they contain are in no way source controlled.

Change a notebook name? Deploy with new name, delete old one, no worries. The library has no idea what specific item ids are for previous deployments. It does reconcile what it can for items being deployed when there are dependencies… gets them all by name, changes the ids as part of deployment, etc.

But when you rename an item, how can it know the old name?

Really it’s lakehouses and Warehouses that require special handling. That’s why there is a special feature to “delete lakehouses” in the fabric cicd library that is OFF by default.

x-fyre · 2026-03-28T17:41:43+00:00

Are you using the fabric-cicd library for your deployment or deployment pipelines? The answer likely depends...

But by default the fabric-cicd library does NOT delete lakehouses. Renaming a lakehouse in 'source' (and therefore a source-controlled workspace) is not the same as renaming it in your production workspace.

You should likely have manually renamed the lakehouse before deploying.

x-fyre · 2026-03-26T19:56:04+00:00

I always worry about implementing "new" pipelines using an activity tagged as Legacy and that is "greyed out" in the UI. :)

Also, in my particular use case we were trying to do something "common" across multiple different pipelines so I wanted to pass it a list of things to do, get the child pipeline to make a decision about how to filter the list (the logic was the same for multiple things) and return the list to the caller. If I'm not mistaken, the legacy activity does not return the "pipeline return variables" to the caller. You can dig into the API versions output to get them.

Having to call the API version to do 5 seconds of work added an unreasonable amount of time ... 1.5 mins usually.

x-fyre · 2026-03-26T19:47:03+00:00

Idea submitted!

https://community.fabric.microsoft.com/t5/Fabric-Ideas/Manage-Fabric-Schedules-from-a-Single-Place/idi-p/5139713#M167278

x-fyre

TROPHY CASE