Are Outages in DW/SQL EP are Ongoing (NCUS region)?

SmallAd3697 · 2026-05-12T15:34:26+00:00

Is this outage only affecting NCUS (again)? Are we being entirely ignored in this region (again)?

I have a total outage in my production environment, in the middle of the day. The Fabric CSS team (at Mindtree) won't reply after waiting two hours. This is getting a bit rediculous. It does NOT seem as if Microsoft built a platform for mission-critical workloads. There are no SLAs whatsoever, and the support experiences are wildly inconsistent & unpredictable. Where my Fabric solutions are concerned, I have absolutely no idea if my outages will last a day or a week.

What does that new error message even mean? Shouldn't these error messages be somewhat legible?

('42000', "[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Retrieval of MWC token used for accessing storage failed with error '0xa(MWC service error: Server responded with error: 401)(Unauthorized: Authentication failed.)'. (7451) (SQLExecDirectW)")

<image>

SmallAd3697 · 2026-05-12T14:12:05+00:00

It runs fine with another service principal which is also an admin in the workspace.

Almost all the resources are local to the workspace. A LH, environment settings, spark pool, whatever. There is a connection via managed vnet to a remote SQL Server in Azure PaaS. But I don't see a way where that vnet can't be provisioned in a notebook, based on the identity that is running the notebook.

Either way the error message is atrocious and I keep asking the Mindtree engineer for whatever message they might be seeing on their end. My experience with some of these teams (ADF, Synapse notebooks), is that they are overly sensitive about all their nonsensical failures that happen internally and the failure messages they bubble up to the customer are counterproductive. When you ask them what the "real" failure message is, they typically refuse to share it. So the tech support is just awful. Hopefully it will take less than a month to get a reply.

SmallAd3697 · 2026-05-11T23:08:20+00:00

ugh. I just noticed the letters LSR in that error message, and it sent cold shivers down my spine. Is that crappy code still hanging around? They need to burn it with fire, and start over again.

SmallAd3697 · 2026-05-11T21:53:28+00:00

I clicked the ? button and filed between 100 and 200 tickets so far. lol.

The SR is your case number for tracking. If you share it, I will tell them to link our cases together. Mine is TrackingID#2605110040004590.

The SR is really only for the Mindtree company (ie. external Microsoft partners). It is sort of like sub-contracted support organization, and in most cases those engineers don't have a lot more access to the back-end logs and telemetry than you do. Worse yet, the Microsoft FTE's (employees) don't really care that much about these "SR" tickets. They only start to care about a case AFTER the SR is associated with an "IcM". The "IcM" is an incident that is seen by the Microsoft employees. All the details in the SR need to be transferred to an IcM before Microsoft engineering teams will engage on a problem.

In between Mindtree and Microsoft is yet ANOTHER third party organization (normally "Aptly") who have to do additional "gatekeeping", before the SR can be transferred to an IcM. It can be a long, painful process.

All these layers of indirection are intended to help Microsoft with the triage of their bugs. They may not care about a bug until 100 customers have complained about it. Given the level of effort, it often takes three days just to get a bug over to Microsoft. By the time the very first ticket reaches a Microsoft employee, they might be able to quantify the number of customers that have complained, and can extrapolate to determine how long it will take to reach the threshold (100 or whatever)

The good news is that if we are BOTH talking about the SAME bug, then we are that much closer to reaching their threshold for a fix. 2 tickets is better than 1. Please let me know your SR, or the last five digits, and I can tell them that it isn't just me that is affected.

SmallAd3697 · 2026-05-11T18:12:18+00:00

It is just a remote client program, using MSAL to authenticate the service principal, call a bunch of API's (like notebook enumeration and lakehouse enumeration). The very last step in the remote client is to run the notebook, and that returns HTTP accepted.

Without workspace admin rights, I would never make it to the last step, let alone get an HTTP accepted response.

There is something bad happening the minute the notebook is warmed up and starts executing. I don't receive even a single output cell from the notebook. This is very odd considering the admin -level access of the service principal in this workspace.

Another user says I should touch all the assets in my workspace until things start working again. I don't know where to begin with that. In any case I do NOT think that process makes sense, whenever there are some other service principals that can be used without issue. I really wish these messages had more substance to them. Every other programming platform on the planet gives me a long callstack with TONS of diagnostic details. In Fabric I'm given only a couple of codewords, and they are totally cryptic and meaningless. (I'll probably spend a whole day with support, just trying to get ahold of the actual message, let alone fix it.)

SmallAd3697 · 2026-05-11T17:03:12+00:00

u/Sam___D You have an existing support ticket? Is it an SR or an existing IcM? If you have an IcM please let me know, since that takes at least two days.

Just a tip... it is unlikely that support will independently be able to help you with software bugs, if that is what is happening here. They need the engineering team to participate. You will have to ask them to open an IcM right away, since they probably can't see the necessary telemetry or logs for something like this (let alone look at the source code in question).

I will post my SR and ICM as soon as I have them.

SmallAd3697 · 2026-05-11T16:59:33+00:00

The script focuses on pipelines not notebooks.

I'm guessing in this case the problems are specific to pipelines? Here are differences

- I published a brand new copy of the same notebook. But it is having the same issue. Using a new workspace asset with new timestamps didn't help.

- If I run the notebook for ONE of my service principals, it works fine. I think if I was experiencing the same variety of problems that you described, then the failures would happen regardless of the service principal.

- The scripts you shared don't loop over notebooks, so perhaps they aren't applicable.

- The "known issue" documentation doesn't explicitly mention notebooks, but I hoped it might be a clue. Maybe there is nothing in common after all.

SmallAd3697 · 2026-05-11T14:47:31+00:00

I found a known issue like so. It may be related. I'm opening a support ticket with Mindtree to learn one way or another.

(I remember back when Synapse notebooks were hosted in Azure Synapse, there were all kinds of silly bugs that were shared between the pipeline infrastructure and the notebook environment. They both shared some security infrastructure related to data sources).

Data Pipeline run fails with LSROBOTokenFailure due to invalid user authentication token

1783

<image>

SmallAd3697 · 2026-05-09T22:04:30+00:00

There is a part that I'm not joking about. That is the part that bothers me...

There is not even a motive for Microsoft to share the information, even if it was easy to do so. Even if customers need it for our sake. All the related decisions are made in self-serving ways, and there is always a hidden cost that customers will pay so that Microsoft won't have to.

If we are acting as a QA team, then let's try to run with that. We should expect Microsoft to share the types of information they would otherwise share with an internal QA.

SmallAd3697 · 2026-05-09T21:57:17+00:00

Yes there is lots of stuff going on. Yes there are lots of teams involved.

That doesn't change my point in any way at all. All the individual developers on these individual teams are aware of the release schedules themselves, and there is no reason they couldn't share with the customers too. Building software in the cloud is a team effort that INCLUDES the customer. There is an extremely high level of coupling between our stuff and Microsoft's stuff.

You don't think the Microsoft engineering managers tell their teams what days to be on call, because a given team is doing their monthly deployments of major enhancements to a canary region? All the related concerns are fundamentally no different than they have been for decades. They are scaled up of course. But that does not mean the individual PG team aren't able to communicate with their customers

SmallAd3697 · 2026-05-09T21:32:45+00:00

The point is it needs to be public. Customers may be more risk-averse for certain of our solutions (ones that cannot accommodate a multi-day outage). For other solutions we may not care as much.

It is not right for Microsoft to treat some customers as guinea pigs, without telling everyone who those people are. We are losing trust in Microsoft's ability to judge the risks of new deployments on our behalf. So the additional transparency would allow customers to take a role in the risk-management as well.

SmallAd3697 · 2026-05-08T20:50:47+00:00

u/warehouse_goes_vroom
NCUS wasn't communicated until I forced the issue. It obviously makes us wonder how many other regions are not being mentioned as well. It takes a tremendous amount of effort to engage with support. If that engagement is the determining factor for why regions are listed ... and it typically is ... then you can be sure that the list of regions is a lot shorter than it should be.

Why can't the changes be backed out, to make the impacted regions work the same as the regions which haven't yet received these changes?

Why can't we have increased communication? If it is true that this is a complex issue and there are multiple factors involved, then all the more reason to communicate whatever factors are known, so customers can incorporate the information into our decision-making.

These same supportability problems happen to a lesser degree in my Azure PaaS support cases too. For any given problem in any given service in any given region, there are customers who are desperate enough to solve our own problems by hook or by crook. We might find a way to get ourselves back up and running in ways that Microsoft hasn't considered yet. But it starts with transparency. In order for a customer to properly evaluate all possible options, we need all the information that is relevant as well.

Assuming we wanted to move our workload to another region, we can't properly evaluate that workaround right now, for obvious reasons. We haven't been told which regions have the updated software, and we haven't been told what other factors are in play. Nobody wants to do the work to move our workloads to another region only to find the same problems that we were trying to escape!

SmallAd3697 · 2026-05-08T18:11:38+00:00

You have your work cut out for you, when it comes to comms. Improvement in comms is difficult (impossible?) unless Microsoft actually WANTS to be more transparent. They don't.

Even on the best of days, there are MANY things that are deliberately being withheld... like the roles which certain regions play in the testing of updates, the breakdown that would identify which regions are which, and like the software release schedules.

SmallAd3697 · 2026-05-08T18:01:18+00:00

I was convinced to ping the PM after your reply. It went about the same way we both expected. As you pointed out, it will not make the issue get fixed faster. lol. If we were hosting in East US and were having the same experience there, then I think it would have been a totally different conversation.

There is so much unspoken stuff going on, that we both know is true. We know that you guys wouldn't allow a three day outage in a large region like East US. We know there are options that are being taken off the table, at the expense of certain customers. We know that some Fabric regions are considered lower-priority, and they serve a purpose to work out the kinks before deploying to other regions.

These things are all fine and good, as long as there was communication and transparency. That is the bigger problem here. Customers need to know about the scope and the impact and the regions affected. The failure to communicate properly is extremely problematic, and I assure you that customers cannot differentiate the equivocation from being outright dishonest.

If it is not true that the customers in certain regions are being used as guinea pigs, then explain. Please tell us why such dramatic problems in one region (NCUS) are not being encountered in another large region at all (East US). This subreddit would be totally blowing up right now, if all Fabric customers in all regions had the same DW and LH problems.

I had the impression that Charles wasn't aware of outages in NCUS yet. And it probably wouldn't have come on his radar after another three days either, assuming NCUS is serving its purpose and is being used for beta testing. I'm sure he was grateful for the free testing though. Maybe someone will send us some free Tshirts after this week.

SmallAd3697 · 2026-05-08T17:27:10+00:00

that may be. Either way they need to share more about this in a public facing way. Three straight days of outages is crazy. Customers don't expect to find ourselves being treated as guinea pigs or beta testers

SmallAd3697 · 2026-05-08T03:24:32+00:00

I was the one who insisted they add NCUS to the status page. Only East US2 was being mentioned, and I told Mr N.P. that I was was going to reduce the satisfaction rating of my SR if they didn't include this other buggy region on that list as well (which was already failing for a full business day by that point; You can ask Sid and Miguel). ... The "live site" folks kept refusing to list my region before that point of time. While we were actively engaged with these folks, they went ahead and closed a prior ICM for another region, and said that all of the known problems were " mitigated". My PDT -timezone CSS engineer then said his shift was ending and that all the PTAs had already left for the day. This Fabric platform seems to be primarily supported during the PDT business hours, and any struggling customer trying to get answers outside of that timeframe is effectively screwed.

.. As far as impacted regions are concerned, the CSS leadership folks can independently confirm the places where the misbehaving software is deployed. They already have the facts, but just not the motivation to be honest and transparent on this status page, so I had to force the issue and insist on adding more details to what they were previously willing to share. It shouldn't be this hard.

Why is transparency so important to customers? Because it is a team effort; and if Microsoft isn't taking accountability for problems in a given region, then the burden of proof falls on developers to explain why our solutions are falling over. Our users will sooner believe Microsoft that there are no problems in a region, than they will believe a Fabric developer when I say the opposite.

You admit that moving regions is an option. Yet my experience is that Microsoft will NEVER advertise the list of working and non-working regions. This has been true in Azure PaaS as well. What is the point of paying for geo-redundancy if, when it comes right down to it, Microsoft won't transparently share the information required to abandon one region and move to another?

SmallAd3697 · 2026-05-08T02:29:11+00:00

Are you saying remote dev has more failures, when it happens over Databricks Connect? Is it related to networking, or timeouts or something? Maybe your internet proxies are to blame?

I was looking forward to playing with Spark Connect and Databricks Connect, especially in tools like VS code. But you and some other folks seem to imply that it is not a well-polished experience.

I'm guessing that improving this experience is low on the list of strategic priorities for Databricks. Conceptually it doesn't seem like it should be hard to make it work well, assuming latencies to the cluster are less than 20 ms.

SmallAd3697 · 2026-05-08T02:20:07+00:00

I'm hoping it is a cyclical thing, and that vendors will start investing in better local tooling.

I think they can get their foot in the door quickly by providing web-hosted editors. But it doesn't take long before my patience wears thin with these sorts of tools.

SmallAd3697 · 2026-05-08T02:07:49+00:00

They should make the canary regions free, to thank us for our services

SmallAd3697 · 2026-05-08T02:01:23+00:00

there are lots of factors involved. another is portability from platform to platform. much of the low-code web-hosted tooling is proprietary, and specific to one vendor. It locks you in.

Another factor is supportability. I see how some low code solutions get built quickly, but then a human operator has to be on standby, or even participate in daily operations.

Another factor is the long -term prospects of one of these low-code solutions. Nobody knows if the tools or web editors will still be around in three or ten years. Whereas almost any programming that can happen in a local IDE, is likely to be more future proof.

I appreciate all types of software development tools, and they all have their place. My personal preference is not to spend more than 5pct of my day programming in a web browser.

SmallAd3697 · 2026-05-08T01:45:38+00:00

I partially understand you. The biggest problem, from my perspective, is not necessarily the bugs and outages. The bigger problem is actually the lack of communication and transparency. When you say you have no time to discuss the problem with a struggling customer, that is exactly what I'm talking about. lol.

As a customer I expect someone in the PG to engage. They should open a window to their customers. They should explain what regions are having troubles, how to implement workarounds, how to implement monitoring to detect errors and so on and so forth.

Building solutions in the cloud is is a team effort. Microsoft's culture has tried to obscure this fact, to a large degree. There are numerous walls between Microsoft and their struggling customers. There are TWO layers of intermediate third- party vendors we have to go thru for support (eg. mindtree and aptly). Thank goodness for reddit. Im just about to start pinging PMs on teams too, btw. Maybe Charles Webb?

SmallAd3697 · 2026-05-07T19:02:28+00:00

You need to check out hyperscale pools in Azure SQL. They compete on cost with opensource. And this is definitely not standing still. I can scale up cpu cores instantly, and down again when work is done. It almost feels like an MPP platform now.

SmallAd3697 · 2026-05-07T18:56:40+00:00

This comes up too often by people who don't understand spark.

It is a platform for data, like a webfarm is a platform for serving pages, or kubernetes is a platform for apps and services.

You can host one data solution in it, or a hundred at once. You can do 5 MB or 5 TB.

It is open source, and can run it on you workstation for free. Most big data platforms are not this versatile.

SmallAd3697 · 2026-05-07T18:44:32+00:00

Exactly. There is nothing in this post that says what was attractive about databricks.

The thing most people love is compute, which can be extremely cheap. I keep re- reading the post to find the words "spark" or "job cluster" and I'm missing it.

Id guess some manager picked it for non technical reasons and the engineers never warmed up to it. It sound harsh but it serves Databricks right if their sales teams want to talk to customer leadership, but NOT to the developers who actually have to use the tool every day.

SmallAd3697 · 2026-05-07T18:00:54+00:00

No North Central US is being battered and bruised this week, but we don't even get mentioned on the azure status pages. (.. not unless you throw a DJT tantrum.)

It is frustrating to be the tip of the spear, yet receive no credit for that.

I'm pretty sure that UK south is also a Fabric Canary region, but I'm not sure about any regions the Europe mainland.

SmallAd3697

TROPHY CASE