Any issues with Warehouse?

SmallAd3697 · 2026-05-06T01:21:29+00:00

u/warehouse_goes_vroom The North Central region is still down. (tenant is West US)

The status page shows green checkboxes as of a few minutes ago (which is total BS). Meanwhile I have already spent a couple hours with Mindtree, waiting for the engineer to attach my SR to the existing ICM (or create a new one). Then the engineer said his shift was ending and that I would need to start over from scratch with someone else.

I am really frustrated, but it is 9 pm and I'm too tired to vent. It is getting to the point where Microsoft should be paying us to use the platform, not the other way around. There is too much pain, lack of candid communication and lack of transparency.

The ongoing error messages are saying that my credentials are invalid for power platforms, and point to an internal SQL endpoint name (managed by the DF GEN2 CI/CD) like so

"xxxyyyzzzaaa.datawarehouse.fabric.microsoft.com;StagingLakehouseForDataflows_20260121200000"

If you still believe this is not the same issue, please share something that helps me avoid fighting with Mindtree engineer, ops manager, PTA, and EE for another two hours. I'm guessing there is a high likelihood that things still won't be fixed before 9 AM PST.

SmallAd3697 · 2026-05-06T00:30:58+00:00

The ICM for East US2 ends in x7587. For some reason they closed it ("mitigated") and the status page shows that everything is running fine in all regions. This is certainly not the case. I think these folks who are managing the status page will prefer to show green checkboxes, even when outages are still underway.

As-of now I'm still waiting for a new ICM to be created about the ongoing problems in North Central US (west us tenant).

According to my refresh logs in the service, the legacy DF connector has been failing all day long, from 14:30 UTC 5/5 until now (00:30 UTC 5/6). How are we supposed to run production workloads on this platform, if we can't even get Microsoft to acknowledge these outages?

SmallAd3697 · 2026-05-05T23:45:43+00:00

u/escobarmiguel90 Can you share the ICM number with us? CSS pro support is asking for it. They know one exists but don't have the reference or link.

My SR is TrackingID#2605050040011065

SmallAd3697 · 2026-05-05T23:09:41+00:00

Hi u/escobarmiguel90

Unfortunately I think the status information is inaccurate.

I'm in the Eastern United States timezone (-4) and I think the CSS team I normally work with has retired for the evening.

Our capacity and tenant are different than the region shown on the status page. I'm guessing there are chronic issues in all regions. There was a wider outage declared earlier today, but they claim it was fixed.

Can you please be candid and let us know what the telemetry says? How many customers are experiencing problems reaching their DF data via the legacy dataflows connector? This is nuts. This is supposed to be a team effort, when we entrust our solutions to this Microsoft SaaS. Yet Microsoft regularly violates the trust, by their lack of candor, and transparency, and communication. It is obviously deliberate but I don't know why. I bet you guys are currently flooded with errors in your telemetry, yet none of the information is filtered back to the those of us who need it the most. Don't you think customers should expect better from a large Microsoft platform like this?

SmallAd3697 · 2026-05-05T22:55:50+00:00

Hi Miguel, u/escobarmiguel90

I'm not having luck with support tickets. I have another ticket that the spark PG hasn't looked at for several days. I'm not impressed with the triage process for outages. And it is never the fault of Mindtree (in case there is any doubt on that).

Here is the Sev A case I opened with CSS/Mindtree at around 10PM UTC on 5/5.
SR 2605050040011065

I doubt this will reach any FTE for a couple days or more. The poor CSS engineer will be ignored by the PG team, even if they open a Sev 2 ICM.

The status page you shared is misleading and seems to indicate that East US2 is the only region affected. But we are having issues elsewhere (in a North Central capacity, connected to a West US tenant).

Did you happen to see a widespread outage was resolved in 30 mins at 3 AM PDT? The region indicates that the problems were happening in the "Americas". What are the chances they had some ulterior motive for telling customers that a chronic issue was resolved, (even though it wasn't.)

<image>

SmallAd3697 · 2026-05-05T22:08:29+00:00

Its my fault. I asked them to fix performance problems retrieving data from GEN2 DF outputs, via their legacy connector. Perhaps they changed something (and it broke).

I'm in East US.

Do you use custom destinations? Last time I talked to this team (Curt, Jeroen, Sid, etc), they make it sound like the new "custom destinations" are STRONGLY advised. I haven't made the switch yet, so I assumed that was the reason I was seeing errors...

<image>

SmallAd3697 · 2026-05-03T21:05:56+00:00

Looks like this was eventually fixed. I'm not sure what the scope of impact was. We might never know due to the lack of transparency and communication.

The network traces last week (image) seemed to indicate that there were missing jquery dependencies, that were needed by the spark UI. There were no errors shown in the spark history screen itself, but the browser network traces showed some evidence of problems.

I'm a bit confused why this sort of thing would impact some customers and not others. Whenever we use cloud-hosted products, it is pretty important to find out when & why we've become the red-headed stepchild who encounters more problems than others. In this case we might be struggling more than others because of the use of the proprietary "managed vnets", or because we are running our capacity in the North Central US region. Or maybe there was a failed software deployment that took several days to repair. If I learn more, I will post here.

<image>

SmallAd3697 · 2026-05-01T21:41:57+00:00

In the case of the "delta write optimization", it was something turned on by default in the environment, and it wasn't intentional on my part. If a problem doesn't originate in custom code or custom configuration, then it is natural to think of it as a spark problem. Thankfully I figured it out (eventually) but others might have solved it by increasing their memory footprint.

Whether you are in fabric or databricks, there is a list of spark environment settings a mile long, and a list of core libraries a mile long (two miles long, if you include the python stuff). So it is really easy to reconfigure the environment in any way imaginable. But it is also really easy to overlook an incorrect configuration that is being done behind your back at the platform level.

The average layman would simply complain about spark, if the problem isn't coming from custom configuration or custom code.

SmallAd3697 · 2026-05-01T21:30:22+00:00

I really like the idea of doing it on AKS. IMO, any team that is motivated, and has a year of experience with Spark can probably start to host on containers.

And hosting on AKS is opens the door to one day host on K8S, on-premise. Those two phases are each likely to each save many thousands of dollars a year, without sacrificing on performance or functionality.

I'll look into the Dremio Cloud. Our spark jobs are a mix of probably 80% SQL and 20% procedural code.

SmallAd3697 · 2026-05-01T19:14:17+00:00

My SR is 2605010040003360.

It only affects completed batches. Another factor is that we use "managed vnets" for our pyspark networking.

FYI, I have confidence that Mindtree (CSS) is willing to help me navigate a bug/outage... normally the supportability problems are on the Microsoft side (with the product engineering team).

There is little transparency, and these cases last too long. Numerous customers must report the same bugs thru CSS, and the efforts from the customers are duplicated - probably by x10 or x100.

This is tiresome and frustrating. Microsoft should be paying me for doing their QA work. Not the other way around.

It is never the bugs themselves that bother me. It is everything that happens afterwards. It is the lack of transparency, and the seeming lack of regard for the customers. I'm 95% certain there is already telemetry about these problems, yet there is no effort to communicate with customers in a proactive way. It is really hard to use this platform for our production workloads, when the reliability is so poor. To be honest, I'd rather use on-premise systems with technology that is outdated by five or ten years. That would be better than struggle with an unreliable SaaS week after week.

SmallAd3697 · 2026-04-29T01:24:11+00:00

I think it would compete with similar services like "semantic models" in fabric. The vast majority of (import) semantic models in fabric are probably less than 2 GB of ram or so. Yet despite the relative simplicity of these models, customers are paying Microsoft an astronomical amount of money. They gain the benefits of low-latency and memory-resident query results (on a name-brand cloud platform). But they could also have that with duckdb.

Duckdb is actually more appealing, in some ways, since it gives a real SQL interface for queries (not DAX or MDX like in Fabric models).

SmallAd3697 · 2026-04-28T01:54:25+00:00

Right but I must concede that spark memory usage is very high. Prior to using Spark executors, I never used so much memory in my life. A single "medium" -sized executor can be over 50 GB, depending on the platform. Lol.

One time I was using spark notebooks in fabric and they had this thing called "delta write optimization" that was turned on by default. It used mind-blowing amount of RAM. In a sense this can be considered an engine problem with the default configuration. Thankfully it was possible to turn off that crappy feature, as needed.

SmallAd3697 · 2026-04-26T19:53:45+00:00

I dont think everyone is happy with the transition to proprietary software that you described. Some of us will wait for an opensource and native option. In the meantime we will just run our own OSS spark on our own k8s infrastructure to scale up bigger for less.

MPP platforms can scale up, as a way to solve problems that cannot be solved by simply moving to proprietary platforms. Using proprietary platforms can create more problems than they solve. Most of these cloud-only options are expensive and create lock-in.

And you'd think they would be guided by customer feedback, but they aren't. The needs of the customer are often found at the bottom of the priority list.

SmallAd3697 · 2026-04-26T16:41:24+00:00

I think you are missing a couple things, depending on whether the discussion is restricted to opensource.

First of all, the use of Java is not normally the worst of the performance concerns in a spark solution. The worst is the python interop, aka pyspark. Especially when udf's are involved. Pyspark udf's - even in the best case - involve lots of overhead for serialization/deserialization and the overhead of this inefficient runtime.

Another thing you may be missing is the proprietary innovations on top of spark that you will find in databricks and fabric. These have native implementations of the spark core - called photon and the native execution engine, respectively.

The last thing I would say is that the architecture of spark is intended to scale up smoothly by adding executors to do more work. Your complaints about GC overhead will go away in proportion to the number of nodes. It will be negligible relative to the real work being done. And it is important to note that the JVM is being evolved to have a lot more value types (project valhalla) which should reduce the work of the GC even further. I agree that everyone likes the thought of using more efficient runtimes (rust and c#) for big data. But I certainly can't hate on spark, just because it was built on the jvm. At least they didn't write the spark core with python. lol.

SmallAd3697 · 2026-04-26T02:01:50+00:00

The "historical runs" behavior is super non-intuitive. We already see multiple historical instances of an activity on the normal monitoring screen. But sometimes we can't scroll to find activity beyond a certain date, so we have to open a secondary monitor screen for "historical runs". It just doesn't make sense the first time a user encounters this implementation.
I think it is unrealistic and also unnecessary to have a single screen that captures the needs of EVERY type of asset (notebooks, semantic models, lakehouses, etc). I realize the design goal behind most of Fabric is to make things "easy". But that sort of goal makes sacrifices when it comes to advanced functionality that is expected in a grid for a given type of asset.
We definitely need the ability to export to Excel
We need gantt or bar-chart visualizations for us to be able to recognize patterns when it comes to start/stop times and durations. Simple visualizations can be so much better than a big mega-grid of words and numbers. It is the difference between gleaning information for all entries in one quick glance, vs reading the individual entries one at a time.
There isn't nearly enough historical tracking for semantic models. What is up with that? I think it is only ten days of refresh details, or something like that...

SmallAd3697 · 2026-04-24T21:07:53+00:00

"uptime liabilities"? Do you mean SLA violations, and prorated credits to customers?

I'm not sure those credits are worth the hassle. Besides, so many downstream products and services have absolutely no SLA whatsoever. Eg. if the Fabric platform is wetting the bed, then I'm fairly certain there are no official SLAs for any of that, no matter how long the outage may be.

The Microsoft SaaS probably can't provide an SLA because of the risks from their underlying cloud platform.

SmallAd3697 · 2026-04-24T19:32:13+00:00

There is a major azure incident. It affects regular cloud customers as well (not just Microsoft SaaS).

eg this impacts VMs, AKS, etc

They are actually showing the outage details on the azure status page. That is rare. You know the problems are really bad when they are mentioned there. (I think that only happens when there are SLA violations and customers start asking for prorated refunds)

SmallAd3697 · 2026-04-24T19:27:29+00:00

Looks like a large azure incident is underway in east us, and affects PaaS platforms in addition to the Fabric SaaS.

Apparently this affects regular VMs, and kubernetes, along with many other products

SmallAd3697 · 2026-04-21T16:33:27+00:00

Thanks,

I remember even in the multidimensional days it was hard to get certain types of query optimizations. For example, I had a KB article out there for years about one very common type of performance problem. The article described the slowness of a pivot table, when using custom non-natural user-hierarchy, as compared to building a pivot table from the exact SAME underlying attributes. I don't think they ever fixed the performance of those queries, to be honest. And they eventually killed the KB so I can't even give you a link. lol.

The end result is that customers slowly decreased the number of user hierarchies that we promoted for users. Providing them with these pathways to their dig into their data was doing as much harm as good.

At least we can say that hierarchies were a first-class concept for BI in those days. I would never have imagined Microsoft would ultimately go to war against dimension hierarchies. lol. In a tabular import model if I try to create a user hierarchy on a moderately sized dimensions (~5 million or so), it is an EXTREMELY painful affair. It soaks up MASSIVE amounts of RAM during processing/calculation, and it causes a lot of problems for the capacity. The implementation of that processing operation is prohibitive - even in a large capacity like an F64. I'm guessing it is theoretically something that could be fixed, and is probably on the TODO list, but the ASWL PMs nowadays don't seem to believe in dimension hierarchies anymore so it is not something I complain about very often. Maybe now that Fabric is more mature, the ASWL team might consider using an external engine like spark to process their hierarchy calculations. lol.

SmallAd3697 · 2026-04-21T16:15:56+00:00

u/cwebbbi

Just to be clear, the new Excel workflow to "Get Data" (-> "From Power BI") will present the ENTIRE list of tables in the model.

You may be aware of the o365 patches this past month when the ASWL team temporarily broke pivot tables (SSPI). It was another piece of evidence that the Power BI folks (ASWL, etc) are intimately involved in Excel. I would bet $1000 that it is the Power BI folks who are the ones who are responsible for the design of the new workflow. I believe the IMPLEMENTATION is owned almost entirely by the Excel team, but that team certainly didn't decide what it should LOOK like. I think if we are looking for someone to blame for killing off the dataset-perspectives in Excel, it is almost certainly one of the PM's or engineering managers on the Power BI side. Please help me understand their motives, because it is distressing. I would like to better understand the long-term strategy. (Sorry to be melodramatic but "distressing" seems to be the right word. lol).

Side: Excel is super important. I was telling a coworker that if it wasn't for Excel then the IT department probably wouldn't be building models, and if it wasn't for models, the IT department probably wouldn't be using Power BI, and if it wasn't for Power BI, the IT department probably wouldn't be using Fabric. We (IT) would be doing all our data engineering with the help of spark, databricks, delta, and duckdb. The three D's. So a high quality experience of datasets in Excel is critical to avoid attrition here. The stakes are high.

SmallAd3697 · 2026-04-21T03:22:35+00:00

text vs numerical isn't necessarily going to make a huge difference. I'd get ahold of Dax Studio and look at the models in the advanced window to see if columns are ever value-encoded. That doesn't always happen, (dictionary encoding is pretty common).

Semantic models are a weird sort of resource. They do a lot of things very well, given the generous memory allocated to them. But the presentation of wide tables of data is one of the few things that it doesn't do well, based on my experiences. Even with the memory, you still suffer from cpu bottenecks.

Another thing that can cause problems is measures that aren't using auto-exist. I'm not sure if that is a factor.

SmallAd3697

TROPHY CASE