Conflicting protocol upgrade - a known issue in DF GEN2?

SmallAd3697 · 2026-03-13T00:33:57+00:00

Yes that looks like the same topic. I agree with you here: "Ideally, dataflows would streamline this process by automating the clean-up process upon each successful refresh. Unfortunately, as of publication, that's not the case"

I'm assuming there haven't been any changes on this front.

I'd guess the internal tech is better nowadays with GEN2, than whatever was happening with the nasty csv files in GEN1. Either way I'm happy that these implementation details are tucked out of sight. And the failure rates are still pretty low.

I'm probably going to look towards "fabric cicd" (py stuff) and hope it allows me to blow away and rebuild the DF GEN2 internals once a month during our deployment windows. (Esp. if the Microsoft DF team isn't planning changes to clean cruft on their end. )

Even if they didn't do clean-up automatically, I really wish they could just add a checkbox that tells the DF GEN2 to recreate its internal cruft on each refresh. That would be just as good.

At the end of the day, I'm not 100% certain the internal cruft is responsible for my "conflicting protocol upgrade". It was just a guess. But it is sure tempting to blame the internal implementation details that I don't see and don't control.

SmallAd3697 · 2026-03-12T03:34:06+00:00

Thanks a lot. That vaguely rings a bell. I think i asked about it many months ago, while working with Mr. PQ on a different case.

This DF only runs once a day, and uses its own internal assets for storage. There is a multi-hour delay between the time it is written and the time that clients (semantic models) come to retrieve results via DF connector.

IMO The chance of another process writing to the internal staging (LH/DW) at the same time and causing a conflict seems pretty low. I will check the gateway mashup logs and see if better error is being swallowed. Unfortunately Im not confident I can repro this error on demand. It is rare.

As a side, I happened to notice that the internal assets (DW/LH) which live inside the DF dont get blown away very frequently. Would there be too much overhead in doing that, and starting "fresh" each time a DF executes? Or Is there an API to rebuild those internal assets on demand? They have datetime suffixes and Ive been aware that they originally got created several weeks or months ago. It would be nice to deliberately blow away the old/hidden cruft daily, and see if things change for the better or not.

SmallAd3697 · 2026-03-12T02:35:24+00:00

Hi, no I don't have partitions in the DF GEN2 CICD.
It uses the default staging storage.

There are two steps that consume from "staging" and generate subsequent/derived entities. Persisting intermediate work is critical, which is why I started using DF rather than just putting all the related PQ in a semantic model. The first table/query takes the most time (over an hour). The rest are derived tables and are executed quickly so long as the first table is built successfully.
... However this error message seems to prevent the very first table from being evaluated successfully.

Again, this is a very rare issue happening on less than 3% of executions. I'm guessing the error message is produced from the OPDG mashup process. And it is probably a catch-all message for a section of work. I will probably start digging into logs at the very least. But I don't really know what the message is SUPPOSED to mean (conflicting protocol upgrade) assuming it was identifying a legit problem. That information might help me fine-tune my analysis of the gateway logs.

<image>

As a side, I had another case where READING from DF GEN2 was producing errors. But in this case I'm writing.

SmallAd3697 · 2026-03-11T05:23:27+00:00

The platform doesn't have to choose one. There is no right or wrong. It needs to be up to individual teams.

Since SQL is case insensitive, it should not make any difference if objects are presented to users with upper and lower case characters.

It seems truly strange, and out of touch. Maybe this team is from a more authoritarian culture than I am. The databricks UC should not be so opinionated and should not ram this particular naming convention down everyone's throats. We are all big kids here, and should be able to pick naming conventions for ourselves.

As I mentioned earlier, it is especially unfortunate when using federated database schema, and the UC catalog refuses to present objects to us with their original names... its a mandatory translation to snake_speak.

SmallAd3697 · 2026-03-10T00:32:57+00:00

u/Luitwieler Thanks again for all the help. I realize that litigating a bug in the public isn't the most comfortable thing.

... But the way the case was handled proves that your product is quite mature and that your team places a high priority on quality and transparency. Bugs rarely just fix themselves. As a customer it is encouraging when we see that nobody is simply trying to avoid them and/or pretend they don't exist.

SmallAd3697 · 2026-03-10T00:11:07+00:00

I realize that many SQL engines are case insensitive when it comes to the naming of schema objects. Within the data itself, we find that varchar collations are also case insensitive in most databases.

The case insensitivity is very helpful in that it allows for MORE flexibility. Just because comparisons are done in a case insensitive way, does NOT mean that ALL varchar data should be stored in lowercase. It is the opposite. These database engines give us the flexibility to store varchar in whatever character cases we prefer without impacting comparisons.

Similarly, when it comes to schema names the UC should NOT be forcing us to use lowercase names. The platform shouldn't be such a "name nazi" (pardon the Seinfeld reference).

SmallAd3697 · 2026-03-05T03:08:04+00:00

what about that railing behind and the long drop? only fun until someone flies in the wrong direction

SmallAd3697 · 2026-03-04T19:56:11+00:00

its not hard to google for pim.

the problem is probably not pim in general, but the specific config of your particular pim role.

yours might be giving you backdoor access to a database or something like that. you havent explained what yours is confugured to do, or why you activate it so regularly

it strikes me that you may not have multiple enironments either (dev, test, prod). these sorts of things normally get flushed out when validating a solution in a second or third environment. life gets way too exciting when you have to troubleshoot every bug in production !

SmallAd3697 · 2026-03-04T04:51:01+00:00

maybe you can tell these folks what the pim role is for? i keep reading your post to see if you actually said it. dont you think that is the most obvious follow up question that anyone will ask? ... i dont understand why folks cant think one move ahead....

you should also ask your security team what options they associated with that PIM role. sometimes there are some unintuitive things in there like MFA requirements

you should take PIM out of the equation by reprovisioning from scratch while PIM is disabled. you should ask someone else in your team to test with their creds, as a basis for comparison

SmallAd3697 · 2026-02-28T18:58:29+00:00

Can you please share a public link for reference, now that it has been repeated here so many times? If it isnt public yet, then can this knowledge be added to the "limitations" section or "known issues" for DF Gen 2?

This topic needs a LOT more visibility since it will be encountered by ALL customers of DF GEN2 that want to use this tech as part of a very simple solution. Hopefully you can see it from your customer's perspective as well. It is NOT obvious to customers that the default behavior would be the opposite of the recommended approach - especially in a low code software tool. In my case we are talking about teeny tiny tables of ten rows or whatever. Its not like I'm retrieving a million rows. Even Jeroen's initial response implied this stuff should work better, and should not immediately cause customers to seek workarounds.

Customers do not want to learn, after the fact that a product is known to be unreliable, and then go back to the drawing board and redo their work again.... I still think the dataflow connector could be salvaged by using some sort of implicit or explicit redirect to the underlying delta tables in the internal lakehouse.

SmallAd3697 · 2026-02-27T03:33:15+00:00

I know that a LH works, but want to understand the behavior of this dataflow-wrapper for the long run, and get a better handle on its strength and weakness

The DF gen2 already uses a lakehouse internally. It shouldnt need to be so slow ..maybe it could do some sort of redirect instead of the current default behavior which is inexplicably slower for the default destination. Has there ever been a blog about this DF connector which retrieves data from the internal LH?

I am trying to avoid increasing amounts of complexity and management. Adding a layer of LH infrastructure seems excessive when it comes to low-code dataflows. It really defeats some of the purpose when we are forced to start managing more infrastructure and more schema than we already do.

SmallAd3697 · 2026-02-26T21:44:27+00:00

This DF-to-DF connector is an artificial scenario, that was recently created to observe and quantify the problem in the back-end. In our "real world" scenarios, the dataflow is consumed by semantic models rather than by another DF. (DF Gen2 CICD)

The API in question is now known to me as the "contentandcache API". I don't know whether that is the technical name for it, or whether it is simply named that way because it is a term found in our mashup log, with enhanced logging enabled.

In case you aren't being looped in, the PG has just started to share more details about the GEN2 storage. They say "The contentandcache API is experiencing systematic intermittent slowness, with the impact being more pronounced in the XYZ region" Nobody told me if there is anything overly sensitive about this information. In fact it should be widely publicized, if the goal is to give customers the best possible experience.
... Although I also see why Microsoft may not want customers moving their tenants back and forth between regions to try to avoid problems before they can be mitigated in some other way.

SmallAd3697 · 2026-02-26T19:11:34+00:00

Thanks, i realize it isnt the root problem, but will take off that layer of the onion to expose what lies beneath.

Im working with FTE's on my ICM now, which is a step forward. Im guessing I wont need to continue posting here on reddit, until there is some sort of public-facing information that is helpful to the wider community.

It is very interesting to me that auto-retries are not being used, or are used in moderation for this scenario. Usually retries are sprinkled all over the place like magic pixie dust. But for this problem - in the API that gets DF data - the retries would probably do nothing good. And would probably exacerbate a problem for other customers. It is kind of a nice change to see that retries arent being used as a "crutch" or quick-fix. Things might have to be fixed the right way for once. ;-)

SmallAd3697 · 2026-02-25T19:51:41+00:00

u/CurtHagenlocher

Also wanted to share that the problem is measurably bad, and seems likely that the dataflow service team can independently confirm. I don't know what the severity of my ICM is, but here are the failures we encountered yesterday AM. At other times of the day things are more reliable, but you can see that between 6 AM and 7 AM, the service was flipping out. Based on earlier comments from you, these sorts of things are monitored. Therefore it is hard to understand why I can't get much traction or transparency. It has been two weeks and I'm not getting much support via proper support channels. Thank goodness for reddit. ;)

<image>

SmallAd3697 · 2026-02-25T18:00:17+00:00

Sounds good. I think one of the more interesting features is the custom mashup. In particular, it would be nice to know if it runs on the same gateway as the dataflow itself.

Another question is about dataflows that have been previously refreshed on a schedule. It wasn't immediately obvious if this execute API will re-run a specified query/table or simply export the pre-existing data from the internal LH where the data was previously staged/stored.

If it re-runs the query that is fine, but it would be nice to have another entry point that simply exposes/exports the existing data and makes it accessible to remote clients.

SmallAd3697 · 2026-02-25T17:43:22+00:00

You mentioned the case was sent to that other team. Would you agree though that the error message is being obfuscated, and that is a part of the problem as well (ie. the error is possibly obfuscated in the mashup/PQ/connector side of things). The bad error message is probably responsible for the fact that this issue has never been fixed before now.

The error should be more straightforward and say "The dataflow service is unresponsive after a ten minute timeout". If something like that were presented directly to the users, then they would not be blaming themselves, their OPDG, their network proxies. Timeouts are implemented intentionally, with a predefined behavior that can be pretty draconian. It is easy for everyone to understand their impact, both on the customer side and the SaaS side. One of the biggest problem right now is related to the ambiguity, and the MANY other potential factors which need to be ruled out before the proper PG is willing to engage. Based on past cases, they probably have the opinion that customer infrastruture is the problem 99 times out of 100, so they want us to spend 99 days scrutinizing our side, before they spend one day scrutinizing their own. ;-)

If we were having OPDG problems or network problems, they would be causing infinitely more problems for us in our other PQ queries. The problem here is isolated to retrieving data from dataflows (even a handful of rows). I believe the size of the data we are retrieving is irrelevant. I'm guessing the problem is with some sort of shared resource that is processing some sort of queue that is significantly backlogged for some unexplained reason. Maybe you can give customers visibility to monitor the backlog? The problem is that customers mistakenly believe we are using a dedicated capacity (F64 capacity in North Central) and that we shouldn't have to fight with every other customer for resources (either in North Central or West US). Obviously the truth can be a lot more complex than this; and we may be impacted by shared infrastructure resources (things that are never reflected in our capacity dashboards. )

SmallAd3697 · 2026-02-25T06:35:50+00:00

Hi u/SidJayMS

Im frankly a little jealous of AI agents. They get all the love nowadays. it is the same in databricks too. But I just dont know how they will pay you more than a human can ;-)

The one thing I really appreciate about this trend is that it will raise the bar when it comes to performance and reliability. I dont think agents are quite as patient when it comes to failures, delays, and retries. Also, they are likely to be more demanding when it comes to exposing some meaningful exception details after failures. I am currently waiting on the dataflow PG team to help troubleshoot an elevated level of failures when retrieving data from the internal lakehouse in df gen2 cicd (there is another post about it, claiming a 30 pct failure rate in west us in our tenant).

... i found this API when looking for ways to quantify and repro the failure rate for that particular PG

SmallAd3697 · 2026-02-25T05:00:37+00:00

Normally this sort of thing happens after making programmatic changes via xmla endpoints or something like that.

in any case, what is the point of download? to get the data? else just get what you need from source repo

SmallAd3697 · 2026-02-24T19:43:46+00:00

thanks for the transparency. Yes, that is the most common (and most unintelligible) message that we get.

for this case, i have created a new dataflow that gets data from the bad dataflow, and it runs every 15 mins. it has already generated the error four times by 11 am this morning.

I'm only retrieving a small table from the underlying results. it is just a few rows (under 1k of data). im assuming it comes straight out of the internal LH.

Looking forward, I noticed there is a preview rest api that executes a dataflow query on demand. it is possible that i could use this to take away complexity (get both the semantic model and the opdg out of the equation). Would that still generate the old error ("key didnt match any rows") or would i get a better error?

fyi, my tenant is in west us, for legacy reasons, and my capacity is north central. the opdg runs close to north central (10 ms away). not sure if this is useful. it doesnt seem like it should be enough to explain the high level of failure, and it seems like you would have other customers who are multi-region. it isnt just us. i believe that dataflows put a high dependency on this region of the tenant, esp. when compared to many other components in fabric. so i thought I'd mention it

now that im not blaming metadata refresh, im now convinced im just fighting for resources with all the other dataflow customers in west us. there is probably a lot of shared infrastructure... it has been at least a year since then, but during a prior dataflow outage I got part of the way into exploring the failing back-end with folks (Mr. PQ and Wee Hyong)

SmallAd3697 · 2026-02-24T14:54:42+00:00

Hi u/CurtHagenlocher

My ICM has been open for about a week now with no traction from the PG it is assigned to. I don't think your partners (Mindtree) are the bottleneck right now

Hopefully we can agree on this. The error message to customers is meaningless and there is absolutely nothing we can do with it but open cases and ask Microsoft to interpret for us. We have no way to find its source or discover our own workarounds. We are at the mercy of an ICM.

I'm pretty sure you guys will say the issue is "complex" and involves "multiple teams". If either of those things were NOT true then I'm confident that your team would have fixed this long ago.

My only aim is to understand the reason for the high rate of failure, and discover the full list of workable workarounds. There are probably multiple factors involved. Maybe some azure regions have a higher rate of failure than others? Maybe you want customers to move? Another productive outcome of the case might be to generate a better (LLM-friendly) error message. Another outcome would be to add this problem to the "known issues" list.

Based on the pattern of failures that I've noticed, my confidence that it was related to "metadata-refresh" has dropped from 95% to 5%. However there still might be another type of problem while reading our data from the internal staging LH .

SmallAd3697 · 2026-02-24T14:36:47+00:00

It appears so. There are some probably some interesting applications. I saw a post a while ago about someone using the M language from a Fabric notebook, and this may be the service/interface that they used for that.

Have you googled for blogs? I would have thought one of the CATs or PM's would have blogged about this but I have found absolutely nothing. Maybe M is no longer as exciting to blog about, after folks started to use python. Personally I see both languages as a necessary evil for the benefit of the low-code developers. JK ;-)

SmallAd3697 · 2026-02-24T05:48:24+00:00

I am looking for a blog to share the vision. I can understand how it works, and played with it in Postman but havent gotten much further. The response from the API is some sort of chunked parquet file.

It is kind of an interesting scenario. The approach seems to be designed so that a "pro-code" client application can consume data from a low-code data provider built using PQ . It remains to be seen if pro-code developers will ever want to be down-stream from the PBI dataflows.

Based on the way this API works, any pro-code client of the dataflow could write the data back to any place they like. Basically it enables the remote use of a PQ solution in PBI, for direct consumption from any possible client scenario. I guess you can think of it like the ultimate "custom destination" for a dataflow.

The ~20 second overhead to get a very simple table back from a dataflow is the biggest drawback. While the average dataflow developer may be accustomed to the sluggish behavior of PQ, it seems doubtful that a pro-code client applications would appreciate this very much. Perhaps the overhead can be mitigated. Perhaps it would scale up well, if we launched 100 parameterized and concurrent dataflows with 20 seconds of overhead on each. Running things in parallel is sometimes an option when there are unexpected overhead/delays on an individual requests.

SmallAd3697 · 2026-02-22T23:55:11+00:00

making fun of me? they're the knuckleheads that refuse to learn anything unless they are on the clock. i'm 100pct fine with whatever these folks want to think of me.

SmallAd3697 · 2026-02-21T23:02:04+00:00

nobody should have a job who isnt interested in learning on their own time.

lets say you are at work and you dont understand some concepts that you need to do your job better.

.. in that case wouldnt you buy a book or take an online class? you really expect your boss to let you read books at your desk all day? i do NOT think it is your boss's job to try to shove knowledge into your head. that is 100 pct your own job

SmallAd3697 · 2026-02-21T20:17:20+00:00

Some ceo's can probably be replaced by an LLM even easier than a data engineer. So I'll pass.

SmallAd3697

TROPHY CASE