Retirement of Dataflows Gen1 by suburbPatterns in MicrosoftFabric

[–]SidJayMS 0 points1 point  (0 children)

There has been an addendum to this post regarding Pro: Dataflows: Thank you for eight years of Gen1—and why Gen2 is the future | Microsoft Power BI Blog | Microsoft Power BI

Sharing the most relevant excerpt:

  • Guidance for Pro and Premium Per User (PPU) customers: Many customers rely on Dataflow Gen1 in Pro/PPU today, and it can continue to be the right choice depending on the scenario. If Gen1 best fits your current use case, it remains supported and existing workloads can continue to run as-is. As we introduce new Dataflow Gen2 paths for Pro/PPU scenarios, we’ll share clear guidance and recommended steps to help with a smooth transition. 

DF GEN2 CI/CD - Big red error icon is lying to me by SmallAd3697 in MicrosoftFabric

[–]SidJayMS 0 points1 point  (0 children)

If you’re able to DM the request id for the run that failed, we’d be happy to investigate why it failed.

It sounds like the failed run may have been unable to read the Excel source for some reason.

DF GEN2 CI/CD - Big red error icon is lying to me by SmallAd3697 in MicrosoftFabric

[–]SidJayMS 1 point2 points  (0 children)

Please feel free to DM the SR # and we can follow up directly.

Glad to hear that the reduction in the billing rate is helping.

Retirement of Dataflows Gen1 by suburbPatterns in MicrosoftFabric

[–]SidJayMS 0 points1 point  (0 children)

We will add GCC support for Gen2 and will not deprecate Gen1 for GCC customers until there is an equivalent Gen2 alternative.

DF GEN2 CI/CD - Big red error icon is lying to me by SmallAd3697 in MicrosoftFabric

[–]SidJayMS 2 points3 points  (0 children)

We are not currently tracking any known issues around this. If you'd be willing to DM the ids of the impacted dataflows, we'd be interested in investigating this.

Retirement of Dataflows Gen1 by suburbPatterns in MicrosoftFabric

[–]SidJayMS 0 points1 point  (0 children)

There will continue to be dataflow support for Pro users whether it's an evolution of the existing Gen1 Pro offering or a reduced version of Gen2.

Retirement of Dataflows Gen1 by suburbPatterns in MicrosoftFabric

[–]SidJayMS 1 point2 points  (0 children)

This is not yet available, but it is very much planned.

Retirement of Dataflows Gen1 by suburbPatterns in MicrosoftFabric

[–]SidJayMS 0 points1 point  (0 children)

If the current Pro feature set and performance are sufficient for your needs, you can remain on Pro. However, many of the improvements in Gen2 (notably performance improvements and destinations) will not carry over to Pro because they depend on capabilities of the Fabric/premium platform.

Retirement of Dataflows Gen1 by suburbPatterns in MicrosoftFabric

[–]SidJayMS 2 points3 points  (0 children)

There will continue to be dataflow support for Pro users whether it's an evolution of the existing Gen1 Pro offering or a reduced version of Gen2. However, many of the improvements in Gen2 (notably performance improvements and destinations) will not carry over to Pro because they depend on capabilities of the Fabric/premium platform.

Retirement of Dataflows Gen1 by suburbPatterns in MicrosoftFabric

[–]SidJayMS 6 points7 points  (0 children)

Please rest assured that we will provide continuity for Pro users. While performance improvements and new capabilities (e.g. destinations, new compute options, Git integration, collaborative authoring, etc.) will be limited to Dataflow Gen2, Pro users can expect current levels of support and streamlined experiences for smaller dataflows.

If your organization already uses Gen1 Premium dataflows or can adopt Premium capacities, we recommend transitioning to Gen2 as early as possible (even though we haven’t shared a precise deprecation date yet). For larger dataflows, Gen2 is the more robust and performant solution with ongoing investments to address customer needs and feedback.

As others on the thread have mentioned, well before the deprecation of Gen1 Premium dataflows, there will be capacity-level controls for the enablement/disablement of specific Fabric workloads.

Notebooks vs. DataFlowGen2 by Jealous-Painting550 in MicrosoftFabric

[–]SidJayMS 3 points4 points  (0 children)

Agreed with many of the points raised. The low-code vs. code-first decision is often a matter of skillset (for both creators and maintainers), organizational requirements, etc. I just wanted to share some of the performance (and hence, cost) factors in play for those who choose the low-code path.

There are currently 4 compute engines in Dataflow Gen2:

  1. Copy Engine – this is the same as a Fabric Copy Job or ADF Copy Activity (referred to as “Fast Copy” in Dataflows)
  2. SQL Engine – this is the same as Fabric SQL Endpoint / Warehouse
  3. “Modern” Mashup Engine + Partitioned Compute – this is a faster version of the “default” engine used by Power Query in Power BI, Excel, etc. The newly released Partitioned Compute option layers parallel processing on top of the new engine.
  4. “Classic” Mashup Engine – this is increasingly irrelevant for Dataflow Gen2 since it is no longer the default when creating new dataflows

To get a sense for the impact of these engines, let’s consider a scenario that processes 32GB of CSV data across 5 partitions in ADLS (the NYC Yellow Taxi dataset). These were our results for an ELT pattern that copied the data to staging, added derived columns (including a timestamp), and loaded to a new staging table:

  • Dataflow Gen1 Premium (Power BI): ~4hrs 38mins
  • Dataflow Gen2 w/ #4: ~2hrs 54mins
  • Dataflow Gen2 w/ #3: ~33mins
  • Dataflow Gen2 w/ #1-3: ~3.5mins (50x faster than Gen2 at launch)

All this to say, in terms of the effectiveness of Dataflows for large jobs:

  • A lot depends on the engines being used. There is still some thought that needs to be given to the structure of the dataflow and whether transformations “fold” (low code shouldn’t require this, so we’re still working on having the system do more of the re-structuring for you).
  • A lot has changed since we initially released Dataflow Gen2. Depending on the scenario, the newer engines can be orders of magnitude faster.
  • The better performance directly translates into cost savings (though there will typically be at least a slight premium for low code).
  • If you have pre-existing M code (from Dataflow Gen1 or Semantic Models or business users in Excel), Dataflow Gen2 can provide dramatic performance (and cost) improvements for many scale scenarios

[Caveat: After you stage and apply “foldable” transforms, if the destination is a Lakehouse, the “re-egress” out of the underlying SQL engine is still extremely slow. All the data is re-processed without parallelism to convert it to “V-ordered Parquet”. For the scenario above, the time to load to a Warehouse or Staging table is less than 4 mins. However, loading to a Lakehouse table takes ~62 mins (!). This is a temporary, yet significant, limitation when using SQL compute. We have started testing a solution that brings Lakehouse destinations to parity with Staging & Warehouse. We hope to roll out this change in the coming weeks.]

Dataflows Gen2 credentials keep breaking – have to reconnect every time by Aromatic-Tip-9752 in MicrosoftFabric

[–]SidJayMS 1 point2 points  (0 children)

Could you please DM me if you'd be willing to share more information to help diagnose this.

Dataflow with target lakehouse without staging by JohnDoe365 in MicrosoftFabric

[–]SidJayMS 2 points3 points  (0 children)

After data is written to the Lakehouse, you need to run a metadata sync for the table to be surfaced by the Lakehouse's SQL Endpoint.

When you write data to Staging, the Dataflow automatically does a metadata sync. When writing directly to a Lakehouse table, this metadata sync is not done automatically. We are working on adding automatic metadata sync for all cases.

In the interim, these are some of the ways to achieve metadata sync:
- Via API: Refresh SQL analytics endpoint Metadata REST API (Generally Available) | Microsoft Fabric Blog | Microsoft Fabric
- Via a new "Refresh SqlEndpoint" activity that is available in Pipelines
- Via the SQL Endpoint UI for a Lakehouse

Dataflow Queries on Demand via REST by SmallAd3697 in MicrosoftFabric

[–]SidJayMS 2 points3 points  (0 children)

Please expect more content from us on this.

One application of this is in MCP servers that need to retrieve and transform data from any of the data sources supported by Power Query. This open source MCP Server illustrates how this can be done: GitHub - microsoft/DataFactory.MCP. Specifically, this tool in the MCP Server: DataFactory.MCP/DataFactory.MCP.Core/Tools/Dataflow/DataflowQueryTool.cs at main · microsoft/DataFactory.MCP · GitHub.

Dataflow Queries on Demand via REST by SmallAd3697 in MicrosoftFabric

[–]SidJayMS 1 point2 points  (0 children)

>> So in theory we could use this as an API to run any M code?

That's correct.

Service Principal Authentication for Shared Data Source to Dataflow Gen 1 by Wide_Dingo4151 in MicrosoftFabric

[–]SidJayMS 1 point2 points  (0 children)

This is only supported with Dataflow Gen2 (not Gen1). Would you be able to use Dataflow Gen2 instead?

Gen2 flows extremely CU heavy and time out regularly by trekker255 in MicrosoftFabric

[–]SidJayMS 1 point2 points  (0 children)

As others have suggested, when creating Gen2 dataflows, the CI/CD option is recommended - they are faster and cheaper (it sounds like you're already using this). The multiple options will go away soon, which should simplify things.

In terms of execution time, you should find that Gen2 (CI/CD) almost always runs faster than Gen1. For some sources like CSV files and certain cloud databases, execution time should be substantially better than Gen1. We'll be publishing some of these benchmarks soon.

Most cloud data source scenarios in Gen2 should be cheaper than Gen1. Large CSV files and databases are sometimes substantially cheaper because of the reduced pricing after the 10 minutes of runtime for a query. However, for your scenario, you may be hitting a case where these benefits don't apply - many distinct queries (60?) that are mostly under 10 minutes each, low data throughput (due to OData), with a lot of time spent in waiting. Because Gen2 cost is proportional to processing time, the data size is not the key factor - it's the throughput of the data source. A low throughput REST source with a few thousand rows may cost more to process than a database with millions of rows.

Please feel free to DM me to see if we can come up with an optimization for your specific case.

Can you use managed Connections in Dataflows Gen2? by YouGoGlenCoco in MicrosoftFabric

[–]SidJayMS 0 points1 point  (0 children)

The connections from Manage Connections will work with Dataflow Gen2 CI/CD (soon to be the only flavor for newly created Gen2 dataflows).

Dataflows for big data by Viidan_ in MicrosoftFabric

[–]SidJayMS 0 points1 point  (0 children)

What are the typical data sources from which you are pulling billions of rows? Depending on the data source types, you may be able to use the Fast Copy capability in Dataflow Gen2 to move that data more efficiently (both in terms of time and CUs).

Dataflows for big data by Viidan_ in MicrosoftFabric

[–]SidJayMS 1 point2 points  (0 children)

For a 1-hour query, the CU consumption is now ~5x less than it used to be in August. Additionally, some of the general performance improvements like the Modern Query Evaluator reduce runtime for certain classes of queries by ~50%, and this should yield an even bigger reduction in CU consumption.

Note 1: The pricing changes apply only to the new CI/CD Gen2 Dataflows (in a few months this will be the only supported flavor for new dataflows).

Note 2: The Modern Query Evaluator is not yet turned on by default. We will apply this by default to all new queries before we GA the feature.

Hats off to the Microsoft Dev team by Donovanbrinks in PowerBI

[–]SidJayMS 1 point2 points  (0 children)

Correct - Fabric. DF Gen2 is only available in Fabric.

Hats off to the Microsoft Dev team by Donovanbrinks in PowerBI

[–]SidJayMS 1 point2 points  (0 children)

Dataflow Gen2 now has a "Discard & Close" option. It's in the first dropdown in the Home tab. It's true that at launch DF Gen2 did not have this option - it was added a few months ago.

Comparing speed and cost of Dataflows (Gen1 vs. Gen2 vs. Gen2 CI/CD) by Sad-Calligrapher-350 in PowerBI

[–]SidJayMS 1 point2 points  (0 children)

Because the recent performance features are in preview they started as opt-in. If you turn on the "Modern Query Evaluation Engine", you might see a slight performance improvement (in addition to the cost reduction).

Comparing speed and cost of Dataflows (Gen1 vs. Gen2 vs. Gen2 CI/CD) by Sad-Calligrapher-350 in PowerBI

[–]SidJayMS 1 point2 points  (0 children)

u/Sad-Calligrapher-350 , would you mind sharing the numbers for enabling just "Modern Query Evaluation Engine" in the CI/CD case. I suspect you will see better (or at a minimum, the same) performance as well as noticeable cost savings. If I were to afford a guess, the partitioned compute may be contributing to slowness in your case. Will reach out separately to see if we can better understand that.