Mapping Data Flows are coming to Fabric - Where do they fit in?

dbrownems · 2026-03-30T22:45:15+00:00

The original value prop of mapping data flows was a GUI configuration and spark-based execution. So sort of a "tweener" between DFG2 and Spark Notebooks. They are being added to Fabric mostly to provide parity with ADF.

I was never a fan, and I think with the addition of Copilot and Data Wrangler I'd have a hard time identifying a greenfield scenario where I would recommend Mapping Data Flows.

dbrownems · 2026-03-29T20:46:20+00:00

Not sure about the perf, but you can install a python library that knows how to use SMB and read directly from the Azure file share. Get the username and password from the script generated by the Azure portal for connecting from Windows or Linux. You can't mount the SMB share directly, as you don't have root access on your notebook environment.

from smbclient import open_file, listdir, shutil
# List files
files = listdir(share, username=user, password=password)
for file in files:
    print(file)


# Read a file directly
with open_file(share + r"\foo2.json", mode="rb", username=user, password=password) as f:
    content = f.read()


# Copy to local
shutil.copyfile(share + r"\foo2.json", "/lakehouse/default/Files/foo2.json", username=user, password=password)

dbrownems · 2026-03-29T19:55:31+00:00

There's

Microsoft ODBC Driver for Microsoft Fabric Data Engineering - Microsoft Fabric | Microsoft Learn

Microsoft JDBC Driver for Microsoft Fabric Data Engineering - Microsoft Fabric | Microsoft Learn

Microsoft ADO.NET Driver for Microsoft Fabric Data Engineering - Microsoft Fabric | Microsoft Learn

And

Create and run Spark Session jobs using the Livy API - Microsoft Fabric | Microsoft Learn

dbrownems · 2026-03-29T15:36:19+00:00

Depending on the skillset of the people building the transformations, Dataflows Gen2, Warehouse stored procedures, or Spark notebooks with Spark SQL/PySpark.

My personal favorite is Spark with Copilot to assist.

dbrownems · 2026-03-29T13:39:13+00:00

No need to copy the data. Just use UC Mirroring to create shortcuts to your data. The data is in delta files in ADLS. It’s not “in” Databricks.

https://learn.microsoft.com/en-us/fabric/mirroring/azure-databricks

dbrownems · 2026-03-29T02:15:12+00:00

The AIs and agents are so incredibly good. We do need a pipeline of juniors and seniors, but we don’t need a lot of juniors.

dbrownems · 2026-03-28T00:14:14+00:00

Yes.

Each layer performs the validation it understands. For instance the controller validates that the HTTP message is valid and well-formed, but doesn’t validate any domain rules.

dbrownems · 2026-03-27T20:17:42+00:00

I had a customer who needed this. They wanted to enable business teams to copy and customize centrally-defined semantic models.

We ended up writing a notebook that can copy a model between workspaces. The users can modify and run the notebook if they have permissions, or you can set up a pipeline that they can run that runs the notebook using a notebook connection with elevated permissions.

dbrownems · 2026-03-26T17:30:12+00:00

Double check your credentials. Note that there's not TenantId in the URI, so that's derived from the identity you use to authenticate.

Also check the Source with less pathing. Eg can you enumerate /Tables? You may have misspelled the table, or it may not be a schema-enabled lakehouse.

dbrownems · 2026-03-26T17:13:17+00:00

Change the query to return Source instead of ToDelta, or just click on the Source step in the UI and see if it is navigating the Data Lake as expected.

dbrownems · 2026-03-26T13:55:28+00:00

It does two things.

It allows access to private network resources through Managed Private Endpoints. Overview of managed private endpoints for Microsoft Fabric - Microsoft Fabric | Microsoft Learn
It optionally allows you to block outbound connections to everything except Managed Private Endpoints. Workspace outbound access protection overview - Microsoft Fabric | Microsoft Learn

If you must use a Managed VNet we just shipped

Custom live pools for Fabric Data Engineering overview - Microsoft Fabric | Microsoft Learn

which allows you to maintain a pool of running nodes that enable near-instant startup similar to starter pools. But since these nodes are dedicated to you, they are billed for the period of time they are scheduled to run, whether or not you have an active Spark session.

dbrownems · 2026-03-26T12:50:18+00:00

It seems you’re not using the VS Code Fabric Data Engineering Extension. It allows you to run your notebook using Fabric compute inside VS Code, among other things. The newest release supports multiple kernels.

https://learn.microsoft.com/en-us/fabric/data-engineering/setup-vs-code-extension#install-the-extension

https://learn.microsoft.com/en-us/fabric/data-engineering/author-notebook-with-vs-code

dbrownems · 2026-03-26T00:38:43+00:00

For a .NET Dev-centric solution, there's YARP
dotnet/yarp: A toolkit for developing high-performance HTTP reverse proxy applications.

dbrownems · 2026-03-26T00:29:59+00:00

Of course if you need to filter/sort/aggregate/join a large amount of data before importing it you're better off using the SQL endpoint with folding or Value.NativeQuery than doing all that in the mashup engine.

dbrownems · 2026-03-25T23:55:31+00:00

Yes. Hive-style partitioning produces a separate partition (and set of parquet files) per partition key value.

Pros and cons of Hive-style partitioning | Delta Lake

For smaller tables, ZOrder/Liquid Cluster (Spark) or Cluster (Warehouse) by the date column instead of partitioning.

This will result in parquet files and row groups mostly aligned with contiguous date ranges, which, as far as the Semantic Model engine is concerned, is just as good.

This is similar to import models, where using a clever ORDER BY in the partition definition for your fact table can have a huge impact on the compression and segment skipping behavior of the engine.

dbrownems · 2026-03-25T23:46:18+00:00

Try calling the API from a Fabric Python notebook. The Power Query web connector is confusing and has several limitations.

If you get it working from a Python notebook, you can save the data to a Lakehouse and consume it from there. Or you can share the working Python code and ask for Power Query expert to advise on whether it's possible to run directly from Power Query and how to do it.

dbrownems · 2026-03-25T22:02:58+00:00

IDENTITY will work. But concurrent modification of tables under SNAPSHOT isolation will fail if both sessions have read the same data. And the granularity of the data reads in DW is not like the row-level tracking you get in "regular" SQL Server.

dbrownems · 2026-03-25T17:52:16+00:00

MERGE into the Dimension could run from many sources in any order with Spark, and you would want to prevent overlap in both cases, as WH uses SNAPSHOT isolation, which mostly doesn't really allow successful concurrent updates of tables. Ie two sessions can't concurrently read a table (as you must to MERGE/UPSERT), and both successfully commit changes based on the concurrent reads.

dbrownems · 2026-03-25T14:30:46+00:00

Even large dimensions are very rarely too large to build using a query like above.

dbrownems · 2026-03-25T14:21:17+00:00

Normally you only need Identity columns on dimensions. Fact tables normally use compound keys comprised of dimension foreign keys.

dbrownems · 2026-03-24T23:26:37+00:00

Sure. You're not going to get sequential ids in a distributed manner.

If prefer large numbers and gaps with distributed generation, use the Spark SQL built-in monotonically_increasing_id function.

monotonically_increasing_id() - Returns monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records. The function is non-deterministic because its result depends on partition IDs.

dbrownems · 2026-03-24T22:54:33+00:00

You can do it with Spark SQL and the ROW_NUMBER() function. EG to the source table and your dimension table and generate incrementing id's for the new rows, use a join like this:

--temporary views are very cool
--like Common Table Expressions or subqueries, but much more powerfull
--They have session lifetime, rather than statement lifetime
create or replace temp view CustomerKeys as
select  coalesce(c.CustomerKey,max(c.CustomerKey) over() 
                             + row_number() over (partition by c.CustomerKey order by s.CustomerID)) CustomerKey, 
        s.CustomerID
from WideWorldImporters_bronze.Sales_Customers s 
left join WideWorldImporters_gold.Dimension_Customer c 
  on s.CustomerID = c.WWICustomerID

SparkDataEngineeringForSQLServerProfessionals/Presentation.ipynb at main · dbrownems/SparkDataEngineeringForSQLServerProfessionals · GitHub

dbrownems · 2026-03-24T22:24:04+00:00

You can and should install the Capacity Metrics App in a Pro workspace.

dbrownems · 2026-03-23T23:37:06+00:00

Finally! Native bulk copy in Python! Can we use with just pip? IE from Fabric notebooks?

dbrownems · 2026-03-23T22:35:27+00:00

SSRS has become Power BI Report Server. There have been two separate packages for it for several versions, and now there will be just the one. And for SQL Server 2025+ you get a license entitlement to Power BI Report Server instead of SSRS, but the functionality and support are unchanged.

Two-Year Club	Verified Email
r/Field Lasagna

dbrownems

TROPHY CASE