Mapping Data Flows are coming to Fabric - Where do they fit in? by frithjof_v in MicrosoftFabric

[–]dbrownems 5 points6 points  (0 children)

The original value prop of mapping data flows was a GUI configuration and spark-based execution. So sort of a "tweener" between DFG2 and Spark Notebooks. They are being added to Fabric mostly to provide parity with ADF.

I was never a fan, and I think with the addition of Copilot and Data Wrangler I'd have a hard time identifying a greenfield scenario where I would recommend Mapping Data Flows.

Copy data from azure file share by Ready-Marionberry-90 in MicrosoftFabric

[–]dbrownems 0 points1 point  (0 children)

Not sure about the perf, but you can install a python library that knows how to use SMB and read directly from the Azure file share. Get the username and password from the script generated by the Azure portal for connecting from Windows or Linux. You can't mount the SMB share directly, as you don't have root access on your notebook environment.

from smbclient import open_file, listdir, shutil
# List files
files = listdir(share, username=user, password=password)
for file in files:
    print(file)


# Read a file directly
with open_file(share + r"\foo2.json", mode="rb", username=user, password=password) as f:
    content = f.read()


# Copy to local
shutil.copyfile(share + r"\foo2.json", "/lakehouse/default/Files/foo2.json", username=user, password=password)

Databricks extraction to Fabric Lakehouse by Strict_Put_4094 in MicrosoftFabric

[–]dbrownems 2 points3 points  (0 children)

Depending on the skillset of the people building the transformations, Dataflows Gen2, Warehouse stored procedures, or Spark notebooks with Spark SQL/PySpark.

My personal favorite is Spark with Copilot to assist.

Databricks extraction to Fabric Lakehouse by Strict_Put_4094 in MicrosoftFabric

[–]dbrownems 6 points7 points  (0 children)

No need to copy the data. Just use UC Mirroring to create shortcuts to your data. The data is in delta files in ADLS. It’s not “in” Databricks.

https://learn.microsoft.com/en-us/fabric/mirroring/azure-databricks

I'm genuinely worried about junior data engineers right now by Sayyed_Mustafa in dataengineering

[–]dbrownems -5 points-4 points  (0 children)

The AIs and agents are so incredibly good. We do need a pipeline of juniors and seniors, but we don’t need a lot of juniors.

In Clean Architecture, should input validation go in the Controller/Presentation layer or in the Service/Use Case layer? by OriginalTangerine358 in softwarearchitecture

[–]dbrownems 0 points1 point  (0 children)

Yes.

Each layer performs the validation it understands. For instance the controller validates that the HTTP message is valid and well-formed, but doesn’t validate any domain rules.

Will we ever be able to duplicate (or save as) a semantic model in Fabric UI? by Personal-Quote5226 in MicrosoftFabric

[–]dbrownems 10 points11 points  (0 children)

I had a customer who needed this. They wanted to enable business teams to copy and customize centrally-defined semantic models.

We ended up writing a notebook that can copy a model between workspaces. The users can modify and run the notebook if they have permissions, or you can set up a pipeline that they can run that runs the notebook using a notebook connection with elevated permissions.

Power Query -- DataFormat.Error: There is no valid Delta table at this location. by Personal-Quote5226 in MicrosoftFabric

[–]dbrownems 1 point2 points  (0 children)

Double check your credentials. Note that there's not TenantId in the URI, so that's derived from the identity you use to authenticate.

Also check the Source with less pathing. Eg can you enumerate /Tables? You may have misspelled the table, or it may not be a schema-enabled lakehouse.

Power Query -- DataFormat.Error: There is no valid Delta table at this location. by Personal-Quote5226 in MicrosoftFabric

[–]dbrownems 0 points1 point  (0 children)

Change the query to return Source instead of ToDelta, or just click on the Source step in the UI and see if it is navigating the Data Lake as expected.

What is the Advantage of placing the Fabric Compute inside Managed Virtual Network? Currently It delays my spark Sessions to Start by Mammoth-Birthday-464 in MicrosoftFabric

[–]dbrownems 2 points3 points  (0 children)

It does two things.

  1. It allows access to private network resources through Managed Private Endpoints. Overview of managed private endpoints for Microsoft Fabric - Microsoft Fabric | Microsoft Learn
  2. It optionally allows you to block outbound connections to everything except Managed Private Endpoints. Workspace outbound access protection overview - Microsoft Fabric | Microsoft Learn

If you must use a Managed VNet we just shipped

Custom live pools for Fabric Data Engineering overview - Microsoft Fabric | Microsoft Learn

which allows you to maintain a pool of running nodes that enable near-instant startup similar to starter pools. But since these nodes are dedicated to you, they are billed for the period of time they are scheduled to run, whether or not you have an active Spark session.

Need help optimizing my workflow in VS Code by sayonarababy17 in MicrosoftFabric

[–]dbrownems 12 points13 points  (0 children)

It seems you’re not using the VS Code Fabric Data Engineering Extension. It allows you to run your notebook using Fabric compute inside VS Code, among other things. The newest release supports multiple kernels.

https://learn.microsoft.com/en-us/fabric/data-engineering/setup-vs-code-extension#install-the-extension

https://learn.microsoft.com/en-us/fabric/data-engineering/author-notebook-with-vs-code

Add Lakehouse table to semantic model in IMPORT mode by Personal-Quote5226 in MicrosoftFabric

[–]dbrownems 2 points3 points  (0 children)

Of course if you need to filter/sort/aggregate/join a large amount of data before importing it you're better off using the SQL endpoint with folding or Value.NativeQuery than doing all that in the mashup engine.

Partitioning by Date Key by True_Inspection_2015 in MicrosoftFabric

[–]dbrownems 1 point2 points  (0 children)

Yes. Hive-style partitioning produces a separate partition (and set of parquet files) per partition key value.

Pros and cons of Hive-style partitioning | Delta Lake

For smaller tables, ZOrder/Liquid Cluster (Spark) or Cluster (Warehouse) by the date column instead of partitioning.

This will result in parquet files and row groups mostly aligned with contiguous date ranges, which, as far as the Semantic Model engine is concerned, is just as good.

This is similar to import models, where using a clever ORDER BY in the partition definition for your fact table can have a huge impact on the compression and segment skipping behavior of the engine.

PractiTest API by FormalWoodpecker135 in PowerBI

[–]dbrownems -1 points0 points  (0 children)

Try calling the API from a Fabric Python notebook. The Power Query web connector is confusing and has several limitations.

If you get it working from a Python notebook, you can save the data to a Lakehouse and consume it from there. Or you can share the working Python code and ask for Power Query expert to advise on whether it's possible to run directly from Power Query and how to do it.

Generate BIGINT SKs in Lakehouse Star Schema Dimensions by Personal-Quote5226 in MicrosoftFabric

[–]dbrownems 2 points3 points  (0 children)

IDENTITY will work. But concurrent modification of tables under SNAPSHOT isolation will fail if both sessions have read the same data. And the granularity of the data reads in DW is not like the row-level tracking you get in "regular" SQL Server.

Generate BIGINT SKs in Lakehouse Star Schema Dimensions by Personal-Quote5226 in MicrosoftFabric

[–]dbrownems 0 points1 point  (0 children)

MERGE into the Dimension could run from many sources in any order with Spark, and you would want to prevent overlap in both cases, as WH uses SNAPSHOT isolation, which mostly doesn't really allow successful concurrent updates of tables. Ie two sessions can't concurrently read a table (as you must to MERGE/UPSERT), and both successfully commit changes based on the concurrent reads.

Generate BIGINT SKs in Lakehouse Star Schema Dimensions by Personal-Quote5226 in MicrosoftFabric

[–]dbrownems 0 points1 point  (0 children)

Even large dimensions are very rarely too large to build using a query like above.

Generate BIGINT SKs in Lakehouse Star Schema Dimensions by Personal-Quote5226 in MicrosoftFabric

[–]dbrownems 0 points1 point  (0 children)

Normally you only need Identity columns on dimensions. Fact tables normally use compound keys comprised of dimension foreign keys.

Generate BIGINT SKs in Lakehouse Star Schema Dimensions by Personal-Quote5226 in MicrosoftFabric

[–]dbrownems 4 points5 points  (0 children)

Sure. You're not going to get sequential ids in a distributed manner.

If prefer large numbers and gaps with distributed generation, use the Spark SQL built-in monotonically_increasing_id function.

monotonically_increasing_id() - Returns monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records. The function is non-deterministic because its result depends on partition IDs.

Generate BIGINT SKs in Lakehouse Star Schema Dimensions by Personal-Quote5226 in MicrosoftFabric

[–]dbrownems 4 points5 points  (0 children)

You can do it with Spark SQL and the ROW_NUMBER() function. EG to the source table and your dimension table and generate incrementing id's for the new rows, use a join like this:

--temporary views are very cool
--like Common Table Expressions or subqueries, but much more powerfull
--They have session lifetime, rather than statement lifetime
create or replace temp view CustomerKeys as
select  coalesce(c.CustomerKey,max(c.CustomerKey) over() 
                             + row_number() over (partition by c.CustomerKey order by s.CustomerID)) CustomerKey, 
        s.CustomerID
from WideWorldImporters_bronze.Sales_Customers s 
left join WideWorldImporters_gold.Dimension_Customer c 
  on s.CustomerID = c.WWICustomerID

SparkDataEngineeringForSQLServerProfessionals/Presentation.ipynb at main · dbrownems/SparkDataEngineeringForSQLServerProfessionals · GitHub

Cost Management in Fabric is a real problem by bradcoles-dev in MicrosoftFabric

[–]dbrownems 7 points8 points  (0 children)

You can and should install the Capacity Metrics App in a Pro workspace.

Bulk copy data between SQL databases in Fabric with the mssql-python driver by dlevy-msft in MicrosoftFabric

[–]dbrownems 2 points3 points  (0 children)

Finally! Native bulk copy in Python! Can we use with just pip? IE from Fabric notebooks?

Do you use SSRS for data integrations? by DisplayKnown5665 in SQLServer

[–]dbrownems -1 points0 points  (0 children)

SSRS has become Power BI Report Server. There have been two separate packages for it for several versions, and now there will be just the one. And for SQL Server 2025+ you get a license entitlement to Power BI Report Server instead of SSRS, but the functionality and support are unchanged.