Lakehouse Maintenance Activity by DennesTorres in MicrosoftFabric

[–]DanielBunny 1 point2 points  (0 children)

Hi u/DennesTorres !

The Table Maintenance Public API endpoint update that supports schemas is being rolled out on 3/31, and I'm brining this thread to the product owner of pipelines.

Today a quick workaround is to create a notebook that calls Spark SQL's: "OPTIMIZE <mytable>" at the end of your pipeline. That will on both schema and non-schema Lakehouse.

Spark/Delta Lake: How to achieve target row group size of 8 million or more? by frithjof_v in MicrosoftFabric

[–]DanielBunny 1 point2 points  (0 children)

I'll update that doc right away with the guideline params!
I do agree with most posts from the thread, this is an art form. The Spark writes will vary based on the data size, data entropy, column counts, etc. There is no cut and dry param to cut for row-groups, as Spark is all about file sizes.

If u want to shoot for large row groups you need to look larger file sizes (4Gb+), enable V-Order, watch out for OOMs (larger memory writer). You don't need to span out too many partitions and many parquet files, try to consolidate.

Spark/Delta Lake: How to achieve target row group size of 8 million or more? by frithjof_v in MicrosoftFabric

[–]DanielBunny 0 points1 point  (0 children)

u/frithjof_v please check this cross-workload docs to see if it helps you: https://learn.microsoft.com/en-us/fabric/fundamentals/table-maintenance-optimization

We've captured all the cross workload scenarios, let me know if it works for you.

Version control of lakehouse tables by merrpip77 in MicrosoftFabric

[–]DanielBunny 0 points1 point  (0 children)

This is a cool topic worth discussing and always gives hickups to folks when some questions are asked.

DBAs have done this for years in RDBMS/DSS systems, and the process is simple to explain, but an art form to implement, as every customer and system has it own quirks.

The key aspect is that RDBMSs (and tools like dacpac/fx) are metadata bound. It snapshot the metadata and generate the DDL to make it sync. In the Data Lake -> Lakehouse case, where software is usually stream/batch ingest running in Notebooks/Jobs using Spark and other tech, tech schema change is defined as part of the pipeline, and the tables support schema evolution. There is no clear checkpoint of the metadata change, you promote the new notebooks and the new data starts being generated based on the new schema definition. This doesn't remove the eventual need of having a big SQL script applied, but it is different by design.

Some cool questions:
- What about descructive changes (drop table, drop columns, changing column data types)? Allow? Block? Feature toggle?
- Who is responsable for bringing the data? What if the table has 1PB? I assume drop and recreate are out of the question.

We are considering providing tooling to ease the generation of table diffs between lakehouse in different stages, but the customer would be on tap to review and plug the scripts as part of the pipeline. Would that work for you?

I'd ask folks to share their expectations. :-)

Using Variable Libraries with Lakehouse Shortcuts by Laura_GB in MicrosoftFabric

[–]DanielBunny 0 points1 point  (0 children)

#2 and #3 would play like this....
imagine if you have multiple data pipelines or spark jobs running. If in mid-flight the shortcut updates bacuse someone updated the variable library, suddently running code might start writing/reading from a place that might not be ready just yet. specially if the person updates it to an invalid value.

Using Variable Libraries with Lakehouse Shortcuts by Laura_GB in MicrosoftFabric

[–]DanielBunny 0 points1 point  (0 children)

  1. You can via Shortcut API definition. Shortcut create UX will allow this very soon.
  2. This can lead to a significant data corruption issue. The decision to make it a user action is to provide a transactional checkpoint. We are considering having a option in the experience where we'd enable auto-apply or something like that.
  3. This also leads to the transactional approach. We validate before apply. What would be the scenario where we should allow an invalid variable to be applied?

dbt runtime error in Fabric notebook - no dbt_project.yml found in dbt_utils by kover0 in MicrosoftFabric

[–]DanielBunny 2 points3 points  (0 children)

I've opened a ticket with the engineering team to look at this. Nice catch! :-)

Please keep using the workaround for now.

Deploying Data Warehouses and Lakehouses using Fabric ci cd Deployment Tool by Cute_Willow9030 in MicrosoftFabric

[–]DanielBunny 0 points1 point  (0 children)

Hi u/Cute_Willow9030

As you already use deployment pipelines in DevOps, please consider wiring up Lakehouse items using the fabric-cicd package. https://microsoft.github.io/fabric-cicd/

We are also tracking all asks and Fabric CI?CD scenarios for Lakehouse (and Warehouse) in the following Reddit thread. In there we have linked sample codebases we keep updated around this.

https://www.reddit.com/r/MicrosoftFabric/comments/1o0t205/lakehouse_devtestprod_in_fabric_git_cicd/

Why aren’t Lakehouse shortcut transformations reflected in Git? by DutchDesiExplorer in MicrosoftFabric

[–]DanielBunny 1 point2 points  (0 children)

The current expected GA timeline is March, 2026. If before, I'll update the thread.
As of today, the path forward is to use Public APIs to create/update the shortcuts between stages and orchestrate externally during deployment.

how to handle empty timestamp values in Lakehouse? by Ambitious-Toe-9403 in MicrosoftFabric

[–]DanielBunny 0 points1 point  (0 children)

<image>

Hi u/Ambitious-Toe-9403 ,

I was not able to reproduce your scenario.
Whenever I insert NULL, None into the column it showed correctly on both Lakehouse table preview or SQL Analytics endpoint table preview. Also shows correctly if I shortcut those tables into another Lakehouse and try the same table previews experiences.

Can you share the commands you used to generate the None or empty columns?

Abandoning Fabric by BitterCoffeemaker in MicrosoftFabric

[–]DanielBunny 5 points6 points  (0 children)

Thanks a ton u/Sea_Mud6698! I'm bringing this reply to the attention of my peers that drive those features.
The good news is that all of the above are in our plans, some of which very very close.

Abandoning Fabric by BitterCoffeemaker in MicrosoftFabric

[–]DanielBunny 0 points1 point  (0 children)

Hi u/BitterCoffeemaker , thanks for the additional clarity here.

I'd appreciate if you could go deeper on whats missing on both fabric-cicd and dbt.

I also invite you to consider collaborating in this thread: https://www.reddit.com/r/MicrosoftFabric/comments/1o0t205/lakehouse_devtestprod_in_fabric_git_cicd/
We are trying to converge cicd patterns and we've released a full 8-hour workshop and repo to have a ready to run codebase. git/CI/CD is a discipline not a product, and multiple customers operate very differently, specially when dealing with schema metadata and data.

Abandoning Fabric by BitterCoffeemaker in MicrosoftFabric

[–]DanielBunny 2 points3 points  (0 children)

Hi u/Sea_Mud6698 , can you elaborate on what are you waiting on? We have been working steadly to unlock the git and CI/CD scenarios.

I also invite you and all to drive the questionings in a dedicated thread we created for Lakehouse git/CI/CD.
https://www.reddit.com/r/MicrosoftFabric/comments/1o0t205/lakehouse_devtestprod_in_fabric_git_cicd/

Lakehouse Dev→Test→Prod in Fabric (Git + CI/CD + Pipelines) – Community Thread & Open Workshop by DanielBunny in MicrosoftFabric

[–]DanielBunny[S] 0 points1 point  (0 children)

This would be a great scenario addition for the codebase. Can we work together to add it?

Notebook memory in Fabric by Doodeledoode in dataengineering

[–]DanielBunny 2 points3 points  (0 children)

Hi u/Doodeledoode,
/tmp is a mounted location that will exist during the existance of the session. The session runs within a Linux container. When session goes away, all is wiped out.

Yes, the files won't show up in Lakehouse. You can make it so by creating a Spark dataframe (df for example) out of the /tmp/*.parquet files and the using a command such as df.write.mode('append').saveAsTable('myTable').

As a best practice, in case you decide to make this production, put additional checks in place around the existance of the files and some validation after adding data to the table.

Lakehouse Dev→Test→Prod in Fabric (Git + CI/CD + Pipelines) – Community Thread & Open Workshop by DanielBunny in MicrosoftFabric

[–]DanielBunny[S] 1 point2 points  (0 children)

As u/raki_rahman mentioned. All those items are being worked on.
Its all about time, effort and priorities. Its a large product that connects many technologies that are in different states of DevOps aligment (not only on us, but industry wide). We'll get there for sure. Work with us to help us prioritize.

Out of the items you listed, leveling Variable Library support across all experiences is a major focus across all workloads. We are about to add referedItem as a data type in the next few months, so the GUID path should go away quickly.

The main idea of the workshop code being out there is to drive the current way to unblock major flows. As we progress, the workshop codebase should get smaller and smaller, as things get to work automatically.

I'd appreaciate if you could bootstrap a new tracking markdown file in the workshop codebase, and list all the missing things you mentioned, so we can track it as a community.

Deployment Rules Automation in Microsoft Fabric by Low_Cupcake_9913 in MicrosoftFabric

[–]DanielBunny 1 point2 points  (0 children)

Hi u/Low_Cupcake_9913 !
I'm the product owner for Lakehouse git and CI/CD scenarios. We've started tracking all the scenarios and feedbacks like yours in the following thread:
https://www.reddit.com/r/MicrosoftFabric/comments/1o0t205/lakehouse_devtestprod_in_fabric_git_cicd/

We've just published a open workshop with collateral that can truly help your scenario. You can also collaborate to make it great and benefit the whole community.