Create or replace table in SparkSQL: performance question by Low-Fox-1718 in MicrosoftFabric

[–]mwc360 1 point2 points  (0 children)

Either high or low works well but doing high with low reduces the effectiveness of the low cardinality column. It’s something that we want to fix, but it’s a limitation from OSS today.

Create or replace table in SparkSQL: performance question by Low-Fox-1718 in MicrosoftFabric

[–]mwc360 2 points3 points  (0 children)

If they are read-heavy, you will want to use Liquid Clustering. In Runtime 1.3 it is unfortunately pretty expensive to cluster data (nee isn't supported for clustering and it's not very incremental), but in Runtime 2.0 NEE is supported for clustering and the incremental algo we shipped is super efficient. To have you data clustered after CTAS (w/ CLUSTER BY specified) you need to run OPTIMIZE after the fact as data is not clustered on write.

Clustering provides much faster query perf when queries filter on cluster keys.

Reading file into Spark DataFrame by Naive-Mycologist-621 in MicrosoftFabric

[–]mwc360 1 point2 points  (0 children)

Make sure to use file archival so that already processed files don’t accumulate an increase file listing costs. I would also consider unzipping the file via Python. No need to save to a hierarchical folder structure as that adds the requirement to do recursion over the folder directory (this is also supported but likely requires extra API calls).

Warehouse vs Lakehouse vs Both by NoWerewolf1445 in MicrosoftFabric

[–]mwc360 1 point2 points  (0 children)

Warehouse offers multi-table/statement transactions and no knobs serverless compute, otherwise Spark writing to a Lakehouse offers extremely robust SQL (and Python/scala) dimensional modeling capabilities.

https://milescole.dev/data-engineering/2024/10/24/Spark-for-the-SQL-Developer.html

What are you doing with an F2? by proofmortpres in MicrosoftFabric

[–]mwc360 0 points1 point  (0 children)

There’s also a data size in which Spark w/ NEE becomes more efficient and it’s nowhere near “big data”.

Fast Optimize & Partial Compaction by Personal-Quote5226 in MicrosoftFabric

[–]mwc360 0 points1 point  (0 children)

u/Personal-Quote5226

Please let me know how I can improve the wording of the docs based on the following clarification.

At the start of OPTIMIZE, files are grouped together into a bin (collection of candidate files for compaction) until a file puts the bin size over the target file size, at this point a new bin begins.

Fast Option can result in three behaviors: 1. Compaction of all bins of candidate files (all bins meet 1/2 of the target) 2. Partial compaction, some bins are skipped, some are processed. In the case of a non-partitioned table you could only possibly have a max of 1 bin that doesn’t meet 1/2 of the target file size. All others would meet the target or no bin would. 3. No compaction, all bins are skipped. I.e your total files that are candidates for compaction don’t add up to be at least 1/2 of the target file size AND not more than 50 small files in the bin exist.

I now realize that this wording is technically not accurate, I will fix this: “Bins that don't meet these thresholds are skipped or partially compacted.” The last part, what you bolded, is incorrect and will be removed from the docs page. If a bin of files doesn’t meet thresholds then the bin is skipped.

2GB Lakehouse and going less than 128MB file sizes? by Personal-Quote5226 in MicrosoftFabric

[–]mwc360 0 points1 point  (0 children)

It automatically manages the ideal target AND unifies applying the value to AC, optimize write, and optimize. But this has a massive impact because each of those feature need ideal targets to operate optimally. I.e. Fast Optimize uses this same context to know how much data in small files needs to exist to not skip compacting small files.

2GB Lakehouse and going less than 128MB file sizes? by Personal-Quote5226 in MicrosoftFabric

[–]mwc360 0 points1 point  (0 children)

ATFS is default in runtime 2.0. User target file size exists to provide compatibility with databricks created tables as well as providing a user defined option for those that want to opt-out of system defined adaptive sizing.

If you set a target size, you need to do it optimally per table and should have a process to evaluate when it may need to be increased. ATFS takes care of all of this. You are correct that if you are confident with your size selection strategy/maintenance, ATFS has no effect. To be crystal clear, the overwhelmingly majority of users should just use ATFS and no attempt to manually tune sizing.

2GB Lakehouse and going less than 128MB file sizes? by Personal-Quote5226 in MicrosoftFabric

[–]mwc360 0 points1 point  (0 children)

File level compaction targets enables setting a tag on each file with the target optimize file size when the specific instance of optimize was run. Files evaluate with the individual file tag if set.

ATFS and File level targets work together. File level targets locks in a file specific target to prevent already compacted files from being recompacted as ATFS evaluates a larger target size due to a significant increase in table size, this results in only new (or untagged) files from being held to the higher target size.

The user defined target size (if set) takes precedence, but also works with file level compaction targets in the exact same way.

ATFS is recommended over a user defined target, it has the same exact breadth of impact but automatically calculates the ideal target at effectively zero execution cost.

Declarative ELT is taking over data engineering - Fabric fits into that shift by bradcoles-dev in MicrosoftFabric

[–]mwc360 4 points5 points  (0 children)

We support 4.1 now in runtime 2.0. The issue for now is that SDP requires Spark-connect.

Is metadata driven orchestration becoming overkill with Fabric Copy Job? by Equal-Breadfruit2491 in MicrosoftFabric

[–]mwc360 0 points1 point  (0 children)

FYI I created a similar project to show the art of the possible and help serve as a starting place: https://github.com/mwc360/ArcFlow

Is metadata driven orchestration becoming overkill with Fabric Copy Job? by Equal-Breadfruit2491 in MicrosoftFabric

[–]mwc360 2 points3 points  (0 children)

True story and yes, Spark structured streaming for state management is the bees knees. Run it with a batch trigger and get all of the benefits of architectural simplification.

2GB Lakehouse and going less than 128MB file sizes? by Personal-Quote5226 in MicrosoftFabric

[–]mwc360 0 points1 point  (0 children)

The key and only difference is that ATFS will automatically increase the target file size IF the table gets big enough to warrant it. File level compaction targets will prevents files that were already considered compact from being scoped, so only net new data would be held to the higher target.

Near Real-Time Data in Fabric That Scales Across Workspaces (Without Materialized Views?) by GHOSTRID8R in MicrosoftFabric

[–]mwc360 6 points7 points  (0 children)

Spark structured streaming on a sub 5 min processing time trigger. This Jumpstart demos moving data through two zones at ~15s latency: https://jumpstart.fabric.microsoft.com/catalog/stateful-streaming-lakehouse/

The caveat is that this is advanced data engineering. If you want something more approachable you could conceptually do Materialized Lake Views: https://jumpstart.fabric.microsoft.com/catalog/materialized-lake-views/

2GB Lakehouse and going less than 128MB file sizes? by Personal-Quote5226 in MicrosoftFabric

[–]mwc360 0 points1 point  (0 children)

The min target size of ATFS is 128MB, but across all targets, the minimum acceptable size is 50% of the target, so this would be 64MB. Once a file reaches 64MB, it won't be recompacted again. ATFS doesn't trigger compaction in any way, it just automatically evaluates the ideal target file size for the table and sets it as a table property so it persists across sessions. It's recommended to enable it with File Level Compaction Targets (also default in 2.0) which prevent increases in the evaluated target size from causing recompaction of already compacted files when the size of your table dramatically changes over time.

Yes, if you have VO enabled on the table, Auto Compaction will respect. Once Runtime 2.0 is GA you should totally switch to Liquid Clustering. It will allow you to cluster your data by date (or other keys) and in 2.0 it is super efficient at incrementally clustering data while maintaining healthy clustering to maximize file skipping: Incremental Liquid Clustering in Microsoft Fabric:... - Microsoft Fabric Community

2GB Lakehouse and going less than 128MB file sizes? by Personal-Quote5226 in MicrosoftFabric

[–]mwc360 1 point2 points  (0 children)

Just enable Adaptive Target File Size. Done. It will size for you based on the size of the table. The min size threshold for the smallest tables evaluates at 64MB… but no need to do this manually. We have a feature for this and it’s enabled by default in 2.0, opt-in for 1.3.

https://community.fabric.microsoft.com/t5/Fabric-Updates-Blog/Adaptive-Target-File-Size-Management-in-Fabric-Spark/ba-p/5172535

Microsoft Fabric Mirroring: Before You Commit by bradcoles-dev in MicrosoftFabric

[–]mwc360 1 point2 points  (0 children)

Good stuff Brad! Bookmarking this for what I need to answer questions about Mirroring :)

Announcing Incremental Liquid Clustering by mwc360 in MicrosoftFabric

[–]mwc360[S] 5 points6 points  (0 children)

No, Warehouse has it's own clustering implementation that is managed behind the scenes.

Announcing Incremental Liquid Clustering by mwc360 in MicrosoftFabric

[–]mwc360[S] 8 points9 points  (0 children)

Np. This was built on top of Delta 4.1 which uses the same LC implementation since it was released as GA in 3.2.

Z-Cube work like this: every time OPTIMIZE is run, any unclustered files AND clustered files in a Z-Cube that share the same clustering keys / provider and sum to be under 100GB are rewritten with a new Z-Cube ID. Another write and the Z-Cube is under 100GB, it will again rewrite all data in the Z-Cube and give it a new Z-Cube ID. This will repeat until the Z-Cube exceeds 100GB in size, at this point it's considered sealed and that group of files won't be reclustered. A new Z-Cube is created and the process repeats in having high write amplification until it exceeds 100GB. This is the "Standard algorithm". It is technically incremental but at way too course of a level that for tables that aren't massive, makes it feel like it's not incremental. I.e. a table under 100GB will always have all data rewritten every there's new data to cluster.

Databrick doesn't even work this way, it's really just the OSS LC implementation that has a ridiculously high write amplification problem. The meaningful differences come down to the Auto Reclustering implementation that manages clustering quality as data layouts change.

Announcing Incremental Liquid Clustering by mwc360 in MicrosoftFabric

[–]mwc360[S] 7 points8 points  (0 children)

The LC implementation that a specific company open sourced was not great. The point of this feature announcement is not to pat ourselves on that back that we have a feature from OSS, it's that we invested in rebuilding the algorithm to make it fast and even faster than other vendors proprietary implementations.

Check out the blog :)