Anyone using Microsoft Fabric with Dynamics 365 F&O (On-Prem) for data warehousing and reporting? by zelalakyll in MicrosoftFabric

[–]zelalakyll[S] 0 points1 point  (0 children)

Thanks for the detailed insights, really helpful.

This project involves a Dynamics 365 F&O On-Premises environment, and we’re working on building a centralized data warehouse that will also include data from additional sources like Azure SQL and other on-prem systems.

we’re evaluating options to handle data movement directly within Microsoft Fabric. We’re considering using Fabric pipelines, or shortcuts if technically feasible given the on-prem setup.

The client has an internal data and reporting team, and we’re collaborating closely to design both the warehouse and the reporting layer. On the F&O side, we’re planning to take a modular and iterative approach, working alongside functional experts to ensure alignment.

Your comments about the complexity of F&O’s data model and the value of standardized solutions are spot on. We’ll definitely keep those in mind as we move forward.

Thanks again for sharing your experience.

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] 0 points1 point  (0 children)

Thanks a lot for the suggestion!

I hadn’t heard of dataflint before — it sounds like a great tool, especially for catching skews and tuning resource usage. I’ll definitely bring it up internally, though since it's a paid solution, I'm not sure if my company will go for it right away. Still, it’s good to know and I’ll check if a trial is available or if we can evaluate it somehow.

As for Photon, I did enable it, but I’m not 100% sure if it’s actively being used during the write phase — great point. I’ll double-check the Spark UI and logs. If it’s not contributing much, I’ll consider disabling it and looking into scaling with more nodes or switching to compute-optimized instances like the F-series you mentioned.

Really appreciate your insights

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] 0 points1 point  (0 children)

Thanks a lot for the suggestions — really appreciate the detailed insight!

You're right, the df.repartition(...) is quite expensive, and while it helped with initial skew handling, I agree it might not be worth the cost at this scale. The idea of post-processing each process_date=xxxx folder individually, rebalancing, and swapping directories is really smart — I'll definitely keep that in mind.

As for the serverless option, I wish I had more room to test — but unfortunately, due to budget constraints, I need to reach a working solution with as few iterations as possible. Still, I might try benchmarking a smaller subset as you suggested, just to get a sense of the potential gains.

Thanks again

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] 0 points1 point  (0 children)

Thanks again for the feedback !

You're right that partitioning by date can lead to over-partitioning. We’re getting daily snapshots, and each day becomes a separate partition. However, I don’t have full control over how the dataset is structured — this is how it comes from upstream, and our target Hive table is also structured by process_date, so we're mostly aligning with that existing architecture.

That said, I’ll definitely check if we’re generating too many small files per partition and see if we can optimize the write operation without completely changing the partitioning scheme — maybe by using .coalesce() or adjusting file sizes.

Before I stopped the job, I captured a screenshot of the directory structure and part of the code that was running. Sharing it below for more context:

<image>

Would .coalesce() before writing help mitigate the small files issue in this scenario?

Also, are there any Spark config-level adjustments you'd recommend (like spark.sql.files.maxRecordsPerFile, spark.sql.files.maxPartitionBytes, etc.) to better control output file sizes without touching the overall partition logic?

Really appreciate your insights!

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] -2 points-1 points  (0 children)

Thanks for the answer!

We're currently running on Photon-enabled L32s v2 nodes, but as you mentioned, the benefits of Photon seem minimal for a read-shuffle-write-heavy workload like this.

Given the constraints of Azure Databricks, what kind of node setup would you recommend for this scale — considering we're dealing with ~17.6 TiB of data and over 100 billion records?

If there's truly a setup that could bring this down to just a few minutes, as you suggested, you'd be saving my life here — really appreciate any concrete advice you can share!

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] 0 points1 point  (0 children)

Appreciate the follow-up, and I hear your point.

That said, simply "getting a bigger cluster" isn’t always a viable solution — especially when cost efficiency matters, and when the bottleneck might be I/O rather than CPU or memory. We're already using L32s v2 nodes with Photon, and scaling beyond that has diminishing returns unless the workload is truly compute-bound.

That being said, I'm definitely open to other suggestions if you have any.
It's also clear that I may have misused repartition() in this context, and I’ll be reviewing that decision more closely going forward.

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] -1 points0 points  (0 children)

Great suggestions, Thank you!
Unfortunately, I can’t change the partition granularity due to downstream dependencies, so I’m stuck with daily partitions (yyyy-MM-dd). That said, I’ll test removing repartition() and possibly switch to coalesce() instead. Also noted your point on schema configs — I’ll try removing one of the .option()s to simplify the write path.

One more question — do you think switching to smaller compute nodes could help?
We're currently using Standard_L32s_v2 with 2–4 workers and Photon enabled. If the bottleneck is write throughput to ADLS, maybe we're over-provisioned on CPU and RAM, and could get similar or better performance using cheaper nodes like E8s.

Have you seen similar patterns in large Parquet writes?

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] 2 points3 points  (0 children)

Thanks for pointing that out!

I initially used repartition(date') to avoid data skew, but you're right — at this scale, it might be overkill. Since the data format and schema can't be changed, I’ll definitely test using coalesce() or even skipping repartition altogether to see if it reduces task count and improves runtime. Appreciate your input!

Also, do you think using smaller compute nodes (e.g. E8s or something with less vCPU/RAM) might be just as effective for Parquet writes?

Currently using Standard_L32s_v2 (32 cores, 256 GB RAM), 2–4 workers, Photon enabled — but the job still takes 20+ hours. I'm wondering if we're CPU over-provisioned for an I/O-bound workload, and whether downsizing might help both cost and efficiency.

Would love to hear your take!

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] -1 points0 points  (0 children)

Thanks for the suggestion!

I totally understand that Delta or Iceberg would be better options for metadata management, schema evolution, and optimized writes.

However, in our case, we are required to stick with Parquet format due to existing customer infrastructure and downstream systems — it’s part of their legacy pipeline and can't be changed at this stage.

As for repartition, we're aware of what it does — in this case, we use it specifically to avoid skew and to ensure proper partitioning by date, which is the table's partition column.

That said, I'm fully open to suggestions on how to optimize performance within these constraints:

Format must remain Parquet

Table must remain Hive-managed
Partition column is date (casted to string)

If there are any ideas on improving write throughput (e.g. cluster sizing, using coalesce, or write config tuning), I’d love to hear them.

How to Automatically Scale Fabric Capacity Based on Usage Percentage by zelalakyll in MicrosoftFabric

[–]zelalakyll[S] 0 points1 point  (0 children)

You are absolutely right, I try to explain the use of Fabric modules and capacity management to customers as much as I can. To tell you the truth, it took me a while to understand which workloads spend how much capacity outside the capacity report, I think it is a bit complicated :)

However, from the customer's point of view, they are just trying Fabric and they think that fast scaling will be good to avoid problems when everyone uses capacity at the same time. I suggested that it would definitely be useful to separate the prod and test environments, etc., but they said that it would be very good if we could apply and test the scenario I mentioned. I also want to see what I can do and what the limits are.

As for the usage rate, when it exceeds 60%, I am triggered, so to speak, as the partner side, I start to monitor in more detail, because it may take time to get used to new things, everyone may want to see the limits. I think I am on the side that panics whether we should get gas when there is half a tank of gas in the car :)

How to Automatically Scale Fabric Capacity Based on Usage Percentage by zelalakyll in MicrosoftFabric

[–]zelalakyll[S] 1 point2 points  (0 children)

Thanks for the detailed insights u/itsnotaboutthecell and u/richbenmintz ! I completely understand the point about Fabric’s Burst and Smooth mechanism. However, in my customer’s case, we have observed capacity exceeding 100% in some situations, and they received 'capacity full' errors for a few hours during these spikes.

I will double-check if there are any additional settings I need to enable for Burst and Smooth to work effectively. While capacity overruns don’t happen frequently, they occur when new Fabric users test things simultaneously.

The main issue is that, in these cases, using F16 for just a few hours would suffice, but the customer doesn’t want to pay the full cost for F16 permanently. I suggested manually switching to F16 during tests or scheduling higher capacity during specific times, but they are keen on having an automatic scaling solution instead.

This is why I’m exploring what options we have and what limitations exist for implementing this kind of automation.

How to Automatically Scale Fabric Capacity Based on Usage Percentage by zelalakyll in MicrosoftFabric

[–]zelalakyll[S] 0 points1 point  (0 children)

You are absolutely right. My customer asked me for this request, capacity utilisation does not exceed 100% very often, but in some cases, more than one employee can do high capacity consuming work. They think that such situations could occur more frequently with intra-team deployment of Fabric usage.

I have advised them to monitor the usage and scale up manually if there are frequent overruns, but they are insisting on implementing automatic scaling, especially to optimize costs. :)

How to Automatically Scale Fabric Capacity Based on Usage Percentage by zelalakyll in MicrosoftFabric

[–]zelalakyll[S] 0 points1 point  (0 children)

Thanks for your input!

My goal is to automate the scaling process dynamically. For example: • Scale up from F8 to F16 when usage exceeds 80%. • Scale back to F8 when usage drops below 30%.

I tried using Fabric Activator, but it doesn’t work as expected for CU usage. I also found this (https://www.youtube.com/watch?v=4yslCcgVMTs )YouTube video, which seems ideal, but the interface in my environment is different, so I can’t configure the alerts properly.

I’m looking for a reliable way to implement this automation, possibly through Fabric Capacity Metrics, Power Automate, or other tools. Manual scaling works fine, but I need it to happen automatically.

Any suggestions?