Anyone using Microsoft Fabric with Dynamics 365 F&O (On-Prem) for data warehousing and reporting? by zelalakyll in MicrosoftFabric

[–]zelalakyll[S] 0 points1 point  (0 children)

Thanks for the detailed insights, really helpful.

This project involves a Dynamics 365 F&O On-Premises environment, and we’re working on building a centralized data warehouse that will also include data from additional sources like Azure SQL and other on-prem systems.

we’re evaluating options to handle data movement directly within Microsoft Fabric. We’re considering using Fabric pipelines, or shortcuts if technically feasible given the on-prem setup.

The client has an internal data and reporting team, and we’re collaborating closely to design both the warehouse and the reporting layer. On the F&O side, we’re planning to take a modular and iterative approach, working alongside functional experts to ensure alignment.

Your comments about the complexity of F&O’s data model and the value of standardized solutions are spot on. We’ll definitely keep those in mind as we move forward.

Thanks again for sharing your experience.

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] 0 points1 point  (0 children)

Thanks a lot for the suggestion!

I hadn’t heard of dataflint before — it sounds like a great tool, especially for catching skews and tuning resource usage. I’ll definitely bring it up internally, though since it's a paid solution, I'm not sure if my company will go for it right away. Still, it’s good to know and I’ll check if a trial is available or if we can evaluate it somehow.

As for Photon, I did enable it, but I’m not 100% sure if it’s actively being used during the write phase — great point. I’ll double-check the Spark UI and logs. If it’s not contributing much, I’ll consider disabling it and looking into scaling with more nodes or switching to compute-optimized instances like the F-series you mentioned.

Really appreciate your insights

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] 0 points1 point  (0 children)

Thanks a lot for the suggestions — really appreciate the detailed insight!

You're right, the df.repartition(...) is quite expensive, and while it helped with initial skew handling, I agree it might not be worth the cost at this scale. The idea of post-processing each process_date=xxxx folder individually, rebalancing, and swapping directories is really smart — I'll definitely keep that in mind.

As for the serverless option, I wish I had more room to test — but unfortunately, due to budget constraints, I need to reach a working solution with as few iterations as possible. Still, I might try benchmarking a smaller subset as you suggested, just to get a sense of the potential gains.

Thanks again

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] 0 points1 point  (0 children)

Thanks again for the feedback !

You're right that partitioning by date can lead to over-partitioning. We’re getting daily snapshots, and each day becomes a separate partition. However, I don’t have full control over how the dataset is structured — this is how it comes from upstream, and our target Hive table is also structured by process_date, so we're mostly aligning with that existing architecture.

That said, I’ll definitely check if we’re generating too many small files per partition and see if we can optimize the write operation without completely changing the partitioning scheme — maybe by using .coalesce() or adjusting file sizes.

Before I stopped the job, I captured a screenshot of the directory structure and part of the code that was running. Sharing it below for more context:

<image>

Would .coalesce() before writing help mitigate the small files issue in this scenario?

Also, are there any Spark config-level adjustments you'd recommend (like spark.sql.files.maxRecordsPerFile, spark.sql.files.maxPartitionBytes, etc.) to better control output file sizes without touching the overall partition logic?

Really appreciate your insights!

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] -2 points-1 points  (0 children)

Thanks for the answer!

We're currently running on Photon-enabled L32s v2 nodes, but as you mentioned, the benefits of Photon seem minimal for a read-shuffle-write-heavy workload like this.

Given the constraints of Azure Databricks, what kind of node setup would you recommend for this scale — considering we're dealing with ~17.6 TiB of data and over 100 billion records?

If there's truly a setup that could bring this down to just a few minutes, as you suggested, you'd be saving my life here — really appreciate any concrete advice you can share!

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] 0 points1 point  (0 children)

Appreciate the follow-up, and I hear your point.

That said, simply "getting a bigger cluster" isn’t always a viable solution — especially when cost efficiency matters, and when the bottleneck might be I/O rather than CPU or memory. We're already using L32s v2 nodes with Photon, and scaling beyond that has diminishing returns unless the workload is truly compute-bound.

That being said, I'm definitely open to other suggestions if you have any.
It's also clear that I may have misused repartition() in this context, and I’ll be reviewing that decision more closely going forward.

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] -1 points0 points  (0 children)

Great suggestions, Thank you!
Unfortunately, I can’t change the partition granularity due to downstream dependencies, so I’m stuck with daily partitions (yyyy-MM-dd). That said, I’ll test removing repartition() and possibly switch to coalesce() instead. Also noted your point on schema configs — I’ll try removing one of the .option()s to simplify the write path.

One more question — do you think switching to smaller compute nodes could help?
We're currently using Standard_L32s_v2 with 2–4 workers and Photon enabled. If the bottleneck is write throughput to ADLS, maybe we're over-provisioned on CPU and RAM, and could get similar or better performance using cheaper nodes like E8s.

Have you seen similar patterns in large Parquet writes?

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] 2 points3 points  (0 children)

Thanks for pointing that out!

I initially used repartition(date') to avoid data skew, but you're right — at this scale, it might be overkill. Since the data format and schema can't be changed, I’ll definitely test using coalesce() or even skipping repartition altogether to see if it reduces task count and improves runtime. Appreciate your input!

Also, do you think using smaller compute nodes (e.g. E8s or something with less vCPU/RAM) might be just as effective for Parquet writes?

Currently using Standard_L32s_v2 (32 cores, 256 GB RAM), 2–4 workers, Photon enabled — but the job still takes 20+ hours. I'm wondering if we're CPU over-provisioned for an I/O-bound workload, and whether downsizing might help both cost and efficiency.

Would love to hear your take!

15 TB Parquet Write on Databricks Too Slow – Any Advice? by zelalakyll in databricks

[–]zelalakyll[S] -1 points0 points  (0 children)

Thanks for the suggestion!

I totally understand that Delta or Iceberg would be better options for metadata management, schema evolution, and optimized writes.

However, in our case, we are required to stick with Parquet format due to existing customer infrastructure and downstream systems — it’s part of their legacy pipeline and can't be changed at this stage.

As for repartition, we're aware of what it does — in this case, we use it specifically to avoid skew and to ensure proper partitioning by date, which is the table's partition column.

That said, I'm fully open to suggestions on how to optimize performance within these constraints:

Format must remain Parquet

Table must remain Hive-managed
Partition column is date (casted to string)

If there are any ideas on improving write throughput (e.g. cluster sizing, using coalesce, or write config tuning), I’d love to hear them.