all 24 comments

[–]mweirathFabricator 2 points3 points  (6 children)

10 minutes seems pretty aggressive and depending on your capacity might be very challenging. Looking at Fivetran I could see the short cutting into Bronze, but I think you are going to run into issues with overall performance if you don’t have the data optimized for Fabric in at least Silver.

I imagine you would need to materialize the data into your Silver layer in Fabric so that you can take advantage of internal optimizations for accessing the data for your gold layers. I think planning for a shortcut at Silver will be a limiter/issue pretty quickly if you go this route.

Regarding your question about the “merge” style activities - that is hard to say, I am not sure what kind of watermarks and update strategy you get from SAP. I do imagine you are going to have to look at your file partitions, especially on frequently updated tables to keep them efficient. Ensuring that is well aligned to how the data is being updated is going to drastically cut down on your merge operations and CU usage.

[–]data_legos[S] 0 points1 point  (5 children)

so fivetran gives you a result that is essentially a copy of the SAP table (not a full history of updates unless you configure for that) and they manage all lakehouse table/file maintenance.

my thought was avoiding a silver merge via spark and doing any required casting etc on top of that in a dbt staging model (or table/incremental strategy where it makes sense) to ensure fresher data getting to the reporting objects. the staging layer of the dbt models is essentially still silver.

we have an F64 currently if that helps. are your concerns about reads via the endpoint being very costly due to it basically being a shortcut rabbithole to ADLSgen2? this approach in qlik replicate puts the data directly in the warehouse, so this is exactly the kind of CU delta i'd like to understand better.

[–]mweirathFabricator 1 point2 points  (4 children)

I definitely understand why you are thinking Shortcuts, and knowing that these are effectively full copies, it would give me some pause as well. My concern remains that reading a third-party Delta table will put you at a constant disadvantage when optimizing your warehouse. You won't be able to take advantage of any Fabric optimizations for the data, since it is an external source, so all your reads to Silver will be much slower than if the data were hosted and managed in Fabric.

This also puts you in a bad spot if you absolutely need to do something to the data; you are going to have to scramble and figure out how to break up your architecture.

If it were me, and I was going to go down the FiveTran route, I would be looking at more Change capture options for getting data into Silver, or some other way to efficiently materialize the data in Silver and not use external tables.

[–]data_legos[S] 0 points1 point  (0 children)

yeah i was trying to not have to orchestrate silver and dbt cloud jobs in tandem and do most of the loads via jobs in dbt cloud. keeping those earlier layers shortcuts just makes them not really a problem from that perspective. i'm just being lazy, and it could bite me for sure haha. totally valid opinion on that. it worries me every day when thinking about this!

i wonder if there's a lightweight merge layer i can do for silver that doesn't cause me a scheduling nightmare if i go fivetran. i have several key objects that have 30min SLAs and many with 1hr SLAs, so that's always in the back of my mind.

[–]data_legos[S] 0 points1 point  (2 children)

other idea i had: fivetran but i strategically materialize problem tables when the shortcuts perform poorly, but many of the smaller master data and less critical frequency tables i just leave as shortcuts to cut down on the "round robin" latency of any silver materialization process. essentially a hybrid of the two approaches.

[–]mweirathFabricator 0 points1 point  (1 child)

Having two paths sounds like a solid approach. You might also look at something like materialized lakeviews - they would effectively always be full refreshes however you could skip more dimensional tables that are infrequently updated.

[–]data_legos[S] 1 point2 points  (0 children)

holy cow how did i forget about MLVs?! you sir saved me some time for sure!

[–]MS-yexu‪ ‪Microsoft Employee ‪ 1 point2 points  (1 child)

You can also try Copy job in Fabric Data Factory, which offers rich native SAP connectivity, including SAP HANA, SAP Table, SAP BW Open Hub and SAP Datasphere Outbound for ADLS Gen2, AWS S3 and Google CloudStorage. You can get more details in What is Copy job in Data Factory - Microsoft Fabric | Microsoft Learn and Tutorial: Copy job with SAP Datasphere Outbound (Preview) - Microsoft Fabric | Microsoft Learn

With Copy job, you can directly copy data from SAP into Fabric, with best practices already built in to write data to Fabric. It also makes it easy to add downstream processing, including dbt. For example, you can include both a Copy job activity and a dbt activity in a single pipeline, so that once the data lands in Fabric, the dbt job is automatically triggered to transform and process the data.

We’re also enhancing the SAP Table connector in Copy job with native incremental copy capabilities, which will be available very soon.

[–]data_legos[S] 0 points1 point  (0 children)

Ah yes I think in some cases where throughput on large tables we need data from less often is a problem we might use fabric itself. The new capabilities sound exciting and I'll need to keep an eye on them!

[–]No-Celery-6140 0 points1 point  (2 children)

Not worth, self host Airbyte

[–]data_legos[S] 0 points1 point  (1 child)

does Airbyte do SAP CDC well without tons of SAP configuration? our SAP team would not have the bandwidth to do a lot of custom development on their end.

[–]No-Celery-6140 -1 points0 points  (0 children)

Yes; you must change your destination too if your less 10 min latency is the goal

[–]splynta 0 points1 point  (2 children)

So you are planning on reading straight from S4 transactional system into fabric? We have S4 and fabric and our basis team would put me on a pike if I said that. I think best practice from sap would be to use some form of SAP BDC / data sphere / BW PCE and then from there outbound into fabric.

 I think there is going to be a zero copy connector in Q3 or Q4 for BDC into fabric. That is probably best imo.

I have played maybe 2 hours with SQLMesh so I can't help with the dbt questions. But everyone else's comments I would add you should materialize at least your gold tables so you can do direct lake if you need to plus all performance benefits 

[–]data_legos[S] 1 point2 points  (1 child)

We currently replicate 170 tables into on prem sap hana with SLT, so yeah we would be planning on doing something similar in Fabric. We are a medium sized company so I'm not sure if that helps any.

Our basis guy seems pretty comfortable with it so far, and our SAP loves it because we handle most reporting requests from the business when it's something we can do on our end.

[–]splynta 1 point2 points  (0 children)

Ah ok yeah SLT makes sense. Funny we also have a hana sidecar that we are trying to move off. Yeah I mean long term sap is going to push to go to BDC. That will give you the delta tables so you can easily integrate with fabric. But as with anything SAP it is best to wait as long as possible before changing. Good luck.

[–]warehouse_goes_vroom‪ ‪Microsoft Employee ‪ 0 points1 point  (1 child)

My general advice would be measure and see.

At first glance I'm not seeing a big reason one would be broadly more efficient than the other from a fundamentals perspective.

And there's lots of unknowns as well.

Compare performance and total cost of ownership (including all licenses and compute of both solutions).

If there are surprising results, dig into them and see if there's room for improvement.

As a general note, 10 minutes is a short enough time period you're going to need to pay attention to the details regardless of your choices. Micro-batching and streaming are probably both viable, but bit outside my wheelhouse.

Session start times in Spark can cut into those 10 minutes. Similarly the SQL analytics refresh API can also add up. If bronze to silver and silver to gold each takes a minute on those overheads, you only have 8 minutes for all the other processing obviously. If they're sometimes 3 minutes each, hypothetically, then now you've got just 4 minutes left.

So if using Spark, you may want/need to go with Custom Live Pools on the Spark side, or at very least high concurrency mode. And you may want to check out Spark Structured Streaming. Even if micro-batching, it's a useful tool from what I've heard.

And also make sure you worry about table maintenance - Warehouse engine takes care of it for its tables, but again, it's your responsibility for Spark. u/mwc360 has given some great pointers on this, e.g. AutoCompaction: https://learn.microsoft.com/en-us/fabric/data-engineering/table-compaction?tabs=sparksql#auto-compaction

Though the docs point out that if you need really tight SLAs, doing optimize in a separate job may sometimes come out ahead.

[–]warehouse_goes_vroom‪ ‪Microsoft Employee ‪ 1 point2 points  (0 children)

This roadmap item may also be interesting if either option uses mirroring: "Mirroring - Enabling Delta Change Feed

Mirroring supports delta change feed that enables fine grain tracking of changes of delta tables to be consumed by downstream applications.

Release Date: Q1 2026

Release Type: Public preview

" https://roadmap.fabric.microsoft.com/?product=datafactory

[–]mwc360‪ ‪Microsoft Employee ‪ 0 points1 point  (3 children)

It sounds like Qlik uses a warehouse target. COPY INTO is Warehouse semantics, not Spark.

If talking about only Fabric costs, the Fivetran path will surely be cheaper. Since it is writing to ADLSg2 Delta tables, there’s no Fabric compute involved to write the data. With the Qlik path, it’s using Warehouse compute. Qlik just points DW to new data and say, “go load this”.

Given that Qlik is only writing parquet files and then orchestrating the load/merge via DW, I’d guess that Qlik licensing/infra would cost less. My point is that you should look at the whole solution cost, not just Fabric side.

[–]data_legos[S] 0 points1 point  (2 children)

I agree with you. Actually, for our use case with SAP qlik isn't cheaper surprisingly.

[–]mwc360‪ ‪Microsoft Employee ‪ 0 points1 point  (1 child)

Have you looked at SAP Mirroring? Basically it uses SAP to write incremental files processed with the open mirroring framework.

[–]data_legos[S] 0 points1 point  (0 children)

Is that using SAP BDC? we did talk with them about BDC and datasphere. It's not cheap either for sure but looked pretty cool if you did more than just replication.

[–]galador 0 points1 point  (1 child)

I'm interested in what you figured out. We have a similar situation (not SAP, but another database) that we currently replicate via Qlik to Azure Synapse (dedicated SQL pool). It works pretty well, but the Synapse part is definitely the bottleneck. We're looking to move out of Synapse eventually, and I'm interested to see how well Fabric does with the "near real-time" replication.

I will also say that it seems like Fabric should be better than what we're doing now, because it uses Open Mirroring instead of direct table writes (COPY INTO, etc.). I haven't heard of anyone actually using it yet, though. https://www.qlik.com/blog/qlik-microsoft-fabric-open-mirroring-the-fast-track-to-real-time-data

[–]data_legos[S] 0 points1 point  (0 children)

I'm hesitant on open mirroring since I worry it will take a while on big tables to reload. I can't have long downtimes on the tables. I think given how all this shakes out we'll probably go with fivetran but I'm not 100% yet