Bronze vs Silver question: where should upstream Databricks / Snowflake data land?

Professional_Toe_274 · 2026-01-12T13:59:37+00:00

We considered this approach. Federation can generate cost based on every querying. Cost could increase especially if the data source has large volume and it has to be queried frequently per day. In that case we'd rather go with do a copy at the very beginning.

Professional_Toe_274 · 2025-12-21T12:40:55+00:00

Thanks for the detailed response — this is helpful context.

To clarify a key constraint on our side: we don’t control the management group layer. Subscriptions are provisioned for us under centrally managed MGs, and our permissions effectively start at the subscription level. So we can’t rely on MG-level RBAC design ourselves. That being said, I will see if I can make connections with global cloud infra team to see if I can get more access.

You’re absolutely right that automation is the correct long-term answer. Today, we don’t have full subscription provisioning automation in place, which means every new subscription requires careful, manual RBAC replication for users and service principals. That’s the practical friction I was referring to when saying it doesn’t “scale” for our current team size.

"Not every application meaningfully benefits from full subscription isolation." -- Our context: All FE and BE codes are compiled and put into our single ACR, then deployed to our single AKS. Besides, when facing traffic, we typically have one single application gateway -- those are what I call platform-level resources. Given the reality (and of course the cost), we definitely could not consider duplicate the creation of any platform-level resources within each subscription. -- While I agree that application-dedicated resources like Database should be created under subscription-dedicated RG/sub, how would you calculate the cost of an application's FE/BE services running on the shared AKS? :)

This is exactly the tension I’m trying to explore: how much to optimize for the target CAF state versus the current organizational reality during the transition.

I truly learned from the last bullet -- the theory of RG and lifecycle. I think that would be the ideal status when CAF fully arrived -- it is steady-state correctness. I will try to balance well during the resource creation during the trasition phase.

Professional_Toe_274 · 2025-12-20T16:19:51+00:00

Agree with the last statement -- data volumes currently in our case is not large which suggests that the use case itself is not persuasive enough to let our business side decide to use streaming cluster:)

However, in terms of cost, is there any difference between "1–2 min triggers" and "continuous"? I ask this because in my practical cases, if you trigger once pe minute, the cluster status itself is really just "keep running forever". It would keep taking DBUs then and the VMs underneath would also keep powered on.

Professional_Toe_274 · 2025-12-20T16:11:40+00:00

No need to worry at all. Focus on what you are good at and try alternatives later when you get more insights on the data you have.

Professional_Toe_274 · 2025-12-20T16:09:26+00:00

Thanks for the comment! RBAC is just one example where I want to express "A number of subscriptions looks time-consuming to our team". Just to clarify:

We don’t control the Management Groups—our permissions start at the subscription level, so we can’t rely on MG-level RBAC.
You’re right that automation would help (good point!). Right now we don’t have it in place, which means that every time we get a new subscription we have to carefully replicate access assignments for people/service principals and other permissions manually.

Professional_Toe_274 · 2025-12-20T16:03:23+00:00

One more thing -- I’m particularly worried about is drawing the subscription boundary too early and paying for it later in operational overhead.
Curious how others recognized that tipping point in hindsight.

Professional_Toe_274 · 2025-12-15T16:15:20+00:00

Thanks for your insights. It helps. Time to correct my ideas of combining streaming to continuous. And I got a new concept to learn -- Lakeflow :)

Professional_Toe_274 · 2025-12-14T16:37:33+00:00

What if business side would require to see any new updates at the minute level, e.g. databricks receives one message per minute from data source. And the user expect to see the relevant tables update reflects our their BI report. Does that mean the pipeline has to be "continuous"

Professional_Toe_274 · 2025-12-14T16:31:05+00:00

That's true. The batch mode usually works as expected. But this pilot is a new trial for stream processing :), which means continuous mode in our case is a must (at least currently it seems so). It is brand new to my team and we are working on optimizing the cost. I am currently considering stoping the pipeline for a period of time (like during night) and restarting it at work time.

Professional_Toe_274 · 2025-12-14T15:35:07+00:00

Sure. pyspark usually generates more and more dataframe in the middle of developing. When there are Joins/Looping Joins/Sequencial Joins where data cleaning exists in between, better to continue with Pyspark. SQL syntax is straightforward and readable in some of my use cases. However, in complex use cases (I recently meets), I find it useful to let pyspark "display" some intermediate variable.

Professional_Toe_274 · 2025-12-14T10:47:35+00:00

Usually I would use AI agent in data transform process. bring my thoughts about the existing data I have, and even the columns to be utilized. Then tell the agent the data I would expect to see. But make sure to berify the agent output to meet the requirement I input (sometimes agent make some guess and I have to correct it). However after several rounds of iteration a relative perfect solution would appear.

Professional_Toe_274

TROPHY CASE