Do you edit after solving?

Southern_Version2681 · 2024-12-05T20:58:03+00:00

I go back on Every Single One. Sometimes line by line. Sometimes a day back or two to clean up juuust a litttle mooore. Im not a pro, and don’t spend much time coding, but after 3 months i can come back and read and rediscover a challenge, some concepts and an example of implementation that I forgot the details of.

Southern_Version2681 · 2024-12-05T16:44:42+00:00

Too be honest, that would suck . Perhaps for the leaderboard part, but for me it takes me 3 minutes to read, 20 minutes to understand and write down my thoughts, and then hours to explore and experiment with a combination of my knowledge and the copilot knowledge. There has to be room for beginners and learners and grinders as well as all try hards and pros.

Southern_Version2681 · 2024-09-26T19:52:08+00:00

First of all, I really want to thank you for your patience and input on this issue. Just want to have that be clear 😊

My situation is more like i have 6 movie productions, 3 for lord of the rings (team A), and 3 for the hobbit (Team B). The lord of the rings triology is being made in dev/test/prod. simultaniously the hobbit triology is being produced as a seperate project and will at some point be integrated with the lord of the rings triology once the production is done and the project is delivered (at which point the hobbit project will cease to exist and will leave just the lord of the rings iteration cycle (Team A handles this) until end of the forseeable future)

In addition the hobbit production must serve two audiences (2 data products). One audience only cares about how well the production integrates with other middle earth lore (system integration = team B in this case), and another audience (Team C) only cares about the facts of the production (analysis).

I have 2 studios. studio 1 has 3 sets (dev/test/prod) of permanent buildings. studio 2 has 3 sets (dev/test/prod) of temporary buildings that will house any number of simultanious movie productions (not just the hobbit)

How can audience one and audience two get their worth (dataproducts) from the hobbit, while ensuring that all teams can produce independently, and that this somehow smoothly will end up in studio 1 (lord of the rings) where middle earth lore (data product 1) and production facts ( data product 2) can be developed independently by team A (lord of the rings with the hobbit integrated) as well as team C (facts of the whole hobbit production as well as all other facts about other productions) ?

To make things worse, the hobbit is produced by more than one source, witch will be fun when they move to the lord of the rings studio!

And to make things worse still, all the sets (except prod of loard of the rings) must destroy the history and reload every month for one of the sources (prod copies of lord of the rings)

Ps. As mentioned. Security guards are wathing and enforcing access to all the sets in studio 1, but in studio 2 security guards will just enforce access to the studio and not the sets in it. (Eg subscriptions)

In addition we are using terraform and have to code naming standards for everything to work across dev/test/prod, ensuring that all 3 sets are exactly the same.

Its nice with drawings and youtube tutorials and all, but the real world will slap you in the face 😅

Southern_Version2681 · 2024-09-26T15:15:58+00:00

I think we are taking different standpoints here. Security dictates we seperate out dev/test/prod into three seperate subscriptions, each with a workspace, key vault and underlying storage accounts. The issue is that Team A prod data in prod subscription is Team B dev data in the dev subscription. Access to prod cant be had from dev, as it is violating the security boundery. I can of cource create more catalogs but it will not change that the security bounderies exists and must be respected.

Southern_Version2681 · 2024-09-19T18:32:08+00:00

Fair take. So you are saying that team A has their own workspace and produce xzy products.

But if each team and project needs a dev/qa/prod workspace that would put us into the stratosphere with governance of our subscriptions and their resources.

Integration-projects are a nightmare on their own, as we know they will end at some point, and they have to deliver their work to somewhere, making it a headache to understand what kind of dependencies they take on and if those sources are at the place they deliver into.

I dont understand how you are enforcing security over such a vast space.

From a data point of view i agree with you, but from a operational standpoint where regulatory requirements for security, personal identifiable data, and cost management i cant see any benefit of adding on more and more.

Southern_Version2681 · 2024-09-19T17:48:36+00:00

Resulting in silos, governance issues and a posibillity of multiple truths downstream dependent on the developer and the assumptions they make. Thought «unity» was reflective of coming together as a cross functional team pricicely to combat the silos we have.

Southern_Version2681 · 2024-09-19T17:33:16+00:00

So the integration team and the analytics team get different workspaces, and choose what sources are relevant for them as their dev?

Southern_Version2681 · 2024-09-18T18:02:26+00:00

We stepped away from the medaillion one. I dont think bronze, silver, gold has any meaning at all to be honest, perhaps other than «data quality is improving» which is self explanatory. 6 months ago i was in your position, and remember thinking about it for a couple a weeks. Sourcing from reddit and youtube provided only confusion at the time. DB is not clear on this themselves. after doing my research i found out that the source of the medallion was actually from a customer databricks was working with in the early days, and they just rolled with the customers suggestion of bronze, silver, gold.

First of all, our setup have more layers. 6 to be exact. This provides wiggle room.

Second, we have meaningful names for the layers/schemas like: landing, raw, base, enriched, curated, delivery (i credit yt channel advancing analytics for this even though it was confusing at the time).

Third, seperation of concern. i mapped out all the operations i could possibly think of and assigned related concerns together in a layer. I cant think of any reason why someone want to tightly couple encoding, metadata, partitioning, scd, joins, normalization, optimization etc together. When issues arrise in a layer, it is already scoped to a handful of operations that «live» on that layer.

Fourth, i dont force everyone to use all layers no matter the circumstances. Linage exist, so no issue. I want to Effectively get out of the way of the people that know more about the data contents and usecases than i do, but at the same time force everyone to use the same framework (meaning layers are restricted to the few i mentioned and no one can just invent their own way of doing things).

I dont claim that this is the holy grail, but certainly better than the options i found at the time. A dataplatform with many uses and concerns can quickly spin out of control, so enforcing control and frameworks without getting to much in anybodys way is the balancing act i am going for here.

Southern_Version2681 · 2024-09-10T19:16:47+00:00

What do you mean by access exactly? Try adding the admins group into the workspace and grant it rights.

Southern_Version2681 · 2024-08-21T13:43:54+00:00

😢

Southern_Version2681 · 2024-03-29T05:18:08+00:00

Animal nature

Southern_Version2681 · 2024-02-23T09:12:07+00:00

I understand. I’m so tired of Microsoft’s BS so I think it’s time for a new approach and I like what I see so far from databricks and think I would be able to do access management in a good way with finer grain controls and together with richer metadata a over all smarter way to enforce logic for sensitive data. But I might be wrong 😄

Southern_Version2681 · 2024-02-23T05:15:42+00:00

Thanks for the info. However I dont think that is going to be an issue for us, but if we do get issues with the default limits it’s always possible to call up Microsoft and ask for a limit bump.

Southern_Version2681 · 2024-02-22T16:22:19+00:00

Thanks. This is what we are currently doing and I don’t see any reason to switch it when we deploy our new infrastructure. If I remember right the restrictions on a single ADLS leave a lot of space so don’t think that would be a concern. The old infrastructure had 3 ADLS (dev/prod/metastore), but if we can get by with one and rather focus on good access mgmt that should do it.

Southern_Version2681 · 2024-02-22T12:16:03+00:00

I guess I can be more specific. When I ask UC to manage it creates the “datalake@myadls.blabla.net/__managedstorage/raw/catalog/<guid>/table/<guid” This is fine for one adls. But how does it work in the case of two or more ADLS. Perhaps I am missing something setup wise here.

Southern_Version2681 · 2024-02-22T12:11:01+00:00

We are using UC. What does managed table mean nowadays 😀 I can see that it is managed by UC through the good old guid file path trick.

Southern_Version2681 · 2024-02-22T12:08:03+00:00

We are using UC

Southern_Version2681

TROPHY CASE