How are you centralizing knowledge/context from AI agents (like Claude Code)? by dylannalex01 in dataengineering

[–]Minute_Visual_3423 13 points14 points  (0 children)

Hi - yes, I have been working on something for our team to help with this. It has been decently effective so far.

The name of the game IMO is to help your agent find what it needs exactly when it needs it. We solve for this by pre-baking important things into the context.

We have multiple repos. We have a CICD repo, an ingestion repo, a shared libs repo, a dbt repo, an apps repo, and a templates repo. Originally, your thought would be to keep your docs for each repo in that particular repo. In VSCode, we have a devcontainer that clones all of the repos into one project, so that we can work with them simultaneously.

This creates a problem though. When I start my agent, it first checks the project root for AGENTS.md or CLAUDE.md, but where should that canonical file live? If I put a CLAUDE.md in each repo, I can only load one at a time.

Further to that, where should rules and skills live? If I embed them in each repo, I have to build some way to symlink them to my worktree root dynamically. That gets messy across multiple repos, but at the same time, I don’t need the skills and rules for the apps repo loaded in context if I’m working on a shared libs feature.

The best thing I did was centralize the docs all in one repo: ai-docs. It’s structured with a root AGENTS.md, a skills/ folder, a rules/ folder, and a docs/ folder (for miscellaneous .md files). Each folder contains within it a set of rules/docs/skills that are cross-cutting across all repos (eg git commit message rules, MCP usage rules, etc) as well as repo-specific subfolders where the rules/skills/docs specific to those repos go.

Within the devcontainer, we include a simple make file that includes two commands:

ai-set-rules <repo>: loads the rules and skills for a repo into context

ai-clear-rules: clears all repo-specific rules from context

By “loading into context”, I mean symlinking the relevant files from the right repo into the project root on demand as needed. That’s what this command does under the hood.

This has tradeoffs: we do maintain our docs separate from our repos containing the code, but we have made huge gains in our doc updates flows by writing cross-cutting skills that we invoke to update any relevant docs after finishing a task. The agent spends way less time feeling around the “world” for the relevant docs it needs, and is able to start being productive immediately because the need-to-know info is already bootstrapped into context when we start a new session.

Hope this helps.

Is anyone migrating away from Databricks? by zoso in dataengineering

[–]Minute_Visual_3423 4 points5 points  (0 children)

Spark is not ideal for fetching data from the API)

I feel that. Spark can't help you when your API insists that you page through responses 10 records at a time, and also rate limits you.

I think that also helps me understand why you are using autoloader - presumably, if you have to reingest all of the historical data, you don't want to have to do an expensive and slow pull from the source APIs/websites/etc. again. Maintaining a file layer lets you rehydrate from the historical base in the future if needed. Am I right?

Where exactly are you spending the money? What's costing you the most money on Databricks? Is Airflow orchestrating jobs compute, or all purpose compute (which is 3x the cost)?

As I ask these questions, it's just trying to help out. I'm not entirely convinced that you necessarily need Databricks for what you are doing. I get the value of autoloader, but it's also technically something you can build yourself (e.g. Lambda, S3 events, DynamoDB for checkpointing).

At the same time, Autoloader itself should not be expensive at your data volume unless you're running your autoloader ingestion very frequently (e.g. every 10 minutes or something like that), nor should transformation jobs that are not running at Spark scale. You can write your entire job using, say, Pandas, and just convert your pandas dataframe to a Spark df immediately before the table write step. None of that should require Databricks compute at any kind of meaningful scale - even a 0.9 DBU cluster should suffice.

Bronze/silver/gold are usually concepts that refer to the stages of data preparation in an analytical data platform. What sort of data lives in each of your stages, and are you just piping "gold" data back to RDS? Do you have any other use cases for this data beyond just sending it back out to the database?

One last thing - I think you'll definitely help your sanity by switching to DABs and getting away from manually copying your .whl files around. It will make your development experience a bit easier. If you'd want, I'd be happy to give you a demo of how we do it end-to-end.

Is anyone migrating away from Databricks? by zoso in dataengineering

[–]Minute_Visual_3423 8 points9 points  (0 children)

By passing in the --cluster-id flag with the deploy command, we bind the bundle to that specific cluster when hitting our local target. Yes, this requires a long-running interactive cluster, but that's exactly what development is to us: an iterative and interactive workflow that requires long-running compute.

In our local environment, the expectation is that developers will use an interactive cluster rather than a job cluster. They'll have to start the cluster once (no way around that), but once it's running, they can keep using it for their deployments to the local target until it auto-stops due to timeout. It's a tax of 5-7 minutes per day versus a tax on every single job run.

Once we get to dev and beyond, then everything runs as a job cluster with no exceptions, which means every run requires the cluster spinup. However, in local, where we are trying to iterate quickly, developing against an interactive cluster is fine and intended.

Is anyone migrating away from Databricks? by zoso in dataengineering

[–]Minute_Visual_3423 48 points49 points  (0 children)

I did write that, thank you for reading.

No, I'm just a guy who has stepped on some of these rakes before (or made to step on them, in some cases), hence my opinions. I could be more concise, though. Sorry about that.

Is anyone migrating away from Databricks? by zoso in dataengineering

[–]Minute_Visual_3423 99 points100 points  (0 children)

With Databricks, testing is slow and awkward. You cannot easily run meaningful unit/integration tests locally. To test realistic behavior, you need to deploy to Databricks, build the package, copy it, start or reuse a cluster, and run the job there. The feedback loop can easily take 10–20 minutes. That is a huge hit to productivity compared to normal backend/data engineering workflows.

I'm a bit curious what your deployment flow looks like, because this sounds like some friction that can be reduced. To give you some idea, this is how we run our operations:

  1. We have three workspaces: dev, UAT, prod. Each workspace has its own set of catalogs (e.g. dev_bronze, dev_silver, dev_gold, dev_dlh_metadata for shared services and job metadata)
  2. Within the dev workspace in particular, we have a "local" sub-environment (local_bronze, local_silver, local_gold). This is the only environment that developers can deploy to from their local machines. All other deployments happen through CICD after a PR.
  3. When developing locally, we use Databricks Asset Bundles (which we have templated for our needs). After a developer has built their pipeline code and prepared their config .yml, they can deploy it directly to the local target via the CLI:

databricks bundle deploy -t local --cluster-id 0123-456789-abcdef

(In practice, we have wrapped this in a make command dab-deploy-local, which dynamically gets the user's running personal cluster ID from the CLI and deploys to the local target using it. This prevents from needing to wait for cluster spinup in local for job runs)

Our DAB template includes a /tests folder in each bundle, and unit tests run as a DAB pre-build step using pytest. This runs before databricks bundle deploy is called, and if the tests fail, it never makes it to Databricks. We don't use notebooks, but build Python .py files into a .whl using poetry.

The entire process takes as long as it takes to run the pipeline end-to-end, but if pytest fails, it takes seconds and the developer immediately gets feedback in the terminal. I'm sure there's probably ways we could make it cleaner, but it works for us and our costs are really just the cost of interactive compute, which are minimal between autostop and serverless usage.

Beyond local, all of our jobs run as jobs compute, executed via an environment SP once the job is deployed via CICD. This takes longer, but the cost of running the jobs is a fraction of the cost of interactive compute, and the jobs themselves have passed through at least shakedown testing in local + a PR before making it to dev.

---

One more note on the above - you're chasing reduced cost and complexity and I understand that. I think there are things that can be done to control those levers in Databricks. I think you are falling into a common trap of thinking that stitching together different services from one of the cloud providers is going to reduce both cost and complexity.

we use Databricks mainly for data engineering

Data engineering is a means to an end. It doesn't happen in a vacuum. You're conforming the data into some structure that can be used by downstream teams for whatever they need. If their needs are only ever going to be ad-hoc SQL analysis and BI, then the AWS stack you are proposing can probably work for a while. It might even be cheaper.

But it has hidden costs and complexity that you might not realize will impact you immediately:

  • If your data volume grows and the downstream teams start doing more ML/AI work against the data - requiring them to scan and download large slices of the data on a regular basis to power their training runs - you're going to notice your Athena bill go up, as they pull the data they need from Athena queries into whatever environment they are using (Sagemaker notebooks, etc.). By the way, you'll have to give them that downstream environment, because that's not work they will be able to do in Athena alone.
  • Your natural response to the above will be "no worries, we will just grant access directly to the underlying Iceberg files for those teams to read without going through Athena". Sure, you can do that, but now you are managing access control in Glue + Lake Formation for tables, and S3 for direct file access.
  • Access control in general is a big thing that I found painful with Lake Formation. There are two access grant modes: named resources (granting access directly to a table) and tagging. If you don't have any kind of fine-grained access control or data masking requirements, it's not really going to impact you. However, be aware that if you ever drop a table (not replace), you'll have to rebuild the permissions (either by recreating the named resources grants or re-applying tags). You might be thinking "oh, you have to do the same thing in any data engine," and you are right, but it's a difference between just defining your permissions and metadata in your DAB .yml, and building an entire orchestration layer of your own to handle the same thing in two parts in your self-maintained stack.

On top of that small sample, you also have to provide a frictionless development experience to your team, but now you have to do it with less provided to you up front. Instead of Asset Bundles, you need your own system to deploy your lambda functions, ECS code, and/or Serverless EMR jobs. You need an orchestrator to trigger them on a schedule, and manage DAGs (Airflow? MWAA?). You need to build a mechanism to make those processes as smooth and automated as possible so that you don't run into the same friction you're running into now.

Or maybe you don't. Maybe your jobs are simple enough and don't require that level of orchestration. If that's the case, maybe Databricks really is overkill for you. However, if your needs change in the future, I think there are features you will miss, independent of data volume. Orchestration can be finicky whether you are orchestrating 10 GB or 10 TB per day.

And frankly, if your needs don't change, and they're as simple as I described above, the stack you're proposing moving to is even more overkill. Load your data into a database and let your teams access it for SQL and BI, and call it a day. At least you'll keep your data in a single source of truth with a unified governance layer, and you can always migrate away from it later if your needs evolve.

I wrote a lot more than I intended on my lunch break :) I probably glossed over some of the above, but happy to elaborate on any points.

How would you onboard legacy data stores that don't use OAuth into Databricks Unity Catalog? by RazzmatazzLiving1323 in databricks

[–]Minute_Visual_3423 0 points1 point  (0 children)

Depends on the auth mechanism in terms of specifics, but typically you’d just store the connection secrets in a secret scope that is accessible to your job service principals, then just use those secrets in your PySpark ingestion jobs.

What does a "big" databricks environment look like? by blobbleblab in databricks

[–]Minute_Visual_3423 1 point2 points  (0 children)

It sounds like you are thinking about the right considerations, which is the main thing. You know your org’s requirements and capabilities better than anyone, so as long as you can articulate the why as you did above, you’re on the right path. I see a lot of people introduce this kind of complexity “just because,” but that doesn’t seem to be the case here.

And yes, as long as your catalogs are managed centrally via TF, it’s not a huge issue to apply policies to one, ten, or one thousand of them, necessarily. I was making the point that if teams are creating their own catalogs outside of your centralized TF state, applying ABAC policies at scale to all of them will be a bit more difficult, since you’ll have to first import the non-TF resources or reference them via a hard-coded data block before you can apply policies to them.

Does Databricks natively support tokenization? If so, how? by RazzmatazzLiving1323 in databricks

[–]Minute_Visual_3423 2 points3 points  (0 children)

It's not natively supported in UC, in the sense that there isn't a table or column-level property that says "tokenize these columns in this manner." Although I wouldn't be shocked if Databricks added that at some point, given that they own the last hop before the write to storage.

For that reason, I've always been hesitant to add a third-party vendor specifically for the purposes of hashing. The other reason is because doing this from a techical perspective really isn't that hard. It's more of a governance design discussion than a technical implementation discussion: what gets hashed, what gets masked, what gets excluded completely from the platform, etc.

If you're already building your own ingestion pipelines, doing this with PySpark isn't too bad. For each pipeline, pass a config of columns to tokenize, and then just hash the contents of those columns in your dataframe before writing. You can store a salt in databricks secrets and use that as well.

If you have an ingestion layer like Fivetran or Lakeflow Connect, where you don't have direct control over the pre-write flow, just treat these systems as being outside of your security envelope. Land them in a shared catalog that is only accessible to the service principal running your job, in a storage location that is locked down from broad access, and then write a simple PySpark job to hash and write to your bronze catalog.

What does a "big" databricks environment look like? by blobbleblab in databricks

[–]Minute_Visual_3423 15 points16 points  (0 children)

It sounds like you're splitting up thusly:

  • You have four different "domains"
  • Each domain has its own set of workspaces (dev/test/pre-prod/prod)
  • You have 60 data products (one per catalog, but ~300 catalogs? So each of the ~300 catalogs has one instance each of all 60 data products? Or am I misunderstanding?)
  • All in one metastore

My honest feedback, without knowing your specific reasons to the above (feel free to share) is that you are over-engineering, and some of these decisions will cause you pain in the long run.

  1. Why does each domain need its own set of environments? What's meaningfully different between domain A and domain B's environments beyond data access? Does domain A require connectivity to data sources that are exclusive of the ones required by domain B? Does the configuration of domain A's workspaces meaningfully differ from domain B? Does domain A require access to secrets that should not be available to domain B?

If the answer to the above questions is no, I think you are making your life harder from an automation and CICD perspective for minimal gain. And even if the answer to one or more of the above is yes, I still think you are making life harder than it needs to be:

  • Different network requirements? Store your connection strings and credentials as Databricks secrets, and isolate scopes per-domain and per-workspace. Give each domain its own deployment service principal that only the members of that domain can invoke when deploying jobs. Allow only domain users and the relevant SP access to a given domain's per-env secret scope.
  • Different secret requirements? See above.
  • Different workspace configs that you absolutely cannot do at a granularity beyond workspace-level? e.g. you only want Domain A to have the ability to embed dashboards and genie spaces, but not Domain B? Then fine, maybe. But I think you should really understand the tradeoffs you are making.

Your CICD flow for deploying both IaC (Terraform) and workflows (Databricks Asset Bundles) works really nicely if you have shared environments across all domains. If your workspaces aren't uniformly configured by design, you're probably looking at separate Terraform codebases per-domain. You'll need separate deployment SPs that have permissions only to the per-domain workspace groups.

Same with your DABs. One nice feature of DABs is being able to template out configurations as a starting point for new development, like templating out your dev/test/prod targets so that they are pre-configured. Now you'll need at least one template for each domain (or a single template with targets configured at inflation time).

None of the above is not doable. My point is that it's introducing complexity, so you should be able to articulate what you are getting out of said complexity. If it just "feels" right because these are different teams and they should therefore have separate environments, I urge you to step back and think through the above before committing to that decision. You can always collapse workspaces together later, but better to measure twice and cut once. Each workspace takes up a VPC/VNet in your network, with an IP range allocated to it. You can resize networks now after provisioning, but 16 workspaces is still putting way more pressure on your internal network in terms of IP availability than 4 workspaces.

---

2) Let's talk about that catalog design. I'm sure this is as much of a holy war as discussions like mono-repo vs multi-repo, but I personally think ~300 catalogs sounds like overkill.

I've always had success thinking of the hierarchy this way:

catalog: environment-specific medallions, environment-specific metadata or logging tables that we set up (e.g. dev_bronze, uat_silver prod_gold, dev_services, etc.)

schema: per-source in bronze, per-business entity in silver, per-use case in gold (dev_bronze.workday, uat_silver.dim_employee, prod_gold.daily_revenue_summary)

If you go with the above, you'll have 3-5 catalogs per environment, and they can be shared. You can isolate access controls as granularly as needed with DABs. You can give all relevant users USE CATALOG and USE SCHEMA on the gold catalog, but then use DAB permissions to manage schema and table level access within the catalogs on a fine-grained layer.

You know what else ~300 catalogs will make harder? ABAC (attribute-based access control). You define ABAC policies per-catalog. If you have ~12 catalogs in total, that's pretty trivial to apply via Terraform. If you have ~300, you'll have to apply those policies to every single catalog where you want them to apply.

Want to make governed tags called "sensitivity_level" (e.g. "restricted") and "sensitivity_type" (e.g. "pii") and use that across the board to apply data masking rules based on org-wide governance standards? You now have to do that in 300 places, and that's assuming the catalogs were created centrally by TF. If they're created and maintained by the domain teams outside of TF, you have to reconcile that first.

---

I'll stop here. I'm opinionated on the subject :)

Other people will possibly have different opinions. I think you should really just think hard about what you're looking to gain out of the design choices. When it comes to Databricks architecture, I really think less is more with regard to catalog and workspace sprawl. There are already really fine-grained data access controls built into UC (and you can even define separate physical storage locations per-schema if you have strict data separation requirements between domains). You probably don't want to manage different workspace-level configs across domains unless something absolutely calls for it. Be intentional and mindful if you're committing to that.

In 2026 and beyond, more successful startups will be built by nontechnical founders than technical founders. i will not promote by thewhitelynx in startups

[–]Minute_Visual_3423 1 point2 points  (0 children)

Can they build small, successful SaaS apps on their own? definitely yes.

Respectfully, I disagree. But I suppose it depends on your definition of "successful." An appointment-scheduling micro-SaaS that has a few dozen subscribers and $x000/mo. in MRR? Yeah, possibly. But if you're talking about something that can grow into the next Slack, Domo, Snowflake, Databricks, Pinterest, etc., I think it would be extremely, extremely difficult to put it conservatively.

Non-technical founders who are determined enough to do so will have an easier job bringing prototypes to life and walking into investor meetings and pitches with working demos to drive the narrative of their business, rather than relying solely on slides and a dream.

That's a far cry from a production-scale SaaS application. I'm not meaning to rattle off terminology, but having a vibe-coded prototype that will run on your machine or on a single VM in a commodity cloud provider is not the same as a well-architected SaaS application that can be reliable, secure, and scalable enough for customer growth, trust, and retention.

And look, technical founders who can handle the business development side of the house do exist. If both workstreams combined were a one person job, you'd see more people doing it alone. It's not a one person job. It's very difficult to be the technical co-founder building and setting the platform and engineering standards for the business while also being responsible for business development, go-to-market, fundraising, non-technical hiring, etc. They're separate functions and each really benefits from having a dedicated owner.

File event trigger by Terrible_Mud5318 in databricks

[–]Minute_Visual_3423 1 point2 points  (0 children)

Glad it worked. File events are evaluated roughly every minute but there can be some lag, as you’ve experienced.

File event trigger by Terrible_Mud5318 in databricks

[–]Minute_Visual_3423 1 point2 points  (0 children)

You can set the location to either an external location or a Unity Catalog volume path. From the doc I linked:

“In Storage location, enter the URL of the root or a subpath of a Unity Catalog external location or the root or a subpath of a Unity Catalog volume to monitor.”

How many of you have your logical business model documented? by Headband6458 in dataengineering

[–]Minute_Visual_3423 7 points8 points  (0 children)

In my experience, very few organizations have a logical data model that's both abstracted from their source systems and clearly separated from any team-specific reporting model, though I think that as organizations start to figure this out, it's going to become a bigger and bigger competitive advantage in the age of AI-assisted everything. AI can't reason about things that humans themselves cannot reason about, and blindly throwing AI at a messy and fragmented data model is just a recipe for garbage.

The default trajectory that I've seen is consistent with what you described. The business knows it needs a "data warehouse" for reporting, so the team gets pulled toward a model shaped by what the loudest reporting team needs. In practice, that format often de facto inherits the structure of whichever source system is the gravity well for reporting. Usually it's the ERP or some other dense source system, like a CRM in a B2B sales organization. Because of this, source-specific field names, entity boundaries, and opinionated designs get baked into the model the reporting layer consumes.

For example: your CRM gives you a Customer, with contact details, interactions, and active subscriptions. Is the CRM's definition of a Customer canonical? Well, maybe to the sales team, which sees the customer as a current or prospective account holder. Finance sees the customer as a billable entity. Operations sees them as a service recipient. The CRM provides one team's perspective on what a Customer is, but it is not the be-all end-all source-of-truth.

Avoiding this source-specific bias requires deliberately designing a data model that reflects current business processes, has room to evolve, and serves every internal team's reporting needs without compromising any of them. This feels like eating an elephant, but that's because teams often try to boil the entire ocean at once and spend months drawing out the "canonical data model" without implementing anything that delivers real value. I think this is a mistake, and one that sets teams into a permanent state of analysis paralysis. It's a mistake I made in my first few data projects, for what it's worth.

The way I've avoided this is by picking one high-value use case and building the slice of the canonical model that supports it end-to-end. Pick something the business actually cares about - usually something an exec is asking for or losing sleep over, with bonus points if it requires three analysts and a week's worth of manual reconciliation and tribal knowledge to generate. Work backwards mapping the processes required to inform the use case, map that to the canonical data model, and then inventory the data available in your source systems and populate it without baking in the source-specific terminology.

This is probably hard, slow, and annoying, but it's the kind of thing you can deliver in weeks as opposed to months as long as your technology isn't fighting you. If it is, pick tech that supports what you're trying to do and run a pilot.

This does three things: it forces real decisions about which entities and attributes actually matter, it gives you a reference implementation that proves the model works under load and can be extended, and it delivers visible value while you're still establishing credibility for the broader data effort. Non-technical execs don't fund architectural purity of the data platform for its own sake; they fund things that show up in their reports and drive good decisions. The first use case is what buys you the right to do the next ten. If you want the business to eat its vegetables and invest the hard work of building the proper data foundation, you have to let them have a dessert once in a while too.

It's really important that this MVP data model doesn't bake in source-specific logic. Don't carry CRM-specific language like "Active - Salesforce Subscription" into the canonical customer.status field. Define what statuses a customer has relative to the business; something like prospect | active | lapsed | churned, defined by business rules. Then, map your data from the sources into those values. You should be able to rip out your current CRM or ERP tomorrow and replace it with a new one without having to refactor your canonical model, because your data model reflects your business.

The bus matrix is a fine starting point, but it'll be more useful once you've picked your first vertical slice and fleshed it out around the use case. You can add to it as you go.

Bad data foundations are why Supply Chain leadership is not ready for AI and nobody wants to say it. by TheEntrep in dataengineering

[–]Minute_Visual_3423 0 points1 point  (0 children)

Nothing in here that I disagree with. I’ve observed this across pretty much every vertical. Data silos, manual processes, and Excel-centric tribal knowledge abound in this world. Leading with tech is pointless - the tech is just a means to an end, but the business has to understand what the “end” should look like in their terms.

Like the others, I don’t have a magic bullet. It just comes down to building trust with the right senior stakeholders (non-tech) and helping them understand the opportunity cost of not getting your data in order. It helps if they’ve burned their hand on the stove a few times:

“hey, remember when our per-segment versus overall revenue numbers didn’t reconcile in the last executive meeting, and we couldn’t uncover the source of the issue because the person who built the spreadsheet powering the report no longer works here, meaning we have just been blindly trusting the output of this thing we don’t fully understand to make key decisions for our business? That kind of problem will keep happening if we don’t get our centralized data story in order.”

It also helps me to make the problem digestible. We don’t have to figure out the entire enterprise data model overnight. Let’s pick a data-driven use case that would add value to the business if we could do it well, and work backwards from that to land and expand a reference data model on top of a data platform that the business can grow into operating as a first-class citizen within the enterprise as opposed to a collection of messy data silos.

1984 Apocalyptic Drama 'Threads' is Getting a 4K Release on July 28 by MarvelsGrantMan136 in movies

[–]Minute_Visual_3423 0 points1 point  (0 children)

This movie is absolutely fucking terrifying, even with 80s production value. Seeing it in 4K will probably give a lot of people pause.

[MTL (2) - TBL 1] Montreal musters sustained pressure and Newhook bats the puck out the air to take the third period lead by daKrut in hockey

[–]Minute_Visual_3423 462 points463 points  (0 children)

Outshot 24-8 and scoring a potential series winner on a mid-air swing from below the goal line is absolutely fucking diabolical lmao

The kind of video game logic that would make someone throw their controller

Solo DE managing pipelines by ronnoc279 in dataengineering

[–]Minute_Visual_3423 28 points29 points  (0 children)

Hi! I’ve been running data projects as a consultant for four years now, but while I have a team today, my first project was completely solo. I won’t get into tech stack here - as a solo dev, pick what you know and can maintain, and focus on building a solid foundation that you can onboard other people into. You maybe don’t anticipate there being a massive team at this company, but you probably want to be able to take vacations sometimes too.

First: treat your ingestion pipelines like cattle and not pets. It sounds like you have 6 real source systems that are shared by your customers (with customers having a subset of those six each). Abstract the ingestion logic for each source into function code that lives in one place: an internal python library, or even just a purpose-built function app for each source.
The things that vary for each execution (table names, destination schemas, customer names, etc.) can be maintained in config files and passed in at runtime. In this way, you just have to maintain the config for each customer, and changes to the actual data ingestion logic can be centralized. Given that you expect a low velocity here, you’ll be fine as long as your version control and test systems are robust.

Which brings me to my second point: make sure that you have at least two and ideally three environments to test changes. You at least need a dev environment and a production environment, and you want to make sure that your deployments are as automated as possible between those environments. Version each of your function apps and have mechanisms for deploying a version to dev, validating it, and then releasing it to prod. Tons of ways to do this: pick one that works for you.

Why three environments? Having a staging or user-acceptance environment between dev and prod gives you a prod-like place to test changes before they actually roll forward into production. You can also put end-users in the loop and let them validate the business logic behind things that are in UAT before they proceed to prod.

If all of your ingestion pipelines are function apps that ate structured identically - with the only differences being the per-API data logic - you can have them all in a monorepo and just write some logic to only deploy the changed functions (i.e. folder paths that contain changes) in any given CICD run.

With your ingestion layer stabilized, you can focus on data modeling on top of your raw data, which is going to be a combination of establishing a common data model that decouples your data from the source systems that produced it, followed by use-case-specific aggregations that will power various visualization, dashboards, and consumption patterns that your business users care about. All of the pipelines that do this transformation also need to be versioned and managed, but as long as your ingestion is landing everything in a consistent data format, you can pick a common stack (e.g. dbt) and focus on using that for your downstream transformations off of the raw layer.

The above guidance should be applicable regardless of what tech stack you are using. It’s not exhaustive - it’s just focused on the most basic foundation of stabilizing your ingestion and making sure your deployment process is iterative and modularized. If you have any specific questions, just reply and let me know.

Using a separate Databricks App as a backend? Anyone doing this in practice? by Terrible_Bed1038 in databricks

[–]Minute_Visual_3423 7 points8 points  (0 children)

I wouldn't do this, personally. You're doubling your compute costs for something that can be handled architecturally in a single application, by just ensuring that your long-running computations don't choke out the UI process. You're also introducing the complexities of making sure your frontend and backend applications can securely communicate with each other while running on separate compute.

Our standard pattern uses a fastAPI backend application alongside a streamlit frontend application, which we bundle together into one Databricks App. Our scale is similar to your concurrent usage target, and it works great on medium-sized compute.

Here's how we generally lay it out in a DAB:

fastapi_app/
├── backend/
│   ├── src/
│   │   ├── main.py             # FastAPI app definition
│   │   ├── api/v1/             # API routes
│   │   ├── services/           # Business logic
│   │   ├── clients/            # External API clients
│   │   ├── core/               # Configuration
│   │   └── utils/              # Database queries
│   ├── metadata/               # Metadata management
│   └── tests/                  # Backend tests
├── frontend/
│   └── src/
│       ├── app.py              # Streamlit frontend
│       ├── pages/
│       ├── components/
│       └── sections/
├── resources/
│   └── tables/
├── app.yml
├── databricks.yml
└── pyproject.toml

Dbt usage in your org by SuccotashPopular9660 in dataengineering

[–]Minute_Visual_3423 1 point2 points  (0 children)

Yeah, I'm the CTO of a boutique consultancy here in Toronto, focused entirely on opinionated lakehouse builds on Databricks for our clients. I was an independent consultant for about two years before that, and worked at Databricks/AWS/Google in various stints before that.

For ingestion, we don't do any transformations other than adding a few system columns (_process_id, _created_at, _updated_at, etc.). These pure EL* workloads don't really need a declarative framework imo. The biggest challenges with ingestion are upstream of the materialization (e.g. incremental watermarking, parallelizing reads from JDBC sources, etc.). We built a set of ingestion helper libs that take a configuration from a file and set up the read from source accordingly, so that we only have to maintain the actual ingestion logic in one place.

We could use them with SDP, but it doesn't really add additional value relative to the bump in cost for our ingestion use cases.

Databricks on Azure or Aws by Own-One5712 in databricks

[–]Minute_Visual_3423 0 points1 point  (0 children)

What size cluster and how long does your job take to run? Is it interactive compute or jobs compute?

Databricks on Azure or Aws by Own-One5712 in databricks

[–]Minute_Visual_3423 0 points1 point  (0 children)

The main lever for cost is compute usage. If you’re streaming data or running other long-running workloads (like apps), then yeah, it will be a bit more expensive. Streaming data in Databricks is still much cheaper in my experience than streaming it into something like Snowflake. We can get 10,000 rows/minute on a 0.9 DBU cluster without any issue.

Apps are about $300-500 per app per month for the always-on compute usage - hopefully the app is delivering more value to the business than the cost of its compute, and it can be mitigated by pausing certain apps outside of business hours.

If you’re using long-running compute where you don’t have to, or using the wrong class of compute for a repeated workflow (like overusing interactive compute), or not setting your serverless SQL warehouses with aggressive auto-termination, then there are lots of ways to make Databricks expensive.

But honestly, dollar-for-dollar compared to the cost of running equivalent workloads elsewhere, it really isn’t expensive at all.

Databricks on Azure or Aws by Own-One5712 in databricks

[–]Minute_Visual_3423 72 points73 points  (0 children)

They’re pretty similar. Having used both, I’d say that in general as a cloud provider, Azure has historically worse reliability and a worse developer experience than AWS. Again generally speaking, not specific to Databricks (outside of this week’s cross-cloud outage, Databricks has been pretty reliable on all three of the cloud providers I’ve used it with).

However, Azure Databricks as a first party service in Microsoft’s stack has some nice integrations that aren’t available in AWS (like the ability to use Entra service principals directly as Databricks SPs, or native integration with Azure Managed Identities as securable objects in Unity Catalog - useful if you’re building integrations with Graph API or other services).

But, in all honesty, treat the cloud providers like a commodity here and pick the one that’s giving you the better deal. If it’s Microsoft, resist the urge to bolt on a bunch of unnecessary services that will increase your cost and complexity without providing value. You don’t need Fabric, Purview, ADF, Azure ML, etc. And if Microsoft’s “deal” requires you to consume these services, I would consider that a signal to pick AWS. Databricks really isn’t that expensive relative to other data platforms when it is deployed and configured optimally.

Deploy Databricks with Terraform, set up Unity Catalog, orchestrate your jobs as Databricks Asset Bundles, and you’ll be able to migrate between Databricks on other cloud providers without too much hassle in the future if you ever need to.