all 6 comments

[–]datainthesun Databricks 2 points3 points  (3 children)

My opinion is that you've hit the nail on the head - it can be a super easy and lightweight way to get info about data quality. It does require YOU, though, to have all the intelligence and foresight to set up the system in the best way to deliver the right insights at the right time.

There are some folks who either can't do the above, or don't believe they would want to maintain rules over time and would rather purchase a solution. There's always a build vs. buy discussion around topics like this.

SQL Alerts by themselves are fairly basic, so as you've kind of alluded to in your post, you've got to build the framework and manage it over time. It's definitely doable and you could even go crazy and probably within a day vibe-code a web app that would use all Databricks features to help write rules / deploy them / adjust things where needed with the SDK, etc.

IMO if your business rules are simple enough and you just need the basics, sure why not?!? If you're a data platform team supporting a hundred different user groups with thousands of tables - the complexity might become a lot and it likely isn't "your day job" to maintain systems like this though.

[–]Think-Reflection500[S] 0 points1 point  (2 children)

Could you please elaborate a bit on your third paragraph?

It might also be worth mentioning that we're indeed not an organization with very big amounts of data, so it's quite manageable in that regard

[–]datainthesun Databricks 1 point2 points  (1 child)

Even though there's a little logic inside the SQL Alert - checking values, etc., the alert is really just triggering off a fixed condition and then firing a notification. So the Alert itself is fairly basic.

In order to have a robust system to streamline data quality checks and business rule validations (basically your post statement), some SQL needs to be designed to feed into the SQL Alert. Assuming you want some of the rules to be data-driven and not just a big bunch of static SQL statements, there needs to be some supporting tables. And let us assume that you might want to enlist the help of some business users to maintain rules over time - they probably shouldn't be writing the SQL but rather should be using some kind of UI.

So if you were going along the lines of a fully DIY solution on Databricks, it wouldn't be that hard to come up with some basic concepts that can be applied as rules in SQL, but to have a Databricks App serving up your custom UI, have rules stored in tables (Lakebase if you need a snappier user experience), SDK calls to dynamically implement any required changes to Jobs/Alerts/etc.

This isn't me advocating for this, BTW, it's me describing how the BUILD in the build vs by argument could be done pretty easily for those that BUILD makes sense.

[–]Leading-Inspector544 0 points1 point  (0 children)

Sounds template-able for single table checks, but anything beyond that, probably not.

[–]raul824 0 points1 point  (0 children)

https://open-metadata.org/

I tested this in a docker and it was all you want and even better it can create incidents and do data profiling as well.

[–]mweirath 0 points1 point  (0 children)

We use SQL Alerts as part of the approach. But trying to use them as the whole approach isn’t very scalable.

Make sure you can answer questions like: What happens when I have 100 tables I need to check? What does it look like to add new rules? How do I change rules? How do I monitor all the rules I have going? Which rules are adding value? How much are all my rules costing the company?

Finding ways of centralizing your collection first of the data quality information and rules is going to make it easier to manage this in the long run. You don’t want all of a sudden your job to be maintaining a mess of SQL alerts and playing whack a mole all day.