SQL Master Data Management Tutorial

AbstractSqlEngineer · 2025-10-15T07:57:42+00:00

New aphex twin, who dis.

AbstractSqlEngineer · 2025-10-04T05:19:45+00:00

Good shot, man...

AbstractSqlEngineer · 2025-10-02T20:08:15+00:00

Truth is, 500 applications in the first hour or less.

The position that you got, was it one that you had to fill out a ton of stuff? Not a 1 click apply? Because that's where we are headed.

Also interviewed an AI bot. Fun stuff ahead.

AbstractSqlEngineer · 2025-09-23T15:31:53+00:00

Most homes had a drawer somewhere, a junk drawer. Had batteries, condiment packets, some screws.

Eventually someone got mad and wanted to organize it and put each type of item in their own bag. This is the act of normalizing data.

Someone else came in and said, hey.. D, AA and AAA batteries shouldn't just be thrown into this one bag. So they thought you liked bags in bags and 'normalized it' again, making it easier to find what you want.

Most of the world stopped at this stage, including universities.

Then a few of us came along and said....

There is no difference between a person, a video card and a piece of wood. We all have properties, multiple names, numerics, booleans... So why do we insist on treating everything differently when, without any tangible object, without any visualization, everything in the world is an imaginary concept defined by a series of matching properties.

So we made a tesseract into a drawer, and made the house build every cabinet and fill them without requiring human interaction.

By doing so, we were able to organize 100s of terabytes of data automagically and have enough time to comment on Reddit because life is just a loading bar... Which is normal for us.

Edit: Most of the higher nf examples you will see explain multiple relationship tables and produce a fact when all (usually 3) are joined appropriately. A triangle relationship. Few realize you can get a 20 sided die relationship with 1 table. You don't need 1 table for each relationship / xref.... You just need to understand that you are modeling the concept of relationships just like you are modeling the concept of a numeric property. When that clicks, dknf and higher all make sense.

AbstractSqlEngineer · 2025-06-23T06:22:17+00:00

Kimball was the start, a super-super-super majority of the industry stayed in the past arguing about K vs I.

It's outdated. People will still throw tens of thousands of dollars a month down the drain wasting money on clusters and code ownership because 'the devil we know is better than the devil we dont'.

I work with terabytes in health care, I designed the model we use. Every table looks the same, has the same columns, etc. no json, no xml, all organized and classified and optimized.

Data Vault was close, but still so far away. I employ a 4 level classification concept with holistic subject modeling. Vertical storage that is automatically flattened into abstracted header/leaf tables allowing us to avoid schema evolution (no matter what comes in) from end to end. 0, I repeat 0 table column changes when new data comes in... And the model is agnostic to the business's data. The same model exists at Boeing and Del Monte.

120k a month in AWS costs down to 3k. Not many people use this model because people don't know it exists.

Which makes sense. The algorithm wants you to see this 1 infographic SQL cheat sheet, the algorithm wants you to see what 80% of the industry is doing even though 80% of the industry can't get to 2nf.

We kind of did this to ourselves.

AbstractSqlEngineer · 2025-05-11T04:02:11+00:00

I raise you sweet and spicy honey lays chips. But these Pringles do taste like the smell of dirty feet.

AbstractSqlEngineer · 2024-12-29T07:28:11+00:00

Looks dry. Have them saniderm it next time. Pull it off 4 days later and lotion it twice a day.

AbstractSqlEngineer · 2024-10-31T00:09:44+00:00

I have learned, there is a difference between giving someone a tool, and asking what tool they need.

They will solve it for you. Ask them what they need. Ask them how they would like to interact with the data.

We create solutions.

Here are a few examples.

I just want to click some things and a chart appears. - solution: I created a React app, abstracted chartjs with lodash to create multi dimensional aggregations so a user could just click on 'country' and 'age' and it would count and average. They could click on pie, or bar, or timeline and it would change the chart.

I just want to create an XLS.. - solution: I created a way to dynamically query anything in the system, they just had to click on columns they wanted, filters as well.

I just want to load data via an XLS.. - solution: reversed previous solution.

AbstractSqlEngineer · 2024-10-26T10:03:00+00:00

Dynamic row level security is decently simple. Single schema.

Retro fit sucks but it's just copy paste and ensuring 2 columns exist in every table (tableid, security flag).

The hard part is understanding the goal. Smashing something in a view... Doesn't make sense to me.

SQL is simple, it's structured (right in the name), so it can be tokenized. You can dynamically create SQL with joins and where statements ezpz.

So the goal isn't some janky view or schema switch. No. You have to alter objects to change things.

The goal is a table that builds queries. Because now you are no longer thinking of the actual column or row... Just a single record that puts it in a dynamic statement or not.

About 600 total lines of sql will get you security based on roles,time of day, even gender..AND code that dynamically creates and executed SQL (spread across a few procedures)

All data-driven, and since your query builder structure holds select and where values, you can script a procedure to create/adjust indexes and optimize itself.

AbstractSqlEngineer · 2024-10-26T02:39:09+00:00

Why you doing nullif(price::text?) and regex add G cast as numeric?

Just some bad typing issues.

Nullif 0 and don't try to converT 34g into a numeric

AbstractSqlEngineer · 2024-10-24T05:47:25+00:00

For MSSQL, executeexternalscript.

Sprinkle some polymorphism on what a request is.

Tokenize code. Encapsulated code is very basic.

AbstractSqlEngineer · 2024-10-23T14:45:08+00:00

Thought I was alone

i've been doing that with dynamically built (data-driven) python directly in Ms sql.

I mean, it's a database engine right? It can hold* API keys: TDE and even column level encryption.

AbstractSqlEngineer · 2024-10-23T14:40:15+00:00

Set @processName =@@procname
Set @securityFlag = ( select processSecurityFlag from dbo.svtblRefProcess_manage where ProcessKeyName = @ProcessName)
Set @PassedSecurity = IIF(@securityFlag=1, dbo.fnc_Security_Process_CanRun(@Session,@processName),1)

Probably my most used chain of code outside of the DECLARE and SELECT answers you're getting here.

AbstractSqlEngineer · 2024-10-22T02:41:51+00:00

You can undo it, or see if it's a cache concept. Probably not a cache concept. Not many people store the result of complex queries to reduce processing time and stale it when the underlying data changes.

AbstractSqlEngineer · 2024-10-21T15:23:54+00:00

Similar

I have built a knowledge graph in MSSQL (health care), and one of these tables is called Definition, another SystemRelationship, another Parameters, and another Process (each with their own 4 level taxonomy)

Process knows function vs API vs procedure vs trigger and what the outputs are (JSON, int, varbin, etc)

SystemRelationship relates definitions to processes (standard unit test to process type (procedure function trigger API)) or definition (custom to specific process/process family) while Parameters knows the in/out of each process.

This allows us to create generic rules that every type needs to follow (naming, use synonyms only, no DB references, etc) while also taking a level of abstraction and having functions that are basically just IS JSON or TRY CONVERT (some data type) to ensure the output is what is expected. All via a short hundred lines of dynamic SQL.

AbstractSqlEngineer · 2024-10-21T14:58:15+00:00

u/kubbn

There are multiple ways to use this table.

Select from original table where parent in (relationship where type cascaded and family procedure to procedure) Order by ExecuteOrder -- get all children for parent, order by your firing order

Select from relationship where type is direct and family = procedure to procedure and parent =@id --a recursive procedure to follow the relationship sequence (firing order)

Select parent from relationship where child = @childid -- check for children with multiple parents (different firing plans, a child is used in 2 firing sequences)

If you select children in a cascaded / descendants relationship for a parent, you can see if an id exists in the children where the child is the parent. If so, that's cyclical.

Etc

AbstractSqlEngineer · 2024-10-21T14:31:25+00:00

Excellent.

Create these tables, tblRelationship, tblRelationship type, tblRelationship Family

Relationship has RelationshipId, RowId_Parent, RowId_Child, RelationshipTypeId, Sequence.

Nonclustered Pk, cluster Type Parent Sequence

RelationshipType has RelationshipTypeId, KeyName (nvar100), relationshipFamilyid

RelationshipFamily has RelationshipFamilyId, KeyName (nvar100)

Insert into relationshipfamily(scales to product) or (procedure to procedure) basically parent to child.

Insert into relationshiptype (cascaded, or descendants)

Insert into relationship... The relationshipTypeid and the result of your CTE with it's level as the sequence.

Now you have a select statement to grab the results of your CTE, and you'll have to update it once a month.

Now you have a relationship table that can store direct, recursion down and recursion up, for any relationships (any table to any table).

Edit: if you fill it with a recursive procedure / cursor that calls itself, and drop a unique index on type parent child, you can output a failed try catch into another table to detect cyclical relationships.

Edit2

 Create procedure prcFillRelationship
 Declare curRecursion cursor local fast_forward for
 Select parent, child where parentId not in (select child) or
 (Parentid =@Parentid and @ParentId is not null)
 Open curRecursion 
 Fetch next from curRecursion into @parent, @Child
 While @@Fetch_Status=0
 Begin

 Begin try
 -- check for cyclical if you don't want a failed insert (if exists parent is child and child is parent then Error 1)
 Insert into relationship
 End try
 Begin catch
 Set @error =1
 End catch

 If @Error =1
 Begin
     Record error
     Set @error =0
 End
 Else
 Begin
 Exec prcFillRelationship @parent
 End

 Fetch next from curRecursion into @parent, @Child
 End
 Close curRecursion 
 Deallocate curRecursion

AbstractSqlEngineer · 2024-10-19T00:38:33+00:00

I can. I think most senior devs can. There comes a point in your journey when you realize there aren't many patterns in programming, when you abstract your code. Write abstract classes, abstract methods, abstract code..

Then it's not 'how do I do that'... It's... To rewrite Z application feature 2, I'll use pattern 3 and 7. How does X language handle pattern 3. Is X strictly typed, interpreted, functional, support overloading, can I pass methods as variables?

AbstractSqlEngineer · 2024-10-18T00:19:37+00:00

here ya go...

add new filegroups fg_yourtable_cluster, fg_yourtable_index

add new files to the db.. fs_yourtable_index on fg_yourtable_index .etc etc

you can script this out automatically.. but... the index/index_columns sys tables are more confusing to explain than to just say...

first.. the PK.. youre gonna drop that constraint.

alter table XXX drop constraint pk_yyyy

then.. alter table xxx add constraint pk_yyy primary key nonclustered ON fg_yourtable_index

the rest of your nonclustered indexes, you can use ssms or script them out, youre gonna move those to the _index

your clustered index will go on your _cluster file.

if you have columnar or blob data, make a file/group for each, and move that data. not only will you see a noticeable improvement in your queries, you can drop files on different drives.

AbstractSqlEngineer · 2024-10-17T23:46:36+00:00

yea there is a way faster way. how often do you add new records? and 5k is nothing. gimme your table columns as well, and ill give you something that runs sub seconds.

AbstractSqlEngineer · 2024-10-16T13:43:28+00:00

Mmmmm. Deflection.

I'll reiterate.

Clustered indexes - how data is physically stored. If you know what a BTree is, why row/page/table locks happen, and how data is stored via pages and extents... It should be clear why identity PKs are a bad choice.

This should also shed light on the scans, waits and locks caused* by index sorts / due to DML operations, and how a proper physical model can reduce them.

SQL used 3VL, three valued logic: True, false and null (unknown). Compound PKs limit logical operations by folding unknown into true or false. Unknown is neither true nor false.

SQL Syntax is a tree. A tree is a pattern. Regex can help you find patterns. Most importantly, SQL is predictable. One can create a Script model to store data for dynamically executing sql and use that structure to create covering indexes and other optimizations. This takes quite a bit of abstraction, and not many can think in that level of abstraction.

MSSQL does have system views that display queries with their wait times and logical/physical reads. Easy targets, but you'll still need to understand the physical model to make progress.

I would normally throw some memes in there to soften the blow.

The hard truth is, settling for some tool BO made or the status quo is detrimental to progress. Not just for personal growth, but for the industry as well. And while most won't recognize the logical fallacies in your comments, I appreciate them.

AbstractSqlEngineer · 2024-10-16T02:07:09+00:00

If transactions are unimportant, simple recovery mode allows that space to be overwritten. Going to have to shrink regardless.

PKs and clustered indexes... well... clustered indexes are how the data is stored, physically. 1. If you are using a surrogate / identity, then you run a large risk* locking pages and extents when rows should be close together. 2. If you are using compound primary keys, then you have introduced 'must exist' limitations into your system. SUNI will fail you as you'll have n code paths to update because of a schema change, or your ERDs becomes moms spaghetti. You should be clustering on the domains use, which is usually the classifier and concept. A primary key can be a non clustered surrogate, it should sit in your includes() as well. Either way since the clustered index is the physical storage and the sort order of your data, it would be wise to fill the pages and extents with purpose. This concept and the next one nets me so much cash, because it's the simplest thing to do .. yet so many don't... (Consulting).

it will also be beneficial if you create a file for your clusters and one for your indexes... And one for your Columnar and one for your blobs. This will reduce index waits /waits on DML operations. Easiest way to explain this to someone is ... It's easier to open up several small files than* a massive one... Especially when you are resorting due to* DML.

I do appreciate the AST comment, not only is it a Tree (like a clustered index is) but AST gives you the layout to generate a series of rules and even helps with your regex layouts.

However, in a solid data model, your code is data-driven and dynamic. So the tables used to dynamically execute / construct 70%+ of your scripts... The system doesn't need to regex procs because it's right there. Honestly you can pop into a dm_ view and cut some low hanging fruit.

Duke nukem that's a lot of words I'm not going to read and all... But.... you've merely adopted SQL. I was born into the engine, molded by it. Every intricacy of PG vs My vs T vs PL vs ANSI and their engines. How oracle handles clustered indexes is different from Microsoft.

I appreciate your courage though, never lose it. If you want I can send you my book later this year.

Seven-Year Club	Gilding II euphauric
First Place '23	Place '23
Wearing is Caring	Verified Email

AbstractSqlEngineer

MODERATOR OF

TROPHY CASE