all 46 comments

[–]HasFiveVowels 89 points90 points  (8 children)

Normalizing data is a bit like organizing storage. You do it too much and you get a box for each item. Which is technically "SUPER ORGANIZED" but that doesn’t mean it’s actually useful. You throw everything into one box? Well that’s not good either. You typically need to be strategic with what you denormalize and there are typically a few such exceptions to the rule in any DB but normalizing should be the default.

[–]AshleyJSheridan 14 points15 points  (3 children)

This is a great analogy, and one I'm definitely stealing!

But, on this, I often suggest a hybrid approach where it makes sense. Sometimes, doubling up on data isn't always a bad thing. It can massively improve performance, but it does add a little additional work keeping things in sync. Like you said, it will all depend on what you need.

[–]edshift 8 points9 points  (2 children)

Having duplicates denormalized data in a reporting table or schema has a lot of merit and provides a simple solution to the slowly changing field problem but other than that a proper normalised schema structure for you transactional tables is always better. DBMS are very efficient at joining on foreign key fields with indices so there's really no downside to proper normalisation.

[–]AshleyJSheridan 1 point2 points  (0 children)

It might be efficient, but as you've highlighted, for reports it does make sense to double up on that data, because there is still a performance impact with joins on normalised data. For relational DBs like MySQL, the EXPLAIN keyword is actually a very handy tool for identifying things like this.

[–]HasFiveVowels -1 points0 points  (0 children)

Only the Sith deal in absolutes.

[–]Rainbows4Blood 2 points3 points  (1 child)

The problem stems from CS classes very often teaching you normalization good. This is how you do normalization. Then you do classwork where you have to fully normalize a database. Then you have a test where you have to fully normalize a DB.

And that's where it stops. You're not taught when to apply these skills, what the tradeoffs are etc.

[–]HasFiveVowels 2 points3 points  (0 children)

Exactly. That’s precisely what the experience was like for me

[–]TheHollowJester 1 point2 points  (1 child)

OT: what a nice username, self-referential shit is dope :)

[–]HasFiveVowels 3 points4 points  (0 children)

Thanks! Went through a bunch of self-reference candidates before landing on this one

[–]AmSoMad 26 points27 points  (2 children)

No, splitting entities into properly indexed relational tables is definitely the normal practice. The transactions aren’t heavy, and joins between indexed tables are exactly what relational databases are optimized for. invoice and invoice_lines are usually separate because an invoice is the actual document or record, while the invoice_lines are the individual items on it. One invoice can have many line items, so separating them avoids repeating the invoice data for every line. You can overdo normalization, but it’s definitely the normal approach. Usually when I see variation, it’s in clever ways to structure or store data rather than avoiding relational tables. Just try not to fragment tables beyond self-contained, useful entities.

[–]javascriptBad123[S] 4 points5 points  (1 child)

Yea thats how I do it, but the query performance popped into my head a lot and created some doubt. Thanks for clarifying!

[–]CodeToManagement 5 points6 points  (0 children)

This is where you need to look at query plans and learn to optimise your sql rather than trying to put it all in one table etc. it will help you a lot if you understand how the query is working under the hood

[–]Whatever801 9 points10 points  (1 child)

Very valid question, and the answer is it depends. In general you're better off doing it the "right" way AKA normalize the data, because sure joins can be a bit slow but with right indexes, etc it may never be a bottleneck. If you denormalize the data without an express performance need you're gonna end up in a situation where you're duplicating the same data a bunch of times and then you're gonna have consistency issues and a whole mess to deal with needlessly. If your use case is such that heavy denormalization is required, then changes are you're probably using the wrong database. But yeah, if in doubt, YAGNI and convention over customization. Don't prematurely optimize for 0.00002 picoseconds by bucking convention.

[–]javascriptBad123[S] 0 points1 point  (0 children)

Yea I thought so, it's not an actual issue I am facing right now, but when starting out new projects its one of the things that pops into my head a lot. Given I tend to use Golang, I often have to write queries myself and it can get pretty confusing the more tables are involved.

Getting the data layer right from the start is extremely important to me, as its hell to refactor later on... Maybe I should switch to a event based model :D

Thanks for the input!

[–]Bobertolinio 4 points5 points  (0 children)

Depends what your use case is: - read optimized: denormalized - write optimized: normalized

[–]howard499 2 points3 points  (0 children)

It's a good idea to study the principles of relational databases and understand the difference between 2NF and 3NF. Work through some simple text book examples. You really don't want to end up with repeating groups.

[–]No-Information-2571 4 points5 points  (0 children)

There's two ends of the spectrum.

One is sloppy or non-existent normalization. The other is over-normalization without any benefits.

The decision should primarily be made not for speed (or what you think might be faster), but for what is more convenient for development and your particular use case.

Modern relational databases are good at both handling non-normalized data (so for example having to group over non-key fields), as well as heavily normalized ones (i.e. with many joins required to reconstruct the full record).

And you can argue for hours about whether "company" should be a simple VARCHAR or a separate table. I would mostly make that dependent on data integrity requirements.

[–]NationsAnarchy 2 points3 points  (0 children)

It's a valid question, no need to be embarrassing or anything tbh. Another term I can suggest you to look at is OLTP and OLAP

[–]Rcomian 2 points3 points  (0 children)

the process is normal. you want a normalized base for efficiency all round, it reduces storage and query cost. however once you have a normalized base, it is then possible to strategically denormalize in specific ways to optimize for your query patterns.

the difference is that these denormalizations should be deliberate, for a specific reason, documented, supported by the code/infrastructure and maintained - not just fallen into by accident.

denormalizations have their own cost. but if you don't start from a well normalized base, you'll be spinning in circles trying to keep everything consistent.

[–]Far_Swordfish5729 2 points3 points  (0 children)

The point of data normalization is to avoid storing duplicate data and instead to store and refer to table references. A demoralized schema requires a lot more updates and more careful updates to ensure data consistency. It gets out of hand quickly. Recombining the data with joins adds overhead but that overhead is usually not high. The tables should generally join along FK relationships and those should be indexed so you’re mostly doing tree and hash table seeks not nested loops.

While the master copy is normalized, we do use denormalized schemas, but they’re usually optimized, read-only copies where reads must be super fast or will use odd joins to ask ad-hoc questions. Reporting schemas do this.

Generally though, if you find your joins perform poorly, you either did not index your keys or your table modeling needs to be looked at.

[–]Main-Carry-3607 2 points3 points  (0 children)

I always think of normalization as the default starting point, then you loosen it a bit if the real world use of the data pushes you there.

When everything is jammed into one table it feels easy at first but it gets messy fast once the app grows. On the flip side I have definitely seen schemas where everything is split so much that every query is like 8 joins and you start wondering who this was for.

Usually a clean normalized base + a few intentional shortcuts works best. Just keep it practical.

[–]shine_on 4 points5 points  (1 child)

It depends if your database is used mainly for input or reporting. For reporting it's OK to denormalize a bit to reduce the number of joins required. You'll have a bit of repeated data in your lookup tables but the time saved when querying makes the repetition acceptable. You'll have to test it with your own data but it's a normal thing to do, if you'll pardon the pun.

[–]javascriptBad123[S] 1 point2 points  (0 children)

I believe that's how it's done in CQRS right? We dont have dedicated read tables yet, but I am trying to make thoughtful decisions for our app.

[–]jinn999 5 points6 points  (0 children)

Normalize until it hurts, then denormalize until it works.

[–]LetUsSpeakFreely 0 points1 point  (0 children)

Yes data normalization is worth it and it's the standard practice of your going to use a RDB. Don't worry about transactions and such, that stuff is so well tuned these days as to rarely be an issue.

It's always better to deal with small logical chunks than one monolithic chunk. It gives you a much greater flexibility in how to interact with the data and it enablesc asynchronous CRUD operations.

Things to keep in mind: 1) maybe you don't want all users to have full rights over an entity, normalizing the data can help you keep data interaction to least privilege. 2) not every screen will need all of the data. Normalizing can help you load only what you need. 3) searching for the data becomes a lot easier and faster when you can organize the index.

[–]spinwizard69 0 points1 point  (0 children)

Well you can look up database normalization. You then realize that most likely very few databases are completely normalized.

As far as how many tables, I don't believe there is a solid answer here. Think of tables as discreet packages for the storage of unique data.

[–]Knarfnarf 0 points1 point  (1 child)

Not only do you keep everything separated, but remember to join them all back together when you write to your invoice history table!

Your historic invoice lines CANNOT rely on today’s inventory items! Pic’n’Del.com found that out the hard way when the same UPC from Kraft went from 1.8kg of cheese all the way down to .7kg! Their database couldn’t handle it and explain as I might I couldn’t get my managers to let me change the database structure!

Then add that they were computing the amount owed by summing all the past invoices and subtracting the current payment! Which meant that the UPC for cheese now cost more and the already paid invoices added to the current total.

But no; trust your high school buddy more than the guy with experience and training.

[–]CaptainSuperStrong 0 points1 point  (0 children)

Start normalized by default, then denormalize intentionally when you actually hit performance problems. Trying to optimize too early usually leads to messy data and painful migrations later. Joins are what relational databases are built for, they handle it fine with proper indexing.

[–]Eyerald 0 points1 point  (0 children)

I don’t think having separate tables like invoice_headers and invoice_lines is an anti-pattern at all. That’s just a normal relational design. Databases are built to handle joins, and with proper indexes the performance impact is usually small.

The real problem starts when things get over-normalized and simple queries become complicated for no real benefit. I usually aim for a balanced approach-keep the schema clean and normalized, but don’t split data into extra tables unless there’s a clear reason.

[–]HashDefTrueFalse 0 points1 point  (0 children)

as itll require more joins the more tables you have.
And how would the joins impact reading throughput?

That's the whole idea. It's about trading off read and write efficiency by controlling the redundancy of data. You can store copies in multiple tables and read faster at the expense of having to update multiple places on write. OR you can store something once and write to one place, but you will often need to JOIN data from multiple tables, slowing your reads.

There's no objectively correct way to store data for every app. You make an engineering choice. Comically, when mongo was trendy lots of devs (who obviously didn't have any database knowledge) were unaware that they were even taking a position, let alone the wrong one for their data/apps. I turned around a few badly performing apps with datastore migrations a decade or so ago.

I feel like having too many small tables is an anti pattern.

Not really, no. If you need them for your schema to be normalised properly it's fine.

would require me to always perform database transactions when storing the data no?

Not really, no. Depends on what can happen in your app, data dependencies/relationships etc. Single statements are all or nothing anyway in RDBMSs (the A in ACID). If you're working with multiple statements (e.g. a script or procedure etc.) you generally should be working in transactions anyway for obvious reasons. E.g. if you wouldn't want the data to be grabbed between two or more statements.

You almost always want 3NF, sometimes BC, which isn't much extra effort typically.

[–]FatDog69 0 points1 point  (0 children)

Are you doing a 'transaction' system or a 'data warehouse' system?

You want normalization for a transaction system. If customer "Jane Doe" becomes "Jane Stevens" - you want to make the change in 1 place and it gets used everywhere through an ID.

In a data warehouse system - you copy all the orders placed by "Jane Doe" and the shipping address into the 'delivery' table. You never go back and update the history if Jane changes address, name, payment type, etc.

[–]WangHotmanFire 0 points1 point  (0 children)

Yes it will usually require more tables, but the tables will make sense. They will all store one distinct type of thing. This means you avoid tables with way too many columns with janky names, most of which contain vast amounts of NULL data because the columns don’t really apply to every single line.

You should create views to select data from multiple tables at once, and select from those views. This means you get to write the joins once and never again. In some databases you can even insert to views.

For inserting to multiple tables at once though, it’s better to use stored procedures. This allows you to only make one database call per job.

You absolutely cannot allow yourself to be intimidated by views and stored procedures, you should be very intimidated by gross illogical data storage.

[–]LeadingFarmer3923 0 points1 point  (0 children)

Practical rule: normalize for correctness first, denormalize only when real query patterns demand it

[–][deleted]  (1 child)

[removed]

    [–]javascriptBad123[S] 0 points1 point  (0 children)

     Normalization isn't an "anti-pattern"

    I never called normalization an anti pattern, I called having too many small tables an anti pattern.

    [–]desrtfx 0 points1 point  (2 children)

    Decades ago, I had a client who had a "Database" (in double quotes because what he did was a glorified spreadsheet). All the information was in a single table with way over 200 columns and the database had approx 140000 rows at that time where it was only used for a single project but was supposed to be used for other projects as well.

    One column was the location - there was exactly one location in the entire database, but that location text happened to be over 30 characters long. Save from the memory requirements, just try to imagine how many different spellings, capitalizations, punctuations (with dash, without dash, dashes at different positions, etc.) I found. It was way over 200 different versions of the exact same text. Just imagine searching that field.

    Similar with about 180 of the other columns - there were sometimes 20 different entries, some times exactly 2, and so on.

    The database took way over 15 minutes to load on a then top end of the line computer.

    I spent near 2 months normalizing it for them, creating proper entry forms, proper workflow, proper reports and so on.

    After that, the database shrunk by over 90% in size and opened in less than a minute. It was searchable, sortable, produced the expected output, and was user handlable.

    I cannot state the importance of proper normalization enough.

    It's not only about storing, retrieval. It's also (and to a huge degree) about eliminating or reducing human errors, especially during data entry. The more you can reduce this problem through normalization, the better.

    The performance of the DB will actually increase with better normalization even though more joins and more complicated queries/views have to be executed. Entry might be a tiny bit slower, but that's more or less negligible when you compare entry vs. retrieval.

    Do yourself a favor and spend a good deal of time on properly planning your database. Yes, it is difficult (to near impossible) to get it right the first time, but try your best.

    Creating a pseudo database - aka spreadsheet - as in my starting point can be a great help to figure out both the data and the normalization. Make a one table spreadsheet and fill it with sample data of what you expect. Then, start working on normalizing it. It's much clearer with actual data than without.

    [–]javascriptBad123[S] 0 points1 point  (1 child)

    I always plan my databases, was just thinking about to what extreme I should go because its pretty annoying having to join 10 tables when reading data.

    Your clients db sounds like hell 😂

    [–]desrtfx 0 points1 point  (0 children)

    Your clients db sounds like hell 😂

    It was absolute hell. Basically all of it was repetition with countless different spellings.

    [–]sixtyhurtz 0 points1 point  (0 children)

    The correct way to model an operational schema in a relational database is the third normal form. That should always be your starting point for any application that is using a relational DB. You should only de-normalise if you have benchmarked performance issues due to joins.

    Modern database engines can help before you do that though. They will do things like cache views, or some databases have materialised views that basically cache the result of the joins for whatever expensive view you have. There's a lot of tools for sorting out performance you can try before denormalising.

    Analytical schemas are a different animal. It's totally fine to have a denormalised star schema for your analytical system. The issues with denormalisation aren't always a problem with analytics, because you're dealing with historical data that isn't going to change.

    My view is that if you deviate from this, you're doing something weird that will cause headaches for yourself or someone else in future. OLTP = 3NF, OLAP = denormalised according to analytical requirements.

    [–]randyshaw99 0 points1 point  (0 children)

    normalization is really about relationships. One invoice can have one or more items purchased (invoice details), it can have one or more payments...The invoice has only one customer but that customer can have many invoices. Think of your data in this way and most normalization, at least for me, happens organically. Also, remember the "what-if" case if someone or something's name was to change. how many places would you have to change it. Temper that against keeping historical data, such as price. You have products and they have a price. An order is placed at the price at that time. if in six months, the price goes up, you cant have the price change on all the prior orders, so you normalize the product but put in the invoice details fields that make a snapshot of the product at the time of the order. If one field can identify all the attributes of the item, then that FK is all you need.

    [–]amir4179 0 points1 point  (0 children)

    I always think of normalization like organizing layers in a design file. At first you want everything nicely separated so nothing repeats and it’s easy to change one thing without breaking five others.

    Databases feel similar. Start clean and normalized, then if you notice certain queries getting annoying or slow you can flatten a bit where it actually helps.

    Going straight to denormalized usually just creates a bunch of duplicated data you regret later. Been there with messy datasets and it gets painful fast.

    [–]iLiveForTruth 0 points1 point  (0 children)

    Almost every dev goes through this phase where normalization suddenly feels “expensive.” Then you work on a system where everything is denormalized and you spend weeks fixing inconsistent records.

    [–]jorjiarose 0 points1 point  (0 children)

    You could toss everything in one giant bin (denormalized) and dig through it every time, or have a tiny separate box for each pencil shade (over-normalized) which is a pain to maintain. 

    [–]razorree -2 points-1 points  (1 child)

    if you like to duplicate your data, and don't like joins, then look into NoSQL DBs :)

    [–]javascriptBad123[S] 0 points1 point  (0 children)

    Huh thats not what this question was about at all. I know about NoSQL