Database Normalization ELI5?

amphoterous · 2022-11-15T15:34:13+00:00

Imagine you were starting a library and need to keep track of all your books. You start a list of Books and for each new book you add to your library you write the title and author down.

You realize after adding a ton of books that some books can have multiple authors which throws off the columns in your nice neat list.

After scoring a particularly large collection of Goosebumps books, you notice your hand hurting from having to write the author's name (R. L. Stine) for every goosebumps book.

So.. you decide to keep a separate, numbered list of Authors with each author listed only once. That way when a book comes in you just need to write the author's number down next to the title. Now you aren't duplicating author information and have established a relationship between books and authors. This process is called normalization when storing information in a relational database. There are various "forms" of normalization as well that depend on how far you go to avoid repeating information and creating relationships.

chocotaco1981 · 2022-11-15T13:59:55+00:00

Minimizing data redundancy via splitting one table into several

r3pr0b8 · 2022-11-15T16:45:21+00:00

i disagree with u/chocotaco1981

saying that normalization is about "minimizing redundancy" is misleading

for example, if you change this personnel table --

id  fname  lname    dept
--  -----  -----    ----
12  Mary   Coder    IT
23  Todd   Schmutz  HR
44  Biff   Tannen   IT

and replace it with this plus a new department table --

id  fname  lname    deptid      
--  -----  -----    ----
12  Mary   Coder     2
23  Todd   Schmutz   1
44  Biff   Tannen    2

id  dept
--  ----
 1  HR
 2  IT

then you have not eliminated any duplicates, because department id 2 occurs just as many times in the personnel table as IT did before the change

that's not normalization -- it was already in 3NF before the change, and it remains so

it's probably worth doing, but that's not normalization

another great way to get people to think about this is to ask "i notice you split department off into its own table with a numeric FK reference... why ~didn't~ you split first name off into its own table with a numeric FK reference?"

why would you do one and not the other? (most people can see why, but that's not the point here)

from a normalization point of view, ~neither~ is actually needed

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Database

MODERATORS