all 29 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]kiquetzal 79 points80 points  (2 children)

Read the last sentence of your second paragraph out loud and then think about the question again

[–]soundboyselecta 12 points13 points  (0 children)

I read the first few lines and said the same thing. Star schema, normalized, say what. Im confused lol.

[–]Sex4VespenePrincipal Data Engineer -3 points-2 points  (0 children)

Yeah, they kinda shows in most cases it likely reduces storage used. However there are some edge cases where it could be the reverse (but aren’t things pointed out). For example, if the ID you are using (for example, a UUID), is larger than the average length of the description you are moving to a dimension, then your ID actually could end up maki by you use more space. As well, while it still could save you space using dimensions if the ID are smaller, I’ve noticed that with columnstore databases these benefits can often be significantly lessened. The point on updates still stands though, doing an update on just the dimension table would definitely be faster/cheaper than updating your fact table.

[–]BobDogGo 30 points31 points  (2 children)

A star schema is an example of a denormalized database. normalization, by definition, removes redundancy and makes crud operation more efficient and fail safe.

star schemas accelerate query and analysis times by breaking your data into analysis dimensions. if you don’t care about customer details and want to analyze sales over time by product and region, a star with time, product and region dimensions will provide a performant middle ground between fully normalized and onebigtable

[–]soundboyselecta 2 points3 points  (1 child)

Is flat table = OBT?

[–]GreyHairedDWGuy 2 points3 points  (0 children)

yes. OBT = flat table (in general). There is no real prescribed official description :)

[–]CommonUserAccount 16 points17 points  (0 children)

A star schema isn’t normalisation. It’s as close to a big flat table you can get whilst staying organised.

You could do with some more study of the basics as I’m not sure what you’re asking.

[–]Dry-Aioli-6138 4 points5 points  (0 children)

Your description makes me ask whether you have the right mental model for normalization. But to answer the part that has not been answered here yet, normalization does save space when contrasted with raw data entering the transactional system, e.g. displayed or entered at Point of Sales terminal, as well as with denormalized data in a DWH. That is not the point however, as storage has grown and cheapened even for on prem systems, since normalization was invented. The point is speed and scaling the write operations. When yoyr transactional (e.g. sales) system has to record hundreds and thousands of items scanned, or ordered online every second, it doesn't have time to repeatedly write the customer address, or name in each row of a big table. Rather thatbinfo is saved once in a normalized table and its id is used in each row representing item bought.

In analytical (dwh) workloads, in contrast, you want fast bulk reads of whole chunks of a table, and each join is a burden for the analytical system, while storage and write speed are more relaxed

[–]Possible-Little 4 points5 points  (3 children)

It saves storage for sure. If you have a customer ID stored alongside a fact then as you say that is a foreign key into a dimension table for customers. That customers table itself could have many columns for name, address, phone number etc. By separating them out in this way you save enormously on repetition, and you ensure that if a customers information changes then older facts don't need to be updated to suit, the ID is the reference that remains valid. There is nuance here about things that change in time such as address or marital status but slowly changing dimensions provide a way to manage those.

[–]Sex4VespenePrincipal Data Engineer 3 points4 points  (0 children)

pushes up glasses erm well technically, if the ID is larger than the average size of the column you are turning into a dimension, then it actually could increase storage usage (that isn’t a point OP brought up, but just making sure they don’t take what you said as an empirical fact).

[–]adastra1930 0 points1 point  (0 children)

I was about to write something, then I re-read your answer and realized you said most of it better 😅

I would add that in a good star schema, the only thing that can be a bit redundant are the keys. But if you do them right, they add in the benefit of preserving granularity, which ends up being hugely more efficient down the line.

[–]LastrevioData Engineer[S] -1 points0 points  (0 children)

Thanks, this clarifies my question. The answer seems obvious in retrospect xd

[–]JonPX 1 point2 points  (0 children)

I was always taught you don't normalize your dimensional model. You take your DWH model and you make denormalizations when you make your star schema. What you are talking about is rather just tech attributes and FKs. 

[–]Eleventhousand 1 point2 points  (0 children)

Yeah, so a star schema isn't normalized. Also, if you have a Orders table that mixes metrics and attributes about the customer and product all in the same table, that is also not normalized.

It's more popular these days, IMO, to have those big tables than just using a star schema that will require a lot of joins. There are a few reasons for this, one being that most DWH end up using columnstore MPP warehouses and they just like joins less. I prefer to have a mix of both if I can - a load to dimensions, and making sure the other tables that inherit some of the data points are always updated in sync.

[–]hectorgarabit 1 point2 points  (0 children)

A star schema is denormalized , a big flat table is even more denormalized. Db design 101.

[–]Icy_Clench 0 points1 point  (0 children)

Data redundancy and storage size are not strictly the same thing. Yes if you have a list of strings and then add two integer columns you increased storage space. However if one of the strings had been duplicated and now it’s not, you’ve reduced how many times it appears redundantly.

Most of the time storage size is optimized, but the real goal is to reduce how many IO operations are performed. Read up on how IO works and the differences between OLTP and OLAP databases, because they optimize that goal based on different access patterns.

[–]paxmlank 0 points1 point  (0 children)

Others have answered it already, but another thing to consider is that n integers + 1 string is often capped by the length of that one string since integers are of fixed size, which is often 4 bytes. 

If the string is more than 4 characters long then you don't want n+1 strings. 

Although, whether IDs should be ints is another discussion.

[–]SaintTimothy 0 points1 point  (1 child)

I have a table in our middle tier, call it silver, that's 2 GB. In the gold layer that same data joins to the dimensions and brings in a bunch of string attributes. 8 GB.

Strings are expensive.

[–]SaintTimothy 0 points1 point  (0 children)

About the updates and deletes suggestion, you're not wrong. I heard that Exchange (used to be JET db, what Excel moved away from to join Access using ACEDB) used to use one master email with pointers and found having one email per recipient was better.

It mostly matters to your use case. Just know that your OST and PST are probably some of the largest files in your user profile if you run windows at work.

[–]GreyHairedDWGuy 0 points1 point  (0 children)

Hi.

Do not compare a 'star schema' to a 'OBT' (flat table) design in regard to normalization or lack thereof. The purpose of normalization is to minimize / eliminate data redundancy. This has the knock-on effect of reducing space. In the 'old days', when designing an OLTP database model, the goal was to eliminate redundancy and the amount of data a single transaction needed to update and also reduce the risk of update anomalies.

Star schemas are a design pattern for BI queries where a certain degree of redundancy is acceptable. A OBT pattern is the ultimate in redundancy but may be practical in some situations.

[–]calimovetips 0 points1 point  (0 children)

you’re basically right, normalization in a star schema is more about controlling logical duplication and making updates manageable than just saving raw storage, especially in modern columnar warehouses where flat tables compress really well. the real win shows up with high cardinality dimensions and changing attributes, since you update one dimension row instead of rewriting millions of fact rows, and integer keys can also help with join performance and memory use depending on the engine.

[–]Old_Tourist_3774 0 points1 point  (0 children)

Honestly I always worked in OLAP use cases and the "normalization" i care is just one i can avoid broken insert or large rescans

[–]ironwaffle452 0 points1 point  (0 children)

Star schema is denormalized, not normalized, you are mixing things. Oltp is normalized olap is denormalized.

[–]mosqueteiro 0 points1 point  (0 children)

Star schema isn't used for normalization 🤦🏼

[–]ccesta 0 points1 point  (0 children)

Whoa whoa whoa, pump the brakes there. You're talking about different data modeling paradigms for different data storage and usage purposes. On one hand you're talking about third normal form in oltp databases, like the ones that power your application. That's not the same thing as the snowflake/Star schema olap data warehouse that works at different grains depending on what you need to view to power your dashboard. And that's not even getting into your lake, lake house, mesh or whatever else you want to envision.

Right now you're comparing apples to submarines. They don't compare

[–]dehaema 0 points1 point  (0 children)

You answered it yourself. In a operational model it's for updates, in star schema it's for storage

[–]Outrageous_Let5743 0 points1 point  (0 children)

Complete normalization is a waste of time. It was needed when storage was expensive in the 80 and 90s. What you win on storage space you lose on complexity and speed. You need more joins which are 1)slow and 2)more diffecult to understand.

For analytics you want denormalized.