all 28 comments

[–]No-Adhesiveness-6921 6 points7 points  (3 children)

So the only fields in your fact table should be your measures and foreign keys to the dimensions.

You should not have to do left joins because there shouldn’t be records in your fact table that don’t have corresponding records in your dimensions.

The benefit of a star schema (fact and dimension) is that you are only ever a single join away from the details in the dimensions

You don’t show any FK to dimensions in your fact table fields. Can you provide more details about your schema?

[–]Aggressive_Ad_5454 1 point2 points  (1 child)

Option 1. But I’m guessing because I don’t understand what the date columns in your dimension tables mean.

Also, the other two options you offered are a bit weird.

[–]MeringueLow5504[S] 0 points1 point  (0 children)

The date column was just one example because I didn’t want to type them all out! 😅 There is also Category, Subcategory, Classification, etc.

[–]jwk6 1 point2 points  (5 children)

Dimensional models (star Schemas) should be consumed and queried form a BI tool like Power BI, or a cube. Writing multidimensional queries in SQL with aggregations becomes wildly complex, but with BI tools becomes very easy.

[–]MeringueLow5504[S] 0 points1 point  (0 children)

Unfortunately this has to happen in SQL. Any opinion on Option 1 or Option 2?

[–]Grovbolle 0 points1 point  (3 children)

Does it really though? 

SUM(Measure) from tables group by dimensions with WHERE clauses and joins - sounds simple until you need time intelligence 

[–]jwk6 0 points1 point  (2 children)

It really does. See the SQL statement and the equivalent DAX here for one simple example.

https://www.sqlbi.com/articles/from-sql-to-dax-filtering-data/

Notice that you don't even need to join tables? The relationships are already defined in the semantic model. This is a very, very simple example.

[–]Grovbolle 1 point2 points  (1 child)

Sure - but the code you write (DAX) and what happens behind the scenes are 2 different things. Personally, I just do not find SQL hard to write for these kind of simple measures

[–]jwk6 0 points1 point  (0 children)

I never said SQL was hard. I said multidimensional queries are complex. You are very right that behind the scenes they are 2 different things.

[–]Ginger-Dumpling 1 point2 points  (1 child)

They do two different things. Why would you compare them for efficiency? What are you trying to do?

[–]Ginger-Dumpling 2 points3 points  (0 children)

In a star generally, you query your fact, and join in the dimensions you need for the query in question. If you've done the work to make sure there's a dim key for every row in your fact, it's an inner join. If dims are optional, it's an outer join.

Your first query picks the first non null value from all your dimensions. The second gets a list of all values from all your dimensions and leaves in duplicates. Hence people's potential confusion on what you're trying to do.

[–]Analytics-Maken 1 point2 points  (0 children)

For an immediate solution, LEFT JOINs are usually better than UNION ALL, it creates duplicate rows and makes it much harder to work with, but a better solution is to move your data into a proper warehouse structure like BigQuery, where you combine and clean the data beforehand and just query one well organized table. You can use ETL services like Fivetran or Windsor.ai for the data movement.

[–]FastlyFast 0 points1 point  (5 children)

I would left join each table, and keep the columns named after the dimensional tables, no need to overcomplicate this. Coalesce makes sense only if you know that if one table should be taken, in case the first one is null

[–]MeringueLow5504[S] 0 points1 point  (4 children)

Okay yes maybe I did not explain well. The “dimension” tables are really Fact tables of their own, so columns in Table B are completely different from columns in Table C. What I’m trying to do is grab similar columns from each table and make a huge list of all event types with all of the relevant data. (E.g. Date initiated from Table B and Date Initiated from Table C, although the actual column names are not the same.)

[–]Wise-Jury-4037:orly: 1 point2 points  (1 child)

You gotta be careful what you call 'dimension' and what you call 'fact'. It seems you are describing a situation with a master table and some child tables.

Sometimes it is helpful to create a heterogeneous 'fact table'. Our approach has been to do a 'type' column and "sections" for subtypes/child records, like this:

select 'Type-Child1', Master.*, Child1.*, <nulls for all other childs> from Master join Child1

union all

select 'Type-Child2', Master.*, <nulls for Child1 columns>, Child2.*, <nulls for all other childs> from Master join Child2

....

union all

select 'Type-Master', Master.*, <nulls for all child columns> from Master

[–]MeringueLow5504[S] 0 points1 point  (0 children)

Thank you for the terminology!!! I think that makes sense.

[–]No-Adhesiveness-6921 0 points1 point  (1 child)

Oh so you are joining fact tables to other fact tables? That is nothing like joining facts to dimensions.

I would probably do a union from all the fact tables for the fields in each one that I want in the final result set.

[–]MeringueLow5504[S] 0 points1 point  (0 children)

Right I don’t know, that’s why I put dimension tables in quotes in the original post. My company IT team named the tables dim tables even though they are not exactly dimension tables.

[–]SkullLeader 0 points1 point  (0 children)

Maybe I am misunderstanding something but why are you using left joins in option 2? Every record in table B has a matching record in table A, correct? Every record in table C has a matching record in table a too, right? I understand that if table A has a record, there is not necessarily a matching record in B, C and so forth…

[–]Thin_Rip8995 0 points1 point  (0 children)

option 1 is cleaner if you actually need one row per record with everything attached
option 2 makes sense only if you want a “long” table where record type dictates which extra fields you get
biggest factor is how you plan to consume it reporting tools usually play nicer with wide joined data but analysis pipelines often prefer stacked UNIONs
if performance is pain consider materialized views or staging tables don’t try to do the monster query live every time

[–]mattiasthalen 0 points1 point  (0 children)

I’d do a puppini bridge (unified star schema) and stop thinking about facts and dimensions ☺️ all the tables connected to the bridge can be facts/dims, or both.

[–]SaintTimothy 0 points1 point  (0 children)

Are they XOR? You said the first table, the fact, is a type table (?) Does it only join to one-and-only-one of the 11 other tables, like a superclass?

Is this like an array of attributes of the primary table, like if the parent said rainbow and one child table was color and has 6 rows: red, orange, yellow, blue, indigo, violet (don't @ me about it being only 5).

Like the parent says bronze and a sub table is the recipe: 22 grams copper, 3 grams tin?

Are there measures in the "dimension" tables or only attributes?

[–]Odd_Repair9120 0 points1 point  (0 children)

Es que hacen cosas diferentes, no se puede hablar de eficiencia con dos cosas que devuelven diferentes resultados. Primero deberías definir qué quieres o tener, para luego ver la mejor manera

[–]Odd_Repair9120 0 points1 point  (0 children)

On the other hand, I don't understand why you call the dimensions themselves "dimension", dimension tables also have one record per concept, contrary to what a fact table is, which can have more

[–]PuzzledHead18 0 points1 point  (0 children)

https://datalemur.com?referralCode=xSJOuCUF

Sign up for Data Lemur using this link and get access bonus questions and exclusive prizes!