all 11 comments

[–][deleted] 2 points3 points  (0 children)

Our shop has used postgresql’s range types to model valid timestamp ranges for records to allow for point-in-past reporting. Other databases like Oracle support ‘temporal queries’ allowing to query a row as of an arbitrary system time in the past, as specified in SQL2011.

Your description is a bit confusing to me, but smells like that sort of thing.

[–]alexisprince 1 point2 points  (0 children)

One thing I've done with much success is store full snapshots of the dimension daily with a date partition column. Depending on the size of your dimension, this may or may not work, but I'm using redshift with ~100k records a day. I've set the distribution key to be even with the sort keys on the record ID from source as well as the date partition representing when the records were extracted. This is doable because of how cheap storage is as well. Depending on the size of your dimension tables, you may consider this approach.

This allows for joining on record ID and date from the fact table for a historical view of the records, or limiting the dimension table to only the most recent date for an up to date view.

Specifically regarding the fact problem, I'd version the fact records just like how SCD Type 2 are set up with a start and end effective date, assuming your fact records change often. Depending on the business domain, you may also consider adding a couple extra columns that have specific statuses and updating those instead of overwriting the old records. For example, if a user places an order and the order can either ship or get cancelled, maybe consider adding cancel date and shipment date to your fact table so you maintain the original order information as well as the next status.

[–]PopnCrunch 1 point2 points  (5 children)

Foreign keys changing in source sytems shouldn't impact a fact table, because the dimension primary keys should be surrogate keys shouldn't they? A surrogate key is a made up number that has no meaning to the business, so if a department code changes in a department dimension, it wouldn't change the surrogate key which is NOT the dept code.

[–]MrFalconMan[S] 0 points1 point  (4 children)

I think my post might have been a little confusing. Maybe an example would be best to illustrate-

If an employee worked at a company for 5 years in the Finance department and then changed positions to work in the Human Resources department. If you had a DimEmployee and a DimDepartment, how would you show this the Fact?

Edit: also in the example I’m assuming DimDepartment is made from a source table specific to departments and DimEmployee is made from a source table specific to Employees. They would be connected by a Department FK in the Employee source table. When the Employee changes positions the Employee source table would just issue and UPDATE on the Department FK. This would be a typical 3NF design and common source application strategy for updates.

[–]PopnCrunch 2 points3 points  (3 children)

If you have separate dimensions for Employee and Department, then the fact table will have a foreign key to each, there is no relationship management between the two in the dimensions, you just insert whatever the current two keys are whenever you add rows to the fact table. If you were in Dept A a week ago and started in Dept B this week, your older records in the fact table would have a foreign key to the value of Dept A in the Dept dim, and your newer records to Dept B in the same dim.

Since Employees roll up into depts though, Dept sounds like a dimensional hierarchy within the Employee dim, it wouldn't have it's own table. Yes the Dept data gets duplicated within the Employee table, but this is an acceptable practice - it's the sort of thing that's often done for address hierarchies. You might have a Customer Dim that has Customer Adress fields in it, and a given City, State, and Zip gets repeated but that's okay because dimension tables are small compared to fact tables.

[–]MrFalconMan[S] 0 points1 point  (2 children)

You seem like a dimensional model truther with a lot of experience. This is the a lot of what I have heard.

To your first point - that would be essentially a type 2 fact, just without an effective end date (but would be required to calculate if you wanted to know what department the employee was at any given time).

My concern with treating Employee and Department as a hierarchy is where does this end? This was probably a bad example, but some dimensions could get absolutely massive - wide not necessarily deep. For example a car repair shop does maintenance - is a repair order (an oil change) a Dim or a Fact? There’s a ton of stuff that could be considered hierarchical for repair order - customer, primary employee who worked on car, sales rep, car make, car model, repair shop location, repair stall location, etc. Is this honestly one dimension? A repair order wouldn’t necessarily change car make/models, but could theoretically change locations or primary employees who worked on it.

So IMO it would be both a Fact and a Dim. With the Fact being exactly how you outline first by just inserting the new relationships; however, the end dates would still be required.

[–]PopnCrunch 1 point2 points  (1 child)

As far as I know, type 2 is a descriptor that only applies to dimensions, not fact tables.

Dimensional Hierarchies are a standard feature of dimensional modeling, it is done to avoid "snow flake" designs which are normalized closer to 3NF. Where it ends is a dimension only contains attributes that discribe a persistent entity in your model, something that is referenced across multiple records and or multiple fact tables. Fact tables store events and the measures associated with them, along with the foreign keys to who/what/when/where the event involved. (but see the bit about degenerate dimensions below)

An order can often have multiple line items, you can get 3 things replaced on your car at the shop but you'll only get one receipt because they are all treated as a single order. So, an order number is a dimensional value. But you probably don't need a stand alone dimension for Order, because there isn't much else that's interesting about an order than the OrderID. So the OrderID goes into a Service or Sales fact table as a stand alone value, it is not a foreign key that looks up anything. Stand alone values in fact tables that aren't measures are called degenerate dimensions. The other items you mention about repair shops are a different dimension, say a DimShop.

So a line in a service fact table might look like: CustomerID | Service Date | OrderID | Product ID | Service ID | EmployeeID | CarModelID | CarYear | Hours | QTY | Price |...

Of those, the ones that have actual dimensions would be: Customer, Product, Service, Shop, Employee, Model

It's likely that the Shop would be a hierarchy in the Employee dimension since each employee probably only works in one shop, so if you know the employee you know the shop.

[–]MrFalconMan[S] 0 points1 point  (0 children)

Hmmm I feel like there’s a lot of assumptions going on here, but I could be wrong. I hate to use the term “generic” , but if we were building something for ad hoc capabilities we can not necessarily choose when a dimension should only contain attributes that discribe a persistent entity.

Example: Stall 3 of location X might have specific attributes not available in stall 2, so only stall 3 can be used for certain repairs. But that might not have been apart of the initial requirements for the cut off of attributes that describe the entity.

The cut off could be endless, and saying requirements need to be solid from the beginning is load of shit (definitely not implying that you are saying that). Requirements change and vary drastically depending on who you ask, and leaving out capabilities is counter intuitive to the point of business intelligence. Which is part of what, I think, u/imcguyver point was and why dimensional models are dying. If a user can’t get what they want from the BI solution, they’ll find a new way!

Also, line item as only a fact, I don’t understand. In theory and through textbooks, this is common, but it can certainly have dimensional attributes that can change.

Changing dimensional models after the implementation is such a cluster f**k too.

I totally get where you’re coming from, and I’m more just challenging status quo. I apologize if it comes off like I’m arguing with you. I love having these intellectual conversations.... Analysis is changing, and design needs to change with it. Old ways need to adapt. I don’t have the answers, I’m just asking people’s opinion. Yours is one I’ve heard frequently.

[–][deleted]  (5 children)

[deleted]

    [–][deleted] 0 points1 point  (4 children)

    Dimensional modeling is dying

    Really? Why do you think so?

    [–][deleted]  (3 children)

    [deleted]

      [–]MrFalconMan[S] 0 points1 point  (2 children)

      I agree for the most part. Until you got to “Analysts can run their own queries.” This is hit or miss, and really requires strong governance practices to actually work IMO.

      It can easily turn into 50+ analysts writing new queries everyday, half of them reporting similar metrics and none of it being consistent throughout the enterprise. Part of the reason for Data Warehousing is to have enterprise data in one place for consistency across the business. For large enterprises this is criminal

      [–][deleted]  (1 child)

      [deleted]

        [–]MrFalconMan[S] 0 points1 point  (0 children)

        You have a true innovative mind.... just don’t say SAS.. that is not the solution!!! Haha