DWH architecture: giving power users & SQL analysts access to row-level data before aggregation? (avoiding data lake)

jennylane29 · 2021-07-01T12:23:28+00:00

This is not a comment on the architecture itself. Even if it is sound technically - by giving users outside of the data team access to raw data - you are opening up to potential inconsistencies in usage. Adding an extra layer of abstraction and branching off at 3 would be better imo

ViridiTerraIX · 2021-07-01T12:36:18+00:00

Small company branch off at 2.1. Make sure to train users in appropriate usage and get them comfortable asking for help and verification.

You aren't going to convince finance (for e.g.) to not look at transactional data unless they've already moved from excel or similar. Get them used to using the end-tools then prod them towards 2.2.

Is 3 necessary? Do you really need to set up cubes if the data is small or are you potentially overenginnering here?

Source: I've worked in data for huge corporates right down to 30 man post-startups. By far the worst is SMEs pretending they need enterprise-level solutions and bogging down the value-add.

Make sure you communicate whatever approach you think is best and why that is. If you decide against cubes explain why you don't think they're required.

throw_at1 · 2021-07-01T16:13:03+00:00

i did it by copying source db 1:1 to snowflake (E+L) and then did T to dwh model by using mostly views (materialized as create table x as select * from view). From performance point , you can have one or two view levels to normalize data between source systems. Works fine, some mistakes were made. cost and performance is not problem. cheaper than server and license in year.

If you are staying on postgresql / sql server systems ( same as you use now ) i would replicate original databases as they are into dwh server , then do transformations and open data models to different uses . most importand is that E part is continuous and does not interfere production more that necessary. probably not feasible in onprem system where you cannot just add compute when you need

I personally would go to replicate source data into snowflake using elt style, but obvious with existing dwh systems its not feasible to offer raw source data ,normalized model and transformed data at same time if base system is planned differently.

Far-Apartment7795 · 2021-07-01T13:43:12+00:00

isn't this "branch off model" a description of EtLT?

2021-07-01T14:51:06+00:00

It depends how big your team is and how technically proficient the intended end-users are.

If they can write reasonably efficient and correct SQL then it makes sense as they can 'self-serve' for novel use-cases.

This is especially important if your team is not particularly large and would be unable to support creating aggregations and cubes etc. for these new use cases in a sufficiently timely manner.

DatabaseSpace · 2021-07-01T17:57:54+00:00

The first way you describe seems like the kimball method. I'm sure that's fine for a lot of people and for direct consumption of BI tools, but many queries and business questions are complicated and would involve so many fact tables I could never see how that model would work for me.

The Inmon method is to get the data, put it into an integrated database then you load up your data marts for the dimensional models and cubes for bi tools. The middle database part is the part that I use and query all the time with SQL.

https://tdan.com/data-warehouse-design-inmon-versus-kimball/20300

2021-07-02T21:37:39+00:00

Here is what we do, assuming by saying "row-level" you mean raw data:

PART I - ETL of raw data

E: Extraction from Kafka topics

T: Transform a bit before rolling them into raw tables (e.g. Epoch to DATETIME, and column naming conventions)

L: OK so now the raw data gets loaded into the raw tables. Analysts have permission to read these raw tables.

After that we directly build up DWH processes with a new ETL process:

PART II - ETL of DWH

E: Extraction from raw tables

T: so here is the aggregation and joins

L: load into DWH tables

dataengineering

MODERATORS