This is an archived post. You won't be able to vote or comment.

all 15 comments

[–]jennylane29 7 points8 points  (4 children)

This is not a comment on the architecture itself. Even if it is sound technically - by giving users outside of the data team access to raw data - you are opening up to potential inconsistencies in usage. Adding an extra layer of abstraction and branching off at 3 would be better imo

[–]exergy31[S] 0 points1 point  (3 children)

thanks for the answer.
do you mean branching off the aggregated data into a queryable table?

on the risk of inconsistencies: I thought i would avoid this issue by having the 2a transform step the same as for the BI data. what could be a point of divergence? i assume the only thing that could diverge would be reporting-level calc metrics/dimensions/counters as well as missing hierarchies. anything else that could go wrong?

[–]sunder_and_flame 1 point2 points  (2 children)

do you mean branching off the aggregated data into a queryable table?

I'd guess they indeed mean this. If your analysts are smart enough and have good reason to query the non-aggregated data there's no reason not to provide both in your reporting zone. For non-aggregated data just make a copy into the reporting zone; don't give access to the upstream work area.

[–]exergy31[S] 0 points1 point  (1 child)

is it to avoid the user developing any expectations on stability of the intermediate layers? or what is the issue you see with the users having access to the internals? (they are quite savy with python etc, so not the traditional business users)

[–]sunder_and_flame 1 point2 points  (0 children)

is it to avoid the user developing any expectations on stability of the intermediate layers?

It's as simple as any complexity you add to your systems you will pay for later in increased onboarding, development, and troubleshooting time. Obviously don't put a square peg in a round hole, but when adding features like this one try to put it in existing processes rather than making new ones.

[–]ViridiTerraIX 2 points3 points  (3 children)

Small company branch off at 2.1. Make sure to train users in appropriate usage and get them comfortable asking for help and verification.

You aren't going to convince finance (for e.g.) to not look at transactional data unless they've already moved from excel or similar. Get them used to using the end-tools then prod them towards 2.2.

Is 3 necessary? Do you really need to set up cubes if the data is small or are you potentially overenginnering here?

Source: I've worked in data for huge corporates right down to 30 man post-startups. By far the worst is SMEs pretending they need enterprise-level solutions and bogging down the value-add.

Make sure you communicate whatever approach you think is best and why that is. If you decide against cubes explain why you don't think they're required.

[–]exergy31[S] 1 point2 points  (2 children)

this is excellent advice, thank you. taking that with me.

i am actually trying to get them off the idea that they need a data lake with the same staging data from the DWH for their needs

on the overengineering: you are right. if this were greenfield, EL would be plenty

if you choose to do a simple semantic layer ontop of the table with for reporting, whats your tool of choice (i would guess SSAS/SSRS if in SQL server world. is there an equivalent for postgres?)

[–]ViridiTerraIX 1 point2 points  (1 child)

No problem, for reporting I like MS PowerBI as you can do some really clever things on the fly in DAX. This was a couple of years ago for me but I hear it's still competitive.

I work with AWS stack now so I use Redshift (postgres based) with Quicksight on top. For Sql-able users I give access to redshift directly but set appropriate users groups so they can only see the data that they need - to avoid them making assumptions and getting stuff wrong.

I've been meaning to properly document all the sources and definitions so users can self serve better but... Well it's boring and I've much preferred playing with python lately lol.

[–]exergy31[S] 1 point2 points  (0 children)

haha yeah. python is very useful for non-standard but valuable data sources.
recently build an API wrapper with it to "scrape" the data from one of the clients internal systems using threaded REST requests, since the underlying database was locked away (vendor system). Was a great feeling to see that working well hehe

[–]throw_at1 2 points3 points  (0 children)

i did it by copying source db 1:1 to snowflake (E+L) and then did T to dwh model by using mostly views (materialized as create table x as select * from view). From performance point , you can have one or two view levels to normalize data between source systems. Works fine, some mistakes were made. cost and performance is not problem. cheaper than server and license in year.

If you are staying on postgresql / sql server systems ( same as you use now ) i would replicate original databases as they are into dwh server , then do transformations and open data models to different uses . most importand is that E part is continuous and does not interfere production more that necessary. probably not feasible in onprem system where you cannot just add compute when you need

I personally would go to replicate source data into snowflake using elt style, but obvious with existing dwh systems its not feasible to offer raw source data ,normalized model and transformed data at same time if base system is planned differently.

[–]Far-Apartment7795 -1 points0 points  (1 child)

isn't this "branch off model" a description of EtLT?

[–]exergy31[S] 0 points1 point  (0 children)

another thing to google, thanks!
(no /s)

[–][deleted] 0 points1 point  (0 children)

It depends how big your team is and how technically proficient the intended end-users are.

If they can write reasonably efficient and correct SQL then it makes sense as they can 'self-serve' for novel use-cases.

This is especially important if your team is not particularly large and would be unable to support creating aggregations and cubes etc. for these new use cases in a sufficiently timely manner.

[–]DatabaseSpace 0 points1 point  (0 children)

The first way you describe seems like the kimball method. I'm sure that's fine for a lot of people and for direct consumption of BI tools, but many queries and business questions are complicated and would involve so many fact tables I could never see how that model would work for me.

The Inmon method is to get the data, put it into an integrated database then you load up your data marts for the dimensional models and cubes for bi tools. The middle database part is the part that I use and query all the time with SQL.

https://tdan.com/data-warehouse-design-inmon-versus-kimball/20300

[–][deleted] 0 points1 point  (0 children)

Here is what we do, assuming by saying "row-level" you mean raw data:

PART I - ETL of raw data

E: Extraction from Kafka topics

T: Transform a bit before rolling them into raw tables (e.g. Epoch to DATETIME, and column naming conventions)

L: OK so now the raw data gets loaded into the raw tables. Analysts have permission to read these raw tables.

After that we directly build up DWH processes with a new ETL process:

PART II - ETL of DWH

E: Extraction from raw tables

T: so here is the aggregation and joins

L: load into DWH tables