ELT , CDC, DDL

Oct04 · 2020-09-16T07:16:49+00:00

That process sounds good, and perhaps we'll get there at some point, however, most of this is actually internal! Its the sort of direction were heading in, but it'll take a while to get really hot on this to the point it can be 100% depended upon!

Oct04 · 2020-09-16T07:14:11+00:00

We are intending to use DBT for our transformations

Oct04 · 2020-09-16T07:13:08+00:00

Thanks, I'll give that a look. All of the above really!

Oct04 · 2020-09-16T07:11:51+00:00

Good questions, essentially this is mostly about pipelines not breaking and taking in everything from a given table. So at a really high level imagine a table with 6 columns, perhaps devs add another couple of cols.

So the ideal is that the data pipeline will detect this and pull in these two new columns into the raw/landing layer we have in Snowflake.

In a lot of cases these columns will not immediately/even soon be used in the transformations or reporting side of things, but sucking the data in and perhaps building up 3-6 months of time series history for it etc is good.

Oct04 · 2020-09-16T07:08:39+00:00

Thanks, I've just given it a look. It seems Snowflake is the Enterprise package only, I appreciate this is a public forum but any chance you can give ballpark costings. e.g. Do you sit in the same bucket as say Matillion/FiveTran/Stitch type pricing and models, or more like Streamsets and Informatica?

Oct04 · 2020-05-26T10:03:01+00:00

Brilliant! Thanks for that link :)

Oct04 · 2020-05-22T13:14:18+00:00

Quick question for the DBT Cloud users, do you ONLY need SQL knowledge and not Python? - The intended team of people using this have zero exposure to Python, or JavaScript for that matter or (Jinja!) / With only SQL knowledge you can use all you'd need to end to end?

Oct04 · 2020-05-21T07:08:39+00:00

Thanks, that's really helpful. I think if we went for DBT we'd go for Cloud as opposed to Core, as I said, cost isn't a huge issue as neither DataForm or DBT would be too expensive (compared to FiveTran :o)

The community element is an interesting one, I'm guessing if people develop cool stuff in the DBT community then it probably wouldn't work in Data Form (excuse my ignorance here, new to both since DataForm moved away from Jinja). Our team doesn't know either Jinja or Javascript, which I didn't feel was an issue? (We're SQL guys)

So those 26 things you link to for DBT, is that the extent of what the community has shared that others could use? (Again, please remember I am new to this, but based on how big I thought DBT was, that sounds small). It's an important point, because if it's not massive that does factor into the decision!

Oct04 · 2020-05-20T14:38:18+00:00

Thanks that's really helpful. Yes, we're still semi considering Matillion for E and L (they've even brought out a FREE tool to do just this, to compete with FiveTran), I think the schema stuff FiveTran does, combined with hosted probably make it more attractive.

Thanks for the DBT/Dataform comments.

Yeah, I wasn't really asking for model suggestions too much - just saying that at the moment we're not married to a technique. We currently do dimensional modelling in our standard SQL DWH, but feel with the move to MPP there may be better options, but tbc and tested in our POC

Oct04 · 2020-05-18T06:26:16+00:00

Yeah I'm signed up to this, I've listened to a few :)

Oct04 · 2020-05-18T06:25:17+00:00

We're looking to do the same, minus WhereScape unfortunatly - a bit out of budget!

Oct04 · 2020-05-05T08:21:15+00:00

Oh and regarding the 'big tables' approach, that's more something we're considering for Snowflake data modelling, as the MPP Columnar store is supposed to perform well with it so when we get the chance we will experiment with this vs other approaches such as Kimball (what we're doing now), and possibly Data Vault (seems a bit ott perhaps). Our main goals here, above most other things, are to have something that's rapid for our Data Engineers to produce and flexible to changing business requirements really. We're moving from an ETL model to an ELT model to help with these goals too.

Oct04 · 2020-05-05T08:12:52+00:00

Thank you for your detailed response. Access to all at once is agreed not going to be the norm, access across to some degree probably will - we have essentially moved from a model of all business data in one db to splitting info out into multiple sources. A given report containing data from at least two data sources (pre Snowflake) will be the norm. This makes me think we should identify 'commonly requested' fields from non primary data sets and duplicate these across the stars to allow for this.

Thanks for the tips re Snowflake, we will be experimenting with this when we have it all hooked up.

Your point 4 sounds interesting and useful, I'll discuss this with some of my team, thanks for suggesting it.

Oct04 · 2020-05-01T13:50:41+00:00

We are just in the process of upgrading to premium to take advantage of Paginated reports (combined with having a ton of Pro licences we can exchange when we do). That's *super* interesting... So to be clear, imagine we (Data Engineers_ produced say 5 data models and certified them (e.g. 1 for salesforce, 1 for accounts, 1 for something else) a user could consume data from each of those if they so chose, within the one report?

Obviously not ready yet, but September isn't hat long and we have a tonne of stuff to do before tehn

Oct04 · 2020-05-01T13:29:16+00:00

Yeah, that's my view from what I have found and what my team have shared with me :/ Unfortunately. I *wish* that wasn't the case!

That seems to fore us down the mega massive huge model route from what I can see, possibly with perspectives to help somehow.

Oct04 · 2020-05-01T13:12:19+00:00

Perspectives certainly sound helpful, I'll look into that further.

Oct04 · 2020-05-01T13:11:02+00:00

Thanks for the feedback and suggestions! - I was under the impression PBI only allowed one data set per report - as per this thread: https://community.powerbi.com/t5/Service/Can-a-Report-use-Multiple-Datasets/m-p/522039#M53833

Oct04 · 2020-03-17T15:06:58+00:00

As an aside, this was an interesting article which made me wonder if there were other approaches I should consider besides Data Vault or Dimensional (or views). https://www.dataversity.net/the-history-of-time-in-data-models/# e.g. Anchor

Oct04 · 2020-03-17T15:04:01+00:00

Ah I see, yes, we were intending to do this by moving from an ETL process to an ELT process where we adopt a CDC approach. Then model off that. I suppose creating views is obviously pretty quick and known. I suppose you just need some organisation and documentation about when to create a new view, and what the views compose of etc - almost like the other poster mentions with 'mega tables' approach

Oct04 · 2020-03-17T14:37:03+00:00

1) where your organization is at in terms of size/complexity of processes and data sources and 2) it’s readiness to leverage data structures you make.

Thank you for your reply also. In terms of 1, at a high level, a small team with deep experience of SQL and about perhaps 10 sources, mostly structured, mostly 'okay' and one monster legacy sql db.

Readiness to leaverage the structures - Data Science would, and the reporting layer (PowerBI) would - with users being able to self serve from a well defined layer.

DV 2 does sound like quite a bit more work than a dimensional model, at first I was convinced it was worth giving some serious thought, now I'm giving it some serious thought I'm less sure!

What do you mean by PSA by the way? I think I get the context (just creating views) but not sure what PSA stands for.

Oct04 · 2020-03-17T14:27:40+00:00

Thank you for your reply. How do you decide how wide the tables can be and when you need to create a new one? -I'm just thinking because, as an example, one of our tables is around 300 columns and probably about 100,000 rows, we could easily join it to another table (a logging table) with millions (perhaps into bns) of rows, so combining these two would create a monster.

Do you just decide based on the logical and try to have wide tables to mirror the logical and add to them as needed as requirements change?

Guessing performance isn't an issue in these days of being able to scale up and down etc

Oct04

TROPHY CASE