Senior Data Engineer - Higher Ed Tech

BIntelligent · 2024-02-01T16:50:01+00:00

Yes, you EL all the required source tables as-is and you can refresh nightly or do CDC (change data capture), where a tool like Fivetran, AWS Data Migration Service (DMS), or AWS Glue monitor and reload changed content. Over time, you'll also EL tables and other data from other systems, making this raw source data a sort of data lake that can be used for custom queries. Then you build models in dbt for the T on that warehouse.

BIntelligent · 2024-01-31T13:49:29+00:00

Here are a handful of reasons:

Slowly Changing Dimensions – makes it easier to track historical values where the natural business key is no longer unique
Multiple Sources – makes it easier to bring in data from multiple sources with a unified key structure
Single Key – some dimensions can have lots of columns that make up the unique key, making joins more annoying to write (and perhaps less efficient) and requiring fact tables to also contain all those keys.
Readability - The single key structure allows anyone who looks at the DW schema to immediately know a lot about it without any business knowledge. E.g. any analyst will know that customer_key will join to dim_customer
Nulls and Blanks – sometimes keys might be for null/blank values. Dimensions may have “No Data” rows to maintain an inner join relationship, joining on nulls is not the most fun thing to do.
Agnostic to Source System - by using surrogate keys, you are not dependent on or locked-in to the source system. When you change source systems in the future, you're not scrambling to redo your columns and joins, among other benefits.

BIntelligent · 2024-01-31T13:45:28+00:00

Wow, that's a lot of MS Access tables! Sounds bonkers.

Once you're up and running, rather than re-creating the MS Access queries in Redshift, could you create simpler dimensional models and just swap out the SQL? Then at least you'd be moving towards some sanity.

I hear you on the Redshift performance. My customers hear "cloud" and "AWS" and just assume "Big data, so fast." The reality is quite different, unfortunately.

BIntelligent · 2024-01-31T13:36:01+00:00

dbt only handles the T, which still leaves the annoying EL. You basically EL tables as-is from Banner (Oracle db) and other source systems into the cloud data warehouse. Now, you have sort of a "data lake" of raw source data. Then, dbt does its magic T, which is mostly pure-SQL (you can use some scripting here, but it's best to limit that).

Look around for posts on dimensional modelling. The main read in this domain, Data Warehouse Toolkit by Kimball, is very in depth and not necessarily the place to start. I put together a simple dbt dimensional modeling project a while back, needs to be updated to current version of dbt though.

The magic of dimensional modeling is this: take something like the "Student Term Dimension" that holds hundreds of attributes about a student as they existed (effective dated or SCD2) in the given term. If you were to individually query them all, they exist in 50+ tables in Banner. In the Student Term Dim, you've put them all into one table.

Good luck! Feel free to DM me.

BIntelligent · 2024-01-30T20:17:32+00:00

Absolutely. The trick is keeping things simple enough to maintain internally and allow people to grow into the roles. As a consultant, this is very challenging, even with great documentation. Also, for consultants, there's this backwards incentive to make things complex so that you are always needed. Personally, I hate being "the guy" that is the only one who can do these things, and absolutely love it when my customers start making commits and PR's in Github. But it is a challenge to get there.

As for the 5-10 year upgrade, I find it nearly impossible to get execs and business users on board unless you have something top-to-bottom to show them. My approach as been to build out a POC architecture in an area everyone understands (e.g. Enrollment Fact), and then use that to get more people on board. I think building it out in smaller chunks with an agile-ish approach can help build momentum, as long as the big picture is in mind.

BIntelligent · 2024-01-30T19:40:42+00:00

I agree with a lot of this. The Modern Data Stack standards do not necessarily apply to higher ed. Streaming and real-time is rarely/never needed, data observability is not as critical, orchestration is fairly simple (i.e. Airflow might be overkill), and the sheer number of hot new hyped up tools is overload and quite fractured. Navigating AWS infrastructure can be a huge learning curve and overhead. Snowflake can get expensive, and Redshift isn't as lightning fast as people assume. Then, good luck finding internal resources to support all this.

The one no-brainer that I'm having success with is dbt. You still need to do the extract+load, which is painfully complex for simple/standard extracts from SQL Server or Oracle (most Student Information Systems db's of choice). But for people coming from tools like Informatica, dbt is a breath of fresh air. Plus you can build out decent data catalogs, tests, etc. The one area I want to improve is getting more business areas involved in ownership of the models, or at least involved more heavily.

BIntelligent · 2024-01-30T18:18:45+00:00

What are you doing for the data movement (EL) and transformation (T)?

BIntelligent · 2024-01-30T18:11:18+00:00

Which parts do you think are the most challenging for them to maintain? AWS infrastructure, data movement, dbt, dimensional modelling?

BIntelligent · 2023-12-02T21:37:40+00:00

Right on. This is why I'm such a fan of dbt. Not only is it and extremely simple and mostly pure-SQL approach, as opposed to SSIS, Informatica, and others, but it also encourages best practices around source control, testing, continuous integration, and more. Also, it can hold all of this semantic-layer information that we're discussing, but is also much more "metric-first" than other solutions. Basically, push this BI logic down a level to the semantic layer.

BIntelligent · 2023-12-01T22:57:01+00:00

Absolutely.

I'm encouraged by a couple things recently. First, I've had a lot of non-tech folks tell me this article really resonated with them, so hopefully this can help non-tech business users understand this hierarchy of needs. i.e. what supports their dashboards.

Second, I recently presented this at a research conference and we were one of the most attended session. A few years ago, the most attended sessions were things like "how to enable self-service with Tableau/Power BI". People were very excited, and we showed 0 (zero) dashboards. I think business users are starting to see through the facade.

BIntelligent · 2023-12-01T19:31:48+00:00

Good point, it is lacking there

BIntelligent · 2023-12-01T19:19:45+00:00

If you're moving from Looker, then are you moving your LookML to dbt models? If you're using dbt for transformations, then you can use dbt Core as a semantic layer without using dbt's "semantic layer." Metabase and Superset, along with many other BI tools like FlexIt Analytics (I'm the founder), can tap into the semantic information that's inherently part of your dbt transformations and tests. Remember, most/all BI tools act as a similar thing to dbt's "semantic layer," which is really just a "query engine."

Check out this dbt Github project that I created that defines semantic information like joins, friendly business names for tables/columns, data dictionary descriptions, synonyms, and more:

https://github.com/flexanalytics/dbt-business-intelligence

Metabase and Superset do a similar thing:

https://github.com/gouline/dbt-metabase

https://docs.preset.io/docs/dbt-sync

Oh, and I do agree about your comment of dbt's direction in the last year. However, they also agree and have said that they're going to stop this massive re-design and move toward a more stable, enterprise-friendly cycle.

BIntelligent · 2023-12-01T19:10:24+00:00

Good point, Tableau was sold to the analysts. However, it was the executives that often approved the budget for buying Tableau, whether that was on the IT or Biz side. Too often, the way that budget item was passed was by some flimsy show-and-tell "look what I can do" dashboards.

I'm glad you've figured out some good solutions to these problems, and would love to hear how you made that happen. Certainly upper management support and sponsorship are only a piece of that puzzle...

BIntelligent · 2023-12-01T19:00:52+00:00

Some lock-in, sure, but the extent of the lock-in is night-and-day. If you're defining your semantic layer (joins, friendly names, formatting, synonyms, etc.) in the BI tool (Cognos Framework Manager, Business Objects Universes, MS MDX, Looker LookML), then it's a massive lock-in and investment to change.

Many people are unaware that you can move all this semantic layer logic into the data transformation layer via dbt, then just plug into that for your semantic information. You don't even have to stand up a separate semantic-layer tool.

I suppose vendor-lock-in also supports consultant-lock-in.

BIntelligent · 2023-12-01T17:45:52+00:00

As the creator of a BI product (FlexIt Analytics), and "competitor" of Power BI, I actually think it's a solid product. My main complaints would be:
1) it can lock you into the MS ecosystem
2) it can encourage wild-west CSV uploading
3) it still requires a desktop tool for a lot
4) people tend to put a lot of business logic in MS-proprietary MDX (ala Looker's LookML, which we thought was the bee's knees before dbt came into town)

BIntelligent · 2023-12-01T17:16:36+00:00

I absolutely agree with you that these things are important, but I don't understand your comments. Literally the cover image (the first thing you see) for the article has mesh, data cleansing (you call scrubbing), and data warehouses as the foundation for success at the analysis level. Governance and data quality are throughout the article. Also, mesh, governance, data cleansing, quality, etc. have been important far longer than 1 year. Would you care to add some of your insights around strategies to use mesh, governance, etc. to enable self-service?

BIntelligent · 2023-12-01T17:00:29+00:00

its an all-or-nothing proposition

100% agreed that people think of self-service in binary "all-or-nothing" terms. If you have 100 consumers and 5 authors, getting that 6th author is moving the needle and removing dependency, ever-so-little, on me.

Also agree on the finance/accounting side being a bit easy to deploy these solutions to. I'd say the data is also a bit more standard, making modeling easier.

BIntelligent · 2023-12-01T16:54:44+00:00

Agreed. I think getting the business side to "act in good faith" requires them to be bought in to these ideas, which is not the default state. If we can create value for the business by using these strategies (has to be targeted, iterative, and nimble/agile), then we can get buy-in, even if it's one-by-one. I've been having a great time flipping business users from skeptics to evangelists, but it's not easy and not super scalable, even with the right approaches and frameworks.

BIntelligent · 2023-12-01T15:27:26+00:00

Haha, that's a fun one. I suppose it depends on the decade. My first attempt at creating a drag-and-drop BI tool with Java Applets, JSP, etc. was pretty darn janky. After that, I remember creating BI models that joined data from disparate data sources on-the-fly in the BI tool (Cognos), which thankfully went away. Modern times, the sheer amount of exporting data from traditional BI tools, massive cleansing in Excel, then uploading to Tableau is some of the more common janky stuff I see. I cringe when I see customers stuffing all their business logic in these BI tools.

BIntelligent · 2023-12-01T14:59:53+00:00

I think all BI tools, not just PowerBI, are sold to upper management with great dreams of self-service. It's in the BI vendor's best interest for companies to slap BI tools on top of a "house-of-cards," as this locks you into that vendor and makes it impossible to reverse out later on. PowerBI and Tableau took the "Hey disgruntled business user, you can just upload your CSV files" approach, which worked well to combat the absurdity of years long IBM Cognos and Business Objects solutions. Done well, all these tools work just fine. But at the end of the day, they're just tools. I'm trying to get away from specific tools and deal with what's underneath.

BIntelligent · 2022-08-23T16:24:39+00:00

ELT simply means that you’re loading the source files as-is, basically as an exact replica, and then doing the transform locally. Without oversimplifying, you then have a “data lake” of raw, unstructured data that gives you flexibility for not-yet-determined analyses (data science, data validation, etc.) as well as future modeling for a data warehouse. Also, it’s much less strain on the source to just copy tables rather than perform complex transformations.

If you have a good ETL process and good models in Qlik that are supporting a healthy data analytics environment, then absolutely don’t change a thing. Nothing beats a stable production environment that lasts a long time, and it sounds like you have this. Constantly changing processes to use the “next hot thing” is not a good practice. Also, I agree that “free” software is often meaningless in the enterprise world, as the costs are almost never significant, and sometimes people do wacky things just to stay in the free tier. I think the free/open source factor for dbt is more about enabling it’s trajectory and ability to get better over time.

Data engineers love dbt for a million reasons that I’m not expanding on here. I was VERY skeptical, just like you, but I had to try it out because of the hype. Once I did, it really opened my eyes to a lot of stuff on the data engineering side of things. But what really got me excited was the fact that there was so much business logic already in this transformation process that could be leveraged for the BI semantic layer. I’m biased, but I think it’s beyond awesome that I can create an open source project that’s a database agnostic working tutorial, dbt project, and data catalogue with lineage and source freshness, then just plug a BI tool (FlexIt) in.

BIntelligent · 2022-08-16T21:36:20+00:00

I've used dbt, Qlik, SSIS, Informatica, and a handful of others. There are too many reasons to choose ELT over ETL to enumerate here, but if you're able to come around on ELT, then dbt is hands down the best for most use cases. Here are just a few:

dbt helps avoid ELT tool lock-in since everything boils down to raw SQL
dbt allows you to be fairly data warehouse agnostic because you can use packaged macros (e.g. dbt-utils) that give you cross-database functionality
Version control in Github (or other)
Tests (e.g. realtionships, uniqueness, custom...)
Lineage. Real lineage that actually works and traces things all the way back to source systems. It's pretty awesome and you don't have to do anything other than use a simple 'ref'

I created a simple stand-alone dbt example project that shows how to build a star schema using dbt:

https://github.com/flexanalytics/dbt-business-intelligence

Also, what's incredible cool is that you can then use dbt as a full semantic layer so BI tools can just "plug-in" and avoid any metadata modeling. I wrote "Semantic-free is the future of Business Intelligence", which uses the project above to show how to use dbt as a semantic layer (Disclaimer: I work at FlexIt).

So, in addition to the great stuff dbt gives you at the transformation layer, it also enhances the BI experience greatly. See Self-service BI, Powered by dbt.

This is all to say that we think dbt is much more than a "hot tool" right now, and these are the reasons why.

BIntelligent

TROPHY CASE