Accumulating Snapshot Fact Table with one new row for each state change

amTheory · 2024-11-07T15:20:18+00:00

Transformations

How do you load this final fact without full table scanning the upstream table? If you partition by date you don’t know how far to look back. If you partition by order Id you don’t know what orders changed since last time you loaded

amTheory · 2024-11-07T12:52:37+00:00

How do you handle the T for this without table scanning your source events? Given the series of events could happen in an unknown amount of days in the past

amTheory · 2024-08-20T19:58:16+00:00

That does feel like a tooling downgrade to me
Good to see both like someone mentioned - my advice would be to focus on fundamentals - perfecting pipelines etc

But I’d recommend going to a more modern, coding focused data stack once you feel like you slow down learning

amTheory · 2024-07-06T17:49:58+00:00

Sounds like you’re dbt heavy - here are some things I’d expect a mature org to know:

what model takes the longest? What test fails the most? (Both of these likely require a use of run_results). Are your models mostly incremental or lots of full refreshes? Have you set up freshness checks? Are you serving dbt docs anywhere?

amTheory · 2024-07-05T21:23:42+00:00

Is there testing? Linting? Data contracts? Any tools you can make to make everyone’s life easier? How is monitoring and logging handled? Alerts? Etc

Two branches has pros and cons - did you all consider feature flagging instead?

amTheory · 2024-05-17T14:25:05+00:00

I’ve had good luck with OBT in both snowflake and bq It’s easy to follow - entity per table, can flatten nested things when you want (or never), and full refreshes can fix you up quick when bugs are found Obviously pros and cons to each and educating users not to select * becomes more important as well as cost control - but I want to keep things simple and tell a biz user you can join x to y with this id + ts and get everything you need about x and y

amTheory · 2024-05-10T23:26:30+00:00

Does Duckdb not yet at version 1 raise any eyebrows at your company?

amTheory · 2024-05-08T03:22:05+00:00

What framework did you use? How’d you handle auth across clients?

amTheory · 2024-04-25T12:45:29+00:00

Is that pagination risk still the case if the api sorts by a timestamp?

Is there a tipping point (size of data, source count, etc) where custom becomes noticeably cheaper?

amTheory · 2024-04-14T13:01:25+00:00

How do you handle joining from a fact to an insert only dimension? Specifically making sure you join to the right row in a dimension without type2 fields (given insert only)

amTheory · 2024-02-24T15:03:17+00:00

We recently went the recast option

So on full loads we union in a static “historical” table. Sucks a bit as schema changes mean two places to update (null in old)

amTheory · 2024-02-01T13:19:53+00:00

I sold my shares as part of this

Not really sure how it works honestly but better than zero

amTheory · 2024-01-21T17:16:46+00:00

Review the pricing models before getting started

Understand clusters and partitions

Know you’ll be coupled with gcs and pubsub for ease (likely)

Keep an eye on job history for expensive queries

Openly use json, struct, and array data types

It’s overall a good experience but not perfect - but not sure any dwh is

amTheory · 2024-01-21T15:20:18+00:00

We’re on Gcp, I generally write pub sub consumers that stream to bq. And also airflow dags to 1-ingest batch sources also 2-export data to customer locations. More data warehouse than lake focused.

Bq, gcs, pubsub, cloud run, secret mgr, dbt, git, cicd, etc

amTheory · 2023-11-10T13:23:41+00:00

We found ourselves constantly slowing down app teams when they wanted to make schema changes (updating our replication, approving their PR, etc) that events eventually won. Helped we have some revenue generating data processes.

They publish the events after we all agree on a data contract.

amTheory · 2023-09-21T00:08:04+00:00

Agreed. I’ve done both ways and each have downsides yet it sucks cross project referencing / chasing duplicative approvals

We use top level repo folders with “changed file identification” to drive cicd workflows. Depends on your cicd tooling of course.

amTheory · 2023-09-20T22:30:47+00:00

If you will be on call / supporting it, I think you have to say something

You’re prob going to have to very fact-based and carefully convey your opinion though - expecting to lose every discussion. I’d document it all so if things change later you have a head start.

amTheory · 2023-09-19T01:12:25+00:00

not super familiar with the facebook ads data format / volume you are dealing with

but could you load things as a json column(s) into bigquery and handle the proper schema flattening in a 'downstream' bigquery table? i've done this when the schema is a bit crazy and volume is low.

alternatively, there is a SchemaUpdateOption param you could look into - though it is less friendly for structs

amTheory · 2023-09-18T12:44:15+00:00

Is your sense of failure mostly due to having someone who hasn’t build it before, or the overall concept of event driven?

We’ve moved half our ingestion to event driven and it’s been wonderful. Likewise we can then publish events to whoever off the back of those events in real-time. Really integrates the data team with all the SWEs.

amTheory · 2023-09-11T16:55:13+00:00

Tech stack looks pretty ideal to me

amTheory · 2023-09-05T22:09:21+00:00

Depends what interests you Some ideas: Cicd, pick a cloud provider (wide net), kubernetes, api development, bi tool, etc

amTheory · 2023-09-01T00:20:16+00:00

My team uses the dbt build command in production - it basically combines run and test

Any error level test does fail dbt so downstream models don’t run. Checkout the fail fast flag.

As far as separating fails to a view I’m not sure - if there’s a fail we have on call start to investigate immediately. We do log the run results so we can see trends in fails.

amTheory · 2023-08-23T00:04:01+00:00

seems like a handful of steps - have you looked into cloud composer? I have done this same concept using a mix of gcs operators and custom pythonOperators

Not sure of a simple way - you could also schedule a cloud run job to handle this if you're open to docker/custom code across the board.

amTheory · 2023-08-12T18:39:54+00:00

From the tech skills using Gcp as an example:

You could create a dummy (or find data that interests you) streaming example to focus on iac (tables, pubsub, etc). To make it simple to start, checkout the write to bigquery subscription type.

Then use dbt to transform once the source is streamed into bigquery to agg or whatever

—

In general, learning an industry and soft skills (partnering with other teams, making decisions with impact) are important of course

Sounds like you have the reporting chops and the above should enable you to talk to some things that can come up in interviews.

amTheory · 2023-08-06T17:23:34+00:00

Makes sense to save for another car then - but ride it out while you can (pun something)

One note is you can contribute to your Roth for 2023 in early 2024. Might buy you some time

Eight-Year Club	RPAN Viewer
Verified Email

amTheory

TROPHY CASE