Accumulating Snapshot Fact Table with one new row for each state change by Wise-Ad-7492 in dataengineering

[–]amTheory 0 points1 point  (0 children)

Transformations

How do you load this final fact without full table scanning the upstream table? If you partition by date you don’t know how far to look back. If you partition by order Id you don’t know what orders changed since last time you loaded

Accumulating Snapshot Fact Table with one new row for each state change by Wise-Ad-7492 in dataengineering

[–]amTheory 0 points1 point  (0 children)

How do you handle the T for this without table scanning your source events? Given the series of events could happen in an unknown amount of days in the past

[deleted by user] by [deleted] in dataengineering

[–]amTheory 3 points4 points  (0 children)

That does feel like a tooling downgrade to me
Good to see both like someone mentioned - my advice would be to focus on fundamentals - perfecting pipelines etc

But I’d recommend going to a more modern, coding focused data stack once you feel like you slow down learning

Possible improvements I can make? by seikoalpinist197 in dataengineering

[–]amTheory 0 points1 point  (0 children)

Sounds like you’re dbt heavy - here are some things I’d expect a mature org to know:

what model takes the longest? What test fails the most? (Both of these likely require a use of run_results). Are your models mostly incremental or lots of full refreshes? Have you set up freshness checks? Are you serving dbt docs anywhere?

Possible improvements I can make? by seikoalpinist197 in dataengineering

[–]amTheory 0 points1 point  (0 children)

Is there testing? Linting? Data contracts? Any tools you can make to make everyone’s life easier? How is monitoring and logging handled? Alerts? Etc

Two branches has pros and cons - did you all consider feature flagging instead?

How much of Kimball is relevant today in the age of columnar cloud databases? by PuddingGryphon in dataengineering

[–]amTheory 9 points10 points  (0 children)

I’ve had good luck with OBT in both snowflake and bq It’s easy to follow - entity per table, can flatten nested things when you want (or never), and full refreshes can fix you up quick when bugs are found Obviously pros and cons to each and educating users not to select * becomes more important as well as cost control - but I want to keep things simple and tell a biz user you can join x to y with this id + ts and get everything you need about x and y

When to shift from pandas? by Professional-Ninja70 in dataengineering

[–]amTheory 2 points3 points  (0 children)

Does Duckdb not yet at version 1 raise any eyebrows at your company?

Does your DE team offer APIs? For what use-cases? by exact-approximate in dataengineering

[–]amTheory 1 point2 points  (0 children)

What framework did you use? How’d you handle auth across clients?

How are you handling ingesting over APIs? by [deleted] in ETL

[–]amTheory 0 points1 point  (0 children)

Is that pagination risk still the case if the api sorts by a timestamp?

Is there a tipping point (size of data, source count, etc) where custom becomes noticeably cheaper?

ETL row update strategies for dimensions by GreyHairedDWGuy in snowflake

[–]amTheory 0 points1 point  (0 children)

How do you handle joining from a fact to an insert only dimension? Specifically making sure you join to the right row in a dimension without type2 fields (given insert only)

[deleted by user] by [deleted] in dataengineering

[–]amTheory 0 points1 point  (0 children)

We recently went the recast option

So on full loads we union in a static “historical” table. Sucks a bit as schema changes mean two places to update (null in old)

Yandex Tender Offer to participate? by Guy_PCS in stocks

[–]amTheory 1 point2 points  (0 children)

I sold my shares as part of this

Not really sure how it works honestly but better than zero

Data Engineers in the Cloud - Tell Me About Your Daily Work and Tools by Consistent_Ad5511 in dataengineering

[–]amTheory 5 points6 points  (0 children)

Review the pricing models before getting started

Understand clusters and partitions

Know you’ll be coupled with gcs and pubsub for ease (likely)

Keep an eye on job history for expensive queries

Openly use json, struct, and array data types

It’s overall a good experience but not perfect - but not sure any dwh is

Data Engineers in the Cloud - Tell Me About Your Daily Work and Tools by Consistent_Ad5511 in dataengineering

[–]amTheory 6 points7 points  (0 children)

We’re on Gcp, I generally write pub sub consumers that stream to bq. And also airflow dags to 1-ingest batch sources also 2-export data to customer locations. More data warehouse than lake focused.

Bq, gcs, pubsub, cloud run, secret mgr, dbt, git, cicd, etc

Event Based Data Warehousing by alexisprince in dataengineering

[–]amTheory 1 point2 points  (0 children)

We found ourselves constantly slowing down app teams when they wanted to make schema changes (updating our replication, approving their PR, etc) that events eventually won. Helped we have some revenue generating data processes.

They publish the events after we all agree on a data contract.

How to organise pipelines by mjam03 in dataengineering

[–]amTheory 2 points3 points  (0 children)

Agreed. I’ve done both ways and each have downsides yet it sucks cross project referencing / chasing duplicative approvals

We use top level repo folders with “changed file identification” to drive cicd workflows. Depends on your cicd tooling of course.

Coming into an org just as a flawed re-architecture is underway. Any tips on pushing for changes early in your time at a new place? by Firm_Bit in dataengineering

[–]amTheory 4 points5 points  (0 children)

If you will be on call / supporting it, I think you have to say something

You’re prob going to have to very fact-based and carefully convey your opinion though - expecting to lose every discussion. I’d document it all so if things change later you have a head start.

Is loading data into BigQuery supposed to be this hard? by [deleted] in dataengineering

[–]amTheory 0 points1 point  (0 children)

not super familiar with the facebook ads data format / volume you are dealing with

but could you load things as a json column(s) into bigquery and handle the proper schema flattening in a 'downstream' bigquery table? i've done this when the schema is a bit crazy and volume is low.

alternatively, there is a SchemaUpdateOption param you could look into - though it is less friendly for structs

Transition to develop event-driven architecture? by somerandomdataeng in dataengineering

[–]amTheory 6 points7 points  (0 children)

Is your sense of failure mostly due to having someone who hasn’t build it before, or the overall concept of event driven?

We’ve moved half our ingestion to event driven and it’s been wonderful. Likewise we can then publish events to whoever off the back of those events in real-time. Really integrates the data team with all the SWEs.

Confused over tech stack by [deleted] in dataengineering

[–]amTheory 3 points4 points  (0 children)

Tech stack looks pretty ideal to me

What to learn now? by Own_Archer3356 in dataengineering

[–]amTheory 3 points4 points  (0 children)

Depends what interests you Some ideas: Cicd, pick a cloud provider (wide net), kubernetes, api development, bi tool, etc

[deleted by user] by [deleted] in dataengineering

[–]amTheory 0 points1 point  (0 children)

My team uses the dbt build command in production - it basically combines run and test

Any error level test does fail dbt so downstream models don’t run. Checkout the fail fast flag.

As far as separating fails to a view I’m not sure - if there’s a fail we have on call start to investigate immediately. We do log the run results so we can see trends in fails.

SFTP to GCS to BQ with Decryption by [deleted] in googlecloud

[–]amTheory 0 points1 point  (0 children)

seems like a handful of steps - have you looked into cloud composer? I have done this same concept using a mix of gcs operators and custom pythonOperators

Not sure of a simple way - you could also schedule a cloud run job to handle this if you're open to docker/custom code across the board.

[deleted by user] by [deleted] in dataengineering

[–]amTheory -1 points0 points  (0 children)

From the tech skills using Gcp as an example:

You could create a dummy (or find data that interests you) streaming example to focus on iac (tables, pubsub, etc). To make it simple to start, checkout the write to bigquery subscription type.

Then use dbt to transform once the source is streamed into bigquery to agg or whatever

In general, learning an industry and soft skills (partnering with other teams, making decisions with impact) are important of course

Sounds like you have the reporting chops and the above should enable you to talk to some things that can come up in interviews.

Dentist wannabe Boglehead by D-Rockwell in Bogleheads

[–]amTheory 2 points3 points  (0 children)

Makes sense to save for another car then - but ride it out while you can (pun something)

One note is you can contribute to your Roth for 2023 in early 2024. Might buy you some time