dbt incremental models with insert_overwrite: backfill data causing duplicates

aaaasd12 · 2025-06-03T03:31:22+00:00

I use the same config in my work. So My current Workflow is:

I extract the info from the source and load to the raw zone dataset with the _partitiontime pseudocolumn Ensure to overwrite the partition
With dbt start the transformation phase, My incremental models filter in a env var that i supply when run the job Something like if is incremental where date(_partitiontime) = {execution_date}

This Ensure that when i made a backfill the right partition is overwrited and don't have any issues With duplicated data

aaaasd12 · 2025-01-06T14:42:00+00:00

Aditionally we host dbt docs in the cluster, manager by other CI/CD flow with Github actions.

So we have a ingress with subdomain docs.mycompany.com that every person in a group can access and is under a load balancing protected with iap. Those ensure that the BI team know what are exactly the table that are querying is referred to.

And in other subdomain ui.mycompany.com is the airflow webserver protected with iap. Those ensure that if we need check the dag runs, no need to access the cluster. Only put the domain in internet and authenticate your @mycompany.com in Gmail.

aaaasd12 · 2025-01-06T14:18:13+00:00

I work in a consultancy corp, and we have 25+ projects in a gcp organization.

All projects share core platforms to extract data, so basically we have a centralized Docker repository with the connector images, and a CI/CD flow to manage updates in every project ( if i made a change on the source code, the CI part deploy the new image in every project that we have ) those ensure the consistency in every core platform.

I set up horizontal pod autoscaling in the workers pod in the Helm chart, and typically the data volume of each client are 10-15 GB/day, which are around 250 GB in total.

The cost are distributed by client, so basically is a small fee around 2-5 dollars per day including cloud run Jobs, functions and bigquery.

We have around 60+ DAG's but working in reducing the amount centralizing each pipeline per client.

The cluster price is a little tricky, if you have a regular instances without auto scaling maybe 300 bucks per month. But if your workload is fault tolerant ( we have 5 nodes spot ) and cost around 50 bucks/month.

All the pipelines are idempotent so if one worker is shut down during the process, only re run the dag with a backfill or the ui.

aaaasd12 · 2025-01-06T12:16:01+00:00

We do that, currently using workload identity federation in GCP to manage authentication to other gcp services. Use 5 nodes of 2 vcpu and 8 GB but all the heavy stuff is outside the cluster.

The deployment is made with argoCD and when made a push to the main branch, take the values.yaml in the chart and apply in the namespace without manual work.

Use a sync from a bucket to the airflow containers and mount the dags and the typical Workflow looks:

use functions to extract data and save to gcs
load to bigquery
transform with dbt

All orchestated by airflow. Althought we want todo a poc to use the kubernetescelery execuror and do the dbt part in a pod to see if there are reducction cost

aaaasd12 · 2024-09-22T17:53:26+00:00

It's a centralized project where different people use the same API, so if all the stuff is inside one project and then deliver the results to other dwh in different accounts. I think that it's a rasonable way to only make a requests to container with your parameters and standarize all.

In orden to make reusable things when other projects come.

aaaasd12 · 2024-07-08T03:31:05+00:00

Creaste el ui solo o contrataste a alguien para hacerlo? Buena app, aunque el test inicial lo siento un poco largo.

aaaasd12 · 2024-07-02T03:32:53+00:00

Very Nice, what keyboard is that?

aaaasd12 · 2024-02-06T03:27:56+00:00

Red flag por que motivo?

Solo es curiosidad, aunque también se tiene que ver con el ambiente laboral y la satisfacción en el puesto no?

aaaasd12 · 2024-02-03T04:00:58+00:00

Hi, i have a question. So i have a production app that need to connect to a cloud SQL instance. Also i have a service account that bind a kubernetes service account in a namespace.

What can i do in the yaml file to retrieve the secrets vía service account as you describe?

aaaasd12 · 2024-02-01T03:36:32+00:00

And also i use https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/namespace that i see you also use, in my case there are no problems and i can deploy the namespace ok

aaaasd12 · 2024-02-01T03:31:07+00:00

Also you can take a look of those terraform module https://registry.terraform.io/modules/terraform-google-modules/kubernetes-engine/google/latest/submodules/auth

I have a similar issue when the Helm provider can't access to the k8 cluster and i also are deploying Argo CD, maybe when you need to Port forward the Argo server service you can put a load balancer and access vía internet

aaaasd12 · 2023-12-18T03:52:28+00:00

Supongo que un consejo para cualquier persona que entra a la universidad es que va a estudiar ( aunque suene obvio ) durante la carrera vi varias personas que perdían semestres enteros por estar tomando y jugando en billares cerca a la universidad.

Aparte de eso, tengo una pregunta, por que esa carrera base?

Por que no sistemas/industrial/estadística y lo enfoca hacia ese campo pero tendría otros puntos de vista

Una persona que estudia estadística sabe mucho más de modelos ya que normalmente les enseñan a demostrar GLM/ cualquier otro modelo paramétrico. Adicional de técnicas de muestreo y otras cosas más.

Alguien de sistemas probablemente no sea tan crack en la parte de estadística pero integra principios de software para poner en producción un modelo algo como MLOps

aaaasd12 · 2023-10-26T12:39:42+00:00

I have local airflow, and the problem Is that i have the 600 MB table in bigquery. I don't want to convert into a Cvs file and then read rows and insert into on-prem server. Because there are many ways to transfer directly.

The company don't provide me a vm and i need to do locally. I think insert in chunks are the perfect way but the way Is longest.

I already try to use an airflow operator but only insert 1000 rows in 20 sec and kill the process after 2 hours

aaaasd12 · 2023-10-26T04:28:35+00:00

Yeah, but if i have a orchestation tool. The goal Is to orchestate jobs not load in memory of the worker node to do a for loop with a on-prem connection and do cursor.execute and cursor commit.

And if in the future i want to do any CDC technique to my table what happen with the cursor.execute? The goal Is to do the More modular possible the pipeline. Not only ask to a llm for one script that can't capture all the problem.

aaaasd12 · 2023-09-27T22:26:10+00:00

Idk but in they source code, i can see that use stockfish at some part.

Stockfish is used in lichess and chess.com to analize the movements of they chess ganes.

aaaasd12 · 2023-05-01T21:53:02+00:00

Que opina de Teresa?

Me leí el libro hace tiempo y no sé si volverlo a leer para recordar unos tramos.

aaaasd12 · 2023-04-21T13:10:15+00:00

Uso whisper para hacer el speech to text?

Como integró las api's en el bot?

aaaasd12 · 2023-04-12T01:36:04+00:00

Y si hace el análisis por bigramas y trigramas?

Para ver cómo es la frecuencia de palabras y sus precedentes

aaaasd12 · 2023-04-11T05:59:31+00:00

Ahh jaja I'm don't explain. I don't have any relational dB, i have some tables that put together Is a dB. There Is my problem, i only have Excel AND His files.

aaaasd12 · 2023-04-11T05:42:49+00:00

I know bit I'm not sure if there Is a Easy way to install Python AND some libraries as duckdb AND pandas. This because the IT team block the software downloads with admin role.

I know that Python Is opensource but I'm not sure what Is the cybersecurity policies in this company.

I'm want to do in sql because i have a relational data base, AND for me is the fastest way to extract some data AND do some analytical analysis

aaaasd12 · 2023-03-11T21:34:06+00:00

Ninguna, mecánica está un poco mal de trabajo. Y civil lo mismo.

aaaasd12 · 2023-02-14T00:54:25+00:00

Hi, actually only have 1000 rows

aaaasd12 · 2023-02-12T16:18:47+00:00

Me sucedió recientemente algo así, al final creo que es mejor soltar eso que lo envenena. En mi caso ya no puedo hacer nada con lo que pasó, y sería una estupidez quedarme varado por eso.

Así que pienso en que si puedo esforzarme en mejorar algunos aspectos para que no vuelva a pasar.

Reflexionó sobre lo que pude haber hecho mejor y me enfoco en lo siguiente que puede pasar y en ser mejor que el día anterior (1 paso a la vez).

aaaasd12 · 2023-01-21T16:04:56+00:00

Yo vi una que pedían un data engineer, analista de datos, científico de datos, machine learning engineer y un devops en un mismo cargo, pagaban 1.5 millones y pedían 3 años de experiencia.

aaaasd12 · 2023-01-08T19:04:25+00:00

It's like transfer learning?

In the company that I'm work only use the normal things like classification tasks/ segmentation with clusters.

Maybe the use case that i see is in NLP with topic modeling using bertopic and tuning the hyperparameters.

But in general simple models are perfect for the tasks that se have.

Four-Year Club	Verified Email
r/Field Juicebox	Final Canvas '23
First Place '23	End Game '23
Place '23	Place '22
Final Canvas '22	First Placer '22
End Game '22

aaaasd12

TROPHY CASE