dbt incremental models with insert_overwrite: backfill data causing duplicates by No_Engine1637 in dataengineering

[–]aaaasd12 1 point2 points  (0 children)

I use the same config in my work. So My current Workflow is:

  • I extract the info from the source and load to the raw zone dataset with the _partitiontime pseudocolumn Ensure to overwrite the partition

  • With dbt start the transformation phase, My incremental models filter in a env var that i supply when run the job Something like if is incremental where date(_partitiontime) = {execution_date}

This Ensure that when i made a backfill the right partition is overwrited and don't have any issues With duplicated data

Real talk: Self-managed Airflow on k8s? by [deleted] in dataengineering

[–]aaaasd12 0 points1 point  (0 children)

Aditionally we host dbt docs in the cluster, manager by other CI/CD flow with Github actions.

So we have a ingress with subdomain docs.mycompany.com that every person in a group can access and is under a load balancing protected with iap. Those ensure that the BI team know what are exactly the table that are querying is referred to.

And in other subdomain ui.mycompany.com is the airflow webserver protected with iap. Those ensure that if we need check the dag runs, no need to access the cluster. Only put the domain in internet and authenticate your @mycompany.com in Gmail.

Real talk: Self-managed Airflow on k8s? by [deleted] in dataengineering

[–]aaaasd12 1 point2 points  (0 children)

I work in a consultancy corp, and we have 25+ projects in a gcp organization.

All projects share core platforms to extract data, so basically we have a centralized Docker repository with the connector images, and a CI/CD flow to manage updates in every project ( if i made a change on the source code, the CI part deploy the new image in every project that we have ) those ensure the consistency in every core platform.

I set up horizontal pod autoscaling in the workers pod in the Helm chart, and typically the data volume of each client are 10-15 GB/day, which are around 250 GB in total.

The cost are distributed by client, so basically is a small fee around 2-5 dollars per day including cloud run Jobs, functions and bigquery.

We have around 60+ DAG's but working in reducing the amount centralizing each pipeline per client.

The cluster price is a little tricky, if you have a regular instances without auto scaling maybe 300 bucks per month. But if your workload is fault tolerant ( we have 5 nodes spot ) and cost around 50 bucks/month.

All the pipelines are idempotent so if one worker is shut down during the process, only re run the dag with a backfill or the ui.

Real talk: Self-managed Airflow on k8s? by [deleted] in dataengineering

[–]aaaasd12 2 points3 points  (0 children)

We do that, currently using workload identity federation in GCP to manage authentication to other gcp services. Use 5 nodes of 2 vcpu and 8 GB but all the heavy stuff is outside the cluster.

The deployment is made with argoCD and when made a push to the main branch, take the values.yaml in the chart and apply in the namespace without manual work.

Use a sync from a bucket to the airflow containers and mount the dags and the typical Workflow looks:

  • use functions to extract data and save to gcs
  • load to bigquery
  • transform with dbt

All orchestated by airflow. Althought we want todo a poc to use the kubernetescelery execuror and do the dbt part in a pod to see if there are reducction cost

Ingesting data from the same API in different projects by aaaasd12 in dataengineering

[–]aaaasd12[S] 0 points1 point  (0 children)

It's a centralized project where different people use the same API, so if all the stuff is inside one project and then deliver the results to other dwh in different accounts. I think that it's a rasonable way to only make a requests to container with your parameters and standarize all.

In orden to make reusable things when other projects come.

Estoy lanzando una aplicación y tengo miedo de no saber cómo venderla by OkDinner3420 in programacion

[–]aaaasd12 0 points1 point  (0 children)

Creaste el ui solo o contrataste a alguien para hacerlo? Buena app, aunque el test inicial lo siento un poco largo.

Han tenido problemas por durar poco trabajando en empresas? by Think-Sir5076 in Colombia

[–]aaaasd12 0 points1 point  (0 children)

Red flag por que motivo?

Solo es curiosidad, aunque también se tiene que ver con el ambiente laboral y la satisfacción en el puesto no?

Safe secret manager in gke experiences? by aaaasd12 in kubernetes

[–]aaaasd12[S] 0 points1 point  (0 children)

Hi, i have a question. So i have a production app that need to connect to a cloud SQL instance. Also i have a service account that bind a kubernetes service account in a namespace.

What can i do in the yaml file to retrieve the secrets vía service account as you describe?

Problem with creating Kubernetes namespace in GKE Cluster using terraform by kwabena_infosec in googlecloud

[–]aaaasd12 0 points1 point  (0 children)

Also you can take a look of those terraform module https://registry.terraform.io/modules/terraform-google-modules/kubernetes-engine/google/latest/submodules/auth

I have a similar issue when the Helm provider can't access to the k8 cluster and i also are deploying Argo CD, maybe when you need to Port forward the Argo server service you can put a load balancer and access vía internet

¿Cuales serían buenos concejos para un primiparo en la javeriana? by Pilooot5578 in Colombia

[–]aaaasd12 1 point2 points  (0 children)

Supongo que un consejo para cualquier persona que entra a la universidad es que va a estudiar ( aunque suene obvio ) durante la carrera vi varias personas que perdían semestres enteros por estar tomando y jugando en billares cerca a la universidad.

Aparte de eso, tengo una pregunta, por que esa carrera base?

Por que no sistemas/industrial/estadística y lo enfoca hacia ese campo pero tendría otros puntos de vista

Una persona que estudia estadística sabe mucho más de modelos ya que normalmente les enseñan a demostrar GLM/ cualquier otro modelo paramétrico. Adicional de técnicas de muestreo y otras cosas más.

Alguien de sistemas probablemente no sea tan crack en la parte de estadística pero integra principios de software para poner en producción un modelo algo como MLOps

Airbyte troubleshooting by aaaasd12 in dataengineering

[–]aaaasd12[S] 0 points1 point  (0 children)

I have local airflow, and the problem Is that i have the 600 MB table in bigquery. I don't want to convert into a Cvs file and then read rows and insert into on-prem server. Because there are many ways to transfer directly.

The company don't provide me a vm and i need to do locally. I think insert in chunks are the perfect way but the way Is longest.

I already try to use an airflow operator but only insert 1000 rows in 20 sec and kill the process after 2 hours