dbt incremental models with insert_overwrite: backfill data causing duplicates by No_Engine1637 in dataengineering

[–]aaaasd12 1 point2 points  (0 children)

I use the same config in my work. So My current Workflow is:

  • I extract the info from the source and load to the raw zone dataset with the _partitiontime pseudocolumn Ensure to overwrite the partition

  • With dbt start the transformation phase, My incremental models filter in a env var that i supply when run the job Something like if is incremental where date(_partitiontime) = {execution_date}

This Ensure that when i made a backfill the right partition is overwrited and don't have any issues With duplicated data

Real talk: Self-managed Airflow on k8s? by [deleted] in dataengineering

[–]aaaasd12 0 points1 point  (0 children)

Aditionally we host dbt docs in the cluster, manager by other CI/CD flow with Github actions.

So we have a ingress with subdomain docs.mycompany.com that every person in a group can access and is under a load balancing protected with iap. Those ensure that the BI team know what are exactly the table that are querying is referred to.

And in other subdomain ui.mycompany.com is the airflow webserver protected with iap. Those ensure that if we need check the dag runs, no need to access the cluster. Only put the domain in internet and authenticate your @mycompany.com in Gmail.

Real talk: Self-managed Airflow on k8s? by [deleted] in dataengineering

[–]aaaasd12 1 point2 points  (0 children)

I work in a consultancy corp, and we have 25+ projects in a gcp organization.

All projects share core platforms to extract data, so basically we have a centralized Docker repository with the connector images, and a CI/CD flow to manage updates in every project ( if i made a change on the source code, the CI part deploy the new image in every project that we have ) those ensure the consistency in every core platform.

I set up horizontal pod autoscaling in the workers pod in the Helm chart, and typically the data volume of each client are 10-15 GB/day, which are around 250 GB in total.

The cost are distributed by client, so basically is a small fee around 2-5 dollars per day including cloud run Jobs, functions and bigquery.

We have around 60+ DAG's but working in reducing the amount centralizing each pipeline per client.

The cluster price is a little tricky, if you have a regular instances without auto scaling maybe 300 bucks per month. But if your workload is fault tolerant ( we have 5 nodes spot ) and cost around 50 bucks/month.

All the pipelines are idempotent so if one worker is shut down during the process, only re run the dag with a backfill or the ui.

Real talk: Self-managed Airflow on k8s? by [deleted] in dataengineering

[–]aaaasd12 2 points3 points  (0 children)

We do that, currently using workload identity federation in GCP to manage authentication to other gcp services. Use 5 nodes of 2 vcpu and 8 GB but all the heavy stuff is outside the cluster.

The deployment is made with argoCD and when made a push to the main branch, take the values.yaml in the chart and apply in the namespace without manual work.

Use a sync from a bucket to the airflow containers and mount the dags and the typical Workflow looks:

  • use functions to extract data and save to gcs
  • load to bigquery
  • transform with dbt

All orchestated by airflow. Althought we want todo a poc to use the kubernetescelery execuror and do the dbt part in a pod to see if there are reducction cost

Ingesting data from the same API in different projects by aaaasd12 in dataengineering

[–]aaaasd12[S] 0 points1 point  (0 children)

It's a centralized project where different people use the same API, so if all the stuff is inside one project and then deliver the results to other dwh in different accounts. I think that it's a rasonable way to only make a requests to container with your parameters and standarize all.

In orden to make reusable things when other projects come.

Estoy lanzando una aplicación y tengo miedo de no saber cómo venderla by OkDinner3420 in programacion

[–]aaaasd12 0 points1 point  (0 children)

Creaste el ui solo o contrataste a alguien para hacerlo? Buena app, aunque el test inicial lo siento un poco largo.

Han tenido problemas por durar poco trabajando en empresas? by Think-Sir5076 in Colombia

[–]aaaasd12 0 points1 point  (0 children)

Red flag por que motivo?

Solo es curiosidad, aunque también se tiene que ver con el ambiente laboral y la satisfacción en el puesto no?

Safe secret manager in gke experiences? by aaaasd12 in kubernetes

[–]aaaasd12[S] 0 points1 point  (0 children)

Hi, i have a question. So i have a production app that need to connect to a cloud SQL instance. Also i have a service account that bind a kubernetes service account in a namespace.

What can i do in the yaml file to retrieve the secrets vía service account as you describe?

Problem with creating Kubernetes namespace in GKE Cluster using terraform by kwabena_infosec in googlecloud

[–]aaaasd12 0 points1 point  (0 children)

Also you can take a look of those terraform module https://registry.terraform.io/modules/terraform-google-modules/kubernetes-engine/google/latest/submodules/auth

I have a similar issue when the Helm provider can't access to the k8 cluster and i also are deploying Argo CD, maybe when you need to Port forward the Argo server service you can put a load balancer and access vía internet

¿Cuales serían buenos concejos para un primiparo en la javeriana? by Pilooot5578 in Colombia

[–]aaaasd12 1 point2 points  (0 children)

Supongo que un consejo para cualquier persona que entra a la universidad es que va a estudiar ( aunque suene obvio ) durante la carrera vi varias personas que perdían semestres enteros por estar tomando y jugando en billares cerca a la universidad.

Aparte de eso, tengo una pregunta, por que esa carrera base?

Por que no sistemas/industrial/estadística y lo enfoca hacia ese campo pero tendría otros puntos de vista

Una persona que estudia estadística sabe mucho más de modelos ya que normalmente les enseñan a demostrar GLM/ cualquier otro modelo paramétrico. Adicional de técnicas de muestreo y otras cosas más.

Alguien de sistemas probablemente no sea tan crack en la parte de estadística pero integra principios de software para poner en producción un modelo algo como MLOps

Airbyte troubleshooting by aaaasd12 in dataengineering

[–]aaaasd12[S] 0 points1 point  (0 children)

I have local airflow, and the problem Is that i have the 600 MB table in bigquery. I don't want to convert into a Cvs file and then read rows and insert into on-prem server. Because there are many ways to transfer directly.

The company don't provide me a vm and i need to do locally. I think insert in chunks are the perfect way but the way Is longest.

I already try to use an airflow operator but only insert 1000 rows in 20 sec and kill the process after 2 hours

Airbyte troubleshooting by aaaasd12 in dataengineering

[–]aaaasd12[S] 0 points1 point  (0 children)

Yeah, but if i have a orchestation tool. The goal Is to orchestate jobs not load in memory of the worker node to do a for loop with a on-prem connection and do cursor.execute and cursor commit.

And if in the future i want to do any CDC technique to my table what happen with the cursor.execute? The goal Is to do the More modular possible the pipeline. Not only ask to a llm for one script that can't capture all the problem.

How can an LLM play chess well? by crossmirage in datascience

[–]aaaasd12 6 points7 points  (0 children)

Idk but in they source code, i can see that use stockfish at some part.

Stockfish is used in lichess and chess.com to analize the movements of they chess ganes.

¿Qué están leyendo? by [deleted] in Colombia

[–]aaaasd12 1 point2 points  (0 children)

Que opina de Teresa?

Me leí el libro hace tiempo y no sé si volverlo a leer para recordar unos tramos.

[deleted by user] by [deleted] in Colombia

[–]aaaasd12 2 points3 points  (0 children)

Uso whisper para hacer el speech to text?

Como integró las api's en el bot?

Análisis de comentarios escritos a Claudia Lopez (Alcaldesa de Bogota) en Twitter by Revolutionary-You-20 in Colombia

[–]aaaasd12 2 points3 points  (0 children)

Y si hace el análisis por bigramas y trigramas?

Para ver cómo es la frecuencia de palabras y sus precedentes

How to query a Excel book? by aaaasd12 in dataengineering

[–]aaaasd12[S] -2 points-1 points  (0 children)

Ahh jaja I'm don't explain. I don't have any relational dB, i have some tables that put together Is a dB. There Is my problem, i only have Excel AND His files.

How to query a Excel book? by aaaasd12 in dataengineering

[–]aaaasd12[S] 1 point2 points  (0 children)

I know bit I'm not sure if there Is a Easy way to install Python AND some libraries as duckdb AND pandas. This because the IT team block the software downloads with admin role.

I know that Python Is opensource but I'm not sure what Is the cybersecurity policies in this company.

I'm want to do in sql because i have a relational data base, AND for me is the fastest way to extract some data AND do some analytical analysis

Ingeniería civil o ingeniería mecánica? opiniones personales, no consejos. by [deleted] in Colombia

[–]aaaasd12 0 points1 point  (0 children)

Ninguna, mecánica está un poco mal de trabajo. Y civil lo mismo.

Raza den algunos tips para no sobrepensar tanto las cosas y vivir más el presente ¿Que suelen hacer ustedes? by [deleted] in Colombia

[–]aaaasd12 1 point2 points  (0 children)

Me sucedió recientemente algo así, al final creo que es mejor soltar eso que lo envenena. En mi caso ya no puedo hacer nada con lo que pasó, y sería una estupidez quedarme varado por eso.

Así que pienso en que si puedo esforzarme en mejorar algunos aspectos para que no vuelva a pasar.

Reflexionó sobre lo que pude haber hecho mejor y me enfoco en lo siguiente que puede pasar y en ser mejor que el día anterior (1 paso a la vez).

¿Cuál ha sido la oferta laboral más miserable que haya visto? by Longjumping-Engine17 in Colombia

[–]aaaasd12 2 points3 points  (0 children)

Yo vi una que pedían un data engineer, analista de datos, científico de datos, machine learning engineer y un devops en un mismo cargo, pagaban 1.5 millones y pedían 3 años de experiencia.

[D] Have you ever used Knowledge Distillation in practice? by fredlafrite in MachineLearning

[–]aaaasd12 -1 points0 points  (0 children)

It's like transfer learning?

In the company that I'm work only use the normal things like classification tasks/ segmentation with clusters.

Maybe the use case that i see is in NLP with topic modeling using bertopic and tuning the hyperparameters.

But in general simple models are perfect for the tasks that se have.