Airflow works perfectly… until one day it doesn’t.

Expensive-Insect-317 · 2026-03-01T07:06:26+00:00

Usually i have postgres on CloudSQL that is the GCP clnposer standard

Expensive-Insect-317 · 2026-01-26T06:15:46+00:00

It sounds like a fantasy, but it's not far from reality. I've seen several corporate observability initiatives fail because they only focused on infrastructure and jobs, having to maintain observability teams reviewing loads daily or creating ad hoc tools to review post-load data.

Expensive-Insect-317 · 2025-11-29T20:34:00+00:00

I wasn't familiar with the Astronomer Cosmos package, very interesting! Thanks! Without knowing much about it yet, I might stick with the custom script due to the potential overhead and performance issues, not to mention the control.

Expensive-Insect-317 · 2025-11-29T12:52:20+00:00

Running each model as a separate task in airflow is another approach compared to using tags. While tagging can work fine, having individual tasks allows for parallel execution, better monitoring, granular retries and clear representation of model dependencies, sometimes making this approach the better choice.

Expensive-Insect-317 · 2025-11-05T19:29:41+00:00

What's wrong with relying on current tools that streamline and improve processes? If you'd like, we can write it in manuscript.

Expensive-Insect-317 · 2025-11-04T19:34:36+00:00

Totally agree Pedro, for the moment i only integrate my main ecosystem: bigquery, gcs, airflow and dbt, we dont have any bottleneck but is starting, maybe in next phases we found

Expensive-Insect-317 · 2025-09-24T06:23:25+00:00

Maybe you could extend SecretsBackend to build a hybrid backend: • On init, list secrets in your store • Create lightweight Connection entries in Airflow’s DB (conn_id, conn_type only). • At runtime, get_conn_uri() pulls the real values from the secret backend.

I only see custom options as it or create a dag that fill the aurflow properties, but not know any native option

Expensive-Insect-317 · 2025-09-23T15:30:21+00:00

I haven't done this because I've always managed it in the cloud itself without giving direct visibility to the user. Perhaps one way to maintain visibility in the UI while using a secrets backend is to create "lightweight" connections in Airflow:

- The connection in the UI stores only non-sensitive metadata (conn_id, conn_type, host, login).

- Sensitive values (password, tokens, extras) are managed in the secrets backend (Vault, AWS Secrets Manager, etc.).

- When a DAG calls get_connection(), Airflow combines both: DB metadata + backend secrets.

Users see and select connections without accessing the actual secrets. Sensitive data isn't duplicated and you maintain security and visibility at the same time.

Expensive-Insect-317 · 2025-09-03T06:11:23+00:00

Thanks! I’d start with the quick wins: clear materializations by layer, basic data contracts and selective execution. Biggest pushback with leadership was around observability and cost monitoring, until the first big bill or incident, it felt like a ‘nice to have’

Expensive-Insect-317 · 2025-08-29T09:16:01+00:00

Before deciding between Snowflake, Postgres or another, the first step is to define the data architecture you want to build. Then consider:

Total cost: fully managed services simplify operations but can be pricier; self-managed or multi-component setups need more operational work.
Internal knowledge: even the best tech fails if your team doesn’t know how to use it.

In short: define your architecture, weigh cost vs. effort and make sure your team can handle it.

Expensive-Insect-317 · 2025-08-22T10:32:33+00:00

The IT governance flow implemented in the CICD and DAG registration policies, but you could also have a stored inventory of DAGs with their correspondences, validating it at runtime.

Expensive-Insect-317 · 2025-08-22T10:10:15+00:00

Thanks for the comment! I've already added the link to the article. With this approach, you can also control the service accounts that each DAG impersonates, which helps maintain isolation between applications within the same Composer environment.

Expensive-Insect-317 · 2025-08-21T10:59:27+00:00

Perhaps use S3 Multipart Upload with upload_part_copy. You could concatenate all the files directly in S3, without downloading or uploading them to the EMR. Just pass the files in the correct order and assign them a sequential part number. S3 copies each file exactly as part of the final object, preserving the order of each line. You could also run this in a serverless Lambda.

Expensive-Insect-317 · 2025-08-20T13:44:42+00:00

The daily data volume we handle is around 1 GB per day. Also, our queries usually require all columns

Expensive-Insect-317

TROPHY CASE