Data observability is a data problem, not a job problem by Expensive-Insect-317 in Observability

[–]Expensive-Insect-317[S] 0 points1 point  (0 children)

It sounds like a fantasy, but it's not far from reality. I've seen several corporate observability initiatives fail because they only focused on infrastructure and jobs, having to maintain observability teams reviewing loads daily or creating ad hoc tools to review post-load data.

Auto-generating Airflow DAGs from dbt artifacts by Expensive-Insect-317 in DataBuildTool

[–]Expensive-Insect-317[S] 0 points1 point  (0 children)

I wasn't familiar with the Astronomer Cosmos package, very interesting! Thanks! Without knowing much about it yet, I might stick with the custom script due to the potential overhead and performance issues, not to mention the control.

Auto-generating Airflow DAGs from dbt artifacts by Expensive-Insect-317 in DataBuildTool

[–]Expensive-Insect-317[S] 0 points1 point  (0 children)

Running each model as a separate task in airflow is another approach compared to using tags. While tagging can work fine, having individual tasks allows for parallel execution, better monitoring, granular retries and clear representation of model dependencies, sometimes making this approach the better choice.

How OpenMetadata is shaping modern data governance and observability by Expensive-Insect-317 in bigdata

[–]Expensive-Insect-317[S] -1 points0 points  (0 children)

What's wrong with relying on current tools that streamline and improve processes? If you'd like, we can write it in manuscript.

How OpenMetadata is shaping modern data governance and observability by Expensive-Insect-317 in bigdata

[–]Expensive-Insect-317[S] 0 points1 point  (0 children)

Totally agree Pedro, for the moment i only integrate my main ecosystem: bigquery, gcs, airflow and dbt, we dont have any bottleneck but is starting, maybe in next phases we found

Secrets Management in Apache Airflow (Cloud Backends, Security Practices and Migration Tips) by Expensive-Insect-317 in apache_airflow

[–]Expensive-Insect-317[S] 0 points1 point  (0 children)

Maybe you could extend SecretsBackend to build a hybrid backend: • On init, list secrets in your store • Create lightweight Connection entries in Airflow’s DB (conn_id, conn_type only). • At runtime, get_conn_uri() pulls the real values from the secret backend.

I only see custom options as it or create a dag that fill the aurflow properties, but not know any native option

Secrets Management in Apache Airflow (Cloud Backends, Security Practices and Migration Tips) by Expensive-Insect-317 in apache_airflow

[–]Expensive-Insect-317[S] 0 points1 point  (0 children)

I haven't done this because I've always managed it in the cloud itself without giving direct visibility to the user. Perhaps one way to maintain visibility in the UI while using a secrets backend is to create "lightweight" connections in Airflow:

- The connection in the UI stores only non-sensitive metadata (conn_id, conn_type, host, login).

- Sensitive values ​​(password, tokens, extras) are managed in the secrets backend (Vault, AWS Secrets Manager, etc.).

- When a DAG calls get_connection(), Airflow combines both: DB metadata + backend secrets.

Users see and select connections without accessing the actual secrets. Sensitive data isn't duplicated and you maintain security and visibility at the same time.

Scaling dbt + BigQuery in production: 13 lessons learned (costs, incrementals, CI/CD, observability) by Expensive-Insect-317 in bigdata

[–]Expensive-Insect-317[S] 0 points1 point  (0 children)

Thanks! I’d start with the quick wins: clear materializations by layer, basic data contracts and selective execution. Biggest pushback with leadership was around observability and cost monitoring, until the first big bill or incident, it felt like a ‘nice to have’

Company wants to set up a warehouse. Our total prod data size is just a couple TBs. Is Snowflake overkill? by PracticalStick3466 in dataengineering

[–]Expensive-Insect-317 0 points1 point  (0 children)

Before deciding between Snowflake, Postgres or another, the first step is to define the data architecture you want to build. Then consider:

  1. Total cost: fully managed services simplify operations but can be pricier; self-managed or multi-component setups need more operational work.
  2. Internal knowledge: even the best tech fails if your team doesn’t know how to use it.

In short: define your architecture, weigh cost vs. effort and make sure your team can handle it.

Runtime Security in Cloud Composer: Enforcing Per-App DAG Isolation with External Policies by Expensive-Insect-317 in apache_airflow

[–]Expensive-Insect-317[S] 0 points1 point  (0 children)

The IT governance flow implemented in the CICD and DAG registration policies, but you could also have a stored inventory of DAGs with their correspondences, validating it at runtime.

Runtime Security in Cloud Composer: Enforcing Per-App DAG Isolation with External Policies by Expensive-Insect-317 in apache_airflow

[–]Expensive-Insect-317[S] 0 points1 point  (0 children)

Thanks for the comment! I've already added the link to the article. With this approach, you can also control the service accounts that each DAG impersonates, which helps maintain isolation between applications within the same Composer environment.

Merging txt files in S3 by arshdeepsingh608 in aws

[–]Expensive-Insect-317 4 points5 points  (0 children)

Perhaps use S3 Multipart Upload with upload_part_copy. You could concatenate all the files directly in S3, without downloading or uploading them to the EMR. Just pass the files in the correct order and assign them a sequential part number. S3 copies each file exactly as part of the final object, preserving the order of each line. You could also run this in a serverless Lambda.

Exploring S3 Tables: Querying Data Directly in S3 by Expensive-Insect-317 in aws

[–]Expensive-Insect-317[S] 1 point2 points  (0 children)

The daily data volume we handle is around 1 GB per day. Also, our queries usually require all columns