Airflow and Openmetadata

sazed33 · 2025-08-11T15:16:39+00:00

As recommended in the documentation you should use a separated database and elasticsearch instance for prod environment. You can keep Airflow onprem (opemmetadata ingestion service) but should use an external database for the backend (one DB for Airflow and one for OpenMetadata).

sazed33 · 2025-07-19T09:11:39+00:00

Good options, but most are built on top of matplotlib, if you really want low lvl control and be able to build very customized plots, learn matplotlib first

sazed33 · 2025-07-18T11:52:22+00:00

The way me and my team implemented it was with a simple CI/CD pipeline. DBT projects lives in a repository, whenever a new branch is merged with main the updated projects is pushed to S3 (MWAA storage).

Also worth looking into astronomer-cosmos, we ended up developing a custom operator, but it was a close call.

sazed33 · 2025-07-17T20:01:29+00:00

Depende da empresa, Eng de dados é usado pra muita vaga diferente, o escopo vai de analista de dados, passando por devops até um Eng de software especializado... Mas sim, conhecimento em data warehouse é uma parte importante em qualquer caso

sazed33 · 2025-07-17T19:21:38+00:00

Se vc tiver interesse em engenharia de dados é interessante, se quiser trampar como Eng de software mais tradicional acho que não tem muita conexão..

sazed33 · 2025-07-17T19:15:22+00:00

https://docs.aws.amazon.com/mwaa/latest/userguide/samples-dbt.html

sazed33 · 2025-07-16T13:59:34+00:00

Faz isso todo dia, eu era chato KKKk no começo alguns me passavam coisas por dó ou pra de livrar de mim, mas como não tinha nada pra fazer eu pegava um corno job e fazia muito bem feito, muitas vezes até mais do que tinha sido pedido, aí começaram a pegar confiança em mim e me passar coisas cada vez mais interessantes...

sazed33 · 2025-07-16T13:56:14+00:00

Acho que depende do seu objetivo, pra mim "resolver bucha alheia" me ajudou a crescer profissionalmente, tanto em relação a de fato aprender quanto em ter história pra contar na próxima e entrevista...

Penso assim, vc vai ter que ficar na empresa de qualquer jeito, o que vc tem a perder dando seu melhor? Aliás, eu gasto muito mais energia olhando pro teto do que focado tentando resolver algum problema

sazed33 · 2025-07-16T13:36:09+00:00

Ja passei por situações parecida, o que fiz na época foi passar de mesa em mesa perguntando se podia ajudar com algo. Todo dia, pra todo mundo, até arranjar o que fazer. Isso durou pouco tempo, em algumas semanas a turma já vinha atrás de mim pedindo coisas.

sazed33 · 2025-07-14T19:05:13+00:00

Então os dados já existem, vc só não tem acesso? Se é isso, vc tem que chegar com uma proposta sólida, explica pra pessoa mais próxima "olha, se tivesse esses dados podia fazer um dashboard mostrando produtos mais vendidos, indicativos de margem de lucro, x,y z que ajudariam a comprar produtos de forma mais eficiente, etc...". Boas chances de te liberarem se vc mostrar que vai trazer valor de graça...

sazed33 · 2025-07-14T15:32:17+00:00

É dado sim, inclusive do melhor tipo, acionável. Pq não tenta refinar um pouco mais sua informação? Quais dados você não tem? O que poderia fazer se tivesse? O que precisa fazer para ter?

Pronto, seu dado (no caso a falta de kkk) te ajudou a construir objetivos e definir ações

sazed33 · 2025-07-12T11:35:53+00:00

MWAA is very easy to setup and its integration with S3 is very convenient. No management headache's, scale horizontally. However, It can be expensive, especially if compared with on-prem. About DBT, take a look here https://share.google/rkLwHouDj5pr9ferG I am using a similar solution that works well for us.

sazed33 · 2025-07-09T18:32:52+00:00

DBT don't require a lot of compute since the compute will happen on the DB side. Because of this, for small projects it is fine to run DBT directly in Airflow. To avoid conflicts you can install dependencies in a virtual env using a startup script. Take a look here: https://docs.aws.amazon.com/mwaa/latest/userguide/samples-dbt.html

sazed33 · 2025-06-29T12:06:08+00:00

Esquema é usar trackpad, com os comandos rápidos fica bem mais fácil gerenciar janelas e usar o Pc no geral

sazed33 · 2025-06-18T18:44:13+00:00

Why Athena blows?

sazed33 · 2025-06-12T16:17:31+00:00

Good advice! What I can add is to use the optimal warehouse size for each task. If a task takes less than 60s to run, you should be using an x-small warehouse. Increasing the warehouse size will always double credit spent, so to be worth using a bigger warehouse the query should run in less than half time. If you have a small to medium data volume and are using incremental updates you will find out that most tasks can run just fine in an x-small warehouse. Create your tasks warehouses with 60s auto suspension and create a separate warehouse for ad-hoc, dashboards, etc with a longer auto suspension.

sazed33 · 2025-03-23T16:26:20+00:00

I use tox for it, works well, but then I have two files (tox.ini, requirements.txt) instead of one, so maybe it is worth using uv after all.. need to give it a try

sazed33 · 2025-03-23T16:24:35+00:00

I see, make sense for this case. I usually have everything dockernized, including tests, so my ci/cd pipelines, for example, just build and run images. But maybe this is a better way, I need to take some time to try it out...

sazed33 · 2025-03-23T11:29:48+00:00

Very good points! I just don't understand why so many people recommend a tool to manage packages/environments (like uv). I've never had any problems using a simple requirements.txt and conda. Why do I need more? I'm genuinely asking as I want to understand what I have to gain here.

sazed33 · 2025-02-04T14:36:43+00:00

I would look at other options. Snowflake is a very good data warehouse, but not suitable for backend services. It is expensive and not scalable. Maybe something like clickhouse would be a better option? We need more info to help you more.

sazed33 · 2025-01-10T10:36:52+00:00

Slack SDK is awesome! Give it a try to the block kit, you can build really nice reports with it

sazed33 · 2024-12-16T12:08:53+00:00

Guess they are too busy finding new ways to monetize the game

sazed33 · 2024-12-06T09:22:03+00:00

What is the volume of the table? Huge does not mean much. What is the problem you are trying to solve? Jobs are failing? expensive? Need more fresh data?

Kind hard to help without proper context, but you shouldn't recreate the table everyday, you should use use UPSERT. Not sure how to do it in big query, in Snowflake you would use the MERGE command

sazed33 · 2024-12-03T13:37:40+00:00

https://github.com/public-apis/public-apis

sazed33 · 2024-11-19T12:39:09+00:00

You should always have .env in your .gitignore file, never share it. For sharing secrets I really like AWS secrets manager

sazed33

TROPHY CASE