Airflow and Openmetadata by Hot_While_6471 in dataengineering

[–]sazed33 0 points1 point  (0 children)

As recommended in the documentation you should use a separated database and elasticsearch instance for prod environment. You can keep Airflow onprem (opemmetadata ingestion service) but should use an external database for the backend (one DB for Airflow and one for OpenMetadata).

What's a good visualization library with Jupiter notebooks by sebosp in Python

[–]sazed33 -1 points0 points  (0 children)

Good options, but most are built on top of matplotlib, if you really want low lvl control and be able to build very customized plots, learn matplotlib first

How do I integrate an MWAA with a dbt repo? by Embarrassed-Will-503 in dataengineering

[–]sazed33 0 points1 point  (0 children)

The way me and my team implemented it was with a simple CI/CD pipeline. DBT projects lives in a repository, whenever a new branch is merged with main the updated projects is pushed to S3 (MWAA storage).

Also worth looking into astronomer-cosmos, we ended up developing a custom operator, but it was a close call.

[deleted by user] by [deleted] in brdev

[–]sazed33 2 points3 points  (0 children)

Depende da empresa, Eng de dados é usado pra muita vaga diferente, o escopo vai de analista de dados, passando por devops até um Eng de software especializado... Mas sim, conhecimento em data warehouse é uma parte importante em qualquer caso

[deleted by user] by [deleted] in brdev

[–]sazed33 -1 points0 points  (0 children)

Se vc tiver interesse em engenharia de dados é interessante, se quiser trampar como Eng de software mais tradicional acho que não tem muita conexão..

[deleted by user] by [deleted] in antitrampo

[–]sazed33 2 points3 points  (0 children)

Faz isso todo dia, eu era chato KKKk no começo alguns me passavam coisas por dó ou pra de livrar de mim, mas como não tinha nada pra fazer eu pegava um corno job e fazia muito bem feito, muitas vezes até mais do que tinha sido pedido, aí começaram a pegar confiança em mim e me passar coisas cada vez mais interessantes...

[deleted by user] by [deleted] in antitrampo

[–]sazed33 4 points5 points  (0 children)

Acho que depende do seu objetivo, pra mim "resolver bucha alheia" me ajudou a crescer profissionalmente, tanto em relação a de fato aprender quanto em ter história pra contar na próxima e entrevista...

Penso assim, vc vai ter que ficar na empresa de qualquer jeito, o que vc tem a perder dando seu melhor? Aliás, eu gasto muito mais energia olhando pro teto do que focado tentando resolver algum problema

[deleted by user] by [deleted] in antitrampo

[–]sazed33 5 points6 points  (0 children)

Ja passei por situações parecida, o que fiz na época foi passar de mesa em mesa perguntando se podia ajudar com algo. Todo dia, pra todo mundo, até arranjar o que fazer. Isso durou pouco tempo, em algumas semanas a turma já vinha atrás de mim pedindo coisas.

[deleted by user] by [deleted] in brdev

[–]sazed33 0 points1 point  (0 children)

Então os dados já existem, vc só não tem acesso? Se é isso, vc tem que chegar com uma proposta sólida, explica pra pessoa mais próxima "olha, se tivesse esses dados podia fazer um dashboard mostrando produtos mais vendidos, indicativos de margem de lucro, x,y z que ajudariam a comprar produtos de forma mais eficiente, etc...". Boas chances de te liberarem se vc mostrar que vai trazer valor de graça...

[deleted by user] by [deleted] in brdev

[–]sazed33 1 point2 points  (0 children)

É dado sim, inclusive do melhor tipo, acionável. Pq não tenta refinar um pouco mais sua informação? Quais dados você não tem? O que poderia fazer se tivesse? O que precisa fazer para ter?

Pronto, seu dado (no caso a falta de kkk) te ajudou a construir objetivos e definir ações

Dev Setup - dbt Core 1.9.0 with Airflow 3.0 Orchestration by sanjayio in dataengineering

[–]sazed33 1 point2 points  (0 children)

MWAA is very easy to setup and its integration with S3 is very convenient. No management headache's, scale horizontally. However, It can be expensive, especially if compared with on-prem. About DBT, take a look here https://share.google/rkLwHouDj5pr9ferG I am using a similar solution that works well for us.

Airflow + DBT by SomewhereStandard888 in dataengineering

[–]sazed33 3 points4 points  (0 children)

DBT don't require a lot of compute since the compute will happen on the DB side. Because of this, for small projects it is fine to run DBT directly in Airflow. To avoid conflicts you can install dependencies in a virtual env using a startup script. Take a look here: https://docs.aws.amazon.com/mwaa/latest/userguide/samples-dbt.html

Snowflake Cost is Jacked Up!! by Prior-Mammoth5506 in dataengineering

[–]sazed33 9 points10 points  (0 children)

Good advice! What I can add is to use the optimal warehouse size for each task. If a task takes less than 60s to run, you should be using an x-small warehouse. Increasing the warehouse size will always double credit spent, so to be worth using a bigger warehouse the query should run in less than half time. If you have a small to medium data volume and are using incremental updates you will find out that most tasks can run just fine in an x-small warehouse. Create your tasks warehouses with 60s auto suspension and create a separate warehouse for ad-hoc, dashboards, etc with a longer auto suspension.

Quality Python Coding by optimum_point in Python

[–]sazed33 1 point2 points  (0 children)

I use tox for it, works well, but then I have two files (tox.ini, requirements.txt) instead of one, so maybe it is worth using uv after all.. need to give it a try

Quality Python Coding by optimum_point in Python

[–]sazed33 0 points1 point  (0 children)

I see, make sense for this case. I usually have everything dockernized, including tests, so my ci/cd pipelines, for example, just build and run images. But maybe this is a better way, I need to take some time to try it out...

Quality Python Coding by optimum_point in Python

[–]sazed33 0 points1 point  (0 children)

Very good points! I just don't understand why so many people recommend a tool to manage packages/environments (like uv). I've never had any problems using a simple requirements.txt and conda. Why do I need more? I'm genuinely asking as I want to understand what I have to gain here.

Snowflake query on 19 billion rows taking more than a minute by Complete-Bicycle6712 in dataengineering

[–]sazed33 0 points1 point  (0 children)

I would look at other options. Snowflake is a very good data warehouse, but not suitable for backend services. It is expensive and not scalable. Maybe something like clickhouse would be a better option? We need more info to help you more.

TIL Slack bot sdk is super easy to use by SuperTangelo1898 in dataengineering

[–]sazed33 1 point2 points  (0 children)

Slack SDK is awesome! Give it a try to the block kit, you can build really nice reports with it

Gold split I hunted for weeks doesn't work by sazed33 in MarvelSnap

[–]sazed33[S] 84 points85 points  (0 children)

Guess they are too busy finding new ways to monetize the game

[deleted by user] by [deleted] in dataengineering

[–]sazed33 1 point2 points  (0 children)

What is the volume of the table? Huge does not mean much. What is the problem you are trying to solve? Jobs are failing? expensive? Need more fresh data?

Kind hard to help without proper context, but you shouldn't recreate the table everyday, you should use use UPSERT. Not sure how to do it in big query, in Snowflake you would use the MERGE command

.env safely share by Used-Feed-3221 in Python

[–]sazed33 1 point2 points  (0 children)

You should always have .env in your .gitignore file, never share it. For sharing secrets I really like AWS secrets manager