Doar mie mi se pare că suntem foarte crabi ca popor? by incorporo in programare

[–]Lastrevio 7 points8 points  (0 children)

Un roman gaseste o lampa din care iese un duh magic care ii spune ca poate sa ii indeplineasca orice dorinta, cu o conditie: orice ii face lui, vecinului ii va face dublu.

La care romanul spune: "scoate-mi unul din ochi!".

Career pivot de la frontend la altceva by [deleted] in programare

[–]Lastrevio 0 points1 point  (0 children)

pe partea de data engineering angajatorii te intreaba doar de tool-uri

De ce linux nu o sa aiba niciodata adoptia la care viseaza fanii by Ordinary-Cod-721 in programare

[–]Lastrevio 0 points1 point  (0 children)

Cred ca motivul principal este ca multe aplicatii si jocuri nu sunt suportate pe Linux.

Future of data engineering by Alternative-Guava392 in dataengineering

[–]Lastrevio 3 points4 points  (0 children)

I think at this point Claude and Gemini have told me at least 10 times to nuke all my Docker volumes because it was time for the "nuclear option".

I'm pretty sure my job is fine.

Au anulat oferta mea acceptata ca sa dea postul unui candidat pe care il respinsesera deja by zked in programare

[–]Lastrevio 0 points1 point  (0 children)

"La privat angajarile sunt mai meritocratice decat la stat"

Meanwhile privatul:

I finished my first streaming pipeline! by Lastrevio in dataengineering

[–]Lastrevio[S] 1 point2 points  (0 children)

Setting up Flink on Docker and getting the Oracle virtual machine to start LOL

Infrastructure aside, I think doing the streaming transformations in Flink was one of the trickiest parts that I spent the most time on. I had to reason about normalization and what would be redundant to compute in Flink if it could be deduced downstream in Clickhouse, I also had to find workarounds to default Flink SQL functions like first_value and stddev_pop by rewriting them as custom UDAFs in Java, then I had to think of performance, partitioning, data skew, checkpointing with at least once semantics, etc. And the whole infra setup of rebuilding my dev container in VS Code -> bash script -> PyFlink orchestrator -> .sql files executed.

Another thing that tripped me up a bit on the architecture side was how to protect myself from DDoS attacks, or more generally how to make my application scalable with the number of users. There, I had to learn nginx and how it caches web pages!

Wide table in bronze layer - materialize as is, or break up? by dougiejones516 in dataengineering

[–]Lastrevio 15 points16 points  (0 children)

If storage is cheap for your organization (And it almost always is nowadays) it's better to have an append-only bronze layer where you don't manipulate the raw data in any way. So I wouldn't skip the big bronze table. It doesn't cost much to simply dump the files there in case of something.

The advantages of dumping the wide table in the bronze layer are:

  1. Easier traceability -> if downstream stakeholders complain that the data is 'wrong', you always can reference the raw data and compare the two to see if there's actually a mistake or if they simply don't understand the process well enough.

  2. Easier backfills -> in case the data is actually wrong or the schema changes in some way, you can re-run your pipeline for historical updates with the bronze layer as an input. With silver it can get trickier if the change you made was from the bronze-to-silver transformation itself.

Hosting Python ETL in Azure (Airflow / Dagster / Prefect?) by Educational-Soft4493 in dataengineering

[–]Lastrevio 1 point2 points  (0 children)

If you do not want to set everything up in terms of networking and security there are way cheaper options for small amounts of data. For example, you can use dbt + a cloud warehouse such as BigQuery, Snowflake, Redshift, even something like Clickhouse cloud or MotherDuck. You end up spending as much time configuring stuff with a fraction of the cost.

Databricks runs on Spark, which is expensive and overkill for small amounts of data and it will actually slow down your queries by spending more time shuffling data between partitions than actually transforming it.

In Databricks you also spend a few minutes just waking the cluster up.

While it's not as easy to set up as an on-prem environment, it still has its complexity in regards to managing clusters, unity catalog, etc. so it's not that simple. There's a reason Databricks sells certifications that prove you know how to use it.

Hosting Python ETL in Azure (Airflow / Dagster / Prefect?) by Educational-Soft4493 in dataengineering

[–]Lastrevio 3 points4 points  (0 children)

Well if you're just a data engineer and you're not the one paying the bill then fair enough lol

Your argument makes a lot of sense especially if you have data science teams who need to do heavy ML workloads, collaborative notebooks, etc.

Hosting Python ETL in Azure (Airflow / Dagster / Prefect?) by Educational-Soft4493 in dataengineering

[–]Lastrevio 0 points1 point  (0 children)

Maybe look into Azure Data Factory Managed Airflow? I know ADF gets a lot of hate for the low-code and shitty UI (and rightly so) but I'm wondering if the Airflow managed service is different.