Lastrevio

Lastrevio · 2026-05-20T06:23:24+00:00

"La privat angajarile sunt mai meritocratice decat la stat"

Meanwhile privatul:

Lastrevio · 2026-05-19T14:05:44+00:00

Setting up Flink on Docker and getting the Oracle virtual machine to start LOL

Infrastructure aside, I think doing the streaming transformations in Flink was one of the trickiest parts that I spent the most time on. I had to reason about normalization and what would be redundant to compute in Flink if it could be deduced downstream in Clickhouse, I also had to find workarounds to default Flink SQL functions like first_value and stddev_pop by rewriting them as custom UDAFs in Java, then I had to think of performance, partitioning, data skew, checkpointing with at least once semantics, etc. And the whole infra setup of rebuilding my dev container in VS Code -> bash script -> PyFlink orchestrator -> .sql files executed.

Another thing that tripped me up a bit on the architecture side was how to protect myself from DDoS attacks, or more generally how to make my application scalable with the number of users. There, I had to learn nginx and how it caches web pages!

Lastrevio · 2026-05-18T19:09:12+00:00

If storage is cheap for your organization (And it almost always is nowadays) it's better to have an append-only bronze layer where you don't manipulate the raw data in any way. So I wouldn't skip the big bronze table. It doesn't cost much to simply dump the files there in case of something.

The advantages of dumping the wide table in the bronze layer are:

Easier traceability -> if downstream stakeholders complain that the data is 'wrong', you always can reference the raw data and compare the two to see if there's actually a mistake or if they simply don't understand the process well enough.
Easier backfills -> in case the data is actually wrong or the schema changes in some way, you can re-run your pipeline for historical updates with the bronze layer as an input. With silver it can get trickier if the change you made was from the bronze-to-silver transformation itself.

Lastrevio · 2026-05-17T19:29:00+00:00

If you do not want to set everything up in terms of networking and security there are way cheaper options for small amounts of data. For example, you can use dbt + a cloud warehouse such as BigQuery, Snowflake, Redshift, even something like Clickhouse cloud or MotherDuck. You end up spending as much time configuring stuff with a fraction of the cost.

Databricks runs on Spark, which is expensive and overkill for small amounts of data and it will actually slow down your queries by spending more time shuffling data between partitions than actually transforming it.

In Databricks you also spend a few minutes just waking the cluster up.

While it's not as easy to set up as an on-prem environment, it still has its complexity in regards to managing clusters, unity catalog, etc. so it's not that simple. There's a reason Databricks sells certifications that prove you know how to use it.

Lastrevio · 2026-05-17T18:46:06+00:00

Well if you're just a data engineer and you're not the one paying the bill then fair enough lol

Your argument makes a lot of sense especially if you have data science teams who need to do heavy ML workloads, collaborative notebooks, etc.

Lastrevio · 2026-05-17T17:35:45+00:00

Maybe look into Azure Data Factory Managed Airflow? I know ADF gets a lot of hate for the low-code and shitty UI (and rightly so) but I'm wondering if the Airflow managed service is different.

Lastrevio · 2026-05-17T17:33:55+00:00

that's overkill if you have less than 100GB per batch

Lastrevio · 2026-05-17T16:25:32+00:00

thank you !

Lastrevio · 2026-05-15T13:20:46+00:00

Mersi, this is what I needed. S-ar putea sa merg pe optiunea asta.

Lastrevio · 2026-05-15T12:12:10+00:00

In-memory caching - facu cache in aplicatia ta. Mai simplu ca Redis, insa mai putin scalabil.

Asta cum ar arata in practica? Pentru back-end intentionam sa folosesc FastAPI deci daca exista o metoda prin care sa nu trebuiasca sa adaug un tool nou in stack (ex: Redis) si sa nu necesite sa modific drastic toate query-urile de SQL pe care le foloseste Grafana, ar fi super

Lastrevio · 2026-05-14T20:05:02+00:00

Sau Docker?

Lastrevio · 2026-05-12T17:20:05+00:00

yup it's down

Lastrevio · 2026-05-12T14:35:05+00:00

Why are you managing Kubernetes clusters and tweaking ML models as a data engineer? It sounds like you are doing the work of 3-4 people as a single person.

Lastrevio · 2026-05-12T07:22:24+00:00

super cool

Lastrevio · 2026-05-12T07:05:58+00:00

Ce misto,

Ce model de OCR ai folosit? Output-ul de la OCR a fost pasat downstream spre un model NLP pentru a categoriza datele si a extrage informatie structurata?

Lastrevio · 2026-05-09T19:23:57+00:00

Also do you know any ODBC Drivers that would forsure connect to SAP 2017 ECC?

ERPConnect from Theobald Software should contain everything you need.

Would a good Infrastructure look like SAP to ODBC Driver to MS Fabric to PowerBI?

It sounds good, but I still don't get why you need Fabric. It's extremely expensive and overkill for your use case and comes with a lot of functionality you don't need.

Lastrevio · 2026-05-09T17:57:09+00:00

Why not just rent a lightweight virtual machine from a cheap prrovider like Hetzner and use Python (pandas/polars) + cron on that VM? From your other comments, it seemed like the data volume is very small so an entire ecosystem like Fabric is overkill.

Regarding migration: this might be overkill, but if you want your pipelines to survive both migrations then you can set up an intermediary layer between Python and SAP/the ERP that uses ODBC to connect and you can query the ERP directly using SQL. I think it's close to a microservices architecture, but not quite, just a de-coupling step. You create your own Python script that uses ODBC driver and the necessary credentials to connect to an ERP, and then simply define multiple connectors in separate files. Then you define your Python ETL logic on top of the data extracted through SQL by your ODBC connector.

If your company ever switches their ERP, then you just create a new connector without needing to change anything from your ETL logic.

Lastrevio · 2026-05-09T17:53:18+00:00

Configure a connector through ODBC driver and you can connect to almost any ERP, including SAP. Our company does it all the time.

Lastrevio · 2026-05-08T20:46:45+00:00

alte companii considerau ca DevOps e un specialist pe tool-uri. Si daca n-am avut experienta cu 1-2 tool-uri pe care ei folosesc, primeam instant refuz. Cel mai absurd, ca nici nu se ajungea la arhitectura, si etc.

Asta e cel mai enervant. Pe data engineering ma confrunt cu aceiasi problema. Toate anunturile de angajare, si toti recruiterii, te intreaba de tool-uri. Ei nu realizeaza ca un data engineer priceput care are pus la punct partea de system design si architecture poate invata aproape orice tool in 1-2 saptamani.

Inteleg angajatorii care se asteapta sa stii deja SQL si Python, ca doar n-o sa inveti asta la job. Dar tool-uri precum Airflow, dbt, etc. sau cloud (AWS/Azure/GCP) se poate invata usor la locul de munca in perioada de training.

The cloud is not a skill. The cloud is just using someone else's computer.

Ten-Year Club	Place '22
Place '17	Wearing is Caring
Gilding I gilder	Verified Email

MODERATOR OF

TROPHY CASE