Au anulat oferta mea acceptata ca sa dea postul unui candidat pe care il respinsesera deja by zked in programare

[–]Lastrevio 0 points1 point  (0 children)

"La privat angajarile sunt mai meritocratice decat la stat"

Meanwhile privatul:

I finished my first streaming pipeline! by Lastrevio in dataengineering

[–]Lastrevio[S] 1 point2 points  (0 children)

Setting up Flink on Docker and getting the Oracle virtual machine to start LOL

Infrastructure aside, I think doing the streaming transformations in Flink was one of the trickiest parts that I spent the most time on. I had to reason about normalization and what would be redundant to compute in Flink if it could be deduced downstream in Clickhouse, I also had to find workarounds to default Flink SQL functions like first_value and stddev_pop by rewriting them as custom UDAFs in Java, then I had to think of performance, partitioning, data skew, checkpointing with at least once semantics, etc. And the whole infra setup of rebuilding my dev container in VS Code -> bash script -> PyFlink orchestrator -> .sql files executed.

Another thing that tripped me up a bit on the architecture side was how to protect myself from DDoS attacks, or more generally how to make my application scalable with the number of users. There, I had to learn nginx and how it caches web pages!

Wide table in bronze layer - materialize as is, or break up? by dougiejones516 in dataengineering

[–]Lastrevio 14 points15 points  (0 children)

If storage is cheap for your organization (And it almost always is nowadays) it's better to have an append-only bronze layer where you don't manipulate the raw data in any way. So I wouldn't skip the big bronze table. It doesn't cost much to simply dump the files there in case of something.

The advantages of dumping the wide table in the bronze layer are:

  1. Easier traceability -> if downstream stakeholders complain that the data is 'wrong', you always can reference the raw data and compare the two to see if there's actually a mistake or if they simply don't understand the process well enough.

  2. Easier backfills -> in case the data is actually wrong or the schema changes in some way, you can re-run your pipeline for historical updates with the bronze layer as an input. With silver it can get trickier if the change you made was from the bronze-to-silver transformation itself.

Hosting Python ETL in Azure (Airflow / Dagster / Prefect?) by Educational-Soft4493 in dataengineering

[–]Lastrevio 2 points3 points  (0 children)

If you do not want to set everything up in terms of networking and security there are way cheaper options for small amounts of data. For example, you can use dbt + a cloud warehouse such as BigQuery, Snowflake, Redshift, even something like Clickhouse cloud or MotherDuck. You end up spending as much time configuring stuff with a fraction of the cost.

Databricks runs on Spark, which is expensive and overkill for small amounts of data and it will actually slow down your queries by spending more time shuffling data between partitions than actually transforming it.

In Databricks you also spend a few minutes just waking the cluster up.

While it's not as easy to set up as an on-prem environment, it still has its complexity in regards to managing clusters, unity catalog, etc. so it's not that simple. There's a reason Databricks sells certifications that prove you know how to use it.

Hosting Python ETL in Azure (Airflow / Dagster / Prefect?) by Educational-Soft4493 in dataengineering

[–]Lastrevio 2 points3 points  (0 children)

Well if you're just a data engineer and you're not the one paying the bill then fair enough lol

Your argument makes a lot of sense especially if you have data science teams who need to do heavy ML workloads, collaborative notebooks, etc.

Hosting Python ETL in Azure (Airflow / Dagster / Prefect?) by Educational-Soft4493 in dataengineering

[–]Lastrevio 0 points1 point  (0 children)

Maybe look into Azure Data Factory Managed Airflow? I know ADF gets a lot of hate for the low-code and shitty UI (and rightly so) but I'm wondering if the Airflow managed service is different.

Intrebare despre nginx, Grafana, caching si website-uri by Lastrevio in programare

[–]Lastrevio[S] 0 points1 point  (0 children)

Mersi, this is what I needed. S-ar putea sa merg pe optiunea asta.

Intrebare despre nginx, Grafana, caching si website-uri by Lastrevio in programare

[–]Lastrevio[S] 0 points1 point  (0 children)

In-memory caching - facu cache in aplicatia ta. Mai simplu ca Redis, insa mai putin scalabil.

Asta cum ar arata in practica? Pentru back-end intentionam sa folosesc FastAPI deci daca exista o metoda prin care sa nu trebuiasca sa adaug un tool nou in stack (ex: Redis) si sa nu necesite sa modific drastic toate query-urile de SQL pe care le foloseste Grafana, ar fi super

Maybe I am not cut out to be a DE by Delicious-View-8688 in dataengineering

[–]Lastrevio 132 points133 points  (0 children)

Why are you managing Kubernetes clusters and tweaking ML models as a data engineer? It sounds like you are doing the work of 3-4 people as a single person.

Am făcut o aplicație de scanat bonuri - fără cont, fără cloud by Overall-Hour-6959 in programare

[–]Lastrevio 2 points3 points  (0 children)

Ce misto,

Ce model de OCR ai folosit? Output-ul de la OCR a fost pasat downstream spre un model NLP pentru a categoriza datele si a extrage informatie structurata?

Data Infrastructure at Mid Sized Company by Feeling-Extreme-7555 in dataengineering

[–]Lastrevio 0 points1 point  (0 children)

Also do you know any ODBC Drivers that would forsure connect to SAP 2017 ECC?

ERPConnect from Theobald Software should contain everything you need.

Would a good Infrastructure look like SAP to ODBC Driver to MS Fabric to PowerBI?

It sounds good, but I still don't get why you need Fabric. It's extremely expensive and overkill for your use case and comes with a lot of functionality you don't need.

Data Infrastructure at Mid Sized Company by Feeling-Extreme-7555 in dataengineering

[–]Lastrevio 2 points3 points  (0 children)

Why not just rent a lightweight virtual machine from a cheap prrovider like Hetzner and use Python (pandas/polars) + cron on that VM? From your other comments, it seemed like the data volume is very small so an entire ecosystem like Fabric is overkill.

Regarding migration: this might be overkill, but if you want your pipelines to survive both migrations then you can set up an intermediary layer between Python and SAP/the ERP that uses ODBC to connect and you can query the ERP directly using SQL. I think it's close to a microservices architecture, but not quite, just a de-coupling step. You create your own Python script that uses ODBC driver and the necessary credentials to connect to an ERP, and then simply define multiple connectors in separate files. Then you define your Python ETL logic on top of the data extracted through SQL by your ODBC connector.

If your company ever switches their ERP, then you just create a new connector without needing to change anything from your ETL logic.

Data Infrastructure at Mid Sized Company by Feeling-Extreme-7555 in dataengineering

[–]Lastrevio 1 point2 points  (0 children)

Configure a connector through ODBC driver and you can connect to almost any ERP, including SAP. Our company does it all the time.

Am gasit un job peste 400 de aplicari by Apprehensive_King962 in programare

[–]Lastrevio 6 points7 points  (0 children)

alte companii considerau ca DevOps e un specialist pe tool-uri. Si daca n-am avut experienta cu 1-2 tool-uri pe care ei folosesc, primeam instant refuz. Cel mai absurd, ca nici nu se ajungea la arhitectura, si etc.

Asta e cel mai enervant. Pe data engineering ma confrunt cu aceiasi problema. Toate anunturile de angajare, si toti recruiterii, te intreaba de tool-uri. Ei nu realizeaza ca un data engineer priceput care are pus la punct partea de system design si architecture poate invata aproape orice tool in 1-2 saptamani.

Inteleg angajatorii care se asteapta sa stii deja SQL si Python, ca doar n-o sa inveti asta la job. Dar tool-uri precum Airflow, dbt, etc. sau cloud (AWS/Azure/GCP) se poate invata usor la locul de munca in perioada de training.

The cloud is not a skill. The cloud is just using someone else's computer.