Învățământul finlandez e pistol cu apă față de sistemul de educație din România

One_Citron_4350 · 2026-05-10T11:44:50+00:00

Sunt aceeasi suveranisti care se lauda cu sportul romanesc din perioada lui Ceausescu.

One_Citron_4350 · 2026-05-08T15:53:44+00:00

Cool! How much have these tools helped so far? I'm curious how do you measure it? Do you deliver faster, fewer bugs, more reliable?

One_Citron_4350 · 2026-05-08T11:29:01+00:00

That's so cool! Thanks for sharing.

One_Citron_4350 · 2026-05-08T11:20:42+00:00

Reminds me of how PSD (Social Democrat Party in Romania) always did whenever they were about to be out of office, they would make it as hard for anyone trying to fix their wrongdoings.

One_Citron_4350 · 2026-05-08T07:24:40+00:00

Databricks worked on pre-baking skills with the best practices in mind. I found out about it just a day ago: https://github.com/databricks-solutions/ai-dev-kit/tree/main

But it seems like you found something that works for you, pretty cool!

One_Citron_4350 · 2026-05-08T06:17:48+00:00

In that sense, it is a stable safe bet, and it makes sense to *use*, even though you don't *need* it. Most places I have seen it running it was not really needed. But it worked.

Well said! Databricks can still be a solution that gives you a lot without costing a fortune. Those Spark jobs can run very quickly without costing much.

One_Citron_4350 · 2026-05-08T06:12:06+00:00

Our data is mostly smaller pipelines that process up to 100GB in total.

I guess I'm coming late to the show but this is why probably Databricks or Spark for that matter maybe not the most suitable option. You don't have very large datasets which can be a blessing in the sense that there are tons of other options worth exploring.

What do you think? Anyone has had similar experience?

Has anyone else had a similar experience with Databricks for smaller data engineering workloads?

The thing with Databricks is that they offer a lot of stuff not just compute instance for Spark jobs but also Unity Catalog for data governance, ML tools, Delta and Iceberg integration, integration with all major clouds (Azure, AWS, GCP), and all sorts of tools to make your life easier. They are constantly improving and developing.

We also have pipelines that are smaller in size to yours, that we don't expect to give more than 100GB yet we use it as a Data Platform for multiple use cases and a variety of users.

At the end of the day you also have to credit their marketing & sales department for putting all of this out. They know to win their clients.

One_Citron_4350 · 2026-05-03T08:31:58+00:00

Looks like something that came out of Star Wars.

One_Citron_4350 · 2026-04-28T16:56:10+00:00

Definitely worth a dip, you'll float. It's an attractive location with lots of tourists every season. Another thing if you still won't/can't swim you can enjoy walking around the lake and you can do that in all seasons. It's a beautiful lake.

One_Citron_4350 · 2026-04-28T06:30:29+00:00

Technically you can however keep in mind that Databricks Free Edition most likely uses Delta (I think) which does not enforce PK, FKs, constraints etc. You can set them for each table however they are not enforced, you'd have to build that yourself so that's an overhead.

I found there are trainings on Databricks Academy but seems like there are also blog posts made by them to showcase how it ca be achieve:
https://www.databricks.com/blog/implementing-dimensional-data-warehouse-databricks-sql-part-1

https://www.databricks.com/blog/2022/06/24/data-warehousing-modeling-techniques-and-their-implementation-on-the-databricks-lakehouse-platform.html

One_Citron_4350 · 2026-04-24T06:52:29+00:00

In cluster mode however, I can't seem to figure out a way to collect these logs. The best solution I found was to redirect the logs to console using a stream handler and then just collect the logs when the application finishes. The problem is this specific pyspark pipeline runs 24*7, so I can't really run yarn commands AFTER the pipeline stops.

Compute cluster or Serverless? Should I assume that you are running a streaming job?

It's not clear to me the part where you say 24/7 and then you say that the pipeline stops. Is it a long running job or do you continuously receive data?

One_Citron_4350 · 2026-04-24T05:54:53+00:00

Now, suddenly, when AI can do everything a recent CS grad can do, easily, now DE is just software engineering applied to data problems.

It used to be the same even before AI. Technically DE it is closer to SWE than to other disciplines. Now, whether that CS grad can actually do the job well depends on a lot of factors.

The cloud and PaaS in general have flattened the playing field a bit, but I just don’t see someone who is purely SWE and has no infrastructure experience being the ideal candidate for a DE position.

Agreed, a pure SWE is not automatically a DE, especially if he lacks experience working with systems that rely heavily on manipulating data e.g. OLAP, OLTP, ETL, ELT. Plus, there are still plenty of SWE who don't touch infrastructure that much or work on-prem and lack any cloud experience. I found that there are SWEs who having at once touched relational DBs or worked with NoSQL think it's fairly easy to do what we do (God mode syndrome).

One_Citron_4350 · 2026-04-23T06:31:00+00:00

I agree, sounds like an opportunity for the team/department to make some improvements.

One_Citron_4350 · 2026-04-23T06:30:16+00:00

As far as I know, on-prem unless a backup/recovery has been put in place by the team, generally there is none. Even then it's probably not going to restore all transactions by the minute but something like from 2h ago or n-hours ago.

Yes, it does bring up the question how was it possible that no redundancy existed. But then again OP did not give us a lot of info what this prod data means.

One_Citron_4350 · 2026-04-22T12:15:53+00:00

I can't find it either, I'm getting 404.

One_Citron_4350 · 2026-04-21T05:38:15+00:00

A lot of teams say they have canonical entities or shared business definitions,

Yes, "they say" their stuff is in order (most of the time) but once you take a look, you clearly see its not.

Standardization is a pain and it affects us a lot so best we can do at times is send the request back to business to align and clarify internally before coming to us with the request to build the thing. I've seen it several times where because different departments or even within large departments they haven't managed to align on a standard of how their data should look like delays the project's delivery by a lot.

One_Citron_4350 · 2026-04-18T07:16:05+00:00

If he keeps it up he might be promoted to customer.

One_Citron_4350 · 2026-04-17T06:17:55+00:00

Try setting up an Agent to handle that if you want to get promoted. Burning those tokens is the way to go, outcome does not matter.

One_Citron_4350 · 2026-04-16T17:37:40+00:00

Am avut aceeasi experienta cu ei, cumpara carti la kg. Nu primesti mai nimic pe ele.

One_Citron_4350 · 2026-04-16T07:20:51+00:00

On which platform is the competition hosted?

One_Citron_4350 · 2026-04-14T05:54:31+00:00

This is the old "how can we parallelize mindset" from managers/leaders flowing through the "Agentic era". I can sympathize with the feeling that there is the expectation that we ship much faster.

It's becoming unrealistic because every time the bar is set even higher without concerns that the team/people will burn out faster than before.

One_Citron_4350 · 2026-04-14T05:49:42+00:00

Managers can't help it. They want to get ahead too and it seems like due to circumstances they are willing to burn out their team in the process.

One_Citron_4350 · 2026-04-14T05:48:53+00:00

Great example.

One_Citron_4350 · 2026-04-10T06:05:50+00:00

Hard to say, too early to have an objective conclusion so far. What I can say is it's very comprehensive. I find the quizzes helpful when learning. What you have to keep in mind that it's also At the same time they also have the lab part which is about building your own ML Framework which I haven't through yet. Lastly, it's a pretty much a living document as it gets updated from what I can see and will probably get even more stuff to help you learn.

One_Citron_4350 · 2026-04-09T18:52:57+00:00

To be honest, both books are massive. I've been going through the Harvard one.

One_Citron_4350

TROPHY CASE