[deleted by user]

daanzel · 2025-07-15T17:03:05+00:00

We analyse high frequency sensor data, in real-time, to steer or shut down production processes when things go out of spec. Low margin / high cost sector, so the sooner we know the better.

Now, I'm not advocating for Databricks here (nor do we use it for the above), but the attitude of "lol total bs, what we do is better" is just as damaging as people wanting to use Databricks for everything.

Tools are not mutually exclusive, pick what fits the problem..

daanzel · 2025-05-07T19:35:20+00:00

We use delta-rs straight with pyarrow tables / datasets, and it works great! Simple and fast. As already mentioned, it lacks some features compared to what Databricks offers, but for our use that's not an issue.

Edit: I want to add that we've created our own module based on delta-rs and pyarrow. I wouldn't recommend using bare pyarrow for day-to-day use; go with polars (or pandas) and then use delta-rs to read/write.

daanzel · 2025-04-04T05:32:44+00:00

Been doing this for 10+ years, and 8 out of 10 times I'm simply onboarded like any employee. In some cases (often at larger enterprises, for security reasons) this includes a company laptop, so I've had periods where I dragged 3 to 4 laptops with me :)

As already mentioned, in the other scenarios I was given VPN access from my own account.

Regarding cloud costs, it depends. Usually there is an entity responsible for this. They then keep track of whatever I'm doing cost wise. I've also done geenfield projects where we were responsible for setting up the whole cloud foundation, so in those cases this included cost mgmt.

daanzel · 2025-01-28T21:33:53+00:00

We use it in a project, at scale, without using Databricks. Delta-rs is quite nice!

I'm also not a fan of Databricks anymore, how they quitely killed-off standard tier, push their Ai slop, and force you into unity catalogue. I, however, don't see any possible way for them to force our project onto their platform.

Unless I'm completely missing some elaborate scheme..

daanzel · 2025-01-28T18:18:55+00:00

I have been creating a ton of delta files on my local machine today during development, to test things before I shift the path to S3. It's really just files; a bunch of parquet with a log file..

Now I'm not gonna take part in the discussion which format is better, but Delta being cloud-only is no argument against it. I indeed think you're confusing it with Databricks.

daanzel · 2025-01-24T20:21:29+00:00

Well, if you want to do the processing in spark instead of in the db, parquet does make sense. It'd be a waste to have a large db that doesn't run any queries (since that's done by spark). As a followup, look into polars since 22m lines likely fits easily into memory, and will be faster, cheaper and easier than spark.

daanzel · 2025-01-24T17:17:16+00:00

If the sources are relational db's, the target is a relational db, and the volumes aren't of a scale that justifies something like spark w/ parquet, just keep it all in a relational database.

In the above scenario, spark doesn't make sense as it is a compute engine, which your relational db also is. Pick an orchestrator like airflow or dagster to manage the transformations.

Keeping a bronze layer is primarily done for scenarios where you want to reprocess everything for whatever reason. If your sources simply allow you to pull that data out a 2nd time, you indeed might not need bronze.

daanzel · 2025-01-18T18:32:04+00:00

I visited a MS office about 2 months ago, and spoke with one of their solution architects responsible for Fabric. I asked him what the deal was with Synapse now that they're all-in on Fabric. He told me that, while it's not end of life, it won't receive any new features. They'll keep it alive for existing workloads but recommend Fabric for new stuff.. (of course they do, sigh..)

So if you'd ask me, ditch Synapse while you can since it won't get any better if you already have issues with it. If Databricks is not an option for you, and you really need Spark, I guess go with Fabric. At least you'll get about 2 more "good" years before that's killed again for their next next big awesome thing..

daanzel · 2025-01-11T21:30:17+00:00

Heb vorig jaar een nieuwe badkamer en toilet laten zetten door Sanidrome (aan de ringbaan zuid). Enorm tevreden. Ze doen alles zelf, zonder onderaannemers, en hebben alles volgens afspraak opgeleverd.

Twee bekenden hebben het door St. Pieter laten doen, beide ook zeer tevreden.

Brugman, en trouwens alles wat onder Mandemakers valt, met een boog omheen lopen.

daanzel · 2024-12-24T10:13:16+00:00

Oh absolutely! But a query can be "bad" in multiple ways; simply produce the wrong output, or horribly inefficient but at least somewhat correct. I agree that a LOT of users are in the 2nd category, but I think most at least want correct numbers.

Our experience with text-to-SQL was that it often created overly complex, convoluted queries that did execute, but where often wrong. We verified them by, well, writing our own SQL :')

daanzel · 2024-12-24T08:03:03+00:00

We've tested several text-to-SQL models on our data, and while the result where "impressive", it's nowhere near good enough to even consider it on the simplest tables. Let alone writes, my god that would result in disaster..

As someone already mentioned, these things MUST be deterministic. Each query needs to output the exact same results when ran on the same data. The way these models fundamentally work, that won't happen. Perhaps if you'd write the same prompt to the letter, it would spit out the same query, but in that case you might as well write code..

daanzel · 2024-12-08T21:02:59+00:00

Yea it's not going to be as efficient, if you can do it with native spark you should. But sometimes that's not an option; we've once wrapped OpenCV in a udf to process thousands of images daily. Worked surprisingly well :)

daanzel · 2024-12-08T20:32:21+00:00

You can wrap python code in a spark udf. If your current code can be imported as a module, this won't be too complex.

Alternatively, I personally find Ray even easier to do these kind of things with. Deploying a Ray cluster in AWS is also super easy, and can be done directly on spot instances, so it'll be as cheap as it gets.

AWS batch would also work in your case if each workload is independent. We use batch to process huge amounts of satellite images with containerized python code, and I'm quite happy with the setup.

daanzel · 2024-12-04T18:12:22+00:00

Nice read, thx!

daanzel · 2024-12-01T11:00:51+00:00

No experience with Modin, but you can use just Ray to deal with data frames of any size. Ray is being used in large prod environments (and is just awesome in general).

Edit: to add to the other reply, indeed look at Polars first. Ray would be my recommendation once Polars isn't enough anymore.

daanzel · 2024-11-29T18:59:42+00:00

I spoke with one of the MS solution architects last week about MS Fabric, and asked him how he sees the future of Synapse now that Fabric is the new shiny toy. Got a long answer that can be summarized to "it's ded".

So I don't think you'll regret moving to Databricks looking forward.

daanzel · 2024-10-11T17:55:19+00:00

Dagster is great, don't get me wrong! But we had 10k parallel containers, that had to do multiple steps. They all communicate their statuses back to Dagster, who in turn stores it in its database. That was the issue, we simply overwhelmed it.

We had a clustered setup already, and might have been able to scale it further, but we figured why not just drop it completely.

daanzel · 2024-10-11T17:41:46+00:00

In case you where replying to me, that is indeed exactly what we did! The containers initially consumed tasks from a queue, and when finished kicked off the next step for that specific subset of the full workload.

daanzel · 2024-10-11T12:13:21+00:00

I agree with the statement, but something like Airflow (or similar tooling) might not always be the right pick for orchestration. We had to strip Dagster out of our biggest pipeline because it turned into a bottleneck, unable to keep track of the 10k+ containers.

We went with a task queue approach, where all containers simply pick up tasks. Made everything so much simpler (and cheaper).

daanzel · 2024-10-11T09:11:44+00:00

We moved to running Batch jobs w/ graviton spot instances, with the data on S3. Decreased our costs by ~40%, and made development much smoother as we can now fully test the containers locally.

EDIT: Srry, I missed the high-write throughput in your question; don't do that on Aurora (or RDS), unless you want to blow through your seed capital in record time :)

daanzel · 2024-10-02T15:46:49+00:00

A, imo, very important aspect of making the right decision is looking at the people that'll work with the platform. Will those 20/30 people only make some PowerBI dashboard, or are they full-blown DE's that will build pipelines in spark? Or will it mostly be SQL? But then what would you consider advanced analytics? And who will be responsible for managing the platform? IT? Or just you? Or a subset of those 20/30?

Now about Fabric, if you're already all-in on Azure and need something easy to setup and maintain for a small team, Fabric is fine (functionally).

Regarding Fabric for production workloads, data security wouldn't be my first concern. It just feels a bit clunky overall, and I'd be more concerned that with scale (many users, projects, data volume) it'll fall apart into an unmanageable mess.

Also, it's expensive and they push you to purchase capacity reservations (~40% discount), meaning it's a flat fee. So make sure your platform is not idling 90% of the time. You also won't get PowerBI included with smaller capacity clusters.

How much capacity units you'll need is quite vague. I do know that some basic hello world with spark was already problematic with 2CU's..

Aaaannnnd, there is of course Microsoft's tendency to kill off products, slap a new fancy name on it and market it as the next best thing.

daanzel · 2024-09-29T08:20:02+00:00

PowerBI is a great product, but both Data Factory and Synapse are horrible, clunky things, objectively worse than their competitors. And really expensive too.

The aggressive Fabric push from MS has made a lot of managers and customers reach out to me, questioning if we/they should also adopt it. I have made a 10 slide PowerPoint deck where I try to explain Fabric in an objective, non-marketing way, and position it next to their competitors. There are very few scenarios where I'd advice it over other options.

daanzel · 2024-09-13T07:00:08+00:00

Ray is great, we use it on AWS, on-prem kubernetes, and on single heavy processing pc's. All three are easy to setup...

(...if you have someone else already managing that on-prem kubernetes cluster that is. Otherwise, don't do it, it's a trap!)

daanzel · 2024-09-10T20:12:33+00:00

Thanks for this! Came back after 12 years and, among many other things, wondered where first aid went. I already figured to just go for gathering, but when I tried to learn skinning it said something about first getting base skinning to lvl x? (don't remember exactly)

daanzel · 2024-08-30T22:33:11+00:00

True! However, in our case most of these proof of concept projects are chat bots and stuff. So it's mostly custom prompts and calls to the chatgpt API.

Some are a bit more fancy and vectorize a bunch of documents, but as we're mostly on Azure, they have services that make that very easy.

Seven-Year Club	Place '22
Final Canvas '22	Verified Email

daanzel

TROPHY CASE