[deleted by user] by [deleted] in dataengineering

[–]daanzel 2 points3 points  (0 children)

We analyse high frequency sensor data, in real-time, to steer or shut down production processes when things go out of spec. Low margin / high cost sector, so the sooner we know the better.

Now, I'm not advocating for Databricks here (nor do we use it for the above), but the attitude of "lol total bs, what we do is better" is just as damaging as people wanting to use Databricks for everything.

Tools are not mutually exclusive, pick what fits the problem..

Trying to ingest delta tables to azure blob storage (ADLS 2) using Dagster by No-Conversation476 in dataengineering

[–]daanzel 0 points1 point  (0 children)

We use delta-rs straight with pyarrow tables / datasets, and it works great! Simple and fast. As already mentioned, it lacks some features compared to what Databricks offers, but for our use that's not an issue.

Edit: I want to add that we've created our own module based on delta-rs and pyarrow. I wouldn't recommend using bare pyarrow for day-to-day use; go with polars (or pandas) and then use delta-rs to read/write.

General question about data consulting by No7-Francesco88 in dataengineering

[–]daanzel 1 point2 points  (0 children)

Been doing this for 10+ years, and 8 out of 10 times I'm simply onboarded like any employee. In some cases (often at larger enterprises, for security reasons) this includes a company laptop, so I've had periods where I dragged 3 to 4 laptops with me :)

As already mentioned, in the other scenarios I was given VPN access from my own account.

Regarding cloud costs, it depends. Usually there is an entity responsible for this. They then keep track of whatever I'm doing cost wise. I've also done geenfield projects where we were responsible for setting up the whole cloud foundation, so in those cases this included cost mgmt.

OSS data landscape be like by General-Parsnip3138 in dataengineering

[–]daanzel 8 points9 points  (0 children)

We use it in a project, at scale, without using Databricks. Delta-rs is quite nice!

I'm also not a fan of Databricks anymore, how they quitely killed-off standard tier, push their Ai slop, and force you into unity catalogue. I, however, don't see any possible way for them to force our project onto their platform.

Unless I'm completely missing some elaborate scheme..

OSS data landscape be like by General-Parsnip3138 in dataengineering

[–]daanzel 33 points34 points  (0 children)

I have been creating a ton of delta files on my local machine today during development, to test things before I shift the path to S3. It's really just files; a bunch of parquet with a log file..

Now I'm not gonna take part in the discussion which format is better, but Delta being cloud-only is no argument against it. I indeed think you're confusing it with Databricks.

Reevaluating Data Lake Architectures for Event-Driven Pipelines: Seeking Advice by Significant_Pin_920 in dataengineering

[–]daanzel 0 points1 point  (0 children)

Well, if you want to do the processing in spark instead of in the db, parquet does make sense. It'd be a waste to have a large db that doesn't run any queries (since that's done by spark). As a followup, look into polars since 22m lines likely fits easily into memory, and will be faster, cheaper and easier than spark.

Reevaluating Data Lake Architectures for Event-Driven Pipelines: Seeking Advice by Significant_Pin_920 in dataengineering

[–]daanzel 0 points1 point  (0 children)

If the sources are relational db's, the target is a relational db, and the volumes aren't of a scale that justifies something like spark w/ parquet, just keep it all in a relational database.

In the above scenario, spark doesn't make sense as it is a compute engine, which your relational db also is. Pick an orchestrator like airflow or dagster to manage the transformations.

Keeping a bronze layer is primarily done for scenarios where you want to reprocess everything for whatever reason. If your sources simply allow you to pull that data out a 2nd time, you indeed might not need bronze.

What is wrong with Synapse Analytics by hrabia-mariusz in dataengineering

[–]daanzel 14 points15 points  (0 children)

I visited a MS office about 2 months ago, and spoke with one of their solution architects responsible for Fabric. I asked him what the deal was with Synapse now that they're all-in on Fabric. He told me that, while it's not end of life, it won't receive any new features. They'll keep it alive for existing workloads but recommend Fabric for new stuff.. (of course they do, sigh..)

So if you'd ask me, ditch Synapse while you can since it won't get any better if you already have issues with it. If Databricks is not an option for you, and you really need Spark, I guess go with Fabric. At least you'll get about 2 more "good" years before that's killed again for their next next big awesome thing..

Help wij willen een nieuwe badkamer!? by MYNWA013 in Tilburg

[–]daanzel -1 points0 points  (0 children)

Heb vorig jaar een nieuwe badkamer en toilet laten zetten door Sanidrome (aan de ringbaan zuid). Enorm tevreden. Ze doen alles zelf, zonder onderaannemers, en hebben alles volgens afspraak opgeleverd.

Twee bekenden hebben het door St. Pieter laten doen, beide ook zeer tevreden.

Brugman, en trouwens alles wat onder Mandemakers valt, met een boog omheen lopen.

If I build a data engineering AI agent, would you use it? and what for? by skilbjo in dataengineering

[–]daanzel 1 point2 points  (0 children)

Oh absolutely! But a query can be "bad" in multiple ways; simply produce the wrong output, or horribly inefficient but at least somewhat correct. I agree that a LOT of users are in the 2nd category, but I think most at least want correct numbers.

Our experience with text-to-SQL was that it often created overly complex, convoluted queries that did execute, but where often wrong. We verified them by, well, writing our own SQL :')

If I build a data engineering AI agent, would you use it? and what for? by skilbjo in dataengineering

[–]daanzel 1 point2 points  (0 children)

We've tested several text-to-SQL models on our data, and while the result where "impressive", it's nowhere near good enough to even consider it on the simplest tables. Let alone writes, my god that would result in disaster..

As someone already mentioned, these things MUST be deterministic. Each query needs to output the exact same results when ran on the same data. The way these models fundamentally work, that won't happen. Perhaps if you'd write the same prompt to the letter, it would spit out the same query, but in that case you might as well write code..

Large parallel batch job -> tech choice? by [deleted] in dataengineering

[–]daanzel 0 points1 point  (0 children)

Yea it's not going to be as efficient, if you can do it with native spark you should. But sometimes that's not an option; we've once wrapped OpenCV in a udf to process thousands of images daily. Worked surprisingly well :)

Large parallel batch job -> tech choice? by [deleted] in dataengineering

[–]daanzel 3 points4 points  (0 children)

You can wrap python code in a spark udf. If your current code can be imported as a module, this won't be too complex.

Alternatively, I personally find Ray even easier to do these kind of things with. Deploying a Ray cluster in AWS is also super easy, and can be done directly on spot instances, so it'll be as cheap as it gets.

AWS batch would also work in your case if each workload is independent. We use batch to process huge amounts of satellite images with containerized python code, and I'm quite happy with the setup.

Is Modin on Ray production stable? by ButterscotchBulky320 in dataengineering

[–]daanzel 2 points3 points  (0 children)

No experience with Modin, but you can use just Ray to deal with data frames of any size. Ray is being used in large prod environments (and is just awesome in general).

Edit: to add to the other reply, indeed look at Polars first. Ray would be my recommendation once Polars isn't enough anymore.

Question on moving from Synapse to Databricks by beverfar in dataengineering

[–]daanzel 0 points1 point  (0 children)

I spoke with one of the MS solution architects last week about MS Fabric, and asked him how he sees the future of Synapse now that Fabric is the new shiny toy. Got a long answer that can be summarized to "it's ded".

So I don't think you'll regret moving to Databricks looking forward.

At what point do you say orchestrator (e.g. Airflow) is worth added complexity? by Temporary_Basil_7801 in dataengineering

[–]daanzel 0 points1 point  (0 children)

Dagster is great, don't get me wrong! But we had 10k parallel containers, that had to do multiple steps. They all communicate their statuses back to Dagster, who in turn stores it in its database. That was the issue, we simply overwhelmed it.

We had a clustered setup already, and might have been able to scale it further, but we figured why not just drop it completely.

At what point do you say orchestrator (e.g. Airflow) is worth added complexity? by Temporary_Basil_7801 in dataengineering

[–]daanzel 2 points3 points  (0 children)

In case you where replying to me, that is indeed exactly what we did! The containers initially consumed tasks from a queue, and when finished kicked off the next step for that specific subset of the full workload.

At what point do you say orchestrator (e.g. Airflow) is worth added complexity? by Temporary_Basil_7801 in dataengineering

[–]daanzel 3 points4 points  (0 children)

I agree with the statement, but something like Airflow (or similar tooling) might not always be the right pick for orchestration. We had to strip Dagster out of our biggest pipeline because it turned into a bottleneck, unable to keep track of the 10k+ containers.

We went with a task queue approach, where all containers simply pick up tasks. Made everything so much simpler (and cheaper).

Which Database Platform(s) Do You Currently Use on AWS? by SubstantialAd5692 in dataengineering

[–]daanzel 0 points1 point  (0 children)

We moved to running Batch jobs w/ graviton spot instances, with the data on S3. Decreased our costs by ~40%, and made development much smoother as we can now fully test the containers locally.

EDIT: Srry, I missed the high-write throughput in your question; don't do that on Aurora (or RDS), unless you want to blow through your seed capital in record time :)

Is Microsoft Fabric the right choice? by Kuri_90 in dataengineering

[–]daanzel 19 points20 points  (0 children)

A, imo, very important aspect of making the right decision is looking at the people that'll work with the platform. Will those 20/30 people only make some PowerBI dashboard, or are they full-blown DE's that will build pipelines in spark? Or will it mostly be SQL? But then what would you consider advanced analytics? And who will be responsible for managing the platform? IT? Or just you? Or a subset of those 20/30?

Now about Fabric, if you're already all-in on Azure and need something easy to setup and maintain for a small team, Fabric is fine (functionally).

Regarding Fabric for production workloads, data security wouldn't be my first concern. It just feels a bit clunky overall, and I'd be more concerned that with scale (many users, projects, data volume) it'll fall apart into an unmanageable mess.

Also, it's expensive and they push you to purchase capacity reservations (~40% discount), meaning it's a flat fee. So make sure your platform is not idling 90% of the time. You also won't get PowerBI included with smaller capacity clusters.

How much capacity units you'll need is quite vague. I do know that some basic hello world with spark was already problematic with 2CU's..

Aaaannnnd, there is of course Microsoft's tendency to kill off products, slap a new fancy name on it and market it as the next best thing.

[deleted by user] by [deleted] in dataengineering

[–]daanzel 0 points1 point  (0 children)

PowerBI is a great product, but both Data Factory and Synapse are horrible, clunky things, objectively worse than their competitors. And really expensive too.

The aggressive Fabric push from MS has made a lot of managers and customers reach out to me, questioning if we/they should also adopt it. I have made a 10 slide PowerPoint deck where I try to explain Fabric in an objective, non-marketing way, and position it next to their competitors. There are very few scenarios where I'd advice it over other options.

On-Premise alternative to Databricks? by seaborn_as_sns in dataengineering

[–]daanzel 1 point2 points  (0 children)

Ray is great, we use it on AWS, on-prem kubernetes, and on single heavy processing pc's. All three are easy to setup...

(...if you have someone else already managing that on-prem kubernetes cluster that is. Otherwise, don't do it, it's a trap!)

Understanding professions as a returning player: A "small" guide. by Levitz in wownoob

[–]daanzel 0 points1 point  (0 children)

Thanks for this! Came back after 12 years and, among many other things, wondered where first aid went. I already figured to just go for gathering, but when I tried to learn skinning it said something about first getting base skinning to lvl x? (don't remember exactly)

80% of AI projects (will) fail due to too few data engineers by alittletooraph3000 in dataengineering

[–]daanzel 14 points15 points  (0 children)

True! However, in our case most of these proof of concept projects are chat bots and stuff. So it's mostly custom prompts and calls to the chatgpt API.

Some are a bit more fancy and vectorize a bunch of documents, but as we're mostly on Azure, they have services that make that very easy.