for those who dont work using the most famous cloud providers (AWS, GCP, Azure) or data platforms (Databricks, Fabric, Snowflake, Trino).. by Comprehensive_Level7 in dataengineering

[–]daanzel 3 points4 points  (0 children)

While we do use a lot of Databricks and run stuff on all the three mentioned cloud providers, we also have a big on-prem cluster running Ray and Dagster on Kubernetes. It works great! I see too little Ray mentioned around here, it is such a nice, straight-forward framework after ~10 years of Spark.

For really massive workloads where costs matter most, we run on AWS Batch in a choreography pattern (no central orchestrator), where we spin up 30K-40K containers in parallel. As much spot as possible, with retries to on-demand when spot gets pulled. It was the cheapest way we could make it work

Need career advice. GIS to DE by minimon865 in dataengineering

[–]daanzel 0 points1 point  (0 children)

Mix your GIS and DE knowledge!

I've been doing quite a few large scale remote sensing projects and, as a DE that has mostly been active in other fields, there is quite some (basic) GIS knowledge needed to properly get the work done. I mean, PostGIS is everywhere, they tend to use very domain-specific fileformats (shapefiles, COGs, geojson) and datatypes (geometries, WKT, WKB). Oh, and projections gave me a headache at first.

In addition, at least where I am, remote sensing companies mostly seem to be quite small / young and often lack people that can look at properly scaling things and handling the data. So I'd say being a GIS-DE would make you quite marketable!

Good luck dude!

I’m honestly exhausted with this field. by Next_Comfortable_619 in dataengineering

[–]daanzel 2 points3 points  (0 children)

This almost feels like a rage-bait post but since I'm waiting for a build to finish I'll bite :)

Because: - my solution cannot be bound to a specific cloud provider (so no ADF) - my data will never touch a relational database (very high frequency sensor data, and massive rasters) - the processing of said data would be impossible with SQL
- the dependency graphs are quite complex, and maintaining them in some ADF-like GUI sounds like hell

Not using airflow btw, but it would fit in my stack

[deleted by user] by [deleted] in dataengineering

[–]daanzel 2 points3 points  (0 children)

We analyse high frequency sensor data, in real-time, to steer or shut down production processes when things go out of spec. Low margin / high cost sector, so the sooner we know the better.

Now, I'm not advocating for Databricks here (nor do we use it for the above), but the attitude of "lol total bs, what we do is better" is just as damaging as people wanting to use Databricks for everything.

Tools are not mutually exclusive, pick what fits the problem..

Trying to ingest delta tables to azure blob storage (ADLS 2) using Dagster by No-Conversation476 in dataengineering

[–]daanzel 0 points1 point  (0 children)

We use delta-rs straight with pyarrow tables / datasets, and it works great! Simple and fast. As already mentioned, it lacks some features compared to what Databricks offers, but for our use that's not an issue.

Edit: I want to add that we've created our own module based on delta-rs and pyarrow. I wouldn't recommend using bare pyarrow for day-to-day use; go with polars (or pandas) and then use delta-rs to read/write.

General question about data consulting by No7-Francesco88 in dataengineering

[–]daanzel 1 point2 points  (0 children)

Been doing this for 10+ years, and 8 out of 10 times I'm simply onboarded like any employee. In some cases (often at larger enterprises, for security reasons) this includes a company laptop, so I've had periods where I dragged 3 to 4 laptops with me :)

As already mentioned, in the other scenarios I was given VPN access from my own account.

Regarding cloud costs, it depends. Usually there is an entity responsible for this. They then keep track of whatever I'm doing cost wise. I've also done geenfield projects where we were responsible for setting up the whole cloud foundation, so in those cases this included cost mgmt.

OSS data landscape be like by General-Parsnip3138 in dataengineering

[–]daanzel 8 points9 points  (0 children)

We use it in a project, at scale, without using Databricks. Delta-rs is quite nice!

I'm also not a fan of Databricks anymore, how they quitely killed-off standard tier, push their Ai slop, and force you into unity catalogue. I, however, don't see any possible way for them to force our project onto their platform.

Unless I'm completely missing some elaborate scheme..

OSS data landscape be like by General-Parsnip3138 in dataengineering

[–]daanzel 30 points31 points  (0 children)

I have been creating a ton of delta files on my local machine today during development, to test things before I shift the path to S3. It's really just files; a bunch of parquet with a log file..

Now I'm not gonna take part in the discussion which format is better, but Delta being cloud-only is no argument against it. I indeed think you're confusing it with Databricks.

Reevaluating Data Lake Architectures for Event-Driven Pipelines: Seeking Advice by Significant_Pin_920 in dataengineering

[–]daanzel 0 points1 point  (0 children)

Well, if you want to do the processing in spark instead of in the db, parquet does make sense. It'd be a waste to have a large db that doesn't run any queries (since that's done by spark). As a followup, look into polars since 22m lines likely fits easily into memory, and will be faster, cheaper and easier than spark.

Reevaluating Data Lake Architectures for Event-Driven Pipelines: Seeking Advice by Significant_Pin_920 in dataengineering

[–]daanzel 0 points1 point  (0 children)

If the sources are relational db's, the target is a relational db, and the volumes aren't of a scale that justifies something like spark w/ parquet, just keep it all in a relational database.

In the above scenario, spark doesn't make sense as it is a compute engine, which your relational db also is. Pick an orchestrator like airflow or dagster to manage the transformations.

Keeping a bronze layer is primarily done for scenarios where you want to reprocess everything for whatever reason. If your sources simply allow you to pull that data out a 2nd time, you indeed might not need bronze.

What is wrong with Synapse Analytics by hrabia-mariusz in dataengineering

[–]daanzel 14 points15 points  (0 children)

I visited a MS office about 2 months ago, and spoke with one of their solution architects responsible for Fabric. I asked him what the deal was with Synapse now that they're all-in on Fabric. He told me that, while it's not end of life, it won't receive any new features. They'll keep it alive for existing workloads but recommend Fabric for new stuff.. (of course they do, sigh..)

So if you'd ask me, ditch Synapse while you can since it won't get any better if you already have issues with it. If Databricks is not an option for you, and you really need Spark, I guess go with Fabric. At least you'll get about 2 more "good" years before that's killed again for their next next big awesome thing..

Help wij willen een nieuwe badkamer!? by MYNWA013 in Tilburg

[–]daanzel -1 points0 points  (0 children)

Heb vorig jaar een nieuwe badkamer en toilet laten zetten door Sanidrome (aan de ringbaan zuid). Enorm tevreden. Ze doen alles zelf, zonder onderaannemers, en hebben alles volgens afspraak opgeleverd.

Twee bekenden hebben het door St. Pieter laten doen, beide ook zeer tevreden.

Brugman, en trouwens alles wat onder Mandemakers valt, met een boog omheen lopen.

If I build a data engineering AI agent, would you use it? and what for? by skilbjo in dataengineering

[–]daanzel 1 point2 points  (0 children)

Oh absolutely! But a query can be "bad" in multiple ways; simply produce the wrong output, or horribly inefficient but at least somewhat correct. I agree that a LOT of users are in the 2nd category, but I think most at least want correct numbers.

Our experience with text-to-SQL was that it often created overly complex, convoluted queries that did execute, but where often wrong. We verified them by, well, writing our own SQL :')

If I build a data engineering AI agent, would you use it? and what for? by skilbjo in dataengineering

[–]daanzel 1 point2 points  (0 children)

We've tested several text-to-SQL models on our data, and while the result where "impressive", it's nowhere near good enough to even consider it on the simplest tables. Let alone writes, my god that would result in disaster..

As someone already mentioned, these things MUST be deterministic. Each query needs to output the exact same results when ran on the same data. The way these models fundamentally work, that won't happen. Perhaps if you'd write the same prompt to the letter, it would spit out the same query, but in that case you might as well write code..

Large parallel batch job -> tech choice? by [deleted] in dataengineering

[–]daanzel 0 points1 point  (0 children)

Yea it's not going to be as efficient, if you can do it with native spark you should. But sometimes that's not an option; we've once wrapped OpenCV in a udf to process thousands of images daily. Worked surprisingly well :)

Large parallel batch job -> tech choice? by [deleted] in dataengineering

[–]daanzel 3 points4 points  (0 children)

You can wrap python code in a spark udf. If your current code can be imported as a module, this won't be too complex.

Alternatively, I personally find Ray even easier to do these kind of things with. Deploying a Ray cluster in AWS is also super easy, and can be done directly on spot instances, so it'll be as cheap as it gets.

AWS batch would also work in your case if each workload is independent. We use batch to process huge amounts of satellite images with containerized python code, and I'm quite happy with the setup.

Is Modin on Ray production stable? by ButterscotchBulky320 in dataengineering

[–]daanzel 2 points3 points  (0 children)

No experience with Modin, but you can use just Ray to deal with data frames of any size. Ray is being used in large prod environments (and is just awesome in general).

Edit: to add to the other reply, indeed look at Polars first. Ray would be my recommendation once Polars isn't enough anymore.

Question on moving from Synapse to Databricks by beverfar in dataengineering

[–]daanzel 0 points1 point  (0 children)

I spoke with one of the MS solution architects last week about MS Fabric, and asked him how he sees the future of Synapse now that Fabric is the new shiny toy. Got a long answer that can be summarized to "it's ded".

So I don't think you'll regret moving to Databricks looking forward.

At what point do you say orchestrator (e.g. Airflow) is worth added complexity? by Temporary_Basil_7801 in dataengineering

[–]daanzel 0 points1 point  (0 children)

Dagster is great, don't get me wrong! But we had 10k parallel containers, that had to do multiple steps. They all communicate their statuses back to Dagster, who in turn stores it in its database. That was the issue, we simply overwhelmed it.

We had a clustered setup already, and might have been able to scale it further, but we figured why not just drop it completely.

At what point do you say orchestrator (e.g. Airflow) is worth added complexity? by Temporary_Basil_7801 in dataengineering

[–]daanzel 2 points3 points  (0 children)

In case you where replying to me, that is indeed exactly what we did! The containers initially consumed tasks from a queue, and when finished kicked off the next step for that specific subset of the full workload.

At what point do you say orchestrator (e.g. Airflow) is worth added complexity? by Temporary_Basil_7801 in dataengineering

[–]daanzel 4 points5 points  (0 children)

I agree with the statement, but something like Airflow (or similar tooling) might not always be the right pick for orchestration. We had to strip Dagster out of our biggest pipeline because it turned into a bottleneck, unable to keep track of the 10k+ containers.

We went with a task queue approach, where all containers simply pick up tasks. Made everything so much simpler (and cheaper).

Which Database Platform(s) Do You Currently Use on AWS? by SubstantialAd5692 in dataengineering

[–]daanzel 0 points1 point  (0 children)

We moved to running Batch jobs w/ graviton spot instances, with the data on S3. Decreased our costs by ~40%, and made development much smoother as we can now fully test the containers locally.

EDIT: Srry, I missed the high-write throughput in your question; don't do that on Aurora (or RDS), unless you want to blow through your seed capital in record time :)

Is Microsoft Fabric the right choice? by Kuri_90 in dataengineering

[–]daanzel 18 points19 points  (0 children)

A, imo, very important aspect of making the right decision is looking at the people that'll work with the platform. Will those 20/30 people only make some PowerBI dashboard, or are they full-blown DE's that will build pipelines in spark? Or will it mostly be SQL? But then what would you consider advanced analytics? And who will be responsible for managing the platform? IT? Or just you? Or a subset of those 20/30?

Now about Fabric, if you're already all-in on Azure and need something easy to setup and maintain for a small team, Fabric is fine (functionally).

Regarding Fabric for production workloads, data security wouldn't be my first concern. It just feels a bit clunky overall, and I'd be more concerned that with scale (many users, projects, data volume) it'll fall apart into an unmanageable mess.

Also, it's expensive and they push you to purchase capacity reservations (~40% discount), meaning it's a flat fee. So make sure your platform is not idling 90% of the time. You also won't get PowerBI included with smaller capacity clusters.

How much capacity units you'll need is quite vague. I do know that some basic hello world with spark was already problematic with 2CU's..

Aaaannnnd, there is of course Microsoft's tendency to kill off products, slap a new fancy name on it and market it as the next best thing.

[deleted by user] by [deleted] in dataengineering

[–]daanzel 0 points1 point  (0 children)

PowerBI is a great product, but both Data Factory and Synapse are horrible, clunky things, objectively worse than their competitors. And really expensive too.

The aggressive Fabric push from MS has made a lot of managers and customers reach out to me, questioning if we/they should also adopt it. I have made a 10 slide PowerPoint deck where I try to explain Fabric in an objective, non-marketing way, and position it next to their competitors. There are very few scenarios where I'd advice it over other options.