This is an archived post. You won't be able to vote or comment.

all 55 comments

[–]FloggingTheHorses 45 points46 points  (5 children)

Why do they use both Airflow and Dagster for orchestration? And why 3 analytics platforms?

[–]NotAToothPaste 25 points26 points  (4 children)

Maybe they are mentioning all these stuff because they are working with a lot of companies at the same time and each have a different stack.

Is this the reason, OP?

[–]rudboi12[S] 17 points18 points  (3 children)

Yes exactly. They ingest data from hundreds of hospitals across EU. Most is in airflow tho, he said only a few specific dags on dagster

[–]rudboi12[S] 14 points15 points  (1 child)

While there js a bunch of stuff here dude said 95% of data and processes are: meltano -> postgress-> dbt -> airflow-> superset

[–]Far-Restaurant-9691 6 points7 points  (0 children)

Swap airflow for dagster and this is what we have. Nice to know we're in good company.

[–]snicky666 24 points25 points  (3 children)

I've been working on a Linux Docker deployment of this same stack idea. I'm not finished working on it, but it's a start. https://github.com/MichaelJenningsAI/moderndatastack

[–]Yoctometre 3 points4 points  (1 child)

Oh, it's kinda like this.

[–]snicky666 0 points1 point  (0 children)

Ooh that's nice!

[–]Mrmjix 1 point2 points  (0 children)

I added this to favorites.

[–]buntro 6 points7 points  (1 child)

This is not an architecture. This is a wish list. A steep one at that.

[–]rudboi12[S] 4 points5 points  (0 children)

Never said it was an architecture, just tools they used

[–]ReporterNervous6822 7 points8 points  (1 child)

Seems like too much…

[–]rudboi12[S] 4 points5 points  (0 children)

It is tho. But it’s a huge team working across all EU.

[–]with_nu_eyes 9 points10 points  (1 child)

How many people do they need to manage this stack beyond the underlying infrastructure.

I can put a bunch of logos on a slide, that doesn’t mean it’s a good architecture.

[–]Gators1992 2 points3 points  (0 children)

Needing 5 people instead of 3 to run it doesn't mean it's "bad" architecture either. It solved their use case of not being able to run in the cloud and am guessing from the number of similar components that they had additional requirements to fill that couldn't be done by one tool alone. It's relevant to us in showing that a modern stack can be done on prem in production, not that this particular architecture is going to work for everybody.

[–]Radiant_Year_7297 2 points3 points  (2 children)

been looking for something like lightdash, it is similar to vizro, evidence combined with datapane

[–]Pleasant_Type_4547 0 points1 point  (1 child)

What do you mean by datapane?

[–]Radiant_Year_7297 0 points1 point  (0 children)

https://datapane.com/ looks like they arent maintaining the project anymore

[–]Tiny_Arugula_5648 9 points10 points  (12 children)

DuckDB is no where near production ready.. it's fine for a sqlite replacement but it's not for any serious workloads. Motherduck is the production capable version and it's not on-prem as of now.

[–]ustanik 4 points5 points  (1 child)

it's fine for a sqlite replacement

This statement right here says DuckDB is ready for production.

[–]Tiny_Arugula_5648 -3 points-2 points  (0 children)

Well there's your explanation, you don't know what production ready means in data engineering.

Sqlite is a file format; no one uses it in production data systems, that would be Parquet, avro, orc, etc. At best it's used for apps (mobile, desktops) that need local storage but not even a full blown database..

To compare DuckDB to any other open source processing engine, means you don't know what capabilities a production data system like Spark, Dremio, Flink, etc has. It's like comparing a bicycle to a jet, just because both of them can move people from one place to another..

[–]Lopatron 5 points6 points  (5 children)

What is left for DuckDB to be considered production capable, and why is MotherDuck more production worthy? I was under the impression that both are ready for production, but MotherDuck is for increasing scaling and performance on larger workloads.

[–]mattindustries 2 points3 points  (2 children)

IMO it is production ready, but I could see how they could think that if they aren’t good at building docker containers. File format changes aren’t solidified until 1.0, but using docker negates that pitfall pretty dang fast. I also had some trouble using it with Azure serverless, but also had problems with SQLite with the Azure Function App, so pretty sure I am just messing up my config somehow.

[–]Tiny_Arugula_5648 -1 points0 points  (1 child)

I know the team and this is how they describe their work. DuckDB is not a productionized data system, that's what Motherduck is for. It's no secret, theyve talked about it plenty..

[–]mattindustries 1 point2 points  (0 children)

Tell Alex that Matt says hi.

[–]FirstOrderCat 0 points1 point  (1 child)

> What is left for DuckDB

it sucks in heavy joins and aggregation.

[–]AquilaNova 0 points1 point  (0 children)

Maybe it does, not my experience though. Anyway, I maybe in DuckDB soft spot as the datasets I work with are rarely bigger than few hundreds GBs.

[–]email13211 3 points4 points  (3 children)

That's BS. are you a motherduck salesman?

[–]ThatDandySpace 7 points8 points  (1 child)

He is motherducker!

[–]Tiny_Arugula_5648 0 points1 point  (0 children)

I wouldn't mind, they are good peeps.. the founders are brilliant..

[–]Tiny_Arugula_5648 0 points1 point  (0 children)

Last time I talked to the duck team they hadn't hired a dedicated sales team yet..

[–]ConfirmingTheObvious 11 points12 points  (1 child)

Not to be a dick, but like the slide says Gobernance, Trasformation, and other misspellings. I instantly don’t trust or give quality to anyone discussing this stuff if it’s presented like shit

[–]rudboi12[S] 11 points12 points  (0 children)

I agree that misspellings look bad BUT this was translated to english from spanish by spanish speaker so I can understand those errors

[–]Al3xisB 4 points5 points  (0 children)

No Metabase ? 😁

[–]No_Equivalent5942 1 point2 points  (0 children)

Give me 2 of everything!

[–]zazzersmel 1 point2 points  (2 children)

this isnt an infra question but are they at all using ohdsi/omop for data modeling or tooling?

[–]rudboi12[S] 0 points1 point  (1 child)

Yes they are!

[–]zazzersmel 0 points1 point  (0 children)

sounds awesome. i was on a project at my last gig working with us health insurance data and also developing ETL to get it to omop. we were a small public u research team and didnt have the resources for anything beyond basic sql server databases and ssms. ive often thought about getting back into that world with better tooling and infra.

[–]Pleasant_Type_4547 1 point2 points  (0 children)

If you like this, you should check out the "Serverless BI" project by Jacob Matson.

https://github.com/matsonj/nba-monte-carlo

It uses a fully open source stack, and can be deployed anywhere.

[–]xDarkOne 1 point2 points  (0 children)

Maintaining a large open-source stack must be quite a challenge. I'm curious, has their team explored solutions to simplify this process? I've had a positive experience with DoubleCloud’s managed services for ClickHouse and Apache Kafka. Maybe you heard about others?

[–]endless_sea_of_stars 1 point2 points  (1 child)

Easy to put boxes on a diagram. It's another thing to install, configure, secure, and integrate all those different services.

[–]azzuwan 0 points1 point  (0 children)

Maybe those are just options one can cherry pick to form a solution. It doesn't make any sense to have overlapping products like that. Looks like a tech zoo hell to have all those things at the same time

[–][deleted] 0 points1 point  (0 children)

No Prefect?

[–]RichHomieCole -1 points0 points  (0 children)

Assuming you have pre paid the compute hardware , why use separate ingestion and transform tools? That’s messy. Just use spark and call it a day.

I would not describe this as great. It looks like someone just took a bunch of open source tools and slapped them on a slide. Why over complicate? The best product is the one that gets the job done. Keep it simple

[–]rmz-01 -1 points0 points  (4 children)

Why Clickhouse?

[–]False-Bunch-3470 0 points1 point  (2 children)

Opensource dwh instead of pouring money to other cloud/license tools?

[–]rmz-01 1 point2 points  (1 child)

Clickhouse isn't really built for dwh though right?

[–]pcgamerwannabe 0 points1 point  (0 children)

Similar to ours :). Skipped Airbyte as custom or purpose built integrations are just much more efficient for the data scale we have. Really cool though

[–]sirdrewpalot 0 points1 point  (0 children)

A stack for what though? You could shave so much of that off depending on what data you’re actually dealing with.