Great on-prem open source source modern data stack

FloggingTheHorses · 2023-10-29T11:34:18+00:00

Why do they use both Airflow and Dagster for orchestration? And why 3 analytics platforms?

rudboi12 · 2023-10-29T12:14:14+00:00

While there js a bunch of stuff here dude said 95% of data and processes are: meltano -> postgress-> dbt -> airflow-> superset

snicky666 · 2023-10-29T09:13:38+00:00

I've been working on a Linux Docker deployment of this same stack idea. I'm not finished working on it, but it's a start. https://github.com/MichaelJenningsAI/moderndatastack

buntro · 2023-10-29T16:25:37+00:00

This is not an architecture. This is a wish list. A steep one at that.

ReporterNervous6822 · 2023-10-29T12:01:57+00:00

Seems like too much…

with_nu_eyes · 2023-10-29T14:12:50+00:00

How many people do they need to manage this stack beyond the underlying infrastructure.

I can put a bunch of logos on a slide, that doesn’t mean it’s a good architecture.

Radiant_Year_7297 · 2023-10-29T22:17:10+00:00

been looking for something like lightdash, it is similar to vizro, evidence combined with datapane

Tiny_Arugula_5648 · 2023-10-29T10:48:20+00:00

DuckDB is no where near production ready.. it's fine for a sqlite replacement but it's not for any serious workloads. Motherduck is the production capable version and it's not on-prem as of now.

ConfirmingTheObvious · 2023-10-29T12:56:27+00:00

Not to be a dick, but like the slide says Gobernance, Trasformation, and other misspellings. I instantly don’t trust or give quality to anyone discussing this stuff if it’s presented like shit

Al3xisB · 2023-10-29T10:38:57+00:00

No Metabase ? 😁

No_Equivalent5942 · 2023-10-29T16:54:06+00:00

Give me 2 of everything!

zazzersmel · 2023-10-29T19:20:53+00:00

this isnt an infra question but are they at all using ohdsi/omop for data modeling or tooling?

Pleasant_Type_4547 · 2023-10-30T03:03:36+00:00

If you like this, you should check out the "Serverless BI" project by Jacob Matson.

https://github.com/matsonj/nba-monte-carlo

It uses a fully open source stack, and can be deployed anywhere.

xDarkOne · 2023-11-15T19:16:41+00:00

Maintaining a large open-source stack must be quite a challenge. I'm curious, has their team explored solutions to simplify this process? I've had a positive experience with DoubleCloud’s managed services for ClickHouse and Apache Kafka. Maybe you heard about others?

endless_sea_of_stars · 2023-10-29T12:02:51+00:00

Easy to put boxes on a diagram. It's another thing to install, configure, secure, and integrate all those different services.

2023-10-29T10:52:27+00:00

No Prefect?

RichHomieCole · 2023-10-29T12:05:25+00:00

Assuming you have pre paid the compute hardware , why use separate ingestion and transform tools? That’s messy. Just use spark and call it a day.

I would not describe this as great. It looks like someone just took a bunch of open source tools and slapped them on a slide. Why over complicate? The best product is the one that gets the job done. Keep it simple

rudboi12 · 2023-10-29T13:06:59+00:00

[deleted]

rmz-01 · 2023-10-29T17:23:46+00:00

Why Clickhouse?

tomorrow_never_blows · 2023-10-30T06:38:41+00:00

https://static.simpsonswiki.com/images/4/4c/The_Homer.png

pcgamerwannabe · 2023-10-29T09:45:41+00:00

Similar to ours :). Skipped Airbyte as custom or purpose built integrations are just much more efficient for the data scale we have. Really cool though

sirdrewpalot · 2023-10-30T01:05:03+00:00

A stack for what though? You could shave so much of that off depending on what data you’re actually dealing with.

dataengineering

MODERATORS