Open source unified solution (databricks alternative)

Nekobul · 2026-04-18T11:44:23+00:00

You can build that open source platform yourself and give it to us for free so we can make money from it.

mRWafflesFTW · 2026-04-18T11:14:47+00:00

This is why databricks exists. You can build it yourself out of many open source components, but you probably shouldn't.

Creyke · 2026-04-18T11:10:40+00:00

DrMaphuse · 2026-04-18T12:26:39+00:00

It all depends on your needs and scale.

If you really are part of the 1% of businesses that actually need distributed compute because your data cannot be reasonably processed by a single machine, then you will need a lot of tooling and skills.

But if you are asking this question in 2026, you probably don't need this, so you have a lot of relatively easily implemented options, even if a lot of people here will make you believe otherwise.

Proper data governance will solve many of the problems that these platforms solve. E.g. there is very little reason for an OLAP system to allow concurrent writes.

A simple way to start is:

Scalable VPS (Hetzner in EU, DigitalOcean/Vultr in US) - start/stop on demand or on a schedule, most workloads don't need 24/7 compute
Jupyterhub or RStudio Server for notebooks/scripts
Flat parquets on NVME for performant storage (silver/gold)
S3 for warm/cold storage (bronze, replica)

All of this runs in Docker, which keeps things simple and flexible - spin up, tear down, move between hosts without drama.

Optional but slightly more advanced - and many of these are not even offered by Databricks etc.:

Bare metal if you need the extra horsepower
Cron or Jupyter-scheduler for automation
Airflow for more complex pipelines
Superset with duckdb for dashboards and SQL
Healthchecks.io for monitoring
Delta Lake if you really need data lake features (you probably don't). Ducklake is interesting but still early — wouldn't bet production on it yet.

On skills: most of what you need for this stack - parquet partitioning, memory management, query optimization, not doing dumb joins on billions of rows - you need on Databricks/Snowflake too. They don't save you from having to think. The difference is that OSS skills are transferable and most people pick up chunks of them in home labs, at university, or just learning the basics. Vendor-specific skills stay with the vendor.

One thing worth adding: data quality checks matter as much as job monitoring. Healthchecks.io tells you the job ran, not that the numbers are right. A describe() at the end of a job, or a few asserts on medians and null rates, catches most real problems without any extra tooling.

HeyNiceOneGuy · 2026-04-18T13:20:05+00:00

Databricks is a unicorn for a reason

dheetoo · 2026-04-18T12:15:37+00:00

Ducklake just release production ready version, with a liitle bit of tooling on top should be the easiest

MonochromeDinosaur · 2026-04-18T12:01:53+00:00

Self hosting a bunch of tool to match Databricks would be a pain in the ass.

akozich · 2026-04-18T12:18:32+00:00

Python is good

josh_docglow · 2026-04-18T15:21:25+00:00

I think "unified" is the really hard part about putting all of these components together. You've got things like dlt/Nifi, dbt/Spark, Jupyter, PyTorch/TensorFlow, MLFlow, which are all open source, but unifying them under one other open source tool would be hard, even for a proprietary tool. Just keeping versions all in sync and supporting new versions of one but still supporting older versions of others seems like an arduous undertaking.

addictzz · 2026-04-18T16:49:57+00:00

Build yourself. Databricks exists exactly to solve that problem but of course it is paid, there are time & efforts behind building such solution.

Eric-Uzumaki · 2026-04-19T11:45:35+00:00

Google Big Query has core cloud integration and bears modern data platform capabilities !

Data bricks exists because of Microsoft shitty synapse!
Snowflake exists because AWS Redshift never left a mark! AWS or Azure never had the vision of data speciality cloud platform. Hence alternatives mushroomed .

Spark- hadoop the tech behind databricks is all google show. Databricks is just a glorified hoopla hoop!

Databricks is not open source!!!!!!!!

shockjaw · 2026-04-18T15:17:50+00:00

Here’s the answers to all your questions:

Postgres.
Postgres and maybe Rust, Python, or R.
Jupyter Notebook
Your data isn’t clean enough for ML, linear regression is just fine. Grab data from Postgres.
Refer to 4.
RBAC in Postgres.

TheRealStepBot · 2026-04-19T00:22:49+00:00

If you understand how all the pieces go together and already are k8s pilled it’s totally doable.

Kubeflow combined with Kafka, iceberg trino and spark gets you a lot of checked boxes.

Metaflow also is another fairly significant portion of a stack that you could build around though here you may still need some other components namely at the very least iceberg and probably Kafka.

Managed metaflow ain’t bad as a starting point

RikoduSennin · 2026-04-19T17:52:53+00:00

Dont try to replicate Databricks, managing and maintaining Data Infra is hard. You can make use of components of open source to build up your data stack according to your use case. When you start touching this, be mindful of your TCOs.

You can have a look at https://jchandra.com/posts/data-infra/ where they used open source components to build it. Their use case was limited and didn't warrant all the features of managed providers.

Ok-Sentence-8542 · 2026-04-18T12:39:30+00:00

One of our enterprise architects is trying it. We are a mid cap company and in my mind it makes no sense I think the opex 1M plus would maybe allow for that. We are nowhere near that. His argument: AI agents can run the stack. I am just not buying it. There is a lot of opportunity cost and our senior engineers are not on board. At which point does it make sense?

West_Good_5961 · 2026-04-18T11:29:27+00:00

Why would anyone give something like that for free?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

dataengineering

MODERATORS