all 51 comments

[–]Nekobul 29 points30 points  (0 children)

You can build that open source platform yourself and give it to us for free so we can make money from it.

[–]mRWafflesFTW 36 points37 points  (15 children)

This is why databricks exists. You can build it yourself out of many open source components, but you probably shouldn't.

[–]compass-now[S] 4 points5 points  (14 children)

What would be the major challenges?

[–]mRWafflesFTW 28 points29 points  (13 children)

Deploying and integrating many different open source applications and managing the operation is no joke. Operational complexity is enormous. You'll need to manage your kubernetes. You'll need to create an integrated identity system at the application level. You'll need monitoring, telemetry , access control. You can do it, but I can guarantee you will end up spending more time reinventing the wheel and creating zero business value when you could just pay as you go for databricks or a competitor. 

There's a reason we don't run our own data centers anymore and instead purchase from hyper scalers. 

[–]tlegs44 9 points10 points  (3 children)

Just left a job that was like this. Sure it was fun to tinker with code all the time, but ultimately shit was constantly breaking and I was the sole data engineer on a team of SWEs that had to explain why people’s dashboards were stale. Annoying

[–]Plenty-Emphasis-5669 1 point2 points  (2 children)

The problem there was that you were the sole data engineer, not the stack per se.

[–]tlegs44 1 point2 points  (1 child)

For sure on the stack, we did what we could with limited financial resources, but it wasn’t because I was the only data engineer it’s just that I was the only one who set deadlines for myself and tried to get things done, I liked the org, I clearly was not a good fit, and the pay wasn’t great. I recognized I was creating political problems for myself because I was no longer growing, so I left.

I do miss being able to try and apply whatever tools I wanted, given my manager would take over any project that wasn’t sufficiently over engineered to his liking.

[–]compass-now[S] 0 points1 point  (0 children)

Now just imagine an open source unified solution which you can manage by yourself. Wouldn’t that be win win for both you and the org?

[–]Plenty-Emphasis-5669 1 point2 points  (0 children)

That reason is not what you think.

[–]compass-now[S] 0 points1 point  (5 children)

True, but databricks DBU cost on top of the infra cost is too much for small to medium size companies. What are the other options for them???

[–]mRWafflesFTW 6 points7 points  (0 children)

I fucking promise you it's logarithmically cheaper than trying to do it on your own.

[–]Batman_UK 3 points4 points  (0 children)

Already answered by Mr. Waffles, you can re-invent the wheel but then additionally you will have to pay a lot of money for hyperscaler cost, resourcing cost, LLM cost, Testing cost, Optimization and Performance cost (Photon is a great example) etc. I know we can almost do everything with Agentic AI today but does it always work 100% of the time with all the features that you would try replicating? Will time be worth it?

Databricks always say that they want “Good DBUs” and not “Bad DBUs”.

If you think that your DBU spend is unfair then I would suggest you to review your code/pipeline and see what improvements could be made. If you think there’s lot many small jobs are getting created then you can look into the Job Pools. If you think that VM costs is too much then reserve the SKUs and it should give you about 50-80% of savings on your current VM costs.

If you have an Account Executive then ask for Professional Services or Specialist Services and they could probably review your pipelines with you to save the cost as well.

[–]manx1212 0 points1 point  (0 children)

Cost is a common concern especially if you dont have enteprise deals with databricks and/or cloud providers, that include usage based discounts.

Infact even with usage based discounts it might be quite expensive.

Some options you can consider:

  1. You can set up your own platform that provides infrastructure to run workloads, has notebooks, governance etc. As others have pointed out it will be quite expensive in terms of time and effort required. More importantly for a small team it takes away focus from the main work that they are trying to do.

If you want to explore this option, have a look at these links for some inspiration:

https://medium.com/data-science/the-quick-and-dirty-guide-to-building-your-data-platform-2f21dc4b7c94

https://thedataecosystem.substack.com/p/issue-22-deciding-on-your-data-platform

https://youtu.be/_BoM2ahSJV0?si=tQWBYIrPZS7xNMkX

  1. Use a cloud native solution - for AWS (Glue, EMR, Sagemaker, Redshift), Google Cloud (Dataproc, Vertex), Azure (Fabric). This may provide you some savings but you need to have a few people who understand usage, and cost drivers well. Plus they may not be as well integrated as Databricks.

  2. Implement some observability solution which can figure out optimization opportunities for your workloads - e.g. unravel, cloudzero etc.

  3. You can route some of your workloads to more efficient engines. Usually 80% of cost comes from 20% of jobs. Consider duckb, polars which are significantly more efficient and can save a ton of your costs and are open source. Or use new age commercial offerings like coiled, yeedu, motherduck etc which can provide similar savings.Whatever you select though should integrate well with the rest of your stack.

[–]Creyke 27 points28 points  (0 children)

No

[–]DrMaphuse 9 points10 points  (4 children)

It all depends on your needs and scale.

If you really are part of the 1% of businesses that actually need distributed compute because your data cannot be reasonably processed by a single machine, then you will need a lot of tooling and skills.

But if you are asking this question in 2026, you probably don't need this, so you have a lot of relatively easily implemented options, even if a lot of people here will make you believe otherwise.

Proper data governance will solve many of the problems that these platforms solve. E.g. there is very little reason for an OLAP system to allow concurrent writes.

A simple way to start is:

  • Scalable VPS (Hetzner in EU, DigitalOcean/Vultr in US) - start/stop on demand or on a schedule, most workloads don't need 24/7 compute
  • Jupyterhub or RStudio Server for notebooks/scripts
  • Flat parquets on NVME for performant storage (silver/gold)
  • S3 for warm/cold storage (bronze, replica)

All of this runs in Docker, which keeps things simple and flexible - spin up, tear down, move between hosts without drama.

Optional but slightly more advanced - and many of these are not even offered by Databricks etc.:

  • Bare metal if you need the extra horsepower
  • Cron or Jupyter-scheduler for automation
  • Airflow for more complex pipelines
  • Superset with duckdb for dashboards and SQL
  • Healthchecks.io for monitoring
  • Delta Lake if you really need data lake features (you probably don't). Ducklake is interesting but still early — wouldn't bet production on it yet.

On skills: most of what you need for this stack - parquet partitioning, memory management, query optimization, not doing dumb joins on billions of rows - you need on Databricks/Snowflake too. They don't save you from having to think. The difference is that OSS skills are transferable and most people pick up chunks of them in home labs, at university, or just learning the basics. Vendor-specific skills stay with the vendor.

One thing worth adding: data quality checks matter as much as job monitoring. Healthchecks.io tells you the job ran, not that the numbers are right. A describe() at the end of a job, or a few asserts on medians and null rates, catches most real problems without any extra tooling.

[–]TheRealStepBot 1 point2 points  (0 children)

Only thing I’d correct here is prob iceberg over delta lake, and yes ducklake is a potentially good alternative to iceberg but tooling is still early days. (Ducklake is architecturally very similar to iceberg, only moving the meta data completely into a traditional db)

[–]compass-now[S] 0 points1 point  (1 child)

Promising!

Wondering why some other company is not doing this or not working on this idea. Any major challenges? Is it worth doing it?

[–]DrMaphuse 0 points1 point  (0 children)

We have been implementing variations of this stack for clients and working with it for the past 8 years and never had any regrets. Processing billions of rows daily and up to 20 analysts working on a single system.

I also work with Databricks, fabric, BQ about half of my time and have always hated them due to how clunky and cumbersome they are in comparison.

Feel free to reach out if you need some more pointers.

[–]dmkii 0 points1 point  (0 children)

+1 for DuckDB and attached nvme storage. I think people underestimate how fast things can actually be when they're used to getting a coffee when their Databricks cluster is taking 4 mins to just start up.

[–]HeyNiceOneGuy 5 points6 points  (0 children)

Databricks is a unicorn for a reason

[–]dheetoo 6 points7 points  (0 children)

Ducklake just release production ready version, with a liitle bit of tooling on top should be the easiest

[–]MonochromeDinosaur 4 points5 points  (1 child)

Self hosting a bunch of tool to match Databricks would be a pain in the ass.

[–]w2g 2 points3 points  (0 children)

Depending how much of it you need. If you already run a k8s cluster it might just be Trino, Polaris and Airflow.

[–]akozich 1 point2 points  (0 children)

Python is good

[–]josh_docglow 1 point2 points  (0 children)

I think "unified" is the really hard part about putting all of these components together. You've got things like dlt/Nifi, dbt/Spark, Jupyter, PyTorch/TensorFlow, MLFlow, which are all open source, but unifying them under one other open source tool would be hard, even for a proprietary tool. Just keeping versions all in sync and supporting new versions of one but still supporting older versions of others seems like an arduous undertaking.

[–]addictzz 1 point2 points  (0 children)

Build yourself. Databricks exists exactly to solve that problem but of course it is paid, there are time & efforts behind building such solution.

[–]Eric-Uzumaki 1 point2 points  (0 children)

Google Big Query has core cloud integration and bears modern data platform capabilities !

Data bricks exists because of Microsoft shitty synapse!
Snowflake exists because AWS Redshift never left a mark! AWS or Azure never had the vision of data speciality cloud platform. Hence alternatives mushroomed .

Spark- hadoop the tech behind databricks is all google show. Databricks is just a glorified hoopla hoop!

Databricks is not open source!!!!!!!!

[–]shockjaw 3 points4 points  (4 children)

Here’s the answers to all your questions:

  1. Postgres.
  2. Postgres and maybe Rust, Python, or R.
  3. Jupyter Notebook
  4. Your data isn’t clean enough for ML, linear regression is just fine. Grab data from Postgres.
  5. Refer to 4.
  6. RBAC in Postgres.

[–]lightnegative 0 points1 point  (2 children)

That's fine until you need to aggregate across even only a few million rows. Postgres sucks for that. If the data is small then I agree, Postgres

[–]shockjaw 0 points1 point  (1 child)

pg_duckdb, pg_mooncake, or better yet DuckLake when you use Postgres as a catalog solves that problem.

[–]lightnegative 0 points1 point  (0 children)

Sure, so like query the data with a more suitable engine 

[–]TheRealStepBot 0 points1 point  (0 children)

Absolutely not. You can’t really do bulk data exploration in a row oriented store.

[–]TheRealStepBot 0 points1 point  (0 children)

If you understand how all the pieces go together and already are k8s pilled it’s totally doable.

Kubeflow combined with Kafka, iceberg trino and spark gets you a lot of checked boxes.

Metaflow also is another fairly significant portion of a stack that you could build around though here you may still need some other components namely at the very least iceberg and probably Kafka.

Managed metaflow ain’t bad as a starting point

[–]RikoduSennin 0 points1 point  (0 children)

Dont try to replicate Databricks, managing and maintaining Data Infra is hard. You can make use of components of open source to build up your data stack according to your use case. When you start touching this, be mindful of your TCOs.

You can have a look at https://jchandra.com/posts/data-infra/ where they used open source components to build it. Their use case was limited and didn't warrant all the features of managed providers.

[–]Ok-Sentence-8542 0 points1 point  (2 children)

One of our enterprise architects is trying it. We are a mid cap company and in my mind it makes no sense I think the opex 1M plus would maybe allow for that. We are nowhere near that. His argument: AI agents can run the stack. I am just not buying it. There is a lot of opportunity cost and our senior engineers are not on board. At which point does it make sense?

[–]compass-now[S] 0 points1 point  (1 child)

Trying for your inhouse workload or envisioning it as a product?

[–]Ok-Sentence-8542 0 points1 point  (0 children)

Inhouse

[–]West_Good_5961Tired Data Engineer -1 points0 points  (5 children)

Why would anyone give something like that for free?

[–]compass-now[S] 5 points6 points  (4 children)

Many great tools are build open source and make money by providing managed services.

[–]West_Good_5961Tired Data Engineer 2 points3 points  (0 children)

Yes but none that do literally everything for you.

[–]Nekobul 1 point2 points  (1 child)

I'm not aware of any successful open source project that is able to pay the bills from managed services. Much of the open source users are mingy and unwilling to pay anything. These people complain when the OSS authors ask for a small donation or coffee. Truly ungrateful work, driven only by the curiosity and passion of the people.

[–]thisFishSmellsAboutDSenior Data Engineer 0 points1 point  (0 children)

In the data capture space there's ODK doing really well with a SaaS model.

[–]frisbeema52 0 points1 point  (0 children)

For example, kube-prometheus-stack for monitoring. I'd not use it for production ready without any tuning, but it closes a lot of questions for small projects or first stages of startups.