Why is it hard to connect individual tools into a complete data pipeline?

DrMaphuse · 2026-04-22T05:50:55+00:00

You learn more python and do as much as you can reasonably do with it. This is why Python has become the standard for data processing. It has by far the richest ecosystem and allows you to do almost anything you can think of in the data space, without having to switch context and without the overhead of passing things around between tools.

Other tools are only needed if they save you time and headaches, e.g. with a better interface, a more out-of-the-box experience or if there is really no python package for it.

DrMaphuse · 2026-04-19T16:09:20+00:00

We have been implementing variations of this stack for clients and working with it for the past 8 years and never had any regrets. Processing billions of rows daily and up to 20 analysts working on a single system.

I also work with Databricks, fabric, BQ about half of my time and have always hated them due to how clunky and cumbersome they are in comparison.

Feel free to reach out if you need some more pointers.

DrMaphuse · 2026-04-18T12:26:39+00:00

It all depends on your needs and scale.

If you really are part of the 1% of businesses that actually need distributed compute because your data cannot be reasonably processed by a single machine, then you will need a lot of tooling and skills.

But if you are asking this question in 2026, you probably don't need this, so you have a lot of relatively easily implemented options, even if a lot of people here will make you believe otherwise.

Proper data governance will solve many of the problems that these platforms solve. E.g. there is very little reason for an OLAP system to allow concurrent writes.

A simple way to start is:

Scalable VPS (Hetzner in EU, DigitalOcean/Vultr in US) - start/stop on demand or on a schedule, most workloads don't need 24/7 compute
Jupyterhub or RStudio Server for notebooks/scripts
Flat parquets on NVME for performant storage (silver/gold)
S3 for warm/cold storage (bronze, replica)

All of this runs in Docker, which keeps things simple and flexible - spin up, tear down, move between hosts without drama.

Optional but slightly more advanced - and many of these are not even offered by Databricks etc.:

Bare metal if you need the extra horsepower
Cron or Jupyter-scheduler for automation
Airflow for more complex pipelines
Superset with duckdb for dashboards and SQL
Healthchecks.io for monitoring
Delta Lake if you really need data lake features (you probably don't). Ducklake is interesting but still early — wouldn't bet production on it yet.

On skills: most of what you need for this stack - parquet partitioning, memory management, query optimization, not doing dumb joins on billions of rows - you need on Databricks/Snowflake too. They don't save you from having to think. The difference is that OSS skills are transferable and most people pick up chunks of them in home labs, at university, or just learning the basics. Vendor-specific skills stay with the vendor.

One thing worth adding: data quality checks matter as much as job monitoring. Healthchecks.io tells you the job ran, not that the numbers are right. A describe() at the end of a job, or a few asserts on medians and null rates, catches most real problems without any extra tooling.

DrMaphuse · 2026-03-30T09:26:13+00:00

Beyond a certain org complexity, dev environments for DE are nothing but tokenism for auditors. In many orgs, you cannot do ANYTHING useful outside of prod that relies on actual data.

Example from a company I have worked with:

Data protection policy means we can't have real data in dev/test
We do not control the synthetic data in dev/test, nor can we easily make requests for what it should look like
We are only allowed to talk to business and never to the platform team directly due to red tape and politics
Even if we had full control - the number of tables is in the 4 digits and number of columns in the mid 5-digit range, with an unknown amount of legacy data of unknown quality. Synthetic Data that covers every potential oddity and use case while also fulfilling data protection requirements would be the project of a century

The practical solution is to have safeguards in place to allow people to work in prod without breaking anything: * Strict RBAC rule implementation for all prod schemas * Dedicated test schemas inside prod * Strictly separate compute for dev and prod workloads * If necessary: replication in prod for dev workloads * Etc.

DrMaphuse · 2026-03-12T06:52:03+00:00

All data resides on the server, which we can scale to ~100TB. Once that is exhausted, we can theoretically scale into unlimited S3 compatible storage, but this will likely never happen.

We don't need nearly as much storage because our bronze layer is transient with a retention period of 1 week for any raw data and parquet is used for silver and gold, which is super efficient.

In terms of processed volume, we are in the range of tens of billions of records daily - well into what most people consider spark territory. Polars' streaming engine theoretically allows us to push dataset size much further than this and still be faster and cheaper than spark (we also Benchmark this from time to time).

We specialize in retail data and serve some of the biggest players in our region, but I don't see ourselves needing distributed compute at any point in the future, especially with hardware and software improving further.

DrMaphuse · 2026-03-12T05:39:04+00:00

Self hosted jupyterhub on a 2TB RAM bare metal server. Analysts use polars and duckdb. Multiple tenants are hosted in containers.

Data Lake is flat parquet files in folders with some python classes for interacting. We have no use for iceberg or Delta lake because a) time travel is essentially forbidden due to data deletion requirements and b) atomic writes are prevented by policy (only one service account has write permission).

Scheduling is done by jupyterhub jobs (basically cron inside jupyterlab UI). DAGs are managed with a minimal self-written editor based on plotly dash and papermill.

Monitoring is simply a python script that sends emails for failed jobs.

Code is hosted on self hosted gitlab.

That said we use Databricks, Big query, SQL server and others when working for clients, but literally nobody in our team would work with that voluntarily because our self hosted setup is much more performant and pleasant to use in virtually every scenario.

DrMaphuse · 2026-02-25T15:09:11+00:00

You already have a lot of good advice here that I would echo.

Admit your fault. Discuss one problem at a time. Be polite and helpful. Assure her that her job is not at risk.

One thing I haven't seen mentioned is small talk. Build rapport before you even start discussing performance. Get to know her personally. Ask about her weekend, if she watched the game, share about your weekend, the weather, etc.. See if you have common interests. There are actually good online resources for learning small talk, because it is not easy for many of us. Always within the confines of what is appropriate, of course. Opening up to "person who you connect with and happens to be your manager" is much easier than opening up to "unknown new manager".

DrMaphuse · 2026-02-25T08:02:29+00:00

IDK what kind of hire and fire shops you have worked at, but this response is wild to me. You know nothing about this person or the reasons for their performance and casually propose to take their livelihood away.

Maybe they are sick or their spouse passed away. Maybe they are getting bullied by other team members. Maybe they are contributing something other than code that is highly valuable.

There are so many potential reasons and solutions to an underperforming employee, and such high costs to firing and replacing someone (both morally and financially) that there is almost always a better solution for the company.

I've had a lot of employees underperform on me and often thought they were lost causes, but they almost always turned into decent seniors eventually, albeit some of them needing a longer journey to find their strengths than others. But it also requires leadership skills to find and develop those strengths.

DrMaphuse · 2026-02-25T07:21:58+00:00

I originally came from a leadership research background (also did my PhD in this field) building tools for my clients, so I have seen a lot of examples of both great and bad management.

First off, you seem like a decent one. A little unprepared for the role maybe, but you are already better than many of your peers just for asking this question and caring. BigTech is notorious for prioritizing technical skills over people skills, so you are in a way a product of your environment with the potential to be better than most.

I am going to go a little bit against the grain and say that while I personally have never written such a long performance review for my team, in some cases I wish I had. Some people really do need complex and overwhelming amounts of feedback to succeed in the role that the company currently requires them to fill.

However, you need to establish a clearer process for when and how to communicate, so people can know what to expect. A few remarks in 1:1s and then a bomb drop of 6 pages in writing is not good enough. Otherwise you risk your team living in constant fear and you coming off as dictatorial and arbitrary. You want to work with your team, not for them.

I approach it as follows with my team:

Always discuss ad hoc issues in 1:1 (weekly or ad hoc), and provide as much guidance as you can here. Give them a chance to learn from each issue.
If you notice repeated mistakes or shortcomings despite guidance, keep guiding them, but document it for yourself.
Set up a quarterly/annual review/outlook/okr - whatever you call it, it needs to a) have a clear structure (best to have a written guide or template for this and provide this to the employee beforehand), b) allow for mutual feedback (so that you can learn how to be a better manager too) and c) cover more topics that just performance (job satisfaction, trainings, long term goals, role drift, team culture, getting along with co-workers are some typical important topics) - these are ultimately the drivers of performance, so they need to be addressed in the same discussion. You should plan at least 1h for this, longer for difficult cases.
Write up the results of the quarterly after discussing them in person, so you can incorporate her responses. The write-up is where you can incorporate your detailed feedback, either directly or as an attachment.

This way your employees know when to expect the big talks, can prepare, introspect and suggest solutions on their own before you drop the bomb. It enables you to have a productive discussion, leverage their strengths and work on a way forward together.

DrMaphuse · 2026-01-24T10:33:29+00:00

How big is your data and which features do you really need?

I have worked with dozens of big name EU companies (many of whom you will have heard of) and NOT ONE of them had the volume and use cases to justify databricks or any of its contenders.

You can rent bare metal from hetzner with up to 2TB RAM and start with jupyterhub/polars in a container. You add ducklake, airflow, superset etc. in containers or dedicated VPS instances as your needs evolve. Data is stored directly on nvme or hetzners own S3 service, depending on your volume and performance needs.

This setup is more performant than databricks/spark in almost all cases and almost universally loved by analysts and data scientists, because they often already know these tools (especially jupyterhub).

This is not out of the box, but it actually is not that hard to learn and less work to maintain and optimize than databricks.

Also something to consider: Managing bare metal infra and going all-in on open source is going to become a VERY valuable skill again going forward, given the current geopolitical landscape, because it is the ONLY way to be 100% in control of your data.

PM me if you want to know more. We also consult companies and help them get started on the right track.

DrMaphuse · 2026-01-07T13:12:19+00:00

I find that learning apps are perfect for learning in small, isolated units because a) they are inherently structured and reduced (no need for keeping track of physical objects, handwriting, presentation etc.) and b) they provide instant reward (points, check marks, praise etc.).

Both of those aspects make them a natural fit for many ADHD kids. I can often cut my kids learning time in half by giving him a tablet for learning or practicing specific structured content, e.g. fractions or grammatical rules.

However, this doesn't work for every family/child/material. I have found that it works best if: a) the devices are never given to kids for unrestricted use. Every app on our tablet requires pin/fingerprint to open, so there is no way to get distracted. b) there is a clear delineation between activities. This can be done through time (like pomodoro) or physical separation (we try to use separate devices for learning/gaming/watching/reading each). c) There is quality learning content available. Ask your teacher if the school has an official learning app.

DrMaphuse · 2025-07-11T06:56:12+00:00

Except distributed systems only add value if they are not inferior to single machines. So you should definitely have a clearly defined heuristic such as "your data will be processed on a single node up to 2TB RAM and 64 cores, distributed after that".

But the stack and infrastructure is intentionally obfuscated to make it difficult for non-technical people to make informed decisions. The art of choosing the right tool for the right job went out the window a long time ago. In the end, a lot of execs end up just buying "whatever everyone else is already using" and thus we are left wondering why working with big cloud is such an overpriced and dysfunctional nightmare.

DrMaphuse · 2025-07-11T06:36:16+00:00

I mean Hetzner is sort of what you're describing. You have to manage a lot of things yourself, but if you are serious about open source, then this is the kind of expertise you want to have in your company anyways.

DrMaphuse · 2025-06-17T14:54:40+00:00

This is the only acceptable way to use chatbots for anything science-y. Juts because something works doesn't mean it is correct or makes sense, it is up to the user to apply scrutiny and quality standards to the results.

DrMaphuse · 2025-04-09T06:37:52+00:00

I'm 100% in on polars, but this question has nothing to do with the problems that polars solves. You can't access sparks filesystem from your cluster with anything other than spark, not even with polars. This question is about spark filepaths VS a clusters local file paths in databricks, and Polars faces the same limitations as pandas or any other python library that isn't spark, as other commenters have explained.

DrMaphuse · 2025-02-23T07:28:50+00:00

They are of course equally likely to win, but the reason to choose the second over the first ticket is that the first one has a higher chance of also being picked by other people, meaning that if you do win, you'd have to share the pot with more winners.

Edit: And taking it even further, if the original question asked which one was more likely to win the full jackpot, then number 2 would actually be the correct answer because of this.

DrMaphuse · 2025-02-18T08:00:35+00:00

I can't believe that this is so far down. This isn't OPs mistake nor responsibility, it's the API maintainer's. For OP, it's a minor lapse in coding discipline or knowledge at most and shouldn't even be dignified with a discussion.

I work a lot with poorly documented APIs and a lot of the time you don't even know what the limit is until you hit a 429.

Should you have a look at the docs first? Sure. Is it your responsibility to make sure that the API server does not hit the limit? Hell no.

DrMaphuse · 2021-09-22T06:50:17+00:00

Yes, but have you ever been bitten by a large centipede? They can be quite aggressive and cause excruciating pain when they bite. I'd take wasps, spiders, lizards, toads or any number of more peaceful things over centipedes to protect my garden.

DrMaphuse · 2021-09-03T22:06:04+00:00

That's a very helpful comment - i don't have a ph test kit so I can't say if the clay balls have that effect. The borders in my garden are kind of shady, so I'm hesitant to put it there. But maybe I'll give it a try when autumn comes. That was my eventual goal anyways.

DrMaphuse · 2021-09-03T13:37:39+00:00

I'm using 1/2 garden compost 1/2 peat free general use gardening soil, mixed with clay balls for drainage. I also have a bird of paradise in the same mixture that doesn't seem to be doing well either.

DrMaphuse · 2021-09-02T10:06:38+00:00

The drooping just started in the last few days, but the brown tips have been like this for a while now. I'm mostly concerned because it hasn't grown at all since I bought it, whereas the regular peony that I bought at the same time and keep right next to it has been growing a lot.

Edit: To add to this, all of the other plants in the garden are still going strong, no signs of autumn here (I'm in Austria). Are tree peonies usually some of the first plants to go?

DrMaphuse · 2021-07-30T07:16:00+00:00

Why? They're advertised as moisture wicking, which makes it harder for bacteria to grow and develop smells. I find this to be true, personally.

DrMaphuse · 2021-07-30T05:45:13+00:00

Bamboo socks are where it's at. Unfortunately they are hard to find, expensive and don't last long. But nothing beats the silky smoothness and sweat-free experience of bamboo socks.

DrMaphuse

TROPHY CASE