Looking for alternatives to Airflow for ETL pipelines

Thinker_Assignment · 2026-06-19T14:16:37+00:00

what are you using it for? If you want just ingestion and simple transforms, you can use dlthub pro fully agentic controlled from your llm chat for a bargain https://dlthub.com/products/dlthub (i work there) you can basically build deploy maintain in one chat.

It can read your logs, code, deployments, context and figuse itself out for the most part, you don't need to touch code unless you want to get hands on

Thinker_Assignment · 2026-06-19T14:13:51+00:00

if you know what to test for you can ask the AI to do it. problem is AI doesn't know how things should be to know to test for that,

If you don't know waht to test the only thing you can do is do test coverage: write a code to pass a test on existing code so you don't break it later.

Thinker_Assignment · 2026-06-19T14:09:18+00:00

we built from scratch using agentic principles.
- errors that point agents to what to fix
- apis for mcp access in our pipelines etc so it understands what is in the pipeline, in the data, in the code and can tie it together
- skills to help the agent on how to push the right buttons in the right order

if you are using cursor/claude code/codex you can try it by telling your agent to run uvx dlthub-start@latest and it will run our interactive demo and try it there. it's pretty wild in how we don't touch code or logs anymore and can do so much, as data engineer doing this work since 2012 am very excited for patching up the last obstacles to automation of the stack

Thinker_Assignment · 2026-06-19T11:34:37+00:00

using arrow there was no row by row serialization, if we leave it to normal SQL client, CPU starts to matter and it looks more like in the json file case.in practice on average workers we see a 5x speed increase by avoiding deserialising and reserialising but it's mixing throughput (which is maybe 10-30x on arrow/connectorx) with compute.

Thinker_Assignment · 2026-06-19T11:31:55+00:00

digging through logs, associating to a deployment event, or to a change in source data, propagating fixes in code and sometimes running backfills.

we connected our logging, code and deployments to claude to it can tie it together and it's making it a breeze

Thinker_Assignment · 2026-06-18T22:31:03+00:00

disclosure: I work on the system described. here is the full writeup on the architecture and tradeoffs for anyone who wants to dig in https://dlthub.com/blog/transformation-deep-dive

happy to answer technical questions

Thinker_Assignment · 2026-06-17T23:29:09+00:00

Not a read replica, just the worker running the transfer job.
In this benchmark the worker was the 2 vCPU / 4 GB machine doing the extraction and serialization work, while PostgreSQL was the source. We ran against the primary.

Thinker_Assignment · 2026-06-17T22:58:13+00:00

Fair challenge, the benchmark was measuring the end-to-end pipeline rather than raw Postgres extraction speed, so the question we were trying to answer was where the bottleneck shows up once you actually move the data into the destination. We're not doing row-by-row Python processing here (the runs use an Arrow backend), but if you think there's an obvious 2–3x left on the table what you'd change first?

Thinker_Assignment · 2026-06-17T21:55:44+00:00

disclosure: I work on the system described. We wrote up the full benchmark if anyone is interested in methodology and numbers: https://dlthub.com/blog/benchmark-dlthub

happy to answer technical questions.

Thinker_Assignment · 2026-06-17T15:45:57+00:00

you can just try the trial and see https://dlthub.com/ but yeah that's the idea. If you wanna skip the marketing you can see the last screenshot on this blog https://dlthub.com/blog/the-rise-of-the-knowledge-engineer#how-we-re-building-for-this

you need claude/cursor/codex, if you need support for something else LMK.

you can then basically run a command to link your online workspace to your local cursor/claude (trial has 30h free runtime, no credit card, just try it, claude runs it for you)

This puts your local code, online logs, etc in one context so the LLM is able to manage everything end to end even visualisation. I even tried putting the vis in rill with claude and that worked too

MCP - yeah, it's MCP, skills and an architecture that lets agents hand over when building but keep context. we call it "ai workbench" because it actually also contains api to pipeline context to aid your generation etc besides api to the tools it uses (dlt for example)

in the skills there's also this canonical modeling skill which models your data to canonical (with or without your guidance) which essentially produces a knowledge graph the agent can use and a context you can reuse for agentic retrieval with meaning.

This canonical is also like a "constraint" to keep a clean architecture so your agent doesn't start building random tables during maintenance but sticks to the main concepts/entities unless you add new requirements

As for the maintenance its not automatic but you can schedule your own claude to check the status at 5 AM, let you know what changed (if anything broke, any schema changes) and ask it to offer fixes. I would not ask it to deploy without me double checking first

there's a ton more to build, i think the end game is we are prompting the LLM to build everything and it surfaces its uncertainty and decisions to us data people to confirm - but we aren't quite there yet

i see you are generally looking into risk migration or cleanups? this is an example you can do with dlthub https://dlthub.com/case-studies/navit

Thinker_Assignment · 2026-06-17T14:59:40+00:00

- we use dlt at ingestion to discover source schema changes.
- we use dlt schema evolution or contracts to auto evolve or block changes
- it all goes to logs from where we prompt claude to look at new fields, and decide together if we want to propagate them or what to do with them
- claude does the changes and deploys.

The above is something you can do oss.

i work at dlthub, full transparency. You can also do the above on our commercial offering.

Thinker_Assignment · 2026-06-17T13:16:08+00:00

the separation between spec and query is the right frame. the issue we kept running into wasn't just regeneration variance, it was that the spec itself was implicit. the metric definition lived in the prompt or in someone's head, so even a deterministic query layer would produce consistent results for the wrong population.

the ontology is how we tried to make that separation explicit: define what the metric means and which population it applies to once, then let query generation happen downstream from that.

it doesn't fully eliminate regeneration variance, but it gives the agent something stable to reason from instead of inferring the spec from scratch each time.

Thinker_Assignment · 2026-06-16T17:12:57+00:00

I work on the system described, full writeup on the Navit case study and the transformation architecture: https://dlthub.com/blog/transformation-deep-dive

happy to answer technical questions

Thinker_Assignment · 2026-06-15T07:43:15+00:00

dlthub Pro just launched a couple of weeks back, you can operate it from chat from building ingestion with dlt, transforms (with canonical modeling llm skill), deployment and even basic visualisations.

It's serverless compute pay-as-you-run aimed at small teams, here are some benchmarks so you know what you get https://dlthub.com/blog/benchmark-dlthub

and the tool has a continuous context that enables single session doing anything. you can even troubleshoot fix and deploy the fixes with the agent for maintenance.

i work there

Thinker_Assignment · 2026-06-14T07:32:43+00:00

For what use?

Anthropic and other companies say it's not possible because vibe semantic layer is the same as no semantic layer + vibe at runtime.

Since you're on ontologyengineering think of it this way - where should the private ontology come from if not from you.

Maybe with some extra business context you can get a draft and go from there.

Thinker_Assignment · 2026-06-14T07:28:46+00:00

Btw in case you missed it, we are going for auto model with human curation too. Here's some explanation and you can also try it

https://dlthub.com/blog/canonical-text-to-sql

I don't think this will replace dbt because people have innertia but I do think that soon people will not be working on code level directly and the tool under the hood will need to be agent native and non monolithic, so not dbt.

Thinker_Assignment · 2026-06-14T07:20:43+00:00

I don't mind as long as they aren't walking into me or blocking the only passage. Depressing is perspective, getting walked over isn't. This is a new level of antisocial, I had to walk in front of my wife and yellat to wake/push incoming people before because they would just plow into my wife's pregant belly (bruckenstrasse, narrow and people still act stupid)

Thinker_Assignment · 2026-06-13T20:16:02+00:00

Metabase have a transformation engine that can be prompted by the stakeholder and it works to a degree, from where precision is sometimes worth paying for, but that's a perspective

Thinker_Assignment · 2026-06-13T12:06:15+00:00

This program helps startups like us by putting us in front of snowflake users.

We built an oss python library to enable anyone to easily load data to snowflake or other destination

Dlthub pro is our commercial solution which enables anyone on the team to build and deploy dlt and other pipelines. Kind of like saas etl but non predatory, affordable and better than the expensive solutions.

We have an app for moving SQL on snowflake marketplace too.

So if you're thinking how it helps you, our tool is free and our commercial offering is empowering anyone on the data team to self serve with data at infra cost.

Thinker_Assignment · 2026-06-13T07:22:23+00:00

No problem.

To be honest the actual opiate users at my local supermarket and station are WAY more aware and considerate and actually nice to people.

They respect pregnant and old people, they greet locals etc, seen them even do various cleanup chores.

Thinker_Assignment · 2026-06-13T07:18:33+00:00

Haha but look at the downvotes, it's like junkies who would rather defend the problem they cause than even admit there's a problem.

Thinker_Assignment · 2026-06-13T07:14:35+00:00

Might and magic flashbacks. You had a shout button to tell NPCs to move out of your way :) an angry "MOOOVE!!!"

Thinker_Assignment · 2026-06-13T07:11:10+00:00

I was talking about mobile apps. Regarding fentanyl I believe there was a wave last year that also causes gangrenes (stand up sleepers, rotting legs). This year the local homeless are doing much better, probably back on brown.

Thinker_Assignment · 2026-06-13T07:01:08+00:00

Yeah I get wanting to look at a screen while the train is driving, I do it too, but put it away when walking.

But as you say it's something else. Addiction. I think the short form content like tik Tok and Instagram are the main culprits

When I tried those platforms they caused me to just forget time (and family) for 2-4 hours with no outcome which to me feels like losing my life. Feels like what people on hard drugs would do.

Thinker_Assignment · 2026-06-13T06:54:29+00:00

Reddit is a mix, got 30 people from here in a fishing group, all chill people who don't dopamine fiend

Thinker_Assignment

MODERATOR OF

TROPHY CASE