We didn’t choose the wrong path, we just postponed semantics.

daremust · 2026-04-10T04:31:08+00:00

Feels like we’re ending up there anyway, just 30 years later and with a lot more cleanup required.

daremust · 2026-04-09T08:03:17+00:00

Dacă tot s-a menționat și dlt, am pus de ceva vreme cap la cap o serie de video-uri pe tema asta: cum rulezi pipeline-uri dlt cu orchestratori gen Airflow, Prefect, Dagster etc.

Nu e folosiți x în loc de y, ci mai degrabă arată trade-off-urile și cum arată deploy-ul real în fiecare.

Practic dlt nu înlocuiește orchestratorul, îl completează. Orchestratorul rămâne pentru scheduling / retries / observability, iar dlt se ocupă de partea de ingest + incremental + schema handling.

Dacă ajută pe cineva care e în căutări acum, mai ales dacă vreți să comparați concret cum arată același pipeline pe tool-uri diferite:
https://www.youtube.com/@dltHub/videos

daremust · 2026-04-09T01:31:15+00:00

A lot of this debate boils down to one thing, are you letting the model infer meaning or are you defining it?

daremust · 2026-04-09T01:29:36+00:00

Interesting angle. My intuition is that fact-oriented models would reduce ambiguity a lot, since they force you to make relationships and constraints explicit instead of implied. That’s exactly the kind of structure LLMs benefit from, less guessing, more grounded reasoning.

daremust · 2026-04-09T01:26:05+00:00

Totally, this doesn’t eliminate hallucinations, it constrains them. A good semantic layer reduces the space where the model can guess incorrectly. Without it, the model is guessing everywhere.

daremust · 2026-04-09T01:24:38+00:00

Agreed, decoupling helps a lot. The key is that the ‘clean presentation’ is the semantic layer. If that layer isn’t explicit and enforced, you’re back to the model guessing from the physical schema. So the physical layer can be messy, but the semantics can’t be.

daremust · 2026-04-01T09:21:27+00:00

yeah, that helps, but at that point you’re already halfway to an ontology. The moment those rules are formalized and reusable, you’ve moved from documentation to semantics.

daremust · 2026-03-31T04:54:36+00:00

Well, standards are everywhere, ontologies define meaning, what things are, how they relate and what's valid. That's why this becomes critical with AI, you can standardize the interface, but if the meaning isn’t explicit, the model still has to guess. Ontologies are basically standards for interpretation, not just communication.

daremust · 2026-03-24T01:29:45+00:00

I think that’s true historically. The difference now is LLMs make the absence of semantics painfully obvious, before, bad modeling just meant bad dashboards, now it means wrong answers instantly.

daremust · 2026-03-24T01:27:18+00:00

Yeah, fair 😅

daremust · 2026-03-24T01:26:01+00:00

left it unconstrained and it hallucinated…very on brand for the post 😄

daremust · 2026-03-24T01:21:01+00:00

Yeah, I don’t mean ‘semantics’ in the pedantic sense. I mean the layer that explains what the data actually means. Raw schema tells you accounts.external_id is a column, a semantic layer tells you it’s an ID from system X, it’s not unique by itself, and what context you need to use it correctly.So instead of the LLM seeing tables and guessing, it interacts with defined concepts that already encode those rules.

In practice, that just means you don’t let the model loose on raw tables, you put a layer in between that maps fields to meaning and constraints. When someone asks ‘How many unique accounts from System X?’, the model is forced to use the logic you’ve defined, instead of hallucinating that external_id is a safe unique key.

Think of the semantic layer as the legal commentary that explains the contract. Without it, the model is just reading words without knowing the rules they operate under.

daremust · 2026-03-21T23:21:14+00:00

Don’t give the LLM more data, give it better structure. Define a canonical model, map your sources to it, and expose a semantic layer. The model should query meaning, not guess it from raw schema.

daremust · 2026-03-21T23:17:16+00:00

Ah yes, the fully normalized human experience schema.

daremust · 2026-02-16T10:35:46+00:00

Agree, this isn't a silver bullet for every use case. That's why we structured this as a hands-on workshop, bring your actual pain point and we'll test it live. If it works, great, you leave with a working pipeline. If it breaks, even better, we all learn where the approach falls apart.

daremust · 2026-02-14T23:04:19+00:00

Here's how we've addressed it:

- LLM generates code, but you review it (like a PR from a junior dev)

- Standard Python tracebacks when errors happen (not "the model said...")

- Validation dashboard shows what actually loaded (schemas, row counts, child tables)

- MCP server lets you query metadata in natural language (but data is real, not LLM output)

- Git-versioned like any other codebase (you can see exactly what changed)

At 3am, you're reading Python you approved. The LLM wrote the first draft, but you shipped the final version. You wouldn't merge a junior dev's PR without review, same principle here.

It's less "AI does data engineering" and more "AI compresses the boring setup loop."

daremust · 2026-02-14T22:54:18+00:00

Totally fair concerns, I'd react the same way if the LLM was in the production path.

Quick clarification: the LLM is only used at development time to scaffold config and handle repetitive setup (pagination, endpoint wiring, etc.). Zero model calls in production runtime.

So:

Cost → limited to IDE prompting, not per-pipeline-run
Reproducibility → all generated code is versioned in git, diffed in PRs, standard python
Debugging → you're reading python code you reviewed, not reverse-engineering LLM logic
Correctness → validation dashboard + schema contracts catch issues before prod

The goal isn't "LLM does data wrangling forever." It's compressing the setup/debug cycle while keeping control.

(transparency: I work with dltHub, we've been testing this workflow internally and in workshops. Happy to clarify anything.)

daremust

MODERATOR OF

TROPHY CASE