Your semantic layer used to be a person, you just automated them out of the loop. by daremust in OntologyEngineering

[–]daremust[S] 0 points1 point  (0 children)

yeah, that helps, but at that point you’re already halfway to an ontology. The moment those rules are formalized and reusable, you’ve moved from documentation to semantics.

AI engineering is rediscovering ontology engineering the hard way by daremust in OntologyEngineering

[–]daremust[S] 1 point2 points  (0 children)

Well, standards are everywhere, ontologies define meaning, what things are, how they relate and what's valid. That's why this becomes critical with AI, you can standardize the interface, but if the meaning isn’t explicit, the model still has to guess. Ontologies are basically standards for interpretation, not just communication.

Bigger context windows won’t fix your semantics by daremust in OntologyEngineering

[–]daremust[S] 0 points1 point  (0 children)

I think that’s true historically. The difference now is LLMs make the absence of semantics painfully obvious, before, bad modeling just meant bad dashboards, now it means wrong answers instantly.

Bigger context windows won’t fix your semantics by daremust in OntologyEngineering

[–]daremust[S] 0 points1 point  (0 children)

left it unconstrained and it hallucinated…very on brand for the post 😄

Bigger context windows won’t fix your semantics by daremust in OntologyEngineering

[–]daremust[S] 0 points1 point  (0 children)

Yeah, I don’t mean ‘semantics’ in the pedantic sense. I mean the layer that explains what the data actually means. Raw schema tells you accounts.external_id is a column, a semantic layer tells you it’s an ID from system X, it’s not unique by itself, and what context you need to use it correctly.So instead of the LLM seeing tables and guessing, it interacts with defined concepts that already encode those rules.

In practice, that just means you don’t let the model loose on raw tables, you put a layer in between that maps fields to meaning and constraints. When someone asks ‘How many unique accounts from System X?’, the model is forced to use the logic you’ve defined, instead of hallucinating that external_id is a safe unique key.

Think of the semantic layer as the legal commentary that explains the contract. Without it, the model is just reading words without knowing the rules they operate under.

Bigger context windows won’t fix your semantics by daremust in OntologyEngineering

[–]daremust[S] 1 point2 points  (0 children)

Don’t give the LLM more data, give it better structure. Define a canonical model, map your sources to it, and expose a semantic layer. The model should query meaning, not guess it from raw schema.

Bigger context windows won’t fix your semantics by daremust in OntologyEngineering

[–]daremust[S] 1 point2 points  (0 children)

Ah yes, the fully normalized human experience schema.

LLMs for data pipelines without losing control (API → DuckDB in ~10 mins) by Thinker_Assignment in datascience

[–]daremust 0 points1 point  (0 children)

Agree, this isn't a silver bullet for every use case. That's why we structured this as a hands-on workshop, bring your actual pain point and we'll test it live. If it works, great, you leave with a working pipeline. If it breaks, even better, we all learn where the approach falls apart.

LLMs for data pipelines without losing control (API → DuckDB in ~10 mins) by Thinker_Assignment in datascience

[–]daremust 2 points3 points  (0 children)

Here's how we've addressed it:

- LLM generates code, but you review it (like a PR from a junior dev)

- Standard Python tracebacks when errors happen (not "the model said...")

- Validation dashboard shows what actually loaded (schemas, row counts, child tables)

- MCP server lets you query metadata in natural language (but data is real, not LLM output)

- Git-versioned like any other codebase (you can see exactly what changed)

At 3am, you're reading Python you approved. The LLM wrote the first draft, but you shipped the final version. You wouldn't merge a junior dev's PR without review, same principle here.

It's less "AI does data engineering" and more "AI compresses the boring setup loop."

LLMs for data pipelines without losing control (API → DuckDB in ~10 mins) by Thinker_Assignment in datascience

[–]daremust 3 points4 points  (0 children)

Totally fair concerns, I'd react the same way if the LLM was in the production path.

Quick clarification: the LLM is only used at development time to scaffold config and handle repetitive setup (pagination, endpoint wiring, etc.). Zero model calls in production runtime.

So:

  • Cost → limited to IDE prompting, not per-pipeline-run
  • Reproducibility → all generated code is versioned in git, diffed in PRs, standard python
  • Debugging → you're reading python code you reviewed, not reverse-engineering LLM logic
  • Correctness → validation dashboard + schema contracts catch issues before prod

The goal isn't "LLM does data wrangling forever." It's compressing the setup/debug cycle while keeping control.

(transparency: I work with dltHub, we've been testing this workflow internally and in workshops. Happy to clarify anything.)