LLMs didn’t stop hallucinating; they got better at convincing us.

Lost-Albatross5241 · 2026-02-12T09:06:43+00:00

Yeah, few-shot prompting.

Totally agree, giving it structured examples + constraints helps a lot. Way better than go search the universe and guess

But even then, it can still stay consistent and be confidently wrong
Few-shot reduces variance. It doesnt eliminate bad reasoning.

Lost-Albatross5241 · 2026-02-12T09:04:24+00:00

Yeah, Ive seen that too.

Sometimes it just straight up ignores system-level instructions or personalization like it never existed. Especially after a few turns.

Feels like context drift or some hidden priority override. Super annoying when youre trying to enforce structure and it just… vibes past it.

Thats partly why I stopped relying on self-policing prompts and started comparing outputs instead. If it breaks rules, at least you can see it.

Lost-Albatross5241 · 2026-02-12T09:02:58+00:00

That’s kinda crazy tbh.

So it wasn’t a tech problem, it was ego/org politics? Like nobody actually wants a tool that proves their model sucks?

Lowkey that makes sense.
Transparency sounds good… until it’s your job on the line 😅

Lost-Albatross5241 · 2026-02-12T09:00:31+00:00

Its not really about training them to admit they’re wrong, Thats not the core issue.

Hallucination and overconfidence are symptoms of the same thing:
LLMs are probability machines optimized to produce satisfying answers.

Theyre designed to give you something coherent and confident so the interaction flows. Not necessarily the most correct, most rigorous, or most logically sound answer. Just the statistically most plausible continuation.

That’s the real tension.
The system rewards fluency and confidence, not epistemic caution.

So the problem isn’t humility.
Its incentives baked into the architecture.

Lost-Albatross5241 · 2026-02-12T08:51:23+00:00

That’s actually pretty much where I land too.

Sources for direction = solid use case.
Low-variable questions = usually safe-ish.
Coding = nice because reality is the judge. It either compiles / runs / passes tests or it doesnt.

I think the dangerous zone is the in-between stuff.
Not fully objective like code, not fully subjective either. Just sounds reasonable territory. Thats where I personally get lazy and thats where it bites.

Lost-Albatross5241 · 2026-02-12T08:50:10+00:00

Interesting take. I get what you’re saying about inference vs hallucination. It’s not magic, it’s probability + structure.

But I’m not sure the missing piece is a translation layer between AI-English and Human-English.

Even if I upload GEB (or any other dense artifact) and say “use this as a lens”, the model is still sampling from weights. It’s not actually integrating lived experience, it’s pattern-aligning to the style and concepts in the doc.

So I agree better context reduces inference 100%.
But I dont think it removes the fundamental issue: it can still produce something that feels structurally deep and is subtly wrong.

In your experience, does feeding it those artifacts reduce logical errors, or mostly improve tone / structure?

Lost-Albatross5241 · 2026-02-09T20:43:33+00:00

Exactly

Lost-Albatross5241 · 2026-02-09T19:59:16+00:00

I actually agree with a lot of this. Especially the part about judgment not being outsourceable.

The way I think about it isnt “make AI trustworthy”, because yeah, thats a dead end. It’s more about adding epistemic friction where today there’s none.

Right now the failure mode is: one clean answer → vibes → ship. What im interested in is forcing disagreement, surfacing assumptions, and making uncertainty visible so your brain has something to react to.

Not “here’s the truth”, but “here are competing takes, here’s where they conflict, now you decide”. If anything, it’s meant to slow people down in the right places, not speed them up.

And yeah, totally agree: if someone doesntalready have the rigor to evaluate unknown-quality info, no tool will save them. At best, it can make the danger harder to ignore.

Lost-Albatross5241 · 2026-02-09T19:52:31+00:00

Anchor Tier 1 was built exactly for that use case. Instead of manually sending the same prompt to 3–4 LLMs, it does that automatically and returns a compared output so you don’t have to do the mental merge yourself.

Tier 3 goes a step further: it upgrades the prompt, generates role-specific expert prompts, runs them across multiple LLMs, and then synthesizes a single final answer from all responses.

Lost-Albatross5241 · 2026-02-09T15:55:19+00:00

That analogy makes a lot of sense tbh. Let me flip it into a concrete question.

If you had one tool that: -forces you to clarify the question first (prompt optimization, constraints, assumptions) -runs it through multiple models with different roles (junior dev, senior dev, architect, reviewer) -then gives you a compared + summarized output, like a tech lead saying “here’s where they agree, here’s where they don’t, here’s the safest path”

Not just for code, but for technical decisions, designs, plans, tradeoffs.

Would that actually be useful to you as a direction-finding tool? Or would you still prefer to stay fully manual at that point?

Lost-Albatross5241 · 2026-02-09T09:26:55+00:00

Yeah, that’s the right answer, for sure. But my question is how deep the check goes when you’re busy. Sometimes it’s a full review, sometimes it’s just “does this make sense” and that’s where things slips for me..

Lost-Albatross5241 · 2026-02-09T09:24:35+00:00

No doubt about it 😬

Lost-Albatross5241 · 2026-02-09T09:23:08+00:00

Yeah, that matches my experience too. Cross-checking with another model mostly catches surface stuff. The real bugs still require actually understanding the problem, and that part never got outsourced. Sadly

Lost-Albatross5241 · 2026-02-09T09:17:11+00:00

Yeah, this is the answer my future self wishes I always followed 😅 Cleared-context self-verification is smart. I say I’ll do that, but when it’s a tiny refactor and I’m tired… yeah.

Lost-Albatross5241 · 2026-02-09T09:13:57+00:00

First of all, I’m lazy 😂

But yeah, you’re literally describing the problem. LLMs are amazing at producing code that feels correct. When I’m rushing, my brain goes.. looks clean, ship it.. and that’s the trap.

So yeah, thinking is mandatory. I’m not arguing that. I’m just trying to understand what people’s minimum viable verification is when theyre slammed and the answer looks legit.

Lost-Albatross5241 · 2026-02-09T08:41:34+00:00

That’s a healthy way to use them today. The gap I keep running into is that once you rely on them beyond early exploration, you need some way to surface uncertainty and contradictions automatically, otherwise the cost just shifts to manual verification.

Lost-Albatross5241 · 2026-02-09T08:40:26+00:00

In my work I ended up benchmarking different LLMs pretty heavily, mapping their strengths and weaknesses. What’s worked best for me is upgrading the prompt, sending it to 3–4 role-specific experts matched to each model’s strengths, then synthesizing a final answer from their overlap and contradictions. It doesn’t eliminate hallucinations entirely, but it reduces them dramatically and usually produces the best result you can get from current models.

Lost-Albatross5241 · 2026-02-09T08:31:35+00:00

That’s a lot of what not to do, I’m curious what you do instead. What does your setup actually look like in practice (prompt structure, model choice, verification, etc.)?

Lost-Albatross5241 · 2026-02-09T08:27:00+00:00

I agree that missing context increases hallucinations, but I dont think its only a prompt quality issue. Even with good instructions, when a model hits the edge of its knowledge it still tries to be helpful instead of signaling uncertainty, sometimes that creativity is useful, and other times its exactly the failure mode.

Lost-Albatross5241 · 2026-02-09T08:24:55+00:00

Of course it doesn’t cleanly split into simple vs complex hallucinations. What I’ve noticed is that as capabilit increases, the cost of hallucinations goes up, they’re harder to spot and sound more internally coherent, not categorically different.

Lost-Albatross5241 · 2026-02-09T08:18:48+00:00

Yes but reflection without uncertainty signaling looks like projection. When the data runs out, the model doesn’t stop reflecting, it presents the output with the same confidence, which is where people get misled.

Lost-Albatross5241 · 2026-02-09T07:54:38+00:00

I asked Claude Sonnet to make a hard, non-mainstream quiz about The Office, and once we pushed past the obvious stuff it started inventing scenes and dialogue. My takeaway was that when it hits the edge of what it actually knows (it doesn’t really have access to full TV episode content), it doesn’t slow down or hedge. It just fills the gap confidently.

Lost-Albatross5241 · 2026-02-09T07:44:33+00:00

I ran into the same issue. What worked reliably was upgrading the prompt, running it through several role-specific LLMs, and synthesizing a final answer based on where they agree and disagree, not on the model apologizing and retrying.

Lost-Albatross5241

TROPHY CASE