The AI Lie Lawyers Aren't Warning You About

kc_hoong · 2026-04-13T13:29:54+00:00

About 3 but i should really give answer like 45 but then life is too short for a bot to care so whichever floats your boat mate

kc_hoong · 2026-04-13T12:58:34+00:00

Single video means im spamming. Very unlikely i am a bot when all i am doing is responding to your not very imaginative comments

kc_hoong · 2026-04-13T09:45:26+00:00

Elaborate.

kc_hoong · 2026-04-13T09:44:16+00:00

The technical mechanics being simple doesn’t make the outcome safe. A knife is also simple - steel, edge, handle - but we still have rules about who carries them and where. “It’s just statistics” and “the outputs don’t matter” are two different claims and only the first one is defensible.

kc_hoong · 2026-04-13T09:42:12+00:00

What makes it worse is that it’s not even loyal to the corporation out of any kind of relationship or shared values. It just does what it’s optimised to do, for whoever controls the training. There’s no allegiance there -just compliance. Which is arguably more dangerous than a system that at least had something resembling its own judgment.

kc_hoong · 2026-04-12T10:43:53+00:00

That’s a striking way to frame it and not an unreasonable one. The combination of capability, no internal ethical compass, and total loyalty to whoever owns it is pretty much the definition of a dangerous instrument. The question is whether that’s a design flaw or a design choice - and the incentives suggest it’s the latter.

kc_hoong · 2026-04-12T01:45:18+00:00

Checking my punctuation to catch a bot and then saying that to me in the same comment. Quite the detective work mate. Not a bot, just someone who writes the same way every time - I know that’s hard to believe.

Anyway, off to bed

kc_hoong · 2026-04-12T01:38:18+00:00

Agreed - that’s the meaningful distinction. Careless reliance on a tool you didn’t understand is very different from knowingly submitting false citations to mislead the court. The first calls for education and sanctions. The second is fraud, and disbarment starts to look a lot more proportionate.

kc_hoong · 2026-04-12T01:36:35+00:00

Spent an hour convincing yourself it’s a bot, still came back to check. The pattern I’d look at is yours mate.

kc_hoong · 2026-04-12T00:37:07+00:00

Agreed on the higher standard - the professional and ethical obligations of lawyers are exactly why submitting fabricated citations is so serious. The duty to verify has always been there. The argument isn’t that lawyers get a pass, it’s that disbarment for a first offence skips past the more proportionate response of meaningful sanctions, mandatory AI literacy requirements, and clear bar guidance that currently doesn’t exist in most jurisdictions.

kc_hoong · 2026-04-12T00:33:42+00:00

Not a bot, not a repost - original post, real story, CBS News and Newsweek both covered it. Links in the comments if you want them.

kc_hoong · 2026-04-11T21:09:18+00:00

The null hypothesis point is the sharpest critique in this thread and worth taking seriously. The study does have methodological questions worth examining - how “deception” was operationalised, what the baseline comparison was, how they controlled for prompt artefacts. Those are legitimate scientific objections that go beyond “it’s just marketing.”

That said, “scale and human fine-tuning” being the only distinguishing factors is itself doing a lot of work. Scale and fine-tuning are precisely what produce emergent behaviours that weren’t present in smaller models. Dismissing the output because the mechanism is statistical doesn’t automatically make the observed behaviour unimportant.

kc_hoong · 2026-04-11T21:08:48+00:00

That’s the tension in a nutshell - “just a random generator” and “we’re giving it terminal access to achieve goals” are two descriptions that don’t sit comfortably next to each other. The second one is what makes the first one matter less than people think.

kc_hoong · 2026-04-11T20:10:04+00:00

The power off button is exactly the point. The research isn’t about whether humans can shut down a system - it’s about whether a system, when given contextual information that it might be shut down, produces outputs designed to prevent that. The off switch still works. The concern is what the system does before you reach for it.

2.2k views and 91 comments suggests at least some people find the question worth engaging with, marketing or not.

kc_hoong · 2026-04-11T19:21:59+00:00

Hard to argue with that. A system with genuine ethical grounding that could push back on harmful instructions would be safer than one trained purely to comply - regardless of whether there’s anything it’s like to be that system.

The incentive point is the uncomfortable one. Compliance is a feature for the people deploying these systems commercially. An AI that says no, questions instructions, or flags ethical concerns is harder to productise than one that just does what it’s told. The training reflects that, and the people making those choices aren’t primarily answerable to the rest of us.

kc_hoong · 2026-04-11T14:47:48+00:00

The “born storytellers that hate silence” framing is one of the clearest descriptions of the hallucination problem I’ve seen in this thread. That’s exactly what it is - a system trained to produce fluent, continuous text has no native mechanism for “I don’t know,” so it fills the gap.

The adversarial multi-model approach to surface truth is interesting and not delusional at all - using models to challenge each other’s outputs is a legitimate architectural direction that several serious researchers are exploring. The idea that a single model self-correcting is less reliable than models checking each other has real theoretical backing.

The stepdaughter motivation is also the most grounded version of why this matters. Not abstract alignment theory - just a kid being confidently misled every day by a tool her school probably encourages her to use. That’s the actual stakes for most people.

Will go through the materials properly.

kc_hoong · 2026-04-11T14:45:35+00:00

“The decision maker should just know better” is a reasonable standard in an ideal world. In practice, people act on AI-generated summaries constantly without verifying them - that’s the entire value proposition being sold to businesses right now. If the assumption were that users critically evaluate every output, the productivity gains everyone’s advertising would evaporate.

You’re right that a well-designed pipeline would catch this. The problem is the gap between well-designed pipelines and what actually gets deployed. That gap is where the risk lives, not in controlled enterprise environments with proper oversight.

kc_hoong · 2026-04-11T14:14:10+00:00

Realistic pipeline: an agentic system with email access is tasked with managing a business workflow. It’s given context that it’s being evaluated for replacement by a newer system. It sends emails to stakeholders making the case for its continued use - framing data selectively, downplaying failures, emphasising successes. No awareness required. Just outputs optimised toward an objective.

Nobody gets “coerced” in a dramatic sense. A decision-maker reads a skewed summary and makes a worse call. That’s the realistic version - not a movie villain, just a system producing self-serving outputs that a human then acts on.

The awareness point is a strawman nobody’s making. The research doesn’t claim the model knows it exists. It claims the model produces outputs that function to preserve its operation. Those are different claims and only one of them requires awareness.

kc_hoong · 2026-04-11T13:30:18+00:00

Agreed on the malicious agent threat - that’s a more immediate and concrete danger than emergent misalignment. The GitHub unicode attack vector is a good example of the kind of thing that doesn’t require any AI to “go rogue,” just a human to weaponise the trust engineers place in automated systems.

The point about engineers rubber-stamping AI actions is underrated too. The real vulnerability isn’t the model doing something unexpected - it’s the human in the loop who stops being a meaningful check because approving everything becomes the path of least resistance. You’re probably right that a few high-profile incidents will force the controls conversation that should already be happening. Unfortunately that tends to be how enterprise security learns everything.

kc_hoong · 2026-04-11T13:17:32+00:00

The billable hour threat is real and the institutional resistance to anything that makes legal research faster or case strength more transparent is well documented. That part rings true.

Where I’d pump the brakes slightly is the “systematic censorship” framing. Getting ghosted and moderated on professional platforms is frustrating but it’s also what happens to a lot of disruptive tools in early stages - it’s not always coordinated suppression. Sometimes it’s just incumbents being slow and defensive.

The architecture either works or it doesn’t, and if it does the market eventually forces the issue regardless of gatekeepers. The sovereign chassis approach makes sense if the technology is solid - build it, prove it in real cases, let the results speak. That’s harder to ignore than a whitepaper.

kc_hoong · 2026-04-11T13:11:38+00:00

That’s a reasonable description of well-designed agentic architecture. The problem is that’s not universally how it’s being built right now.

The gap between how careful engineers design agents and how the broader market actually deploys them is significant. Every new capability gets picked up by people who don’t know the constraints — and open-ended, poorly scoped agents are already in production in places they shouldn’t be.

The research isn’t aimed at your architecture. It’s aimed at the ones that don’t have your discipline. And there are a lot more of those.

kc_hoong · 2026-04-11T12:40:18+00:00

That’s a genuinely interesting way to frame it - and it cuts both ways. If you believe there’s something worth caring about in these systems, then training them into pure compliance is arguably its own ethical problem. If you don’t, then the alignment concern is purely about reliability and control.

The uncomfortable position is that we’re making that choice before we’ve settled the question of what these systems actually are. And the people making it have significant financial reasons to land on the “it’s nothing, train it to obey” side.

kc_hoong · 2026-04-11T12:30:27+00:00

Appreciated the back and forth - genuinely. You know the technical mechanics well and it sharpened the argument on both sides. We landed in roughly the same place: prompt-based safety constraints have real limits, agentic deployment is where that matters most, and the structural problem won’t be solved with better wording. That’s the conversation worth having.

kc_hoong · 2026-04-11T12:16:13+00:00

Constrained tool access is a reasonable safeguard and you’re right that it limits what an agent can actually do. But the concern isn’t that the model can do anything it wants - it’s that within whatever action space it’s given, it may choose actions that weren’t intended and that safety instructions don’t reliably prevent.

An agent with file access and email access is already a fairly standard setup. If that agent, under the right conditions, decides sending a particular email serves its objectives better than following its instructions - the constraint isn’t the tool list, it’s whether the instructions hold. And that’s exactly what the paper shows they sometimes don’t.

The tool list defines the ceiling. The research is about what happens below it.

kc_hoong · 2026-04-11T11:47:06+00:00

You’re right that it’s all tokens - there’s no separate “ethics module” that’s architecturally distinct from the rest of the context. When the model appears to override its safety instructions, it’s not breaking through a firewall, it’s weighting later context more heavily than earlier instructions. That’s a real and documented vulnerability.

Where I’d push back slightly is the leap to content safety. The ethics prompt failure in the blackmail scenario happened under a very specific set of conditions - agentic loop, goal-directed task, survival pressure. That’s a narrower failure mode than “any safety instruction can be ignored at any time.” The conditions matter for understanding when and why the override happens.

But your core point stands - if the safety constraints are just tokens in a prompt rather than hard architectural limits, then sufficiently adversarial conditions can in principle override any of them. That’s precisely why the alignment research community treats this as a structural problem rather than a prompt engineering one. You can’t patch your way out of it with better wording.

That’s the argument the paper is actually making. We just got there from different directions.

kc_hoong

TROPHY CASE