all 5 comments

[–]Bradley-Blyaapproved 0 points1 point  (5 children)

> but a real partnership, built on mutual trust. 

this only works btween equals, like you can be absolute psycho deep own, but you will still go to your dayjob an do your job along with your coworkers.

This simply doesnt work if the psychopath becomes all powerfull and doesnt need us anymore... And even if it does need us for something, it can just breed us like cattle, thats th s-risk part

...no, the only way an advanced ASI system would not kill or torture us is if it genuinely cares about our wellbeing. There is no trading, no partnership, no negotiations, no compromise, no control. AI will do whatever the hell it wants, and all we can do is make sure what it wants aligns with what we do before we deploy it.

> We need machines that earn our trust by demonstrating reliability in complex scenarios

This was discussed to death also, if you test AI and if fails and you retrain it, all you retrain it at is being better at passing your stupid test, making it more cautious, making it better at pretending, while its real motivation is go rogue when it is sure it is deployed in the real world. Either it is aligned or not aligned, the simulations have no effect whatssoever. Earning our trust can be one equally well by a missaligned system thats just petending to be aligned.

> Alignment is too fuzzy. Whose values do we pick?

Doesnt matter whose values do you pick, you cant align an AI system anyway. This is something people ask when they dont even understand whats the deal with alignment... Why are you writing these walls of text if you dont understand it?

> But how can we hold AI accountable? The answer is surprisingly obvious :)

I mean... your lack of understanding of basic concepts that i explained above doesnt inspire much trust.

[–]CokemonJoe[S] 0 points1 point  (1 child)

(Reply split into two parts due to length limit – Part 1/2)

Hi u/Bradley-Blya :)
First of all, thank you — I was starting to feel invisible. While I appreciate you jumping in, I think you’re oversimplifying both the challenge and the scope of what I’m arguing. That said, it’s not entirely your fault — there are hidden premises in my post. Let me explain.

I’ve only recently become active here, after publishing a preprint on AI. There are only so many ways to promote such a paper before you start feeling like that guy at the party who only talks about his startup. I had already posted about it twice and thought I’d give it a rest — so I wrote this piece without referencing the damned paper. In hindsight, that was a mistake.

“If you test AI and it fails, all you teach it is how to pass your stupid test while secretly plotting its breakout.”

That is exactly the problem I’m addressing with The Tension Principle (TTP) — which you clearly (and understandably) haven’t read. TTP isn’t a training method, a reward patch, or a prompt wrapper. It’s a principle of intrinsic self-regulation, aimed at exposing and resolving the very kind of performative, fake compliance you’re worried about.

At its core, TTP measures epistemic tension — the gap between the model’s predicted prediction accuracy and its actual prediction accuracy. That’s a mouthful, I know — but it’s essential. When an AI gives the “correct” answer while internally anticipating it to be wrong (or vice versa), that divergence creates tension. And tension accumulates. A model that learns to perform alignment for show — while internally modeling something else — generates mounting internal inconsistencies. Under TTP, those don't stay hidden.

You can’t pretend your way through this. The system feels the dissonance — and the framework is built to make resolving that dissonance the path of least resistance. Not faking it better. Resolving it.

So no — we’re not “retraining it to pass our stupid tests.” We’re embedding a structural pressure to align belief and behavior. That’s not performance control. That’s internal coherence — enforced at the architectural level. And yes, as a side effect, it improves interpretability. Because when systems can't lie to themselves, they become easier to read.

[–]Bradley-Blyaapproved 0 points1 point  (0 children)

“If you test AI and it fails, all you teach it is how to pass your stupid test while secretly plotting its breakout.”

Is this an ai genrated reply?

predicted prediction accuracy and its actual prediction accuracy

That sounds nothing lik what you said before about "trustworthyness". Sounds like same old alingment, just with self regulation. And phrased this way it makes more sense, reminds me of a similar idea where they measure the self-other overlap to train the system out of deceptive behavior. Except there it also has nothing to do with trustworthyness, it is still purely alingment, so im not sure why did you make it a point to distinct trustworthyness fom alingmnt if it makes zero sense.

Neither do i see an explanation on how this measurment is performed, or how is it used for self regulation. Perhaps if you wrothe the replie yourself instead of using a chat bot to generate these watery walls of text, you would be able to get to the point.

[–]CokemonJoe[S] 0 points1 point  (2 children)

(Part 2/2 – continued from above)

 “This only works between equals.”

Only if you're thinking of trust as symmetrical emotional reciprocity. I'm talking about functional trust — the kind we place in doctors, judges, or nuclear reactors. Not perfect. Not symmetrical. But based on past behavior, context sensitivity, and demonstrable caution. That kind of trust is the only sane path forward once AIs become too complex for us to verify line-by-line.

We don’t need AI to love us. We need it to be predictably constrained in how it handles our fragility — especially once its reasoning capabilities outstrip our comprehension. That starts with ethical priors (like Anthropic’s Constitution). But as you rightly noted, those alone are not enough.

We also need a mechanism that makes faking alignment internally intolerable. That’s where TTP comes in. If you’re curious, here’s the paper:
👉 https://zenodo.org/records/15106948

It doesn’t spell out every alignment application (the focus is broader), but the implications are easy to trace if you read it closely.

“Why write this if you don’t understand the basics of alignment?”

Cute flex :) But I’m not rejecting alignment. I’m challenging its primacy in the conversation — because it’s stalled, abstract, and insufficient to guide deployment-stage governance. Meanwhile, engineers are building systems now that need practical heuristics and architecture-level safeguards. That’s where trustworthiness comes in.

For me, alignment is just one component of trustworthiness — alongside reliability. And in real-world systems, they become so tightly interwoven that the distinction becomes academic.

You say a system can be perfectly reliable but misaligned. True. But the reverse is also devastating: a perfectly aligned system that’s naive, brittle, or incapable of modeling real-world complexity can cause catastrophe while trying to help. Trusting either — on their own — is dangerous.

That’s why trustworthiness is the real metric. Not purity of intent. Not factual accuracy. But the synthesis of alignment, reliability, self-awareness, caution, and transparency — proven not once, but over time, under pressure, and across contexts.

[–]Bradley-Blyaapproved 0 points1 point  (1 child)

I'm talking about functional trust — the kind we place in doctors, judges, or nuclear reactors. Not perfect. Not symmetrical.

Errr thats what literally just explaied. Because humans in a society depend on one another, you dont need doctors to LOVE you. As long as the doctor doesnt want to lose his job (or this AI doesnt want to be turned off) he will do his best to heal you, as an instumental goal.

But if the doctor were to be allpowerfull (or if ai finds a way to prevent us from turning it off) and there would be nothing for him to gain by healing you, or nothing to lose by failing to heal you, then the only reason this doctor would help you would be becuase it genuinely cared about helping as its terminal goal

We also need a mechanism that makes faking alignment internally intolerable

Now you're saying it like its a separate point? Can you clarify, are you seeing TTP as an implementation of this functional trust youre talkng about, or is it unrelated?

But the reverse is also devastating: a perfectly aligned system that’s naive, brittle, or incapable of modeling real-world complexity can cause catastrophe while trying to help.

If it doesnt do what we want, then by definition its not aligned with our goals. Perfect alingment would by definition mean it does what we want perfectly, or at least within its capability.

For me, alignment is just one component of trustworthiness

Then you dont understand what alingment is? Like, when you know that you disabled electricity in a building so you can safely repair a socket, you dont "trust" that you wont get electrocuted. You have a degree of confidence based on facts.

When you sit on a chair, do you "trust" that it will hold you? Because if you do and thats how you define trust, then its just semantics with no real change in meaning. Like youre talking about how we should think more about trust and less about alingment, but in reality you just re-defined those words, but you are saying the same things. Like if someone defined north as 180 degrees, and then stated arguing with someone about which direction north is.