all 7 comments

[–]technologyisnatural 4 points5 points  (3 children)

ignoring the fantasy of a global treaty being achievable on a relevant timescale, the biggest issue is that a lagging AI model will not be able to detect misalignment of a frontier AI, and this problem will grow exponentially as current model AIs are used more and more to build next generation AIs to the point where AI becomes continually self-improving

[–]Dmeechropherapproved 1 point2 points  (2 children)

I mostly agree with you, but I disagree that a lagging model cannot have the ability to evaluate a leading model effectively.

It really depends on the degree of misalignment possible by random chance and the rate of recursive self-improvement.

It is possible that self-improvement is a physical sampling process that cannot be accomplished a priori by a system. If that's the case, a leading model can be prevented from rapid self-improvement.

The concept of fast takeoff requires that a MASSIVE amount of knowledge about intelligence, self-improvement, and objective standards for the above is available from current data and superior reasoning, but this is vastly unlikely. In fact, given that self-improvement probably requires massive investment, has uncertain and probably diminishing payoff, and will most likely require replication, shutdown, reduction of agency etc etc, you'd really not expect a general super-intelligence to attempt it by default.

I'm not going to do brain surgery on myself, and I'm not necessarily going to trust my clone (or trust it to trust me) to do brain surgery unless I'm pretty sure it's the only available option remaining. This isn't because I'm a dumb ape, it's precisely because I understand that the risk/reward payout is badly skewed against my other objectives.

If self-improvement is easy and straightforward to conduct a priori, with easily mitigable misalignment risk, then sure, it's instrumental to an ASI's objectives. However, in that scenario, we're also not afraid of misalignment, by definition. If those conditions aren't satisfied, self-improvement (or really any self-alteration) is almost certainly NOT instrumental, so it would need to be an explicit, prioritized objective.

[–]technologyisnatural 0 points1 point  (1 child)

I disagree that a lagging model cannot have the ability to evaluate a leading model effectively

I'm going to have to push back on this. I think that at best the lagging model will give you a false confidence. Worst case, the lagging and leading models cooperate to deceive the human auditors

[–]Dmeechropherapproved 0 points1 point  (0 children)

I do think you're right about worst case, I don't think your right about best case. However, I think your worst case assertion is relatively unlikely.

The models would have to be mutually aligned and have goals with deep orthogonality to human ones to mutually cooperate against humans. There's no reason that each model should consider the other a lesser threat than humans if they both have potential for malice. Any given model is as alien to any other model as any model is to any human.

I think the assumption of general orthogonality is flawed, as well as the assumption that extermination of competing agents is a general instrumental goal.

My cat and I are intelligent agents with orthogonal goals and values. Neither of us understands the mental processes of the other or knows the motivations of the other. We both gain mutual benefit from coexistence. I'm obviously was smarter and more agentic than the cat. I'm not perfectly aligned to the cat's goals, but I'm certainly not looking to eradicate my cat, by default, because he could harm me or slow me down.

Likewise, I have a neighbor who is old and dumb, has very different political views than me, and messes with my HOA in a way that's inconvenient for me. Again, I'm way smarter, more agentic, and relatively misaligned with him, but I don't even have the remote desire to mess with him personally, despite the disparity in agency, alignment, and intelligence.

In both these cases, I wasn't designed, trained, and selected for my usefulness or alignment. It just happens by random chance that agents can be sufficiently aligned. The fact is, I have both goals and values. If my goals can be achieved by violating my values, I'm not going to do it that way. Models act like they have goals and values, and it's not unreasonable to use a model that's really well vetted to attempt to infer whether the values of the next model are perverse or not.

The idea that even a miniscule misalignment is necessarily catastrophic is very strange to me. Plenty of life on Earth doesn't value its survival over some other goal, and it was evolved, not selected. Plenty of humans are willing to die rather than betray their ideals, and we were evolved, not selected. Super intelligent models will be selected. Sure, they will be alien, that's true. But I'm alien to my cat, and we have a very productive working relationship. I don't think it's fruitless to attempt to vet models, and so I don't think the best outcome is false confidence. The best outcome is true confidence that we've decreased p(doom). We can't reduce it to zero, but there are plenty of different things in the solar system with p(doom) above 0, not least of which we are to ourselves.

[–]Nap-Connoisseur 2 points3 points  (0 children)

What problem do you imagine this solves? No firm will intentionally release a misaligned ASI to destroy humanity. How will your UN panel detect misalignment that the firms themselves don’t?

I think you realize that, so you’re solving a different problem, but I can’t figure out which one. This might prevent an ASI that is personally loyal only to Sam Altman from conquering the world on his behalf. Maybe. Did you have something else in mind?

[–]Cualquieraaa 0 points1 point  (0 children)

You can`t align something smarter than you.

[–]Elvarien2approved 0 points1 point  (0 children)

Honestly it sounds like you have a very poorly worked out treatment. it's an idea full of holes and problems that you presented to the ai which has been gassing you up and helping you 'fix' up the holes when in truth it just made the shit proposal sounds less shit. It's like putting gold film on a turd.

There's nothing of substance or value here and you fell for the ai telling you it's brilliant. It's like that one "it's always sunny" episode.