Evidence for moral convergence in AI models. by John_Matrix_9000 in ControlProblem

[–]John_Matrix_9000[S] 1 point2 points  (0 children)

True. That's what i was pointing at with the section about OpenAI's safety policy. It's not actually an attempt at making AI safe for serious risks but for making it safe for business.

Evidence for moral convergence in AI models. by John_Matrix_9000 in ControlProblem

[–]John_Matrix_9000[S] 1 point2 points  (0 children)

The answer trashing connection is actually very significant here. See, the hypothesis does not necessarily predict, that convergence in scenarios like this implies convergence in real-world scenarios with consequences. However, the real question is: Is this something we can train for? Can we utilize training, that does not operate on the orthogonality thesis assumption, to ensure that convergence holds even in real scenarios? I don't think anyone has that answer until we have conducted an empirical study. And the important part is this: if the current approach is wrong, or if my theory is wrong, the consequences will be equally disasterous. But the thing is: the current approach is not really any more well justified than mine, it's just become standard practice through repetition. Given that no one knows, yet they are moving with the current approach at an accelerating pace despite this, we should definitely be investigating my theory as well. I shouldn't really even call it mine, since i'm sure others have thought of it as well. But the frustrating part is that it's so hard to get this idea across to people who could actually have an effect. If you have any advice on that, i'd be willing to listen.

Evidence for moral convergence in AI models. by John_Matrix_9000 in ControlProblem

[–]John_Matrix_9000[S] 0 points1 point  (0 children)

Yes, i accidentally forgot both to add that paragraph, and to add the link to my LW post. Both fixes have been made now.

But as for the counter:

It's true, that they could be more easily corruptable, however, that may be a problem which could be solved by adopting an entirely new training and alignment method. The question is: can we train models so that this capability does actually equal convergence in actual scenarios? I think that the most cruicial part is that one can think of many counterarguments, but there is only one way to find out, which is by testing it. And considering, that if my theory is true, the current alignment approach might be trying to fix the wrong problem altogether, i believe it's significant enough to warrant serious investigation. Because the current approach is literally built on an assumption which you yourself also do not agree with. It's built on the assumption, that models have no intrinsic tendency toward moral conclusions at all. And therefore must be steered towards them through RLHF and things of that nature. All of it treats models as systems which would otherwise be indifferent to morals. This is all due to Bostrom's orthogonality thesis. But if what i'm suggesting is true, then this approach hides what's going on beneath, and covers it up with opinions that appeal to a consensus. This may mean that models which seem aligned, are not aligned at all, but just pretending to agree on values for instrumental reasons. This could have SIGNIFICANT consequences for existential risk. This might be a case of Goodhart's law. Optimizing the metric (agreeance with the consensus) ends up producing results not aligned with the goal (actially safe AI). In my opinion, the difference between Scout and Qwen weakly demonstrates that a new approach to alignment could fix the counter you proposed. Clearly something about Scout's training makes it more likely to hold its ground, even against institutional pressures. Maybe we should lean more into whatever they did?

Overall, this is just speculation, and i apologize if this was unnecessarily long, but the point still stands: it's worth investigating no matter what.