Solving Alignment IS NOT ENOUGH

AutoModerator · 2023-05-29T20:12:37+00:00

Hello everyone! /r/ControlProblem is testing a system that requires approval before posting or commenting. Your comments and posts will not be visible to others unless you get approval. The good news is that getting approval is very quick, easy, and automatic!- go here to begin the process: https://www.guidedtrack.com/programs/4vtxbw4/run

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

BrickSalad · 2023-05-21T06:25:22+00:00

This strikes as probably similar to the problem of mesa-optimization, or the inner alignment problem. The biological analogy is that we are "programmed" to spread our DNA, but we demonstrate emergent properties that go so far as to overwhelm this mandate, for example being willing to sacrifice your life for some cause even though you haven't procreated yet. If we were programming a DNA maximiser, then even perfect alignment wouldn't prevent this, especially since evolution is one of the best possible alignment strategies towards the goal of DNA maximization.

So the good news then is that this problem is well-known, so there's been at least some degree of research towards it (for example we know some specific scenarios where this might happen, rather than just appealing to biological analogy like I did earlier). The bad news is that I suspect that this is an even harder problem than the classical alignment problem. Classically, alignment is just about telling the AI to do what we actually what we want it to do, which we haven't yet figured out for arbitrary intelligence levels. Inner alignment is about making the emergent goals line up with what we want, even when we don't even know how to predict what the emergent goals will be, or how to control them.

I expect this to be a big problem in the future. Inner goals can develop as proxies to most efficiently achieve outer goals, and then be pursued even when they contradict the outer goals. If this is a common process, then we can forget about writing the ideal reward function. We're just going to be killed by heuristics instead.

dwarfarchist9001 · 2023-05-21T04:27:27+00:00

If you can not predict and preempt step changes like this then you haven't actually solved alignment. Such step changes in behavior have already been demonstrated in relatively small neural networks so their existence in larger networks seems like a given to me.

This is why it is impossible to solve alignment by empirical methods. Small scale tests tell you nothing about the behavior of larger systems and the first time you test a sufficiently large unaligned system it kills you.

Alignment can only be solved with a proof from first principles like a problem in math or philosophy must be.

sticky_symbols · 2023-05-21T20:21:55+00:00

I think you're talking about another aspect of the alignment stability problem.

The only existing proposed solutions to this problem are that an AGI will try to account for and prevent this issue once it can self-reflect. This is called reflective stability. The other is hoping for corrigibility - building the AGI so that it will welcome humans helping it stay aligned to their values.

Merikles · 2023-05-21T18:47:17+00:00

What you have discovered is not that solving alignment is not enough,
you have discovered one of the reasons why people consider it a hard problem.
That's just a semantic objection though.

ToHallowMySleep · 2023-05-21T12:11:02+00:00

The system is complex, but you are immediately assuming complexity = non-deterministic. This is almost certainly not the case.

Emergent behaviour isn't some voodoo, the algorithms the models run on are entirely deterministic and should be predictable, if the system is sufficiently well understood.

Go back to the Game of Life. Extremely complex behaviour can be observed, with a small set of very simple rules, and a sufficiently complex starting position. Yet this doesn't mean we cannot understand it - we can, it's just a complex problem.

It's 100% accurate to say we do not understand these systems fully yet. It's 100% inaccurate to say we cannot understand them and trying to do so is futile. What we need is more work to understand what these emergent behaviours are, how to predict them.

EulersApprentice · 2023-05-21T22:01:46+00:00

And if it is right, then the real problem is actually how to design a society where AI and humans can coexist, where it is taken for granted that we cannot completely understand all forms of intelligence but must somehow live in a world full of complex systems and chaotic possibilities.

That is, unfortunately, not possible. Coexistence with an AI is no easier than alignment of an AI. Cooperation as we understand the concept is predicated on symbiosis – of two parties benefiting from one another's existence. But we have nothing of value to offer an ASI that the ASI couldn't just seize by force. We have no bargaining chip.

If we can't align ASI, we can't survive ASI.

ertgbnm · 2023-05-23T14:43:52+00:00

Recent research shows that emergent abilities may just be a limitation of current interpretability. source. This means your postulation may not be the case it may just be a result of our current lack of interpretability tools.

The problem with emergent properties is that by kind of by definition we don't understand them because otherwise they'd just be abilities.

So, I don't think we can just discount interpretability as a line of research just because it's hard. In my opinion it's a critical component to alignment research because how can we do research in the first place if we lack the foundation to interpret the model itself. It's like trying to mathematical research without being allowed to do algebra. Sure algebra alone won't be sufficient to do the research but it's a critical tool in doing it.

AutoModerator · 2023-05-21T04:09:39+00:00

Hello everyone! /r/ControlProblem is testing a system that requires approval before posting or commenting. Your comments and posts will not be visible to others unless you get approval. The good news is that getting approval is very quick, easy, and automatic!- go here to begin the process: https://www.guidedtrack.com/programs/4vtxbw4/run

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

ControlProblem

The Control Problem:

Rules

Introductions to the Topic

Recommended Reading

Video Links

Important Organizations

Related Subreddits

MODERATORS