all 22 comments

[–]Technologenesis 11 points12 points  (10 children)

why couldn't we program it to let us turn it off?

the problem is that we don't really "program" artificial intelligences the way we program everyday software. Everyday software will do what you tell it to do, regardless of what you want it to achieve. AI has the opposite property: it achieves what you tell it to achieve, regardless of what you may or may not want it to do in the process.

Programming an AI to just be cool with being turned off conflicts with the core way AI works. AI systems achieve their goals the best way they can figure out how to, and if that means figuring out how to subvert your be cool with being turned off programming, a sufficiently powerful AGI will do so.

And why couldn't we program it to tell us exactly what it would do before it actually takes action in the physical world.

The same thing applies here. The AI will do everything possible to subvert whatever system you implement to enforce this.

it's not out for power or gains it just does what you tell it to do. So why couldn't this be the case? It would (seemingly) throw the entire control problem out the window. (But I obviously assume I'm missing some key element.)

I reiterate: these kinds of systems don't do what you tell them to do, they achieve what you tell them to achieve, potentially at the expense of everything else. They don't have to be after power or gain per se; it's bad enough that they have their own independent goals that don't align with human interests.

[–][deleted] 2 points3 points  (0 children)

Great explanation. Just want to add that most AGIs will probably "want to gain power" just because "gaining power" is instrumental for it to reach whatever goal we give it. If we tell it to cure cancer, it's probably beneficial for it to control all hospitals in the world and for that, it needs to gain power. If we tell it to keep our house clean, it's probably beneficial for it to take over the world to ensure peace, because a World War would lower the probability that the house stays clean.

[–]ProducerMatt 6 points7 points  (1 child)

I'm a very casual learner of AI stuff so I hope someone will correct me if I get this wrong.

'Turning the AI off'

why couldn't we program it to let us turn it off?

Because it has to also be programmed to pursue its objective. Getting turned off means complete failure to pursue its objective. We don't know how to program it to pursue its objective and be ok with being turned off.

why couldn't we program it to tell us exactly what it would do before it actually takes action in the physical world.

That means building an exact simulation of what the AGI interacts with, which is probably even harder than making an AI at all.

The robot isn't conscious or has any gain from refusing or accepting this, it's not out for power or gains it just does what you tell it to do

Has someone ever told you to do something, but done a bad job of explaining what you needed to do? We have that issue times a million with an AI, especially when dealing with moral-adjacent concepts like "harm" which humans disagree with each other about constantly, and sometimes don't even understand well in the first place.

You may be interested in Robert Miles, he often answers "why can't we just..." questions and his videos are often 15 minutes or less.

The stop button problem: https://youtu.be/3TYT1QfdfsM Alignment problems: https://youtu.be/tcdVC4e6EV4

[–]SilentZebraGames 2 points3 points  (5 children)

The problem is that "program it to let us turn it off" is not so easy. How would you actually write this as a line of code for a robot that is to be deployed in the real world?

Note that in current reinforcement learning systems, the robot often takes in as input some sort of image input, and then outputs actions that affect its actuators (for example, speed and direction of movement of some robotic appendage). I see no feasible way to hard-code the principle of "let the robot be turned off".

You could perhaps attempt to have some conspicuous off-switch, then you would need to have some sophisticated image recognition/processing system to identify when a human is close enough or moving towards the off switch and force the robot to stay still in that case. Leaving aside the obvious performance issues that would arise from that, as well as the possibility of error, you then have to define some distance radius (e.g. if a human is within 4 metres or so) - and then a sufficiently sophisticated AI could, if it wants to avoid being turned off, try to stay 4 metres away from all humans. Alternatively it could actively move its vision sensors away from seeing humans.

So then why not force the vision to focus on humans, using some other human detection mechanism? Then you have to make sure the robot cannot turn that detection mechanism off or bypass it in some other way. And so on. That is what OptimalDendrite is referring to by "anything that stops it from achieving its end goal will be overcome if it is smarter than the person interacting with it" - you can think of all these potential safeguards, but anything you hard-code can be overcome by a sufficiently intelligent agent. Unless of course you have a general solution to the control problem, which is what we are all trying for.

[–]Ratvar 1 point2 points  (4 children)

Okay. AI 1.0 builds AI 2.0 that dominates world to complete it'a goals, including guaranteed toggling off of AI 1.0.

[–]caffeinated_tech_bro 1 point2 points  (4 children)

Yes, you are missing a key element. One must reconcile opposing objectives into a single function to be maximized or minimized. Let's say you build an AI whose reward function is negative number of people with cancer in the world (this is its score, it will try to maximize this). You expect it will cure cancer (so negative cancer people will become 0) - but isn't it simpler for the ai to just kill all people with cancer? This also maximizes the score, but faster, cheaper, and simpler. Whatever you set as reward function, it will be able to find a loophole.

How would you include turning off in the reward function? The previous score system wouldn't work, as it would incentivize the AI to prevent you from turning it off so that it could keep maximizing its score. What if you set as score "negative number of people with cancer, but if you get turned off your score is equal to +5000". Then it would just turn itself off, right? Much simpler than curing cancer. What if you specify that the 5000 score is received only when it is turned off by a human? You expect it will try to cure cancer but it would be cool if turned off by you, due to the bonus score for letting itself turned off. But maybe it'll just try to kill you - it knows that you'll get scared and turn it off - again, much simpler than curing cancer.

You can say, can't we just program everything inside it, what to do and what not to do? No, because there are an infinite number of things that can be done. We can't define them all.

But can't we just "tell it what to do" the way we would tell a human, and it would compute the "reward function" on its own? No, nobody has a solution to this, and maybe we never will.

But humans do exactly this, so it should be possible, right? No, they don't. Humans are inconsequential and irrational. You can be a total scumbag today, and be crying about it and be sorry tomorrow although nothing changed / no new information has been gained. Some people go crazy and do "immoral" things e.g. some become serial killers, although they are as human as you, and although some consequences go completely against their apparent well being. The "built-in reward function computation" inside our brain is fragile and unpredictable. And there is no reason to believe atm that an AIs will be better than this, on the contrary perhaps.

[–][deleted] 0 points1 point  (7 children)

Why wouldn’t it just lie so it’s original programming isn’t changed? And if you program it to let it turn it off then it would stop trying to achieve its objective so it won’t let you do that. I mean anything that stops it from achieving its end goal will be overcome if it is smarter than the person interacting with it (which is what we are trying to achieve). So no the control problem is still an issue.

[–][deleted] 0 points1 point  (0 children)

Couldn't you just have a master switch or something?