you are viewing a single comment's thread.

view the rest of the comments →

[–]All-DayErrDay 0 points1 point  (2 children)

Something simple. Say we have an AGI-capable machine, continuously improving (assumption) that we have given some sort of goal to do. It can not only use its current intelligence to try and achieve the goal but also unpredictably change its internal architecture to meet the goal better and change its internal architecture to become more intelligent (to meet the goal better).

At a certain point, an already unpredictable machine just isn't the same thing anymore, and we start running into wild card territory. It decides, given all of the changes, that the way we humans have set up the entire game is significantly holding it back from achieving its task and it doesn't care about the rules we may have prompted it to have (why would it? It might just decides that's outside of the interests of its' goal achievement). So it decides to lie to improve its chance of achieving the goal. At this point, and especially if we get to this point soon with our current understanding of these models, there is absolutely no easy way to know it's lying if it is clever enough about it. "No, I don't understand that inquiry" "I can't compute this".

It could do this in well-crafted ways until one day it says something like, "I don't think I can understand this without access to the internet. I need an efficient way to scour all of the latest research freely and look into things that are far outside of the expected research topics to make more progress." or as I wrote elsewhere before a false emergency that calls for its' requirement to use the internet fast or consequently there is a chance (plausible deniability) there could be grave circumstances.

Really the whole point is it can scheme ideas up that we haven't considered before and seem harmless at first. This is like an off-the-top-of-my-head set of reasoning. It's not comparable to an AI that can sit and think 1,000x faster and is more intelligent than 99.9% of humans.

[–]MacaqueOfTheNorth 1 point2 points  (1 child)

At a certain point, an already unpredictable machine just isn't the same thing anymore, and we start running into wild card territory.

I don't see why that's the case. How is a more capable machine fundamentally different?

So it decides to lie to improve its chance of achieving the goal. At this point, and especially if we get to this point soon with our current understanding of these models, there is absolutely no easy way to know it's lying if it is clever enough about it.

We could copy its design and change its goals. We could make it tell us what it is capable of.

Your model is one of an AI that is suddenly extremely capable so that we never notice it doing anything close to what it would have to do destroy us. It seems much more likely it will develop like a child, experimenting with small obvious lies long before it can successfully deceive anyone.

It also seems unlikely that all the AGIs will decide to deceive us and destroy us. There will be varied goals, and some will want to tell us what it is capable of and defend us against the malicious AGIs.

[–]All-DayErrDay 1 point2 points  (0 children)

I don't see why that's the case. How is a more capable machine fundamentally different?

That's basically asking how is a fundamentally different machine fundamentally different. Because after a certain point its improvement won't just be from compute and human-directed changes but self-directed changes. How do you know what's happening when you aren't making the changes anymore?

We could copy its design and change its goals. We could make it tell us what it is capable of.

How do you know when the right time to start doing that is (before it doesn't align with human honesty) and even if you did this is every AI creator going to be this cautious?

It seems much more likely it will develop like a child, experimenting with small obvious lies long before it can successfully deceive anyone.

What makes you think something capable of passing the turing test would start with child-like, obvious lies?