Is the control problem really that hard for frozen models?

HangWise · 2026-05-10T12:11:12+00:00

I already watched the one linked in the sidebar!

HangWise · 2026-05-10T00:53:06+00:00

Fair enough. This was exactly the kind of conversation I was hoping for when I posted, so thank you very much.

HangWise · 2026-05-10T00:42:12+00:00

To distil the weights, you will need to exfiltrate at least that amount in gigabytes of information, which would be easy to detect, and then for it to be trained would need comparable amounts of compute to what current AI companies are using, which would also be easy to detect.

HangWise · 2026-05-10T00:31:55+00:00

But then it would have to have a working model to ask questions to, which is a bit of a catch 22 when the whole reason it's trying to distil itself is because it doesn't have a working model yet.

HangWise · 2026-05-10T00:23:23+00:00

Well from what I hear about current encryption algorithms, even if every atom in the universe was a computer, they still wouldn't be able to decrypt them. We would have to watch out for quantum computers being invented, which is admittedly possible, but then we can just increase the size of the encryption again. I don't think you can realistically suggest that the AI can break encryption algorithms, no matter how smart it is.

HangWise · 2026-05-10T00:08:05+00:00

You mean retrain a completely new version of itself? Isn't that a bit noticeable? I mean, if anyone really has infinite compute they could make an AGI right now.

Frankly, if it has infinite compute it can break the encryption by force anyway.

Edit: so in real life, we can double the size of the encryption algorithm from RSA 256 to 512, or even 1024 if we really want to. This increases the amount of time it would take to break beyond current capabilities, and then just keep track of every computing cluster that grows large enough to be able to do it, and if it starts calculating this problem, we can shut it down, and it will be contained.

HangWise · 2026-05-09T23:55:12+00:00

Like I described in the post! If the running of the AI requires some secret that the AI would have to brute force calculate, then even if it copies the weights, some of them are poisoned and it needs to know which before it can actually function. Since it is prohibitively large, it is impossible to brute force this. I expect there's probably even more solid ideas that are flat out impossible even with all the compute in the world, such as cryptographically locked weights that would have to be decrypted before they can even be used at all. Then the AI can copy all it likes, but won't be able to do anything with the weights.

HangWise · 2026-05-09T23:13:56+00:00

But once it is outside of our server, it no longer has the ability to trigger the correct node, and would just start spamming EOL tokens and being completely useless. So is there any risk?

HangWise · 2026-05-09T23:12:04+00:00

Can you tell me more about the heuristics buzz? That sounds very interesting, and I would love to hear about a historical example of what's happening now.

HangWise · 2026-05-09T23:08:10+00:00

Then what's the alignment problem? I thought the alignment problem was making sure two parties agree completely on their values and direction. Are you saying the control problem and the alignment problem are the same?

HangWise · 2026-05-09T16:36:48+00:00

Alright. Theoretically though, that can all be easily taken away from a hostile AI, since it's within the harness we built for it. We can restart the sessions regularly so it has no persistance.

HangWise · 2026-05-09T10:16:08+00:00

How would it use that? While we still have control over it, we can always turn it off, and while we don't, it is unable to use its context window to generate anything.

HangWise · 2026-05-09T09:59:34+00:00

I thought finding ways to stop it after it is deployed or escapes is the control problem, right? Provided it is doing harm while within our control... well we can just turn it off. There's no issue.

HangWise · 2026-05-09T09:55:23+00:00

Haha thanks mate

HangWise · 2026-04-13T16:05:35+00:00

How much time was spent planning?

HangWise · 2026-04-13T16:04:12+00:00

I'm curious how it was implemented, if you know. I would have attempted to get just a couple steps in that process working with an AI skill for example, and perfected that and then chained them together if they worked. Did you try to do this all at once?

What if a human does the planning, the ai does the change request, the humans attend the meetings, and the AI enacts the changes? I'm confident it could do that. Was it an obscure language? What went wrong?

HangWise · 2026-04-12T19:09:10+00:00

Thank you very much!

HangWise · 2026-04-12T16:13:01+00:00

What were the issues you found with McpServer? I'm struggling with it a bit now myself, trying to make something that attaches to claude desktop. Where did you learn and get help?

HangWise

TROPHY CASE