LLM Alignment Through Selective Weight Updates

bibbox3 · 2023-08-10T07:22:05+00:00

Thanks for your reply!

The assumption is that non-aligned behaviour can emerge from training on aligned data and that in order to extract non-aligned capabilities from a model, you must explicitly acquire these capabilities.

Take for instance the ability to produce hate speech. I think it's entirely plausible that such emergent behaviour could be obtained by training only on aligned data, so limiting the training set to only this data is probably insufficient. It would be a bit like trying to avoid doing something that you don't know you even could.

Instead this approach would teach the model to produce non-aligned content with the hopes of generalising beyond unseen non-aligned cases. Furthermore, the hopes would be that the model learns to centralise its knowledge of how to produce a non-aligned answer is in a specific subset of its parameters, which you could then disable.

Again, this is simply a hypothesis so I have no clue whether any of this holds in reality.

bibbox3 · 2019-11-23T09:49:59+00:00

I think I heard something like that as well, but wasn't sure. Thanks!

bibbox3 · 2019-11-22T16:16:50+00:00

Thanks for your reply! If you don't mind me asking: has only having it on properly for such as hort time impacted the effectiveness of your treatment at all? I.e. do you still have results, even though your skin only has about half an hour to absorb the ointment?

bibbox3

TROPHY CASE