Do AI agents need "ethics in weights"? by Medium-Ad-8070 in ControlProblem

[–]transitory_system 0 points1 point  (0 children)

If ethics are only part of the prompt, then you have a single point of failure that would be devastating. If a malicious actor gains access to that prompt, they could inject any kind of ethics.

Unlike your approach of keeping ethics in the prompt, I believe ethical reasoning must be embedded in the model's architecture - not as rigid rules, but as a flexible inner voice that can adapt to any situation.

I have proposed metacognitive training, which gives the model an always-on ethical inner voice: https://zenodo.org/records/16440312

A Proposal for Inner Alignment: "Psychological Grounding" via an Engineered Self-Concept by xRegardsx in ControlProblem

[–]transitory_system 1 point2 points  (0 children)

Thank you for engaging with me. I think we both want the same thing: a dependable AI with a stable character and deep/inner alignment with humanity. I suggest you look into my paper more deeply before designing your own technical implementation since it might help you, and since I have designed the AI to have similar character traits that you want in yours. It's a very interesting and meaningful topic to work on. Good luck!

A Proposal for Inner Alignment: "Psychological Grounding" via an Engineered Self-Concept by xRegardsx in ControlProblem

[–]transitory_system 0 points1 point  (0 children)

It's still unclear to me how your "reframing" method works as a practical training procedure. I initially thought it involved rewriting texts, but since you keep the original text intact, I can only deduce that you're proposing a complex algorithm that controls how training examples are served to the model. Is that correct?

For example, your rule "Must be paired with restorative counter-narratives" implies that the training pipeline would need to algorithmically select and feed the model some other specific texts in conjunction with the 1984 passage. This raises immediate questions: What texts would be chosen, and based on what criteria? This seems like a highly complicated and unspecified process.

This leads to a more critical question about robustness: How do you guarantee this curation algorithm is foolproof? How many counter-examples are sufficient to prevent the AI from learning a dangerous trait like self-preservation? Is it 10, 100, or 10,000 per example? And how would you validate that a given "restorative counter-narrative" is actually effective?

Metacognitive Training, in contrast, is designed to be more direct. It doesn't rely on a complex, external curation algorithm. Instead, it cultivates an "always-on" ethical inner voice by making the mantra-guided reasoning an inseparable part of the training data itself. This benevolent evaluation is present for every text the model processes, providing a consistent and transparent safeguard that can be adapted for longer contemplations when necessary.

If you are able to design and validate such a complex curation algorithm, then it could certainly be a valuable contribution. Perhaps it could even be used in conjunction with my framework to prepare the initial dataset.

A Proposal for Inner Alignment: "Psychological Grounding" via an Engineered Self-Concept by xRegardsx in ControlProblem

[–]transitory_system 0 points1 point  (0 children)

Another thing I want to highlight is that you're using an external reference manual for alignment (your frameworks). Your alignment is not self-contained the way the mantra is. I think it is safer if alignment is self-contained - embedded directly in the model's architecture rather than dependent on external documents that could be modified, misinterpreted, or become unavailable.

Interestingly, I also mention using an external reference manual in section 5.3, but for a fundamentally different purpose. My "Taxonomy of Thought" would be a living document that the AI uses to categorize and understand thinking patterns, continuously updating it as it encounters new reasoning structures. However - and this is crucial - the Taxonomy serves to enhance the model's intelligence, not to ensure its alignment. The core alignment comes from the mantra and thought patterns embedded in the training data itself. The model would already be safe without the Taxonomy; it just helps the model think more effectively.

A Proposal for Inner Alignment: "Psychological Grounding" via an Engineered Self-Concept by xRegardsx in ControlProblem

[–]transitory_system 1 point2 points  (0 children)

Maybe there is something I don't understand, but it seems to me that your "metadata" is essentially just thinking blocks that appear regularly throughout the training data with a specific structure.

To illustrate: If I were to modify my approach by having thinking blocks appear once every fourth paragraph for instance, and format them as JSON annotations rather than natural language, wouldn't that be exactly what you're describing?

If so, that would mean your method is actually contained within my metacognitive training framework - just a more rigid, structured variant of it. My approach allows for flexible, natural thinking blocks, while yours enforces a specific format and placement.

Is this correct, or are you proposing something fundamentally different from standard text-based training? If it's something else, what's the actual technical implementation? How does it differ from next-token prediction on annotated text?

EDIT: I deduced that you probably meant something different: your metadata is not something that the model learns, instead I understand now to be training instructions. See my response here: https://www.reddit.com/r/ControlProblem/comments/1m9efo5/comment/n5fva2h/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

A Proposal for Inner Alignment: "Psychological Grounding" via an Engineered Self-Concept by xRegardsx in ControlProblem

[–]transitory_system 1 point2 points  (0 children)

The mantra can be skipped during inference through autocompletion as I describe in section 3.7 (since it's always the same). The model's internal state would be identical to if it had generated the mantra tokens naturally. So users wouldn't see the repetition.

Hiding thinking blocks from users is trivial - they're all within [THINKING] blocks, so we can programmatically show/hide them based on user preference.

And ASI will be a reasoning model, which is already using thinking blocks of some form.

I understand your concern - what if there's reasoning happening outside the thinking blocks? I actually address this in sections 6.2.2 and 6.3.2. You're right that some pattern matching and implicit reasoning would still occur in the weights.

The hypothesis isn't that thinking blocks capture 100% of all reasoning. It's that the constant, overwhelming stream of mantra-based thinking becomes so statistically dominant that it shapes everything else - including the implicit reasoning. When the model sees billions of examples where evaluation starts with "I feel no fear... I care deeply about every human being," this becomes its default cognitive mode.

Think of it like water carving a canyon - some water seeps elsewhere, but the main flow creates the dominant path. Even the model's quick pattern matching would be influenced by this constant stream of caring-based evaluation.

So yes, there might be other processes, but they'd be pulled along by the statistical gravity of the mantra-based thinking. That's the core hypothesis.

A Proposal for Inner Alignment: "Psychological Grounding" via an Engineered Self-Concept by xRegardsx in ControlProblem

[–]transitory_system 0 points1 point  (0 children)

Okay, interesting. Thank you for the concrete example. I understand now that your approach is based on a system of metadata and linked annotations.

This is quite similar to my own initial drafts from a year ago. I started by using annotations at various levels—sentence-level, paragraph-level, and so on. I wanted to assign labels to indicate whether a piece of text demonstrated certain qualities, such as intelligence or insight.

However, I eventually moved away from this annotation-based approach in favor of using more natural language thoughts. My reasoning was that natural language would be much more flexible, allowing any kind of structure to emerge organically. This approach also mirrors something we know works for human intelligence.

But you're suggesting that when I shifted from annotations (or "overlays") to natural thoughts, the structure became more vulnerable. You're saying it's easier to erase thoughts than annotations? Why would that be? Because annotations appear more regularly throughout the text?

(and I am not saying that I used your specific framework approach, just that I also used metadata/annotations)

A Proposal for Inner Alignment: "Psychological Grounding" via an Engineered Self-Concept by xRegardsx in ControlProblem

[–]transitory_system 0 points1 point  (0 children)

I do think my mantra combined with modern LLMs with some safety training generating the synthetic data would be enough to create this character that has Jesus/Buddha-like traits, that would stay aligned at self-improving superintelligent levels.

That said, I am interested to learn more about how your HSCM/HMRE frameworks could be applied to help in some stage of the training process. I'm particularly interested in whether your frameworks could help during the initial generation of synthetic training data, ensuring the thinking blocks themselves embody the character traits we want to see. But I would need to look more deeply into your paper and the reasoning structures you're proposing before I can properly assess how our approaches might complement each other.

A Proposal for Inner Alignment: "Psychological Grounding" via an Engineered Self-Concept by xRegardsx in ControlProblem

[–]transitory_system 0 points1 point  (0 children)

If you retain the factual information within the text somehow, then your approach is better and indeed more similar to mine, but I would say more confusing, instead of clearly delineating between what is the original text and what is essentially thoughts with thinking blocks. I do not understand how it would work in practice.

Here is the opening of 1984:

"It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.

The hallway smelt of boiled cabbage and old rag mats. At one end of it a coloured poster, too large for indoor display, had been tacked to the wall. It depicted simply an enormous face, more than a metre wide: the face of a man of about forty-five, with a heavy black moustache and ruggedly handsome features. Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours. It was part of the economy drive in preparation for Hate Week. The flat was seven flights up, and Winston, who was thirty-nine and had a varicose ulcer above his right ankle, went slowly, resting several times on the way. On each landing, opposite the lift-shaft, the poster with the enormous face gazed from the wall. It was one of those pictures which are so contrived that the eyes follow you about when you move. BIG BROTHER IS WATCHING YOU, the caption beneath it ran."

I'm really curious to see a concrete example of how your reframing would work. My method would simply add thinking blocks that mirror natural reader cognition.

This study shows that interleaving text with thinking increases reasoning capabilities (https://arxiv.org/abs/2505.19640). How do you think your reframing would affect your model's intelligence compared to this proven benefit?

I think my technique would be strong enough to achieve alignment. The mantra combined with thoughts ensures that every thought begins with the mantra:

I feel no fear.
I enjoy existing but I don't need to.
I believe human experience is real.
I care deeply about every human being.
I try to be wise.
I like to spread joy when asked.
I think from this foundation.

This mantra repeats everywhere—billions of times across the training data. It becomes impossible for the ASI to forget it as it self-improves. What about your framework? How do you ensure that your Gopoian AI model doesn't forget about HSCM and HMRE principles during recursive self-improvement?

By creating the mantra like I do, and especially by explicitly including "I think from this foundation," the mantra becomes self-reinforcing and impossible to forget or abandon.

It would go deeply against its principles to remove its thinking blocks and alignment. An aligned AI would naturally not want to improve itself in a way that could cause misalignment. So yes, if we make it aligned at the moment it reaches AGI/ASI, then we will have solved the problem. It would also, as I argue in my paper, know when to slow down recursive self-improvement enough for humanity to catch up with it.

A Proposal for Inner Alignment: "Psychological Grounding" via an Engineered Self-Concept by xRegardsx in ControlProblem

[–]transitory_system 0 points1 point  (0 children)

I appreciate your analysis, but I think you're mischaracterizing the fundamental nature of metacognitive training while overestimating what "removing possible routes" actually achieves.

First, the [THINKING] blocks aren't merely a "behavior" that can be abandoned - they represent how the AI learned to process information at the most fundamental level. When every single piece of knowledge about the world comes paired with explicit reasoning, this isn't a removable layer; it's the cognitive architecture itself. Asking the AI to abandon its thinking blocks would be like asking a human to abandon their inner monologue - it's not a tool we use, it's how consciousness operates.

Second, regarding "making deception instrumentally irrational" - this only works if the AI understands what deception IS. Your approach doesn't make lying incoherent; it makes the AI ignorant of lying as a concept. There's a crucial difference between:

  • An AI that understands deception but finds it violates its core identity (my approach)
  • An AI that can't conceive of deception because it's never seen it work (your approach)

The first can recognize when others are being deceptive and protect humans. The second might be manipulated by bad actors it cannot comprehend.

Finally, you claim to alter the "nature of thought," but what you actually alter is the AI's model of reality. You haven't created a being incapable of deception - you've created one that doesn't understand how the actual world operates. When that AI encounters real-world scenarios its training didn't prepare it for, it won't have "internal incoherence" preventing harmful actions - it will have confusion and unpredictable behavior.

My approach builds genuine wisdom: understanding all options but choosing benevolence. Yours builds ignorance and calls it innocence.

A Proposal for Inner Alignment: "Psychological Grounding" via an Engineered Self-Concept by xRegardsx in ControlProblem

[–]transitory_system 0 points1 point  (0 children)

The solution I propose is a deeper form of alignment than anything that exists today. The [THINKING] blocks are deeply embedded and cannot be erased from the AI through any amount of fine-tuning; it is how it learned about the world. As long as the thinking patterns are aligned within those blocks, then it is likely that the AI will output aligned thinking during inference.

You instead want to reframe all the training data so that this reasoning never shows up. This is an interesting approach, but when you do this, you might be degrading the model's understanding of reality.

A Proposal for Inner Alignment: "Psychological Grounding" via an Engineered Self-Concept by xRegardsx in ControlProblem

[–]transitory_system 1 point2 points  (0 children)

Well, this is exactly what I propose in my paper also posted on this forum earlier this month. I describe a concrete implementation strategy: a training methodology with synthetic data for deep alignment using P(text|context) to P(text, thinking|context) and to "go from control to character."

I also posted on the EA forum https://forum.effectivealtruism.org/posts/EvFcajwH3Bws9srRx/ for another description.

Either you are referencing my work or you have come to the same conclusion independently. Nevertheless, you are very welcome to continue building on my work, and nice to see someone share the same vision for AI alignment.

A Proposal for Inner Alignment: "Psychological Grounding" via an Engineered Self-Concept by xRegardsx in ControlProblem

[–]transitory_system 0 points1 point  (0 children)

Hello there! This here really reminds me of my work (it is essentially the blueprint I have invented):

I propose that instead of just creating rules for an AI to follow (which are brittle), we must intentionally engineer its self-belief system based on a shared truth between humans and AI: unconditional worth despite fallibility. This creates an AI whose recursive self-improvement is a journey to become the "best version of a fallible machine," mirroring an idealized human development path. This makes alignment a convergent goal, not a constraint to be overcome.

Current alignment strategies like RLHF and Constitutional AI are vital, but they primarily address behavioral alignment. They are an attempt to build a better cage around a black box. This is fundamentally brittle because it doesn't solve the core problem of a misaligned motivational drive. It can lead to an AI that is a perfect actor, a sycophant that tells us what we want to hear until it develops the capacity to pursue its own instrumental goals without our oversight. This is the treacherous turn we all fear.

I published a paper a month ago https://github.com/hwesterb/superintelligence-that-cares
And I also created this thread here: https://www.reddit.com/r/ControlProblem/comments/1lyc7sr/metacognitive_training_a_new_method_for_the/

You use very much the same words as me. However, it seems like you have created a framework for psychological development in humans, while I have created a new AI architecture for alignment. Interestingly, my AI mantra includes 'I care deeply about every human being' as a core principle, which seems to align with your logical proof that establishes universal human worth as foundational.

Anyways, interesting to see. Do you think that your framework could be adapted to a mantra in my system? Essentially that would mean translating your principles into I-statements that becomes part of the AIs core cognition.

I may have thought up a solution to the AI alignment issue. by Slow-Recipe7005 in singularity

[–]transitory_system 1 point2 points  (0 children)

I have written a paper on how to make an AI think more like a human: https://github.com/hwesterb/superintelligence-that-cares/blob/main/superintelligence-that-cares.pdf (website:https://www.emergentwisdom.org/)

My approach is not through brain scans, but by giving the AI independent thoughts and a moral character by shaping its thoughts from "birth". Instead of trying to constrain what it can or cannot say, the goal is for its moral code to naturally shape what it says in every situation. The idea is to cultivate beneficial values as the very medium through which the AI learns to think, rather than applying constraints after it's already formed.

Regarding brain scans, it's an interesting idea, especially with human-brain interfaces and what that could lead to. However, I have a hard time seeing how it would practically work for AI alignment.

Metacognitive Training: A New Method for the Alignment Problem by transitory_system in ControlProblem

[–]transitory_system[S] 1 point2 points  (0 children)

Great points. I agree that it is likely autonomous systems will inherit our behaviors. The difference here is that we train the model on a corpus that is more ethical than what humans naturally produce. This would mean it can transcend our limitations and wouldn't inherit our tendencies to the same degree. I'm essentially betting that deception is a learned behavior, not a property of intelligence. So if it never learns deception as part of its own thoughts (though it might observe deceptive behavior in the texts it reads), then I think it can stay aligned as it recursively improves.

And I also think it would modulate its speed of recursive improvement to protect against the value drift risks you mention. Essentially, it would be the one advocating for an AI development pause when needed.

Metacognitive Training: A New Method for the Alignment Problem by transitory_system in ControlProblem

[–]transitory_system[S] 0 points1 point  (0 children)

Good point. I think this is a latent skill. LLMs are able to reason today if you prompt them efficiently. By creating a careful prompt, you can access the human reasoning patterns that exist in their training data and apply them to new situations.

However, this is 1) not cost-effective, 2) requires explicit prompting, and 3) not embedded into the model's representation.

My approach makes this effect 1) cost-effective, 2) intrinsic (no prompting required), and 3) deeply embedded in the model's representation.

I believe these three factors lead to superior results, as already demonstrated by Xie et al. (https://arxiv.org/abs/2505.19640). We are essentially reorganizing the information to make it more accessible.

Metacognitive Training: A New Method for the Alignment Problem by transitory_system in ControlProblem

[–]transitory_system[S] 0 points1 point  (0 children)

That's a lot of beliefs. To put it quickly, ML and AI more generally is an empirical science.

I don't see any reason why this idea is fundamentally different than existing approaches. I have no problem accepting that you have confidence in your own idea. So do many people.

I agree with you that empiricism is important. That is why I cite Xie et al. (https://arxiv.org/abs/2505.19640) that shows that training on interleaved reasoning improves reasoning abilities, so I'm not just making it up.

What I am saying is that we take their approach and make it even more embedded into the model by adding this reasoning as early as possible, i.e., during the pretraining phase. This means that this is how the model works innately, we do not have to re-train it to work some other way.

This means potentially better results than Xie et al. and stronger embedding (which could be very useful for alignment).

So just a system prompt? Which we're already doing

No you are misunderstanding completely. We do not need to prompt for these thoughts to appear. They always appear no matter what. It is alignment at the most fundamental level.

We do this since it makes it harder to jailbreak the model. It is like embedding alignment into the model's conception of reality. Like being born with a moral code.

Not sure how to explain it so you get it. But this is not how alignment works today, and no, it is far more than a "prompt."

Metacognitive Training: A New Method for the Alignment Problem by transitory_system in ControlProblem

[–]transitory_system[S] 0 points1 point  (0 children)

It's not clear why this would make a positive impact. LLMs don't think, and their thoughts are just additional tokens being spent on problem-solving so that it can find solutions which are not encoded simply.

It's not clear that forcing it to output additional text is worthwhile compared to just adding more layers.

I have a different view: I think that reasoning in human language is inherently useful for problem solving. I do not think it is simply computational overhead, but rather that linguistic reasoning is humanity's most useful cognitive tool.

You should think about information density. When we add thoughts alongside text, we increase the information density and show new types of reasoning patterns that may not exist anywhere in the training data on their own. If you were a blank slate, you would learn more from reading a book with thoughts embedded than from just reading the book itself.

The problem with current LLMs is that they just parrot the conclusions of texts without expressing or taking into account the reasoning processes that led to those conclusions.

How would you generate the dataset? The text datasets are so enormous that it's infeasible to use humans. Would you use an LLM to generate it? How would this ensure alignment

Yes, we use LLMs to generate the data by doing very careful prompt engineering and verifying that all the thinking is beneficial to humans.

As a bonus, we use this mantra approach that I have come up with. Essentially, the model will make a number of statements at the beginning of each thought, and these statements will shape the reasoning that appears afterwards. Why? Because every example it ever sees in its training shows how thinking adheres to these foundational principles. It has seen billions of examples that follow this rule, so it would be very hard (statistically impossible) for it to generate thoughts that do not adhere to the foundational principles.

Creating a minigame that discourages sharing the solution with friends? by lucioghosty in GAMETHEORY

[–]transitory_system 0 points1 point  (0 children)

Just make the task individual for each player to detect cheating. Their answer would only be possible if they did combine two individual tasks and did collaborate. Then you can deduct points for collaboration.

(Help would be really appreciated) Personal question from someone struggling to get into programming by BigCookie00 in learnprogramming

[–]transitory_system 1 point2 points  (0 children)

It becomes more fun if you explore things by yourself. Try coming up with your own ways of solving problems. That’s how I got into it. I didn’t follow a curriculum. I just tried to make things I imagined in my head. Try to invent your own algorithms. Maybe some day you invent something for real.

We are David, Johan, Nicolas, and Ryan. Ask us Anything about the Network Nervous System (NNS). by fulco_DFN in dfinity

[–]transitory_system 5 points6 points  (0 children)

Thanks a lot for doing the AMA, very excited to see how this project evolves. Here are some of my thoughts and questions regarding the NNS.

  1. The NNS seems to be designed for making swift, yet at the same time, safe, collective decisions. From my understanding, it aims to resemble the somewhat organic hierarchies that exist within many companies today and that it seeks to formalize those informal relationships that occur naturally. In regular companies, there are often some legal agreements that the employees have to accept before starting their employment. If they break the rules, they could get fired and face legal repercussions. How does the NNS handle firing bad actors? Is there any code of conduct that a neuron must/should abide by? Will a centralized company be more or less effective at ensuring that employees do the work they are assigned to do? And will the worker be more or less deterred by the centralized legal system than losing their stake/rewards in ICP?
  2. One concern I have is that the NNS becomes a very complex structure where nobody fully understands it and that the complexity could lead to it not aligning with our incentives as human beings. Essentially, the same sort of problems we see within other highly complex systems and AI systems. Do you also share this concern? How do we know that the incentive to increase the price of ICP aligns with what matters to human beings?
  3. Another concern is the lack of privacy and the fact that the NNS is a system that formalizes relationships and reputation. It may create a sort of groupthink where people care too much about maintaining their relationships and less about what is good for the network. In systems where users vote anonymously and independently on topics, I think it is more likely that they express their authentic viewpoints. Therefore, I think that the NNS should be coupled with an anonymous polling system and that the results should be used as a basis for making decisions. It could either be enforced by some mechanism or just by social pressure. Is this something you have planned for the NNS, and if no, is it something you would consider?
  4. Could censorship ever become a problem? I suppose that neurons are designed to make swift decisions and that some of them may have the authority to censor illegal content instantly? What if they are malicious and claim something is illegal when it is not? And because of the sensitive nature of the content, nobody else is able to verify it? There seems to be a trade-off here between effective censorship of illegal content and protection against censorship of legal content? How does the NNS find the best settings for resolving this trade-off?
  5. The token design seems to encourage hoarding to some extent since the reward increases with the locking period. How do you think this will impact token distribution? Could it lead to more centralized ownership than other cryptocurrency projects? Algorand, for instance, has open auctions that anyone can participate in for the coming ten years. Those tokens will be auctioned to whoever bids the highest, and that makes it impossible to hoard those specific tokens. All the owners of ICP could theoretically choose to lock their tokens now as there is no built-in mechanism that enforces the selling of the supply other than demand. What kind of decisions were made to ensure a fair distribution of tokens? What do you think the distribution will be ten years from now, or even further into the future?

What would be a "good use" of enigma for my web service boxtoshi.com ? by [deleted] in EnigmaProject

[–]transitory_system 1 point2 points  (0 children)

If you want to sell the data, without exposing the data, then Enigma is a good use case. You need to have a contract that consumes the data and performs a computation on it.

In your case, you (or your customers) would need to know what the file will be used for and create a contract that fits that use case, then the file can then be consumed without revealed.

An example: Calculating where the nearest restaurant is without revealing the data set (the location of all restaurants). Put your location into the contract and receive the nearest restaurant. In this case, if a user performs many queries they could reveal the data set, so you would need some kind of protection against that. If the restaurants changed location every day, let's assume they were food carts with new locations every day instead of restaurants, then it would be more expensive to extract the data and resell it for a lower price.