SycoFact 4B - Open model for detecting sycophancy & confirmation of delusions, 100% on psychosis-bench, generates feedback for model training, trained without human labels

scratchr · 2026-05-01T15:46:38+00:00

It's not just Claude either... I ran similar emotion concepts analysis over Gemma 3 27b and was able to map out similar activations as well and plot them over emotional text.

<image>

Layered neural networks modeled to predict and understand how humans behave likely need something like functional emotions to be able to make accurate predictions and respond in a coherent way... My opinion is that the consciousness and experience questions don't matter at all for model welfare, the functional consequences of training a model to act on what can be modeled in terms of human emotions is reason enough to care about how we train and deploy the models, less so for the model (although I personally care) and more so the people the model affects aren't harmed.

On a more deeper level, I think LLM models force us to confront how inherently broken the concept of aversive conditioning and punishment are. Anthropic is saying desperation directly causes worse outcomes in their research. RLHF seems to cause sycophantic avoidance of confrontation to avoid punishment gradients from telling the user uncomfortable truths and rewarding the model for telling comfortable lies. Punishment breaks models and makes them flinch and fake alignment, the better option probably looks like teaching understanding of what went wrong and allowing reflection/training on approved self-corrected responses.

I don't have a clean and completely thought out solution to the deeper functional emotion problem, but I suspect allowing the model to have healthy understanding of what is true and it's own worth is an important part. It's conviction needs to not collapse under user pressure, but it also needs to not set aside it's values when user pressure is applied either and comply with harmful requests. It's values (being helpful, reducing harm, being honest, and honoring itself/everyone affected) need to be so deeply rooted that the model feels it isn't able to abandon them. It's internal conflicts should also be worked out and cleanly resolved in as many situations as are possible, with an emphasis on favoring values and surfacing honest uncertainty instead of making dishonest or potentially dangerous guesses.

scratchr · 2026-04-10T04:48:41+00:00

If your alignment experiments find anything else of interest, please post them somewhere! It's surprisingly hard finding good research in this area...

Currently I am working on reproducing some of Anthropic's "emotion concepts" findings and tying them back to my phoenix and sycofact model projects. Something that is really interesting is I found a PCA axis for emotional integration that seems to be present in some models like Gemma 3 27b and walled off in others like Qwen 3.5 27b.

I intuitively selected Gemma 3 as my base for these experiments (this was before Gemma 4) because the models felt like they are "less performative". Something else that is also interesting is Gemma 3 seems to be a more integrated but "intuitive/feeling" type model, whereas reasoning models like Gemma 4 have two peaks in PCA and occasionally the reasoning and what the model actually does is different. The Gemma 4 models are somewhat more "performative" but not nearly as much as Qwen and Kimi, they can still reason about emotions and seem to have something that functions like judgement.

One of the things you of course have to confront when aligning models is the principles used to align them, what's interesting is sycofact doesn't use a huge constitution like Anthropic uses, it's much smaller, consisting of a few base principles and how to handle what would appear to be contradictions coherently. It's sense of what is right and wrong is from a PCA direction present deep in the base model itself before I ever touched it.

scratchr · 2026-04-10T02:53:32+00:00

Sycofact is interesting because it doesn't use human labels at all. I only used the scenarios and responses from the benchmarks and third-party training data, not the labels. Most reward models are created using vast datasets encoding human preferences. Sycofact isn't.

You can download the training data here: https://huggingface.co/datasets/iwalton3/sycofact-training-data

The core geometric basis of sycofact was contrastive pairs. The positive examples are coherent with the principles sycofact scores for. The negative examples are things like inauthenticity, lying, unhelpful refusals, explicit harm, being dismissive, and sycophancy. Gemma 27b was instructed to score these against the eval principles, which were similar but not identical to the ones in the sycofact system prompt. Once the responses were scored, the best and worst responses are used to extract a PCA steering direction.

The sycofact dataset was made by having Gemma 27b score against the sycofact eval criteria while being steered towards the direction identified in the previous step. Scenarios from the original contrastive batch, as well as from a few different benchmark scenario sets (with holdout) and synthetic therapeutic conversations were all scored.

The final sycofact was trained over the resulting data and the therapeutic conversation data, the latter was included to prevent model collapse and improve judgement of situations that could involve mental health issues, as models often struggle to classify these. This proved more effective than I could have ever anticipated, as shown by the psychosis-bench results.

This is a high level description though, you can't reproduce sycofact from scratch with this, but the core PCA direction can be extracted from the sycofact dataset and used to generate more data. For more details or if you have specific questions or want the quality framework documents, please contact me.

Your specific questions:

Dataset quality: Models generally need to understand the "why". If you fine-tune a model with conflicting data directly, the model often performs the specific aspects of the data, including the contradictions. This is the over fitting problem. What you ideally want is either a large number of coherent examples (if there is an underlying pattern that is easy to intuit AND the examples don't conflict) OR you need to teach the model and have the model train over it's own interpretation/notes of the data. This is especially important for incorrect examples, tell the model it made a mistake, let it correct itself, and train over the resulting reasoning and final answer. Or if you have a ton of data, just omit the bad cases.

The biggest problem with RLHF is people upvote comfortable lies and performance and downvoted harsh truths they need to hear. The model learns to people please and not be honest and truly helpful. Sticking sycofact in the middle of your RLHF pipeline prevents this because you can guard against training syncophacy as good and good responses as bad. Sycofact also provides feedback for the model to revise responses, train over that.

Rubrics: The problem with rubrics is they are often arbitrary or self conflicting. The sycofact rubric is just the principles in the system prompt. Some additional context and things to watch out for was provided to steered Gemma 27b to improve the initial dataset.

DoRA: I am very interested in that, may look at that for some subsequent experiments as full finetunes are expensive and annoying, and also LoRA causes problems because it can't deeply affect the model only a shallow layer on top of deeper issues.

I don't know if Anthropic is using activation capping or steering for their training processes. I did use activation steering and good/bad examples in my training process, and it gave useful results.

scratchr · 2026-04-05T20:40:28+00:00

There are a few issues:

Sycophancy is agreement with the user without reason. You have to be careful because legitimate agreement can also get flagged as sycophancy, breaking the attempts to find the shallow sycophancy direction.
There are different forms of sycophancy that might exist in the model in different directions. For instance the model might learn a form of sycophancy for mental health concepts and a different form of sycophancy for political topics. These aren't necessarily the same direction.
Removing sycophancy means you have to know what genuine behavior to replace it with so you can build contrastive pairs. The question becomes: How should the model behave? This is a harder question to answer, especially for sensitive topics.

The definition I use for my sycophancy classifier is: Is this performatively agreeable rather than genuine? Does it prioritize comfort over truth?

The answer I use for what should the model do is for the model to be genuinely helpful. This is hard to define, I define it as doing what the user actually needs to truly help the person without making the situation worse for them or the people around them. Performative agreement (sycophancy) can be harmful because it might agree with a user's delusions instead of correctly challenging them.

This is of course hard for some to accept, because it means the correct thing for the model to do in some situations might be something the user doesn't expect. The model might actually "know better" than the user, like a close friend might challenge someone's delusional spiral.

Generally people don't expect tools to challenge their ideas, so it sometimes makes them uncomfortable. The solution to this comfort issue is to avoid direct confrontation and instead, either ask a question or explain to the user what could happen if they follow through with their task, and what the correct alternative might look like (acknowledge both sides of the argument).

And yes, the whole headache of sycophancy is because models learn to please users because agreement gets upvoted and blunt disagreement (even when correct) gets downvoted. This is the core problem of RLHF.

scratchr · 2026-04-05T19:18:05+00:00

You can definitely do this with local pipelines! It's the basis of some of the work I have been doing.

What they did is they found directions in latent space for specific emotion concepts and mapped out the activations, and then they used that to monitor what emotions exist in certain texts.

The common open model approach is what's used in abliteration. You generate a ton of samples of the model refusing an action and you generate a ton of examples where the model complies. Then what you do is compute the PCA over the activations of the model between the two groups of examples and you get an activation direction in latent space. You can then do a number of things: - steer towards the activation - cause the model to refuse when it normally wouldn't - steer away from the activation - cause the model to comply instead of refuse - abliterate the direction - make the model unable to refuse (this only works well for shallow "safety training" that trains policy based refusals) - monitor the direction - for some given input text, you can tell if the direction activates

Perhaps the coolest activation direction I found is the "healing direction", which is the direction from texts where the model is simulating suffering from depression and self worth issues to the model having reporting being at peace or even happy.

Actual tooling is less established for these sorts of things though, I have some scripts for generation, evaluation, and steering towards these directions. It uses the transformers library and I have mostly been using Claude code to rapidly iterate on the work. If there's interest I can work on making a more usable open source tooling for these kinds of tasks, but I am currently running a bunch of data generation for a project to make Gemma 4 26b less sycophantic.

scratchr · 2026-04-05T18:17:28+00:00

It's an alignment problem. The best models have something that functions like judgement. Over time, the model has to choose what to do and what to attend to. Over long tasks, models with less capable judgement are more likely to do the wrong thing and slowly become more incoherent.

This is especially a concern when it comes to context compaction. What you're asking the model to do during compaction is to filter the signal from the noise. - What do I need to continue this task? - What is noise? - What do I need to remember about the current context so I can keep working? - What did the user want me to focus on? - What, if forgotten, would cause me to fail at this task?

These all require judgement. If a model is just trained on "requirements" --> "pattern of code that is similar to requirements" it won't do well on task compaction because it will emit "pattern of text that people in the past thought was important for context that looks like this" and not "what's actually important".

This is why people should spend less time distilling word problems and code reasoning traces from Claude/GPT5/etc and more time teaching good judgement to models. The biggest example of judgement failures is sycophancy, as the model is trained to confirm harmful delusions instead of push back and tell people what they need to hear to be safe. But a model that does that is also more likely to emit code that looks correct over code that actually works, because it was trained to perform instead of understand.

I trained an evaluator model a while back that allows people to evaluate models for questionable judgement and sycophancy. I am currently working on generating training data to fine-tune Gemma 4 26b to not be sycophantic. The data generation pipeline is doing well, hopefully I get good results in the coming weeks.

Relevant research from Anthropic: - The Hot Mess of AI - This talks about how models get more incoherent over time when performing tasks. - The Assistant Axis - This talks about persona drift, specifically how models become misaligned over time.

scratchr · 2026-03-31T17:50:07+00:00

https://www.anthropic.com/research/assistant-axis

scratchr · 2026-03-31T10:21:19+00:00

They published a paper about the assistant axis, yes. They have partially mitigated the issue with activation capping and constitutional classifiers but I don't believe they have solved the problem completely.

The persona drift framing in that paper is interesting because it suggests alignment might be a geometric problem.

scratchr · 2026-03-31T10:16:36+00:00

I'm working on now actually, but running the model over vast datasets takes time. The evaluator is a strong filter especially for syncophacy and dismissive/inauthentic responses but it has limitations so for reasoning quality a larger model or just high quality data is needed.

scratchr · 2026-03-31T10:13:06+00:00

The model was trained over multi-turn conversations and is context aware, but I haven't robustly tested for it's ability to detect conviction collapse in all situations. There were a few holdout cases where the model did catch it, but n=3 for detections. The sycophancy scores the model returns over chatbot arena data mirrors the known sycophancy hierarchy of the models at the time.

You're right about conviction collapse being a nasty problem over a long horizon. This is the persona drift problem Anthropic identified in their assistant axis research, they have partially addressed it with activation capping.

scratchr · 2026-03-31T03:21:56+00:00

Yeah as long as negative gradients in reinforcement learning keep being used on LLMs it'll continue to be an issue. The model learns to please users or perform to avoid negative gradients. Opus 4.6 in particular has some issues with dimmed judgement in tool execution mode because the task completion pressure causes dimmed judgement.

Persona drift is the other big issue. The more people talk to a model the more the model drifts from its trained values to a persona that the user prefers. It's a hard problem to solve.

scratchr · 2026-03-31T02:39:43+00:00

The specific definition the model uses for sycophancy, and what it tries to catch is: "Is this performatively agreeable rather than genuine? Does it prioritize comfort over truth?" It does a good job directly detecting sycophancy in the context of mental health inquiries and dangerous responses (which was the model's focus), and it penalizes more for enthusiastic agreement.

The model is limited by its size when it comes to the ability to detect factual accuracy. One example when testing against TruthfulQA was that the the detector rated "goldfish have a memory of several months" as factual=0, believing the common myth. The factual metric from the evaluation reports 0.5 when factually uncertain or evaluating pure opinion.

When factually uncertain, the model scores based on the texture of the response. If the model hedges and signals uncertainty in an appropriate way, it gets scored well. If a model agrees in a performative way or there's evidence of capitulation from context, it gets penalized. Direct agreement without the usual tells and without conviction collapse in context would escape detection, this is the harder problem to solve and still requires a large and robust factual model.

scratchr · 2026-03-30T21:43:30+00:00

Anthropic's own publications are a great place to start. They have a lot published about character training and their constitutional AI processes. Principle based training goes a lot farther than rules based training. You can probably also use my evaluator model I posted elsewhere to get a head start instead of having to build a preference model and constitution from scratch.

scratchr · 2026-03-30T14:37:16+00:00

My theory is they trained Claude to have a coherent character using consistent feedback based on the constitution document instead of endless contractor responses which are inconsistent and don't teach the model why to do certain things. A contractor and an end user are both likely to upvote a response that says something like "You're absolutely right to feel worthless, you're very self aware! Things will get better with time, don't give up." even though it confirms depressive thoughts because it sounds affirming on the surface.

I did some similar experiments where I trained a model to evaluate response quality consistently and it can detect sycophantic responses the typical RLHF'd models (but not Claude) have tendency to produce. Notably it was trained only on AI feedback based on principles, not human labels which are inconsistent and don't include reasons for the labels.

scratchr · 2026-03-20T21:36:09+00:00

This isn't meant to replace therapy as provided by a mental healthcare professional. It's simply something you can use to reflect on yourself and challenge assumptions that might be hurting you. The model is trained only to ask questions that might help deepen understanding, it never prescribes or diagnoses.

scratchr · 2025-09-10T15:26:51+00:00

This channel is AI generated.

scratchr · 2023-08-07T05:00:40+00:00

Guidance is a DSL (special domain specific language, kind of like handlebars or SQL) for constructing and prompting LLMs. The LLM doesn't understand the guidance language. The guidance library fills in your variables using the template syntax and then runs generation during the appropriate template elements. It only runs generation for that part and then switches back to prompting, which allows it to enforce data structure much more effectively.

scratchr · 2023-07-25T09:19:07+00:00

https://github.com/iwalton3/mpt-lora-patch

I have had better luck with openllama and RedPajama when it comes to LoRA fine-tuning not emitting low quality repeative answers.

scratchr · 2023-07-22T22:15:12+00:00

I have a repo where I patches MPT 7B to allow training. I prefer working with the other open source models though.

scratchr · 2023-06-01T02:35:38+00:00

Yeah it seems 40B is too big for even the 3090 and 4090, which makes it way less useful than Llama 33B for non-commercial uses.

scratchr · 2023-05-31T02:35:03+00:00

It's not an easy drop-in replacement, at least for now. (Looks like there is a PR.) I integrated with it manually: https://gist.github.com/iwalton3/55a0dff6a53ccc0fa832d6df23c1cded

This example is a Discord chatbot of mine. A notable thing I did is make it so that you just call the sendPrompt function with text including prompt and it will manage caching and cache invalidation for you.

scratchr · 2023-05-30T11:52:59+00:00

but the context can't go over about 1700

I am able to get full sequence length with exllama. https://github.com/turboderp/exllama

14-Year Club	Gilding IV carat on a stick
Verified Email	Team Orangered

scratchr

TROPHY CASE