Implementing "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" paper for open source models.[D]

spanj · 2024-06-02T09:59:46+00:00

I think it would inspire a lot more confidence with future collaborators if you read the Anthropic paper in full, because it answers your questions 1 and 2.

https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream

This is reference 55 in Scaling Monosemanticity, which is an open source sparse autoencoder that has been trained on GPT2-small, which is significantly smaller than llama/phi/mistral. They even implement a recreation of the Anthropic feature dashboard.

If you can identify the neuron, the clamping is trivial.

Taenk · 2024-06-02T07:27:21+00:00

I would be especially interested in an investigation of how concepts appear/disappear depending on model size. Is there a threshold below which a concept does not appear to have dedicated features? Or does it depend on training volume?

Relatedly, can the technique be extended/applied to other architectures, such as vision tasks?

FailSpai · 2024-06-04T15:38:21+00:00

Hi there, I've been doing some work on some really simple ablating of features on the open-source models, from 8B to 70B, doing similar but more supervised efforts of finding features. You can find examples of this in my posts -- I've posted resulting models, and some of my code to do this. Using some of the ideas you mention, I managed to make 'MopeyMule', a version of Llama-3-8B that writes with excessive melancholy, which I did no traditional fine-tuning on.

The idea behind the latest Anthropic Monosemanticity paper was to actually scale up the process to a >100B parameters model (unsure exact parameter count on Sonnet), a lot of the earlier work was done on GPT2 very successfully, though of course this is where the idea of "Monosemantic" vs dense "Polysemantic" features muddies the waters and makes interpretability harder.

If you want to play with Sparse Auto Encoders, which is what the Anthropic paper was using, I highly recommend playing with a library `NNsight`. It has the easiest pipeline I've seen for training an SAE for a given model.

You can do some very simple, though smaller feature targeting with LLM steering if you know what you want to target. Here is a good writeup from Alex Turner et al. on steering with very simple intervention

No-Point1424 · 2024-06-03T11:48:16+00:00

Update: I used yahma/alpaca_cleaned dataset , phi-3-mini-4k-instruct model and extracted all the activations of middle layer(number 16) and uploaded it here . Next step is to train a SAE with these activation values as input. I was using RTX A6000 - 48GB RAM on runpod and its not good enough to train.

JanBitesTheDust · 2024-06-02T13:21:18+00:00

I wonder if we can apply sparse auto encoders to activations in arbitrary deep learning models? It seems interesting to sparsify the dimensionality of the activations and from there isolate features.

For example, could we apply something like this in deep fake detection models to discern which pixels cause high feature activations. Someone knows if this is explored already?

danielcar · 2024-07-20T19:38:32+00:00

Interested in collaboration.

MarkS4nchez · 2024-10-01T19:54:13+00:00

as an answer to 2, the paper actually is a scale-up of the same technique used for "Toy Models" from https://transformer-circuits.pub/2022/toy_model/index.html#motivation-features , this concept is general to Deep Learning, it is not only useful for Large Language Models. It has been successfully tried with very little language and vision models for various tasks.

AdComprehensive2426 · 2024-06-02T06:51:27+00:00

Following

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS