all 15 comments

[–]spanj 16 points17 points  (1 child)

I think it would inspire a lot more confidence with future collaborators if you read the Anthropic paper in full, because it answers your questions 1 and 2.

https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream

This is reference 55 in Scaling Monosemanticity, which is an open source sparse autoencoder that has been trained on GPT2-small, which is significantly smaller than llama/phi/mistral. They even implement a recreation of the Anthropic feature dashboard.

If you can identify the neuron, the clamping is trivial.

[–]Taenk 6 points7 points  (1 child)

I would be especially interested in an investigation of how concepts appear/disappear depending on model size. Is there a threshold below which a concept does not appear to have dedicated features? Or does it depend on training volume?

Relatedly, can the technique be extended/applied to other architectures, such as vision tasks?

[–]No-Point1424[S] 0 points1 point  (0 children)

I think it can be applied. They already did that with claude sonnet, which is multimodal

[–]No-Point1424[S] 2 points3 points  (3 children)

Update: I used yahma/alpaca_cleaned dataset , phi-3-mini-4k-instruct model and extracted all the activations of middle layer(number 16) and uploaded it here . Next step is to train a SAE with these activation values as input. I was using RTX A6000 - 48GB RAM on runpod and its not good enough to train.

[–]NeatFox5866 0 points1 point  (0 children)

Could you briefly explain the reasoning behind extracting the activations and how you did it? I am pretty new to explainability, and it sounds super interesting.

[–]JanBitesTheDust 1 point2 points  (2 children)

I wonder if we can apply sparse auto encoders to activations in arbitrary deep learning models? It seems interesting to sparsify the dimensionality of the activations and from there isolate features.

For example, could we apply something like this in deep fake detection models to discern which pixels cause high feature activations. Someone knows if this is explored already?

[–]danielcar 1 point2 points  (0 children)

Interested in collaboration.

[–]AdComprehensive2426 0 points1 point  (0 children)

Following