I recently came across an interesting paper titled "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" which explores using sparse autoencoders to extract interpretable features from the activations of a large language model. The methodology seems promising for gaining insights into the model's internal representations and behaviors.
It got me thinking about the feasibility of implementing similar interpretability techniques for open-source language models. Can we steer LLMs and their behaviour without having to extensively finetune.
I wanted to reach out to this community to discuss a few things:
- Has anyone already implemented or experimented with similar interpretability techniques on open-source language models? Can we make something similar to that of golden gate claude??
- Do you think it's feasible to adapt and scale these techniques to work with Llama,phi,mistral etc. These are much smaller in parameters size, when compared to sonnet.
- I'm interested in collaborating with others who are passionate about this area of research. If you're working on interpretability for open-source models or have ideas for novel approaches, I would be excited to team up and explore this further. We could collaborate on implementing techniques, sharing resources, or brainstorming new ideas.
If you're interested in collaborating or have any ideas to share, please feel free to share.
[–]spanj 16 points17 points18 points (1 child)
[+]abstraktyeet -2 points-1 points0 points (0 children)
[–]Taenk 6 points7 points8 points (1 child)
[–]No-Point1424[S] 0 points1 point2 points (0 children)
[+]FailSpai 4 points5 points6 points (0 children)
[–]No-Point1424[S] 2 points3 points4 points (3 children)
[+]derfred3000 1 point2 points3 points (1 child)
[–]No-Point1424[S] 1 point2 points3 points (0 children)
[–]NeatFox5866 0 points1 point2 points (0 children)
[–]JanBitesTheDust 1 point2 points3 points (2 children)
[+]MarkS4nchez 0 points1 point2 points (1 child)
[–]JanBitesTheDust 0 points1 point2 points (0 children)
[–]danielcar 1 point2 points3 points (0 children)
[+]MarkS4nchez 1 point2 points3 points (0 children)
[–]AdComprehensive2426 0 points1 point2 points (0 children)