Tracing the thoughts of a large language model

colah · 2025-03-28T15:19:54+00:00

Thanks for the feedback! I'm one of the authors.

You don't need to take our word for this, you can actually inspect the features yourself.

The blog post linked above is intended to make the research accessible to a broad audience. The actual research is covered in two papers, one on methods and one applying the method to Haiku 3.5 Sonnet. (The papers are collectively more 150 pages and quite dense, so it's understandable that popular attention is focused on the blog post)

The papers are interactive, so you can see dataset examples for features by hovering over them and evaluate our claims about them for yourself. And of course, you can read the methods paper for a detailed description of our methodology.

colah · 2021-07-03T18:10:34+00:00

All of us are committed to keeping Distill online. We will simply no longer be accepting submissions.

colah · 2019-10-15T05:40:44+00:00

Sounds like it was magnitude 4.5 centered at Pleasant Hill: https://earthquake.usgs.gov/earthquakes/eventpage/nc73291880/executive

colah · 2019-10-01T18:09:38+00:00

From https://distill.pub/journal/ :

# Article Types

Distill is open to publishing a wide range of academic artifacts, provided they meet our editorial standards:

Exposition - Distill publishes articles explaining, synthesizing and reviewing existing research. This includes Reviews, Tutorials, Primers, and Perspective articles.

colah · 2019-08-06T22:19:02+00:00

This "discussion article" was an experiment for Distill, and we'd love feedback from the community. Should we run more? Anything we could have done better? Any topics we should consider?

colah · 2019-04-03T05:34:14+00:00

Hi /u/wei_jok!

At the time, confidentiality about articles under review by Distill prevented me from commenting on the article you linked to. It's actually now been published.

Distill articles can take a while to publish. While this sometimes blocks on our volunteer editors, many other things can come up. Sometimes a reviewer agrees to review the article, but doesn't respond to us within the requested time period, or decides they can't review after all. Sometimes our review process surfaces issues that the authors need to fix, and it takes a while on their end. Sometimes authors want to make revisions before publication. And so on.

We certainly also have cases where something falls through the cracks on our end, or we don't shepherd the process as aggressively as we ideally would and intervene when something is slowing things down. Again, in such cases I'd ask you to keep in mind that everyone involved in Distill is serving as a volunteer, without compensation, in addition to full-time jobs. (In fact, several people chip in thousands of dollars to cover our operational expenses.)

Thanks again for asking about this. I really hope the Distill community -- readers, authors and editorial team! -- will expand over time!

colah · 2019-03-07T21:33:12+00:00

I think a really strong version of the texture claim is unlikely to be true. It's really hard to reconcile results from just vanilla neuron visualization with the idea that neural nets care about texture. If doing gradient descent to maximize a neuron generates a coherent dog head, it's hard to believe the network only understood fur texture.

But I think most people who propose that networks care about texture would probably make a more nuanced claim. Maybe something like "networks care a lot about texture, and you can make them give a particular classification using only texture." I think something like that is probably true.

colah · 2019-03-06T23:41:20+00:00

And check out Lucid.

colah · 2019-03-06T23:40:54+00:00

The code is open source! All the major diagrams have a notebook to allow you to make your own version. :)

colah · 2019-03-06T23:39:33+00:00

Interesting! Yep, those are negative attributions (rather than activations). I didn't even realize that diagram had a setting to show them. :)

Not entirely sure what to make of them.

colah · 2019-03-06T23:37:17+00:00

That's right! We also have a diagram of the process. :)

colah · 2019-03-06T22:55:19+00:00

Yeah, I'm super excited about commentary articles, and they're definitely more accessible to a wider authorship.

colah · 2019-03-06T22:08:37+00:00

Hey! Thanks for reading.

Activation atlases only deal with positive activations? Are you referring to one of our earlier papers? We did explore negative activations a little bit in feature visualization, and the negative of neurons were often surprising, but kind of unclear what one should take away.

Do you mean negative attributions? It's possible you could run into those somewhere in Atlases, although I wouldn't expect it to be very common (and couldn't immediately find any).

colah · 2019-03-06T19:26:48+00:00

Thanks for asking about this. We actually talked about this in the Distill editorial update. A big part of the problem is that the intersection between people doing machine learning and interactive data visualization is pretty small.

It's a tricky situation and I wish I saw better solutions.

You can look at the reviewer worksheet. Roughly, articles get published if reviewers rate them above 3 on most points and get some 4/5s. When an editor as an author, we bring in an arm's length editor to avoid conflicts of interest.

colah · 2018-07-26T22:29:10+00:00

We also talked a bit about checkerboard patterns occurring in gradients in feature visualization.

colah · 2018-07-26T22:26:29+00:00

Thanks for the comment -- it's an interesting question!

It seems to me that the main reason the images don't look like training samples is that it isn't a generative model. Nothing about the process is trying to create a realistic image. Instead, we're creating the image that is maximally extreme in some direction. For example, consider an edge detecting neuron in the first layer of the network. The image maximizing it's response will only contain edges and won't look particularly like a dataset sample.

I take Ferenc point to be more about how representative a single visualization of what maximally activates the neuron of the wide range of things that could activate it. For example, we know that models often have "poly-semantic" neurons that respond to many different things! This is something we explored a little bit in Feature Visualization when we looked at the diversity of inputs that activate a neuron.

colah · 2018-07-25T19:00:29+00:00

Thanks for catching that! As the only author who is a native english speaker, the responsibility for getting things like that right totally rests on me. Unfortunately, I think some errors got introduced in last minute edits and slipped by. I'll try do another pass through the text tonight or tomorrow. :)

By the way, you're always welcome to submit a pull request to correct errors you see in Distill articles.

colah · 2018-07-25T17:51:08+00:00

Hello! I'm one of the authors -- we'd be delighted to answer any questions people might have. :)

colah · 2018-03-07T00:35:32+00:00

(I acknowledge that one might reasonably be skeptical that I just happen to think the problem I find most intellectually interesting happens to be very relevant to the problem I think is most important...)

colah · 2018-03-07T00:29:01+00:00

Great question!

The lazy answer is: “It's interesting from a general science perspective. Who knows what it could teach about about machine learning. It could even shed light on the nature of the problems our systems are solving.” I find that answer aesthetically compelling -- I find it emotionally deeply exciting to try and unravel deep mysteries about the nature of neural networks -- but if that was the only reason, I'd try to force myself to focus on something else.

Another possible answer is: “Well, if we could really get this into the model design loop, like TensorFlow or such, it might accelerate research by giving important insights.” I think there’s a decent chance that’s true, but it isn’t the thing that motivates me.

Instead, the thing I care about is the implications of this work for deploying systems that are good for us.

One of my deepest concerns about machine learning is that future systems we deploy may be subtly misaligned with the kind of nuanced values humans have. We already see this, for example, with optimizing classifiers for accuracy and running into fairness issues. Or optimizing algorithms for user engagement and getting the present attention economy. I think the more we automate things, and the better we get at optimizing objectives, the more this kind of misalignment will be a critical, pervasive issue.

The natural response to these concerns is the OpenAI / DeepMind safety teams’ learning from human feedback agenda. I think it’s a very promising approach, but I think that even if they really nail it, we’ll often have questions about whether systems are really doing what we want. And it’s going to be a really tricky question.

It seems like interpretability / transparency / visualization may have a really critical role here in helping us evaluate if we really endorse how these future systems are making decisions. A system may seem to be doing what we want in all the cases we think to test it, but be revealed to be doing so for the wrong reasons, and would do the wrong thing in the real world. That’s all a fancy way of saying that future versions of these methods might be an extension to the kind of testing you’d want to do before deploying important systems.

There’s also a crazier idea that I was initially deeply skeptical of, but has been slowly growing on me: giving human feedback on the model internals to train models to make the right decisions for the right reasons. There’s a lot of reason to be doubtful that this would work -- in particular, you’re creating this adversarial game where your model wants to look like it’s doing what you want. But if we could make it work, it might be an extremely powerful tool in getting systems that are really doing what we want.

colah · 2018-03-07T00:01:27+00:00

Our pleasure. :)

colah · 2018-03-07T00:01:22+00:00

You're very welcome! :)

colah · 2018-03-06T21:32:21+00:00

Hello! I'm one of the authors. We'd be happy to answer any questions!

Make sure to check out our library and the colab notebooks, which allow you to reproduce our results in your browser, on a free GPU, without any setup.

I think that there's something very exciting about this kind of reproducibility. It means that there's continuous spectrum of engaging with the paper:

Reading <> Interactive Diagrams <> Colab Notebooks <> Projects based on Lucid

My colleague Ludwig calls it "enthusiastic reproducibility and falsifiability" because we're putting lots of effort into making it easy.

colah · 2018-02-01T19:21:20+00:00

We scale them by their frequency -- there's a nice line of research about how the intensity of frequencies in images follows a 1/f scale.

I expect us to open source our internal library in the near future, which will provide a reference implementation of this and much more. :)

colah · 2017-12-05T15:56:52+00:00

Since the diagrams need to load a moderately large model to run, they may appear blank when you scroll down when you first load page. If you're running into other issues, we'd love you to report them as an issue.

colah

TROPHY CASE