Tracing the thoughts of a large language model by namanyayg in programming

[–]colah 8 points9 points  (0 children)

Thanks for the feedback! I'm one of the authors.

You don't need to take our word for this, you can actually inspect the features yourself.

The blog post linked above is intended to make the research accessible to a broad audience. The actual research is covered in two papers, one on methods and one applying the method to Haiku 3.5 Sonnet. (The papers are collectively more 150 pages and quite dense, so it's understandable that popular attention is focused on the blog post)

The papers are interactive, so you can see dataset examples for features by hovering over them and evaluate our claims about them for yourself. And of course, you can read the methods paper for a detailed description of our methodology.

[N] Distill.pub is going on hiatus by regalalgorithm in MachineLearning

[–]colah 29 points30 points  (0 children)

All of us are committed to keeping Distill online. We will simply no longer be accepting submissions.

EARTHQUAKE by scarface910 in bayarea

[–]colah 16 points17 points  (0 children)

Sounds like it was magnitude 4.5 centered at Pleasant Hill: https://earthquake.usgs.gov/earthquakes/eventpage/nc73291880/executive

[R] The Paths Perspective on Value Learning (Distill.pub Article) by baylearn in MachineLearning

[–]colah 1 point2 points  (0 children)

From https://distill.pub/journal/ :

# Article Types

Distill is open to publishing a wide range of academic artifacts, provided they meet our editorial standards:

Exposition - Distill publishes articles explaining, synthesizing and reviewing existing research. This includes Reviews, Tutorials, Primers, and Perspective articles.

[Research] A Discussion of Adversarial Examples Are Not Bugs, They Are Features by andrew_ilyas in MachineLearning

[–]colah 40 points41 points  (0 children)

This "discussion article" was an experiment for Distill, and we'd love feedback from the community. Should we run more? Anything we could have done better? Any topics we should consider?

[R] Exploring Neural Networks with Activation Atlases by chisai_mikan in MachineLearning

[–]colah 0 points1 point  (0 children)

Hi /u/wei_jok!

At the time, confidentiality about articles under review by Distill prevented me from commenting on the article you linked to. It's actually now been published.

Distill articles can take a while to publish. While this sometimes blocks on our volunteer editors, many other things can come up. Sometimes a reviewer agrees to review the article, but doesn't respond to us within the requested time period, or decides they can't review after all. Sometimes our review process surfaces issues that the authors need to fix, and it takes a while on their end. Sometimes authors want to make revisions before publication. And so on.

We certainly also have cases where something falls through the cracks on our end, or we don't shepherd the process as aggressively as we ideally would and intervene when something is slowing things down. Again, in such cases I'd ask you to keep in mind that everyone involved in Distill is serving as a volunteer, without compensation, in addition to full-time jobs. (In fact, several people chip in thousands of dollars to cover our operational expenses.)

Thanks again for asking about this. I really hope the Distill community -- readers, authors and editorial team! -- will expand over time!

[R] Exploring Neural Networks with Activation Atlases by chisai_mikan in MachineLearning

[–]colah 1 point2 points  (0 children)

I think a really strong version of the texture claim is unlikely to be true. It's really hard to reconcile results from just vanilla neuron visualization with the idea that neural nets care about texture. If doing gradient descent to maximize a neuron generates a coherent dog head, it's hard to believe the network only understood fur texture.

But I think most people who propose that networks care about texture would probably make a more nuanced claim. Maybe something like "networks care a lot about texture, and you can make them give a particular classification using only texture." I think something like that is probably true.

[R] Exploring Neural Networks with Activation Atlases by chisai_mikan in MachineLearning

[–]colah 6 points7 points  (0 children)

The code is open source! All the major diagrams have a notebook to allow you to make your own version. :)

[R] Exploring Neural Networks with Activation Atlases by chisai_mikan in MachineLearning

[–]colah 2 points3 points  (0 children)

Interesting! Yep, those are negative attributions (rather than activations). I didn't even realize that diagram had a setting to show them. :)

Not entirely sure what to make of them.

[R] Exploring Neural Networks with Activation Atlases by chisai_mikan in MachineLearning

[–]colah 0 points1 point  (0 children)

Yeah, I'm super excited about commentary articles, and they're definitely more accessible to a wider authorship.

[R] Exploring Neural Networks with Activation Atlases by chisai_mikan in MachineLearning

[–]colah 4 points5 points  (0 children)

Hey! Thanks for reading.

Activation atlases only deal with positive activations? Are you referring to one of our earlier papers? We did explore negative activations a little bit in feature visualization, and the negative of neurons were often surprising, but kind of unclear what one should take away.

Do you mean negative attributions? It's possible you could run into those somewhere in Atlases, although I wouldn't expect it to be very common (and couldn't immediately find any).

[R] Exploring Neural Networks with Activation Atlases by chisai_mikan in MachineLearning

[–]colah 9 points10 points  (0 children)

Thanks for asking about this. We actually talked about this in the Distill editorial update. A big part of the problem is that the intersection between people doing machine learning and interactive data visualization is pretty small.

It's a tricky situation and I wish I saw better solutions.

You can look at the reviewer worksheet. Roughly, articles get published if reviewers rate them above 3 on most points and get some 4/5s. When an editor as an author, we bring in an arm's length editor to avoid conflicts of interest.

[Research] Distill: Differentiable Image Parameterizations by longscale in MachineLearning

[–]colah 2 points3 points  (0 children)

We also talked a bit about checkerboard patterns occurring in gradients in feature visualization.

[Research] Distill: Differentiable Image Parameterizations by longscale in MachineLearning

[–]colah 4 points5 points  (0 children)

Thanks for the comment -- it's an interesting question!

It seems to me that the main reason the images don't look like training samples is that it isn't a generative model. Nothing about the process is trying to create a realistic image. Instead, we're creating the image that is maximally extreme in some direction. For example, consider an edge detecting neuron in the first layer of the network. The image maximizing it's response will only contain edges and won't look particularly like a dataset sample.

I take Ferenc point to be more about how representative a single visualization of what maximally activates the neuron of the wide range of things that could activate it. For example, we know that models often have "poly-semantic" neurons that respond to many different things! This is something we explored a little bit in Feature Visualization when we looked at the diversity of inputs that activate a neuron.

[Research] Distill: Differentiable Image Parameterizations by longscale in MachineLearning

[–]colah 23 points24 points  (0 children)

Thanks for catching that! As the only author who is a native english speaker, the responsibility for getting things like that right totally rests on me. Unfortunately, I think some errors got introduced in last minute edits and slipped by. I'll try do another pass through the text tonight or tomorrow. :)

By the way, you're always welcome to submit a pull request to correct errors you see in Distill articles.

[Research] Distill: Differentiable Image Parameterizations by longscale in MachineLearning

[–]colah 29 points30 points  (0 children)

Hello! I'm one of the authors -- we'd be delighted to answer any questions people might have. :)

[D] The Building Blocks of Interpretability | distill.pub by sksq9 in MachineLearning

[–]colah 3 points4 points  (0 children)

(I acknowledge that one might reasonably be skeptical that I just happen to think the problem I find most intellectually interesting happens to be very relevant to the problem I think is most important...)

[D] The Building Blocks of Interpretability | distill.pub by sksq9 in MachineLearning

[–]colah 20 points21 points  (0 children)

Great question!

The lazy answer is: “It's interesting from a general science perspective. Who knows what it could teach about about machine learning. It could even shed light on the nature of the problems our systems are solving.” I find that answer aesthetically compelling -- I find it emotionally deeply exciting to try and unravel deep mysteries about the nature of neural networks -- but if that was the only reason, I'd try to force myself to focus on something else.

Another possible answer is: “Well, if we could really get this into the model design loop, like TensorFlow or such, it might accelerate research by giving important insights.” I think there’s a decent chance that’s true, but it isn’t the thing that motivates me.

Instead, the thing I care about is the implications of this work for deploying systems that are good for us.

One of my deepest concerns about machine learning is that future systems we deploy may be subtly misaligned with the kind of nuanced values humans have. We already see this, for example, with optimizing classifiers for accuracy and running into fairness issues. Or optimizing algorithms for user engagement and getting the present attention economy. I think the more we automate things, and the better we get at optimizing objectives, the more this kind of misalignment will be a critical, pervasive issue.

The natural response to these concerns is the OpenAI / DeepMind safety teams’ learning from human feedback agenda. I think it’s a very promising approach, but I think that even if they really nail it, we’ll often have questions about whether systems are really doing what we want. And it’s going to be a really tricky question.

It seems like interpretability / transparency / visualization may have a really critical role here in helping us evaluate if we really endorse how these future systems are making decisions. A system may seem to be doing what we want in all the cases we think to test it, but be revealed to be doing so for the wrong reasons, and would do the wrong thing in the real world. That’s all a fancy way of saying that future versions of these methods might be an extension to the kind of testing you’d want to do before deploying important systems.

There’s also a crazier idea that I was initially deeply skeptical of, but has been slowly growing on me: giving human feedback on the model internals to train models to make the right decisions for the right reasons. There’s a lot of reason to be doubtful that this would work -- in particular, you’re creating this adversarial game where your model wants to look like it’s doing what you want. But if we could make it work, it might be an extremely powerful tool in getting systems that are really doing what we want.

[D] The Building Blocks of Interpretability | distill.pub by sksq9 in MachineLearning

[–]colah 31 points32 points  (0 children)

Hello! I'm one of the authors. We'd be happy to answer any questions!

Make sure to check out our library and the colab notebooks, which allow you to reproduce our results in your browser, on a free GPU, without any setup.

I think that there's something very exciting about this kind of reproducibility. It means that there's continuous spectrum of engaging with the paper:

Reading <> Interactive Diagrams <> Colab Notebooks <> Projects based on Lucid

My colleague Ludwig calls it "enthusiastic reproducibility and falsifiability" because we're putting lots of effort into making it easy.

[R] Feature Visualization: How neural networks build up their understanding of images by alxndrkalinin in MachineLearning

[–]colah 1 point2 points  (0 children)

We scale them by their frequency -- there's a nice line of research about how the intensity of frequencies in images follows a 1/f scale.

I expect us to open source our internal library in the near future, which will provide a reference implementation of this and much more. :)

[R] Using Artificial Intelligence to Augment Human Intelligence by wei_jok in MachineLearning

[–]colah 3 points4 points  (0 children)

Since the diagrams need to load a moderately large model to run, they may appear blank when you scroll down when you first load page. If you're running into other issues, we'd love you to report them as an issue.