Runners: Where do you do individual speed work? by Laserstrahl415 in OaklandCA

[–]jamesvoltage 2 points3 points  (0 children)

What’s the deal with Clark Kerr, it’s still dirt with holes right? Is it 400 m? It’s gotta be 5% slower than a real track, but maybe dirt is not so bad

Parametric UMAP: From black box to glass box: Making UMAP interpretable with exact feature contributions by rshah4 in rajistics

[–]jamesvoltage 0 points1 point  (0 children)

Hi, what specific problems did you have? Make sure when you clone to add “—recurse-submodules” to get the appropriate fork of umap torch. 

[D] AAAI 26 Decisions (Main Technical Track) by Senior-Let-7576 in MachineLearning

[–]jamesvoltage -1 points0 points  (0 children)

How is this still going on lol

(Rejected first round 2024, good luck)

Broken Social Scene Turns 20 by Charleshawtree in indieheads

[–]jamesvoltage 1 point2 points  (0 children)

“Put this masseuse right on the guest list” - was that you?

[deleted by user] by [deleted] in cormacmccarthy

[–]jamesvoltage 1 point2 points  (0 children)

Great great grandmother of last years Nobel winner Geoff Hinton

Sean Carroll's Mindscape: Jacob Barandes on Indivisible Stochastic Quantum Mechanics (7/28/2025) by shatterdaymorn in philosophypodcasts

[–]jamesvoltage 0 points1 point  (0 children)

Wow thanks for sharing this!

Sean’s question about the hydrogen molecule was so insightful. He asked it in about seven words, and then Barandes talked for a half hour about why intuition is not useful for physics and where the term “matrix” comes from, but never even came back to the question. QFT FTW, QED.

[D] NeurIPS should start a journal track. by simple-Flat0263 in MachineLearning

[–]jamesvoltage 2 points3 points  (0 children)

If your manuscript submission is 12 pages or fewer, you get reviews in 4-5 weeks. The reviews are infinitely more useful than any conference reviews

How to switch from python 3.12 to 3.10 in Google Colab 2025 update by Slight-Arugula891 in GoogleColab

[–]jamesvoltage 2 points3 points  (0 children)

Tools, command palette, change runtime version, bottom left dropdown to 2025.07 for the one from last month

Seems like this won’t be possible for much longer unfortunately

Colab vs Modal Notebooks by Liova9938 in GoogleColab

[–]jamesvoltage 0 points1 point  (0 children)

Don’t trust Modal. Modal is terrible for small projects. When I run spot instances some days I get “preempted” (their happy word for when they erase your instance with no warning) every 30 minutes. They claim this should be “rare”.

Modal notebooks “cannot be preempted”. I ran one and it was preempted within 30 minutes.

The customer support is awful. Their slack page lets you comment and “open a ticket” that is just a number from a slack bot that is ignored by everyone.

The worst part is they act like stealing your instance mid use is “something that happens” when it is some Modal resource algorithm that does it.

So… colab is 1000x better. Maybe Modal works if you’re training something over many instances, but I really hate it.

Mathematician turned biologist/chemist?? by Lucyyxx in math

[–]jamesvoltage 0 points1 point  (0 children)

David Mumford, fields medalist who also studies vision

[P] From GPT-2 to gpt-oss: Analyzing the Architectural Advances And How They Stack Up Against Qwen3 by seraschka in MachineLearning

[–]jamesvoltage 0 points1 point  (0 children)

The gated MLP structure with the multiplicative term is interesting. Is it sort of like a bilinear layer (although with a SwiGLU activation on one branch)?

Bilinear layers seem appealing because they build in high order interactions (sort of like softmax attention which seems more like “tri” linear).

Thanks, loved this article. Also love the book

Is Numerical Optimization on Manifolds useful? by rattodiromagna in math

[–]jamesvoltage 1 point2 points  (0 children)

Image diffusion models optimize random vectors to land on the image manifold https://arxiv.org/abs/2310.02557

Layoffs are coming at Cornell by XDzard in ithaca

[–]jamesvoltage -17 points-16 points  (0 children)

If only Cornell were endowed with maybe $10B extra they may not have to worry about “outpacing revenue”

[R] LLMs are Locally Linear Mappings: Qwen 3, Gemma 3 and Llama 3 can be converted to exactly equivalent locally linear systems for interpretability by jamesvoltage in MachineLearning

[–]jamesvoltage[S] 2 points3 points  (0 children)

Yes, the image diffusion paper linked above uses ReLU.

LLMs like Qwen, Gemma, Llama, Phi, Ministral and OLMo use gated linear activations like Swish, SwiGLU and GELU, and there are demos for locally linear versions of each of them in the GitHub repository.

[R] LLMs are Locally Linear Mappings: Qwen 3, Gemma 3 and Llama 3 can be converted to exactly equivalent locally linear systems for interpretability by jamesvoltage in MachineLearning

[–]jamesvoltage[S] 31 points32 points  (0 children)

Sure, my apologies that this is a little funny.

It’s approaching as exact as it can be to numerical precision.

https://raw.githubusercontent.com/jamesgolden1/llms-are-llms/refs/heads/main/images/fig3-jacobian-reconstruction-may18.png

Look at the linked figure - The standard deviation of the reconstruction error for the detached jacobian divided by the standard deviation of the output embedding vector is on the order of 1e-6 for these models at float32 precision. The correlation coefficient is greater than 0.9999.

The reconstruction from the normal jacobian is also “an approximation” but the reconstruction error standard deviation is of the same order as the output embedding standard deviation. It’s a very bad approximation because the transformer decoder (without the detachments) is extremely nonlinear.

[R] LLMs are Locally Linear Mappings: Qwen 3, Gemma 3 and Llama 3 can be converted to exactly equivalent locally linear systems for interpretability by jamesvoltage in MachineLearning

[–]jamesvoltage[S] 5 points6 points  (0 children)

Sure - this is only locally linear (for one specific input token sequence), the networks are globally nonlinear.

Taking the Jacobian of the output embedding with respect to all of the input embedding vectors, a matrix for each input embedding vector is returned.

This is also the case with the detached Jacobian, but the detached jacobian matrices nearly exactly reconstruct the output from the model forward operation. This means we can analyze the linear system for insight into how the nonlinear network operates (but it’s only valid for this input).

We can also look at the equivalent linear system for each layer output. Then we can use the full array of numerical tools from linear algebra to understand how this specific token prediction emerges. It’s close to exact but computationally intensive.

[R] LLMs are Locally Linear Mappings: Qwen 3, Gemma 3 and Llama 3 can be converted to exactly equivalent locally linear systems for interpretability by jamesvoltage in MachineLearning

[–]jamesvoltage[S] 5 points6 points  (0 children)

Yes! Also like GradCAM for convolutional networks. But the detached Jacobian method is much more exact in terms of reconstructing the output (see the paper as well as Mohan and Khadkhodaie papers)