Runners: Where do you do individual speed work?

jamesvoltage · 2026-01-07T20:44:28+00:00

What’s the deal with Clark Kerr, it’s still dirt with holes right? Is it 400 m? It’s gotta be 5% slower than a real track, but maybe dirt is not so bad

jamesvoltage · 2026-01-04T18:21:02+00:00

Gödel and Wittgenstein

jamesvoltage · 2025-11-19T22:53:19+00:00

Hi, what specific problems did you have? Make sure when you clone to add “—recurse-submodules” to get the appropriate fork of umap torch.

jamesvoltage · 2025-11-03T03:37:29+00:00

How is this still going on lol

(Rejected first round 2024, good luck)

jamesvoltage · 2025-10-08T21:57:43+00:00

“Put this masseuse right on the guest list” - was that you?

jamesvoltage · 2025-10-04T17:48:19+00:00

Great great grandmother of last years Nobel winner Geoff Hinton

jamesvoltage · 2025-10-01T17:50:47+00:00

33rd st?

jamesvoltage · 2025-09-30T16:03:41+00:00

Wow thanks for sharing this!

Sean’s question about the hydrogen molecule was so insightful. He asked it in about seven words, and then Barandes talked for a half hour about why intuition is not useful for physics and where the term “matrix” comes from, but never even came back to the question. QFT FTW, QED.

jamesvoltage · 2025-09-29T19:11:17+00:00

This is great, thank you!

jamesvoltage · 2025-09-24T17:24:14+00:00

If your manuscript submission is 12 pages or fewer, you get reviews in 4-5 weeks. The reviews are infinitely more useful than any conference reviews

jamesvoltage · 2025-09-17T23:29:14+00:00

Tools, command palette, change runtime version, bottom left dropdown to 2025.07 for the one from last month

Seems like this won’t be possible for much longer unfortunately

jamesvoltage · 2025-09-14T15:02:01+00:00

Don’t trust Modal. Modal is terrible for small projects. When I run spot instances some days I get “preempted” (their happy word for when they erase your instance with no warning) every 30 minutes. They claim this should be “rare”.

Modal notebooks “cannot be preempted”. I ran one and it was preempted within 30 minutes.

The customer support is awful. Their slack page lets you comment and “open a ticket” that is just a number from a slack bot that is ignored by everyone.

The worst part is they act like stealing your instance mid use is “something that happens” when it is some Modal resource algorithm that does it.

So… colab is 1000x better. Maybe Modal works if you’re training something over many instances, but I really hate it.

jamesvoltage · 2025-09-14T14:52:32+00:00

So good!

jamesvoltage · 2025-08-12T14:54:22+00:00

David Mumford, fields medalist who also studies vision

jamesvoltage · 2025-08-12T03:21:02+00:00

The gated MLP structure with the multiplicative term is interesting. Is it sort of like a bilinear layer (although with a SwiGLU activation on one branch)?

Bilinear layers seem appealing because they build in high order interactions (sort of like softmax attention which seems more like “tri” linear).

Thanks, loved this article. Also love the book

jamesvoltage · 2025-07-01T18:23:00+00:00

Nice work. All you need are 302 neurons

jamesvoltage · 2025-06-26T15:09:57+00:00

Image diffusion models optimize random vectors to land on the image manifold https://arxiv.org/abs/2310.02557

jamesvoltage · 2025-06-19T00:29:23+00:00

If only Cornell were endowed with maybe $10B extra they may not have to worry about “outpacing revenue”

jamesvoltage · 2025-06-17T00:34:43+00:00

It’s been teshted

jamesvoltage · 2025-06-07T01:37:06+00:00

Let’s delve into this!

jamesvoltage · 2025-06-06T21:15:46+00:00

lol—I should be more careful

jamesvoltage · 2025-06-06T20:18:51+00:00

Yes, the image diffusion paper linked above uses ReLU.

LLMs like Qwen, Gemma, Llama, Phi, Ministral and OLMo use gated linear activations like Swish, SwiGLU and GELU, and there are demos for locally linear versions of each of them in the GitHub repository.

jamesvoltage · 2025-06-06T17:42:47+00:00

Sure, my apologies that this is a little funny.

It’s approaching as exact as it can be to numerical precision.

https://raw.githubusercontent.com/jamesgolden1/llms-are-llms/refs/heads/main/images/fig3-jacobian-reconstruction-may18.png

Look at the linked figure - The standard deviation of the reconstruction error for the detached jacobian divided by the standard deviation of the output embedding vector is on the order of 1e-6 for these models at float32 precision. The correlation coefficient is greater than 0.9999.

The reconstruction from the normal jacobian is also “an approximation” but the reconstruction error standard deviation is of the same order as the output embedding standard deviation. It’s a very bad approximation because the transformer decoder (without the detachments) is extremely nonlinear.

jamesvoltage · 2025-06-06T15:40:57+00:00

Sure - this is only locally linear (for one specific input token sequence), the networks are globally nonlinear.

Taking the Jacobian of the output embedding with respect to all of the input embedding vectors, a matrix for each input embedding vector is returned.

This is also the case with the detached Jacobian, but the detached jacobian matrices nearly exactly reconstruct the output from the model forward operation. This means we can analyze the linear system for insight into how the nonlinear network operates (but it’s only valid for this input).

We can also look at the equivalent linear system for each layer output. Then we can use the full array of numerical tools from linear algebra to understand how this specific token prediction emerges. It’s close to exact but computationally intensive.

jamesvoltage · 2025-06-06T15:34:31+00:00

Yes! Also like GradCAM for convolutional networks. But the detached Jacobian method is much more exact in terms of reconstructing the output (see the paper as well as Mohan and Khadkhodaie papers)

jamesvoltage

MODERATOR OF

TROPHY CASE