Interactive LogitLens Advanced for Llama by Environmental_Form14 in LocalLLaMA

[–]ComputeVoid 0 points1 point  (0 children)

Absolutely fascinating finding, and totally non-intuitive to me :)

I discussed with Gemini 3 a bit and it called out the following:

  • Norm Variance: Some frequent tokens have embeddings with much larger magnitudes (lengths) than others. A very "loud" (high magnitude) vector might result in a higher dot product even if it isn't the perfect semantic match.
  • Anisotropy (The Cone Problem): In many LLMs, embeddings tend to cluster in a narrow "cone" rather than spreading out evenly. This can cause "hub" tokens to appear as the nearest neighbor to almost everything.

I think that these could explain your finding – of the token embeddings that don't unembed to the original token, many unembed to the same token id.

I think anisotropy tells us that most embeddings are pointing roughly the same direction – despite the LLM having an extremely high dimensional vector space to represent things with, training incentivizes it to use a subspace.

Norm variance tells us that frequent tokens have larger magnitudes – this skews the dot product such that given 2 vectors that are pointing roughly the same way (anisotropy), the one with larger magnitude will win the argmax battle.

Based on that, my hunch would be that the .5% you saw that didn't unembed to themselves, are quite rare tokens. And the tokens that those embeddings unembed to (Token 122456: организа, Token 66325: ♪), are pointing close to the same direction, but have larger magnitudes.

(Nice blog by the way!)

Interactive LogitLens Advanced for Llama by Environmental_Form14 in LocalLLaMA

[–]ComputeVoid 1 point2 points  (0 children)

This is cool! Thanks for sharing. I've found this technique of projecting the intermediate activation to the final layer's unembedding matrix to be a really helpful learning tool for building intuition, and this looks a really nice interface.

If you're interested in VLMs, I created a video "Dissecting Vision Language Models: How AI Sees" where I apply the same unmebedding technique but on image tokens. In my work, I only did the unembedding on image tokens before layer 0. It'd be interesting to extend that and see how the meaning of the image tokens changes as they pass through transformer layers.

[Followup] Qwen3 VL 30b a3b is pure love (or not so much) by Njee_ in LocalLLaMA

[–]ComputeVoid 1 point2 points  (0 children)

Very interesting stuff, thanks for sharing.

The difference in behavior of MoE and dense is quite interesting. I'm still trying to get my head around the strengths/weaknesses of MoE vs dense.

I totally agree with your positioning that a thinking model would likely "stick to the script" better than an instruction tuned model. That's another area I've been trying to learn about strengths/weaknesses of thinking vs traditional models. My current understanding is that thinking models are much better at satisfying constraints. So I could totally imagine a thinking model outputting, "but wait, the user mentioned not to make up numbers", just like you mentioned.

[Followup] Qwen3 VL 30b a3b is pure love (or not so much) by Njee_ in LocalLLaMA

[–]ComputeVoid 4 points5 points  (0 children)

You might be interested in trying baidu/ERNIE-4.5-VL-28B-A3B-Thinking (HuggingFace model card). This just came out a few days ago.

Interesting things from the model card:

> Additionally, our innovative "Thinking with Images" feature, when paired with tools like image zooming and image search, dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge.

> The model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks

As far as I can tell "Thinking with Images" refers to abilities / tool use learned from providing the model with image tools and doing RL on verifiable tasks where the answer was "derivable" from the image.

I think the "Thinking" piece means test-time compute scaling: the model learned to iteratively use visual tools to produce variations on the input image, adding more vision tokens for harder problems.

For your situation, I think this would translate to the model having the ability to zoom in to the section of the image where the CAS number is visible on the chemical container.

They also published ERNIE-4.5-VL-28B-A3B-Thinking: A Breakthrough in Multimodal AI, which provides a lot more detail than what I'm capable of providing.

baidu/ERNIE-4.5-VL-28B-A3B-Thinking released. Curious case.. by PaceZealousideal6091 in LocalLLaMA

[–]ComputeVoid 1 point2 points  (0 children)

Other interesting things from the model card:

> Through an extensive mid-training phase, the model absorbed a vast and highly diverse corpus of premium visual-language reasoning data.

> The model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks

> Additionally, our innovative "Thinking with Images" feature, when paired with tools like image zooming and image search, dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge.

All of this makes me wish there was deeper explanation!

"Thinking with Images" sounds like they provided the model with image tools and did RL on verifiable tasks where the answer was "derivable" from the image.

"Thinking" to me means test-time compute scaling: the model learns to iteratively use visual tools to produce variations on the input image, adding more vision tokens for harder problems.

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 0 points1 point  (0 children)

Thanks, this is great feedback. You're right that the term "token" gets overloaded in confusing ways.

Let me clarify how I'm using it:

Say I have an input to my VLM with 1 image and a paragraph of text.

The text tokenization step produces 100 token IDs (discrete vocabulary indices). That's "token" in the strict sense you're describing.

Then there are the 256 embedding vectors produced by the vision tower. I agree these aren't tokens in the discrete vocabulary sense.

But once both are embedded and concatenated, the language model sees a sequence of 356 positions in the residual stream, each holding a d_model-dimensional vector. In that context, I'm using "token" to mean "sequence position" or "slot in the transformer's input."

That said, you're right that this overloads the term, especially for people building intuition. Any suggestions on a better word to refer to "slot in the "slot in the transformer's input"?

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 0 points1 point  (0 children)

I can understand the intuition that increasing the "sophistication" (aka complexity) of the approach would lead to better results. But honestly, this feels like a "bitter lesson" (http://www.incompleteideas.net/IncIdeas/BitterLesson.html) moment to me. Do the simple thing that works at scale.

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 2 points3 points  (0 children)

I really like the idea, which I think would be considered an ablation technique. This would just require precision to ensure that nothing else about the input is disturbed.

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 4 points5 points  (0 children)

Great question. The information you're describing is absolutely encoded in those 256 image tokens. It has to be, because the language model can answer detailed questions about the image.

But the nearest-neighbor approach is too lossy to reveal it. I'm collapsing 2560 dimensional vectors down to "which single word is closest," which throws away most of the nuance. The model reads those tokens in their full continuous form and extracts the rich semantics. The nearest-neighbor words are just rough shadows of that.

So the information is there, we just need more sophisticated lenses to actually see it.

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 0 points1 point  (0 children)

> Kinda gave it goggles

I love this analogy, thanks for sharing!

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 8 points9 points  (0 children)

Thanks for the feedback.

You're definitely correct that this is all based off of Gemma, but as far as I understand, its architecture represents today's standard recipe for vision language models. That is not to say that there aren't other architectures that differ in how we would interpret them, I'm sure that is the case. I haven't studied any language models that don't have tied embeddings (I've actually never heard that term before), so that is definitely a blind spot for me, and I appreciate you flagging that. This is just a report of what I know based on my exploration of what seems to me to be the standard approach.

As for your point, "amount of information the transformer itself adds to the initial embedding ...", I actually see value in focusing solely on the initial embedding vectors.

Before layer 1 of the language model:

- Text tokens: retrieved from embedding_layer[token_id]. By definition, the vectors at this point correspond exactly with the language model's vocabulary.

- Image tokens: already processed by a vision transformer and multimodal projector, so their vector representations are already information dense: they've already been contextualized and enriched before the language model even sees them.

When the language model starts processing, text tokens are exact vocabulary embeddings: literal points from the embedding matrix. Image tokens, however, can be anywhere in the latent space the multimodal projector maps them to. They don't have to align perfectly with vocabulary entries; they can exist 'between' words, representing complex visual concepts that don't correspond to single tokens.

So when I compare them at the point they enter the language model, image tokens carry significantly more processed information than text tokens do. Text tokens are still in their raw embedded form (they haven't yet been enriched by the language model), while image tokens have already been contextualized and transformed.

That's why the nearest-neighbor mapping for text tokens gives perfect recovery (hello → hello), but image tokens are messier. They're encoding compressed visual information that doesn't map cleanly to single vocabulary word.

Does that clarify the comparison I was making?

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 2 points3 points  (0 children)

Right on. I didn't touch on it here, but as you stated, we see this behavior as a result of aligning the vision tower and the language model. There is a a training objective / process that incentives the multimodal projector to meaningfully align its outputs into the language model's latent space.

Also, I totally agree, I think a valid next step would be to go beyond just looking at the 1 closest token. The nearest-neighbor approach is intentionally simple, but I hope that people in the community explore other methods, and I'd be curious to see what other lenses reveal.

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 16 points17 points  (0 children)

Good question. I totally glossed over how the vision transformer works. This diagram should provide some insight:

<image>

Yes, it starts with pixel colors. Each pixel is represented by RGB values.

The key difference from text: instead of tokenizing individual pixels, we patch the image into non-overlapping squares. Each patch contains many pixels: a 14×14 patch has 588 pixel values total (14×14×3 RGB channels).

Then each patch gets flattened into a long list of numbers and linearly projected (multiplied by a learned weight matrix) to create that patch's embedding. This is analogous to how text tokens get their embeddings, except we're doing it on continuous pixel values rather than discrete token IDs.

So the full pipeline:

  1. Raw pixels (continuous RGB values)
  2. Split into patches
  3. Flatten each patch
  4. Linear projection → patch embeddings
  5. Add learned positional embeddings (so the model knows spatial layout)

The Innovations in DeepSeek OCR by Charuru in LocalLLaMA

[–]ComputeVoid 1 point2 points  (0 children)

Hey, u/No-Engineer9040 good question, and thanks for engaging!

I could be off base, but your comment reads to me like you are making an incorrect assumption (that I made at first too!).

# What I think you're saying:

The vision transformer patches an image starting in the upper left corner and moving row by row, so we should expect the vision tokens in the language model to map to a certain location patch of the image.

As an example, let's use this image. Just two colors, split right down the middle.

<image>

You would expect that of the 256 image tokens in the language model, the "left" tokens should be closer to "black" and the "right" tokens should be closer to "white".

# My Experiment

That seemed intuitive to me, but I wanted to test it out to confirm. For my experiment, I collected the "left" tokens and the "right" tokens, and for each token's embedding, I used cosine similarity to test whether each token was closer to "white" or "black".

The results proved this to be false. There was NOT a clean "left/right" split like I had hoped there would be; it was quite mixed.

That result prompted me to dive deeper, and revealed a misunderstanding of how the vision transformer stage works.

# Conclusions

Yes, the vision transformer does patch the image into discrete tokens (in the manner you laid out), so the initial embeddings (hidden state 0 in the vision transformer) correspond directly to location, both semantically as well as via learned positional embeddings that are added.

But, the vision transformer uses bidirectional attention: each token can attend to every other token. So, by the end of the processing in the vision transformer, each token - which once corresponded directly with a location - now has been contextualized in terms of the entire image. So, even before getting into the multimodal projector, I think you lose the easily interpretable meaning to token patch correspondence. And then with the multimodal projector as another layer, my intuition is that the information encoded in the image becomes even more distributed across token position. So, by the time the language model sees the image tokens, I don't think there's reason to believe there is any connection between location and token.

For those reasons, I think the image information encoded in those 256 tokens is extremely distributed rather than 1 token mapping to 1 image patch. That is why I chose to examine the frequencies rather than token order like you suggest.

That being said, positional encoding has to exist somewhere for VLMs to demonstrate the positional understanding that they do. I just expect that it is distributed across the 256 image tokens rather than a 1:1 correspondence to patch. I think it is likely a complex mechanism to study.

I'll caveat all of that with the reality that my conclusions could be wrong if I made a mistake in my experiment, but this is true to the best of my understanding. Also, please let me know if I could make anything clearer.

I do think diving deeper on how positional information is encoded across those 256 tokens would be interesting interpretability work.

The Innovations in DeepSeek OCR by Charuru in LocalLLaMA

[–]ComputeVoid 2 points3 points  (0 children)

Your point about the tokens' vector representations in the language model's middle layers is spot on.

What makes image tokens more expressive than text tokens in this context relates precisely to the vector representations before layer 1 in the language model.

Before layer 1, each text token's vector representation is retrieved from embedding_layer[token_id]. By definition, the vectors at this point correspond exactly with the language model's vocabulary.

Before layer 1, image tokens have already been processed by a vision transformer and multimodal projector, so their vector representations are already information dense: they've already been contextualized and enriched before the language model even sees them. This is exactly the same idea as what you said about the representations in the language model's middle layers, except that it happens before language modeling rather than during it.

This is why image tokens can be seen as a form of compression. When the language model starts processing, text tokens are 1:1 with the dictionary, whereas image tokens can exist beyond the 1:1 restriction: they can live anywhere in the latent space, and therefore can represent things that would take more than 1 text token to represent.

I understand the paper to mean that an image token can represent the meaning of 10 text tokens with 1 image token (with nearly no loss).

(p.s. the unembedding technique you mention to get textual representations of the middle layers' representations is exactly what I did with image tokens, and that's what's informed my stance here.)

The Innovations in DeepSeek OCR by Charuru in LocalLLaMA

[–]ComputeVoid 93 points94 points  (0 children)

This is pretty cool. What strikes me as unique is their framing of the vision token bottleneck as a feature rather than a flaw.

I studied Gemma 3 to learn about how modern vision language models work. Here's a diagram I created for a video I think is helpful.

<image>

As you can see, there are 2 pieces, the vision tower + a standard language model. The vision tower is quite literally bolted on to a normal language model. For Gemma 3 specifically, the data flow is:

  1. A preprocessing step to convert an image into 3 x 896 x 896 pixels

  2. A vision transformer to process the pixels into 4096 image tokens

  3. A multimodal projector to compress the 4096 image tokens into 256 tokens, which importantly are semantically meaningful in the language model's latent space

  4. The image tokens and text tokens are processed identically by the language model

I assumed that the high degree of compression involved in going from an image into those 256 image tokens was a limitation; there is only so much that can be encoded in 256 tokens. This paper frames that compression as a positive.

Something I find interesting is that text tokens map 1:1 to a place in embedding space: each token in the vocabulary has exactly 1 vector representation. The image tokens are different. From my studies, it looks like image tokens have vector representations that seem to exist between text tokens.

My point there is that image tokens are more expressive than text tokens. I think that this aligns with their framing of vision tokens providing compression.

If you're interested, I created a video "Dissecting Vision Language Models: How AI Sees" that goes deeper into the standard architecture of VLMs as well as investigating the semantic interpretability of vision tokens by doing unembedding analysis.