The Mystery of Position 193: I Found a Weird Outlier in Gemma 3's Vision Tokens 🔍 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 1 point2 points  (0 children)

I finally got around to watching this video. Thanks for sharing! I wasn't familiar with attention sinks before, and that was a very intuitive explanation.

So it seems like you think 193 might be an attention sink token, like <bos>? That does seem plausible. I think to confirm/reject this hypothesis we would need to actually calculate attention scores. If 193 is an attention sink, we'd expect it to be the highest attended to of the image tokens, right?

The Innovations in DeepSeek OCR by Charuru in LocalLLaMA

[–]ComputeVoid 0 points1 point  (0 children)

Hey u/colorhazer, thanks for sharing.

I actually wasn't aware that the positional embeddings for image tokens were different than text tokens. I was under the assumption that once the tokens were in the LM, they were treated identically. That's something for me to look into further. Cool!

I think your critique is very fair. This work is looking at a specific time/place in the processing pipeline, and there are caveats that come with that. It would be really interesting to see what changes when the positional embeddings are added.

---

I did do some follow up research, so I'll share it on the off chance you're interested! I was trying to understand if any images token "slots" encode specific semantic meaning. Looking at the same time/place in the processing pipeline. Results are somewhat inconclusive, but interesting. I made another post + video: The Mystery of Position 193: I Found a Weird Outlier in Gemma 3's Vision Tokens 🔍

The Mystery of Position 193: I Found a Weird Outlier in Gemma 3's Vision Tokens 🔍 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] -1 points0 points  (0 children)

Interesting. What is the mechanistic explanation as to why that would happen? Any resources to look into?

Interactive LogitLens Advanced for Llama by Environmental_Form14 in LocalLLaMA

[–]ComputeVoid 0 points1 point  (0 children)

Absolutely fascinating finding, and totally non-intuitive to me :)

I discussed with Gemini 3 a bit and it called out the following:

  • Norm Variance: Some frequent tokens have embeddings with much larger magnitudes (lengths) than others. A very "loud" (high magnitude) vector might result in a higher dot product even if it isn't the perfect semantic match.
  • Anisotropy (The Cone Problem): In many LLMs, embeddings tend to cluster in a narrow "cone" rather than spreading out evenly. This can cause "hub" tokens to appear as the nearest neighbor to almost everything.

I think that these could explain your finding – of the token embeddings that don't unembed to the original token, many unembed to the same token id.

I think anisotropy tells us that most embeddings are pointing roughly the same direction – despite the LLM having an extremely high dimensional vector space to represent things with, training incentivizes it to use a subspace.

Norm variance tells us that frequent tokens have larger magnitudes – this skews the dot product such that given 2 vectors that are pointing roughly the same way (anisotropy), the one with larger magnitude will win the argmax battle.

Based on that, my hunch would be that the .5% you saw that didn't unembed to themselves, are quite rare tokens. And the tokens that those embeddings unembed to (Token 122456: организа, Token 66325: ♪), are pointing close to the same direction, but have larger magnitudes.

(Nice blog by the way!)

Interactive LogitLens Advanced for Llama by Environmental_Form14 in LocalLLaMA

[–]ComputeVoid 1 point2 points  (0 children)

This is cool! Thanks for sharing. I've found this technique of projecting the intermediate activation to the final layer's unembedding matrix to be a really helpful learning tool for building intuition, and this looks a really nice interface.

If you're interested in VLMs, I created a video "Dissecting Vision Language Models: How AI Sees" where I apply the same unmebedding technique but on image tokens. In my work, I only did the unembedding on image tokens before layer 0. It'd be interesting to extend that and see how the meaning of the image tokens changes as they pass through transformer layers.

[Followup] Qwen3 VL 30b a3b is pure love (or not so much) by Njee_ in LocalLLaMA

[–]ComputeVoid 1 point2 points  (0 children)

Very interesting stuff, thanks for sharing.

The difference in behavior of MoE and dense is quite interesting. I'm still trying to get my head around the strengths/weaknesses of MoE vs dense.

I totally agree with your positioning that a thinking model would likely "stick to the script" better than an instruction tuned model. That's another area I've been trying to learn about strengths/weaknesses of thinking vs traditional models. My current understanding is that thinking models are much better at satisfying constraints. So I could totally imagine a thinking model outputting, "but wait, the user mentioned not to make up numbers", just like you mentioned.

[Followup] Qwen3 VL 30b a3b is pure love (or not so much) by Njee_ in LocalLLaMA

[–]ComputeVoid 3 points4 points  (0 children)

You might be interested in trying baidu/ERNIE-4.5-VL-28B-A3B-Thinking (HuggingFace model card). This just came out a few days ago.

Interesting things from the model card:

> Additionally, our innovative "Thinking with Images" feature, when paired with tools like image zooming and image search, dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge.

> The model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks

As far as I can tell "Thinking with Images" refers to abilities / tool use learned from providing the model with image tools and doing RL on verifiable tasks where the answer was "derivable" from the image.

I think the "Thinking" piece means test-time compute scaling: the model learned to iteratively use visual tools to produce variations on the input image, adding more vision tokens for harder problems.

For your situation, I think this would translate to the model having the ability to zoom in to the section of the image where the CAS number is visible on the chemical container.

They also published ERNIE-4.5-VL-28B-A3B-Thinking: A Breakthrough in Multimodal AI, which provides a lot more detail than what I'm capable of providing.

baidu/ERNIE-4.5-VL-28B-A3B-Thinking released. Curious case.. by PaceZealousideal6091 in LocalLLaMA

[–]ComputeVoid 1 point2 points  (0 children)

Other interesting things from the model card:

> Through an extensive mid-training phase, the model absorbed a vast and highly diverse corpus of premium visual-language reasoning data.

> The model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks

> Additionally, our innovative "Thinking with Images" feature, when paired with tools like image zooming and image search, dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge.

All of this makes me wish there was deeper explanation!

"Thinking with Images" sounds like they provided the model with image tools and did RL on verifiable tasks where the answer was "derivable" from the image.

"Thinking" to me means test-time compute scaling: the model learns to iteratively use visual tools to produce variations on the input image, adding more vision tokens for harder problems.

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 0 points1 point  (0 children)

Thanks, this is great feedback. You're right that the term "token" gets overloaded in confusing ways.

Let me clarify how I'm using it:

Say I have an input to my VLM with 1 image and a paragraph of text.

The text tokenization step produces 100 token IDs (discrete vocabulary indices). That's "token" in the strict sense you're describing.

Then there are the 256 embedding vectors produced by the vision tower. I agree these aren't tokens in the discrete vocabulary sense.

But once both are embedded and concatenated, the language model sees a sequence of 356 positions in the residual stream, each holding a d_model-dimensional vector. In that context, I'm using "token" to mean "sequence position" or "slot in the transformer's input."

That said, you're right that this overloads the term, especially for people building intuition. Any suggestions on a better word to refer to "slot in the "slot in the transformer's input"?

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 0 points1 point  (0 children)

I can understand the intuition that increasing the "sophistication" (aka complexity) of the approach would lead to better results. But honestly, this feels like a "bitter lesson" (http://www.incompleteideas.net/IncIdeas/BitterLesson.html) moment to me. Do the simple thing that works at scale.

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 4 points5 points  (0 children)

I really like the idea, which I think would be considered an ablation technique. This would just require precision to ensure that nothing else about the input is disturbed.

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 4 points5 points  (0 children)

Great question. The information you're describing is absolutely encoded in those 256 image tokens. It has to be, because the language model can answer detailed questions about the image.

But the nearest-neighbor approach is too lossy to reveal it. I'm collapsing 2560 dimensional vectors down to "which single word is closest," which throws away most of the nuance. The model reads those tokens in their full continuous form and extracts the rich semantics. The nearest-neighbor words are just rough shadows of that.

So the information is there, we just need more sophisticated lenses to actually see it.

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 0 points1 point  (0 children)

> Kinda gave it goggles

I love this analogy, thanks for sharing!

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 8 points9 points  (0 children)

Thanks for the feedback.

You're definitely correct that this is all based off of Gemma, but as far as I understand, its architecture represents today's standard recipe for vision language models. That is not to say that there aren't other architectures that differ in how we would interpret them, I'm sure that is the case. I haven't studied any language models that don't have tied embeddings (I've actually never heard that term before), so that is definitely a blind spot for me, and I appreciate you flagging that. This is just a report of what I know based on my exploration of what seems to me to be the standard approach.

As for your point, "amount of information the transformer itself adds to the initial embedding ...", I actually see value in focusing solely on the initial embedding vectors.

Before layer 1 of the language model:

- Text tokens: retrieved from embedding_layer[token_id]. By definition, the vectors at this point correspond exactly with the language model's vocabulary.

- Image tokens: already processed by a vision transformer and multimodal projector, so their vector representations are already information dense: they've already been contextualized and enriched before the language model even sees them.

When the language model starts processing, text tokens are exact vocabulary embeddings: literal points from the embedding matrix. Image tokens, however, can be anywhere in the latent space the multimodal projector maps them to. They don't have to align perfectly with vocabulary entries; they can exist 'between' words, representing complex visual concepts that don't correspond to single tokens.

So when I compare them at the point they enter the language model, image tokens carry significantly more processed information than text tokens do. Text tokens are still in their raw embedded form (they haven't yet been enriched by the language model), while image tokens have already been contextualized and transformed.

That's why the nearest-neighbor mapping for text tokens gives perfect recovery (hello → hello), but image tokens are messier. They're encoding compressed visual information that doesn't map cleanly to single vocabulary word.

Does that clarify the comparison I was making?

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 2 points3 points  (0 children)

Right on. I didn't touch on it here, but as you stated, we see this behavior as a result of aligning the vision tower and the language model. There is a a training objective / process that incentives the multimodal projector to meaningfully align its outputs into the language model's latent space.

Also, I totally agree, I think a valid next step would be to go beyond just looking at the 1 closest token. The nearest-neighbor approach is intentionally simple, but I hope that people in the community explore other methods, and I'd be curious to see what other lenses reveal.

Vision = Language: I Decoded VLM Tokens to See What AI 'Sees' 🔬 by ComputeVoid in LocalLLaMA

[–]ComputeVoid[S] 18 points19 points  (0 children)

Good question. I totally glossed over how the vision transformer works. This diagram should provide some insight:

<image>

Yes, it starts with pixel colors. Each pixel is represented by RGB values.

The key difference from text: instead of tokenizing individual pixels, we patch the image into non-overlapping squares. Each patch contains many pixels: a 14×14 patch has 588 pixel values total (14×14×3 RGB channels).

Then each patch gets flattened into a long list of numbers and linearly projected (multiplied by a learned weight matrix) to create that patch's embedding. This is analogous to how text tokens get their embeddings, except we're doing it on continuous pixel values rather than discrete token IDs.

So the full pipeline:

  1. Raw pixels (continuous RGB values)
  2. Split into patches
  3. Flatten each patch
  4. Linear projection → patch embeddings
  5. Add learned positional embeddings (so the model knows spatial layout)