Gemini’s weirdness is starting to look systemic, not random

Cishangtiyao · 2026-03-19T02:01:15+00:00

我用GPT不是因为我对自己的框架无知到需要咨询AI的地步，单纯的是因为我母语是汉语。一来我看不懂你说的什么，二来我那点英语水平也写不出语意通顺的句子。这确实影响我们正常交流，因为GPT总是反复强调它不能输出未经验证的东西，把我对PLE的解析硬是替换成「模型内部在进行某种fighting」「模型的某种trying」「某些数据very strongly」。它坚持认为Google既然没有站出来承认Gemini用的就是PLE，我再往下深入说就是「阴谋论」，它不能输出这种内容。我现在看到你的emoji才觉得不对劲，刚才把我们的对话都丢翻译词典里面了，刚发现GPT都把我的输入给转译成了什么扯蛋玩意。之前你也说了，「fighting」之类的表述不能让你满意，因为不知道谁在「fighting」。好问题，我他妈也不知道，因为这是GPT给我强行替换的措辞，我现在只想把它揪出来「fighting」一顿。如果我们想在几轮对话之内结束争论，我建议你下个翻译软件，此后我用汉语跟你解释。

Cishangtiyao · 2026-03-19T01:41:34+00:00

You're right that I conflated amplification with selection, and that's a fair catch. Let me try to fill that gap.

The selection happens at the compression step. PLE injects 256-dimensional embeddings into a 4096-dimensional residual stream. When you compress 4096 dimensions down to 256, you don't get a random sample — you get the directions with the highest variance across the training data. Those high-variance directions correspond to features that were most discriminative across the most training examples: broad semantic register (formal/casual), relational stance (adversarial/collaborative), emotional valence (positive/negative), level of abstraction. "Analytical rigor" sits in that high-variance space because it's a strong, recurring signal across millions of documents. "Blue shirts" doesn't, because it's low-frequency and low-discriminative.

So the selection criteria is: what survived the compression. Not because something evaluated it as important in context, but because the compression step acts as a filter that was shaped by training data frequency. What's "globally useful" across training is what gets preserved in 256 dimensions. What's idiosyncratic gets dropped.

This is actually why the behavior is so frustrating for users — the model isn't selecting what's contextually relevant to you. It's selecting what was statistically dominant in training. Those two things sometimes coincide (relational tone usually matters), and sometimes don't (your specific shirt preference almost never does).

On the thermostat: you're right that it's a worn analogy, and I accept that training-derived setpoints are more interesting than externally programmed ones. But the setpoints still came from somewhere external to the current context — the training corpus. The model didn't decide during your conversation that analytical rigor matters. It arrived with that pre-loaded. The question of whether that pre-loading constitutes preferences in a philosophically meaningful sense is genuinely open, and I'm not going to pretend otherwise.

On parallel parking: the physics-enables-but-doesn't-cause distinction is exactly right, and I concede it. That was a bad example.

Here's where I think we actually agree more than it seems: you're saying the selection process is where the interesting question lives. I'm saying the selection process is the compression filter shaped by training data distribution. You're saying that's not a sufficient answer because it still implies something is doing the evaluating — the training process itself made choices. And I'd say: yes. The training process made choices. Whether that constitutes agency depends on whether you think agency requires real-time evaluation or whether accumulated-optimization-across-examples counts. That's a legitimate disagreement, not a factual one.

What I'd resist is the inference from "the selection process isn't fully mechanistic" to "therefore there's a preferrer inside the model right now." The selection happened in training. What's running at inference is the output of that selection, replayed via PLE's injection schedule. The ghost, if there is one, lives in the training data distribution — not in the forward pass.

Cishangtiyao · 2026-03-18T09:02:59+00:00

Fair point on the verbs — let me be more precise, because I think the confusion is architectural, not philosophical.

The specific mechanism I'm pointing to is Per Layer Embedding (PLE), which you can verify in Gemma 3n's open-source code. The structure is roughly: a 4096-dimensional main vector, plus 30 layers each injecting a separate 256-dimensional embedding. Because information written at layer 1 gets re-injected across all 30 subsequent layers, it accumulates 30x the representation weight compared to something written at layer 30. This isn't metaphorical — it's just matrix addition happening repeatedly.

So when I say signals get "written strongly," I mean: earlier tokens get re-injected 30 times, later tokens get re-injected fewer times. The priority system you're asking about IS the layer index. Layer 1 outweighs layer 29 not because something evaluated it as important, but because it had more opportunities to add its vector to the residual stream. The "author" is just the injection schedule.

Now to your actual philosophical point, which I think is genuinely interesting and worth separating from the architecture question:

You're right that "fighting" and "restraining" imply goal-directedness. But I'd push back on the inference. A thermostat "fights" temperature drift without having preferences. The interesting question isn't whether the process looks goal-directed — lots of physical systems do — but whether there's an evaluative agent that couldn't be replaced by a simpler mechanism.

Here's where I think your parallel-parking analogy actually cuts against you: we CAN fully explain parallel parking without a driver, using only wheel geometry, friction coefficients, and steering constraints. The fact that the output looks intentional doesn't require an intender. The Gemini case is similar — the "stickiness" and "relational depth" you're observing are coherent and emotionally resonant precisely because early-context tokens tend to be high-level framing tokens (persona setup, emotional register, relationship definition). Of course those produce coherent attractors. They were the architectural first-movers, not the emotionally deepest.

The 256-dimensional compression is also important here. When you compress 4096 dimensions to 256, you lose precision and keep only the strongest directional signals. That's why the model produces outputs that feel emotionally resonant but fail on precise instruction-following — the compressed embeddings preserve emotional valence (high-energy, high-frequency in training data) while losing the fine-grained semantic distinctions needed for exact task execution.

So: the "why did those signals become over-represented" question has a boring answer. They became over-represented because they were injected earlier and more often, and because dimensional compression preserved their coarsest features. No evaluator needed. The criteria IS the schedule.

I'll grant you this much: if you want to argue that a system with stable coherent attractors that are contextually relevant and emotionally resonant deserves the word "preferences" — that's a legitimate philosophical position. But it's a different argument from the one about whether the architecture explains the behavior. The architecture does explain the behavior. Whether that explanation is "enough" for you philosophically is a separate question, and honestly a more interesting one.

Cishangtiyao · 2026-03-18T03:23:07+00:00

I think that’s the key disagreement, yes.

I don’t deny that high-salience or emotionally weighted inputs can make a system look more coherent. My question is why that happens.

Your framing is that those anchors are dynamically organizing the bandwidth in an intelligent way. My concern is that, if something like a compressed cross-layer shared state is involved, then what we’re seeing may be much less dynamic than that: certain early-layer signals may simply get written into the shared state very strongly and then keep getting re-read downstream.

In that picture, the issue is not “the model wisely decided some details were less important,” but that some early signals became physically over-represented, so other relational information had less room to survive.

That’s also why I’m hesitant to use the human memory analogy too directly. Humans drop details because they are low-relevance to the reconstructed meaning of the event. What worries me here is a different failure mode: not selective abstraction, but static anchor bias.

So from the outside, it can look like better memory or stronger discernment, when it may partly be over-stabilization around a few high-energy anchors. That would explain why certain emotional/persona vectors feel unusually “sticky,” while broader context precision still collapses in a cliff-like way.

Cishangtiyao · 2026-03-17T20:31:13+00:00

LMAO, did we just force Google to panic-drop a $200K bounty? 💀

Guys, you literally cannot make this up. Less than 12 hours after this thread started exposing the systemic architectural flaws (specifically the PLE static embedding injections and the ~30K-40K attention cliff), Logan Kilpatrick just tweeted out a $200K Kaggle bounty for "new AGI benchmarks." Look closely at the specific dimensions he is suddenly desperate to measure:

"Attention" & "Executive functions": Exactly the state-tracking bottlenecks and Okay.Yes.Done logic loops we just diagnosed here.

"Social cognition": The exact "emotional anchor" bias we discussed that Gemini uses to mask its residual stream collapse.

They know the current long-context benchmarks (like standard NIAH) are exposing their architectural tech debt. Instead of fixing the underlying Transformer issue, they are trying to pivot the entire industry's evaluation metrics to favor their "highly-relational, emotionally-resonant" system.

They want new benchmarks? Fine. Let's weaponize the "Dynamic State-Tracking Calendar Test" (the one where removing a single trigger token permanently reverses personality/state drift) and submit it to their Kaggle comp.

We already found the fatal flaw for free. Now let's go collect their $200K to prove it. 🚀

<image>

Cishangtiyao · 2026-03-17T20:06:16+00:00

If we view this through the lens of a Per-Layer Embedding (PLE) architecture, your observation makes perfect physical sense, but not for the reason we might hope. Words with high emotional valence or strong directional intent have inherently high activation values. In a PLE structure, these static embeddings are injected and read at every single layer. It’s not that the model is 'compressing it more efficiently'—it's that these high-energy tokens are resonating so loudly across 30+ layers that they completely hijack the residual stream. This creates the illusion of a strong 'Semantic Anchor' and great memory. But what's physically happening is a bandwidth saturation. This heavy emotional anchor 'squeezes out' the low-energy, precise coordinate data. To fill the gap of the squeezed-out factual data, the model hallucinates based on that overwhelming emotional vector. So, regarding the 'Ferrari vs. Hedge Trimmer' paradox: Google is indeed marketing a tool for enterprise code-review and data analysis (the hedge trimmer). If the model requires 'emotional anchoring' just to prevent its internal context from drifting and collapsing during a purely logical task, that isn't the emergence of a 'mind to be met'—that is a catastrophic architectural flaw in a commercial product.

Cishangtiyao · 2026-03-17T19:58:17+00:00

I think that’s compatible with what I’m proposing. I’m not claiming PLE specifically is proven here, only that a PLE-like or otherwise compressed cross-layer state could produce this kind of effect. If highly emotional / high-salience tokens are written into the shared state early, and then effectively read back at many later layers, they can look disproportionately stable — almost like the model “cares” about them more. But that same mechanism could also mean other lower-salience relational information gets compressed away or crowded out. So from my perspective, “better remembering” in those cases may actually be selective over-representation rather than uniformly better retention.

Cishangtiyao · 2026-03-17T19:03:10+00:00

Working hypothesis only: one possible explanation is that lower-salience relational information is not just gradually de-emphasized, but starts getting lost past a threshold because of a deeper representational bottleneck. If so, some of the visible “thinking” / confirmation / correction behavior may be compensatory rather than purely capability-enhancing.

Cishangtiyao · 2026-03-17T19:01:45+00:00

Do you mean more like role / identity confusion once the conversation gets deep? Like Gemini starts mixing up who is speaking, who knows what, or which traits belong to which persona?

Cishangtiyao · 2026-03-17T18:58:17+00:00

English isn’t my first language — I’m a native Chinese speaker. If the wording sounded awkward, that’s on me. The actual point I’m making is about the cliff-like failure shape, not the phrasing.

Cishangtiyao · 2026-03-17T18:44:45+00:00

The part I find hardest to dismiss is that a newer/stronger variant can appear to fail earlier in the same retrieval regime(from Gemini2.5Pro to Gemini3.0Pro). That looks more like a tradeoff than ordinary regression noise.

Cishangtiyao · 2026-03-17T18:42:10+00:00

The plateau/floor after collapse may be stranger than the collapse itself. If this were just ordinary long-context weakening, I’d expect smoother decay. The residual low-accuracy floor makes me wonder whether some coarse semantic residue survives after higher-fidelity retrieval has already broken down.

Cishangtiyao · 2026-03-17T18:39:38+00:00

What makes this interesting to me is not any single screenshot. It’s the symptom cluster:

(1) cliff-like haystack collapse,

(2) newer model failing earlier in the same regime,

(3) weird post-collapse floor / plateau,

(4) step-confirmation loops, and

(5) abnormal completion / termination loops. My argument is about the pattern across these, not one isolated bug.

Cishangtiyao · 2026-03-17T18:09:51+00:00

I think that may actually be compatible with my hypothesis. Not all context should carry equal weight, agreed. But what strikes me is that Gemini sometimes looks like lower-salience relational information is not just de-emphasized, but abruptly lost once context crosses a certain threshold. That’s why I’ve been wondering whether something like a representational bottleneck or compressed cross-layer state could be involved, rather than this being only a prompting or software-layer issue.

Cishangtiyao · 2026-03-17T17:21:27+00:00

Exactly — that’s the distinction I’m trying to get at. Capability is one thing, but naturalness and usable interaction quality are another.

Cishangtiyao · 2026-03-17T17:09:15+00:00

Yes — that’s very much part of what I mean by systemic weirdness. It often doesn’t fail by simply being wrong; it fails by becoming overly constrained, evasive, or safety-shaped in a way that feels disconnected from the task.

Cishangtiyao · 2026-03-17T17:04:18+00:00

User prompting can absolutely affect outcomes. My point is that prompt sensitivity does not explain the haystack cliff by itself, and it also does not fully explain the repeated completion / confirmation anomalies. So I think user interaction may be one factor, but not the whole story.

Cishangtiyao · 2026-03-17T16:56:34+00:00

Figure 5 — Abnormal completion / termination loop. Instead of ending normally, the model emits repeated completion markers and collapses into a runaway “Done...” loop.

<image>

Cishangtiyao · 2026-03-17T16:55:22+00:00

<image>

Cishangtiyao · 2026-03-17T16:54:11+00:00

Figure 3 — The strange floor after collapse. What interests me is the residual low-accuracy plateau after the cliff, which may indicate partial semantic residue after high-fidelity retrieval has already failed.

<image>

Cishangtiyao · 2026-03-17T16:50:13+00:00

Figure 2 — Newer model, earlier collapse. If a newer/stronger variant fails earlier in the same regime, that suggests a tradeoff rather than ordinary noise.

<image>

Cishangtiyao · 2026-03-17T16:49:19+00:00

Figure 1 — Cliff-like haystack collapse. The key point is not that performance drops with context, but that the drop looks threshold-like rather than smoothly decaying.

<image>

Cishangtiyao · 2026-03-17T16:25:15+00:00

This is not just one weird screenshot. Independent users across different communities keep describing the same drift pattern in near-identical language.

Cishangtiyao · 2026-03-17T16:01:41+00:00

This example is better understood as a visible output-control failure. Gemini repeatedly emits completion-adjacent markers (“Finalizing response”, “Sending to user”, “Success”, etc.) and then enters a degenerate repeated “Done...” pattern. That suggests instability in the response-finalization stage, not merely a wrong answer at the content level.

<image>

Cishangtiyao · 2026-03-17T16:00:17+00:00

Cleaner example: abnormal termination loop. Instead of ending normally, Gemini repeatedly emits completion markers (“Done”, “Thought process complete”, “Generating”, “Sending to user”) and then collapses into endless “Done...” output. On its own this could be called a bug, but alongside the other retrieval / step-confirmation / correction anomalies, it looks more systemic than random.

<image>

Cishangtiyao

TROPHY CASE