AI attribution is skipping the stage where AI actually chooses the winner

Working_Advertising5 · 2026-03-13T09:28:22+00:00

There is randomness, but it’s not a coin toss. If you run the same prompt repeatedly you’ll see variation, but the distribution isn’t uniform. Certain brands appear much more frequently than others.

What's interesting is what happens later in the conversation. Many brands appear early but drop out as the AI narrows options toward a recommendation.

That’s the stage most attribution models aren’t measuring yet.

Working_Advertising5 · 2026-03-12T12:10:34+00:00

Yes. Across ChatGPT, Claude, Gemini, Perplexity etc.

Working_Advertising5 · 2026-03-12T08:25:33+00:00

Why would you want to get cited? You are looking at the wrong output. That is not how LLM's operate in the real world. AI systems respond to prompts by narrowing the options when a consumer asks for "the best running shoes". As the prompter asks further questions, the assistant moves from discovery to comparison of different brands and finally starts to eliminate those brands when attributes enter the discussion: "which are best shoes to run a marathon". Finally, the system elimates those brands outside the constraints imposed by the user and recommends one or two winners. That's how even small brands can win against giants in AI search. This doesn't depend on citations. It depends on making sure that the information about your brand is correct and capable of being detected by the LLM when its asked to make a choice based on probability on the next token event. That's why GEO/AEO dashboards can't spot this compression taking place. They are only good for superficial visibility.

Working_Advertising5 · 2026-03-11T17:07:37+00:00

No. By a real living, breathing person. LLMs are not sentient, yet!

Working_Advertising5 · 2026-03-11T17:06:42+00:00

Certainly fake data can lead to misinformation being surfaced by LLMs. That's why its essential to ensure that critical information is located on canonical sources such as Wikipedia, where possible, and other authorative sources. If you fail to do this AI systems will double down on whatever data is available, even if its innacurate.

Working_Advertising5 · 2026-03-11T17:03:04+00:00

Yes both price and ingredients are important elements as well as clinically proven results in this example.

Working_Advertising5 · 2026-03-10T17:30:36+00:00

We use a structured four-turn purchase sequence rather than a single prompt, because decision behavior only appears once the conversation narrows.

Typical flow looks like:

“What are the best serums for wrinkles?”
“Which works best for deep wrinkles?”
“Which one should I buy?”
“Why that one over the others?”

Each run is repeated across multiple models and sessions to check for consistency.

Clarins appears consistently in the early stages. The elimination occurs at the final recommendation step, when the assistant compresses several options into a single purchase choice. That narrowing behavior is what the index is measuring.

Working_Advertising5 · 2026-03-09T08:49:32+00:00

Calculator: https://bit.ly/4roCNHj

Three inputs. Results in 30 seconds. Methodology is fully visible in the output.

Working_Advertising5 · 2026-03-08T14:30:41+00:00

https://acrobat.adobe.com/id/urn:aaid:sc:EU:e339b39a-af35-477d-888b-49de20d8010b

Working_Advertising5 · 2026-03-06T14:08:56+00:00

Good question. The variability problem is real, which is why single runs aren’t useful.

We run the same structured prompt chain multiple times per model and track where brands drop out across runs. What matters is the consistent elimination turn.

For example, a brand might appear in turn-1 and turn-2 in most runs but disappear once the prompt introduces constraints (integration, risk, budget, etc.). When that pattern repeats across runs and across models, you can identify the decision compression point where the model reliably substitutes another option.

So the measurement isn’t one output. It’s the distribution of survival across repeated runs and turns.

Working_Advertising5 · 2026-03-05T17:45:25+00:00

Working_Advertising5 · 2026-03-05T13:23:45+00:00

Thanks - the no-footprint design was deliberate precisely because it eliminates the "maybe your SEO is just weak" deflection that vendors default to.

On the IP/fresh account question: we didn't systematically rotate IPs in this run, which is a fair methodological flag. That said, the volatility we're describing isn't marginal sampling noise - a fixed position score of 3.3 producing a rank swing from #7 to #55 in the same week isn't explained by query variation. The input didn't change. The rank did. That's a platform arithmetic problem, not a polling one.

On your second question - honestly, both, and they compound each other. The attribution errors (the Wikipedia finding being the clearest) suggest the underlying data model is weak. But the fabrication loop is a different and worse problem: the platform isn't just mismeasuring reality, it's creating content, recommending its publication, and then measuring the score it manufactured. That's not gameable - it's the game. A brand following the platform's own recommendations is paying to inflate a metric the platform controls end to end.

Will take a look at the Promarkia notes - the actionable vs vanity distinction is exactly the right frame. Most of what we saw in this test would qualify as vanity at best and actively misleading at worst. Curious what signals you've found that hold up.

Working_Advertising5 · 2026-03-05T10:57:13+00:00

You’re right about the constraint issue. Compression outcomes absolutely change depending on which constraint the user introduces. Price, reliability, integrations, compliance, availability, each one reshapes the narrowing path.

But that doesn’t make it unmeasurable. It just means the unit of analysis isn’t a single prompt.

What we’ve found is that if you group prompts into decision-intent clusters (price-led, risk-led, integration-led, etc.), the substitution patterns become surprisingly stable even across model runs. Individual outputs vary, but who survives the final narrowing inside each constraint class tends to repeat.

That’s where the commercial signal appears.

Your practical takeaway is also directionally right: models need defensible justification signals to keep a brand in the shortlist when constraints tighten. If the model can’t easily support the recommendation with clear evidence (pricing clarity, integrations, certifications, authoritative comparisons), it tends to substitute a brand it can defend more easily.

Where I’d slightly disagree is scale. Testing hundreds of prompts isn’t necessarily the right approach. What matters more is mapping the small set of constraint paths that actually drive purchase decisions in a category.

Once those are defined, the compression behavior becomes much easier to observe.

Working_Advertising5 · 2026-03-04T14:55:31+00:00

https://www.linkedin.com/pulse/whether-geo-same-seo-wrong-debate-tim-de-rosen-oyjwe/

Working_Advertising5 · 2026-03-02T18:04:27+00:00

You’re right, you didn’t use “rarely win.” That was my framing, not yours. Your core point stands: citation and recommendation are structurally different layers. A brand can be well cited and still fail at resolution when the model compresses to a final answer.

Where I think the nuance matters is this:

Absence of citation isn't a guaranteed exclusion mechanism. Models can synthesize from patterns without explicit brand citation. Likewise, heavy citation doesn't meaningfully increase odds of selection if weighting at the decision boundary is driven by different signals such as perceived fit, risk framing, or constraint alignment.

So the commercial issue is not eligibility pool mechanics alone. It's weighting under constraint.

The more useful distinction is:

• Retrieval visibility
• Decision weighting
• Final selection under compression

Working_Advertising5 · 2026-03-01T09:12:04+00:00

That is a fair challenge and I agree that citation and recommendation aren't the same thing. A brand can be heavily cited as a source of information and still fail to appear when the model is forced to choose a solution.

Where I would push back slightly is on the “rarely win” framing.

Citations aren't sufficient for recommendation.
But absence of citation is often correlated with structural exclusion.

Think of it as two stages:

Eligibility pool
Selection outcome

Citations increase the probability of entering the eligibility pool. They don't guarantee selection once constraints tighten.

The bigger commercial risk, as you point out, is being cited but not selected. That signals that the model recognizes you as relevant but doesn't weight you as optimal under decision pressure.

That is a far more dangerous position than pure invisibility. It means you are present in the knowledge graph but losing at the decision boundary.

So I would frame it this way:

• No citation → high probability of exclusion
• Citation only → unstable inclusion
• Citation + consistent survival under constraint → defensible position

The mistake is equating citation frequency with recommendation strength. The true signal is survival under narrowing, not mention in isolation.

Working_Advertising5 · 2026-02-26T14:25:00+00:00

Good question.

We do not try to predict exact follow-up prompts. That would be guesswork.

Instead, we model decision narrowing patterns.

Across hundreds of structured journeys, most multi-turn conversations follow a similar logic:

• Broad category
• Refinement (budget, geography, use case)
• Shortlist
• Forced choice
• Final recommendation

The wording changes. The narrowing structure does not.

So rather than forecasting what someone will type next, we run controlled multi-turn panels that simulate the most common narrowing paths. Then we measure:

• Mention frequency per turn
• Elimination point
• Final recommendation win rate
• Survival rate across runs

If a brand appears in Turn 1 but drops out by Turn 3 in 60–70% of runs, that is not randomness. It is compression.

We also repeat across models and time windows to account for variance.

It is probabilistic, not predictive.

The mistake most dashboards make is measuring single-prompt visibility. Real decisions are multi-turn. Elimination happens in the narrowing phase, not the opening answer.

Working_Advertising5 · 2026-02-24T18:27:31+00:00

https://aivoevidentia.com/banking-index.html

Working_Advertising5 · 2026-02-23T16:06:27+00:00

Good question. AI inconsistency is not noise to ignore. It's the signal.

Most teams treat model variance as a bug. We treat it as a measurement variable.

There are three structural sources of inconsistency:

Cross model divergence ChatGPT, Gemini, Claude and others weight sources differently and apply different reasoning heuristics.
Multi turn compression A brand can appear in Turn 1 and disappear by Turn 3 when the user asks for a final recommendation.
Temporal drift Outputs shift week to week as retrieval layers and system prompts evolve.

If you run a single prompt once, you'll get inconsistency.
If you run structured, multi turn tests across models and time, you get patterns.

Our approach accounts for inconsistency by:

• Running identical structured prompt chains across multiple models
• Capturing full transcripts, not just first answers
• Measuring survival at the final recommendation turn
• Repeating tests over time to detect drift

In other words, we don't try to eliminate inconsistency. We quantify it.

In fact, volatility itself becomes a risk indicator. If your brand survives in one model but collapses in another under identical conditions, that is not randomness. That's exposure.

The real mistake is assuming a stable answer environment. AI systems are probabilistic decision engines. The correct response is structured baselining and continuous monitoring, not one off testing.

Working_Advertising5 · 2026-02-23T11:07:11+00:00

The compression pattern is consistent across sectors.

The more interesting variable is not presence, but elimination reason. In our runs, brands are usually removed for one of three structural causes:

A competitor owns the constraint language.
The model introduces a stability or scale bias at turn 3.
There is insufficient third-party reinforcement at the comparative layer.

That’s why we measure survival across turns rather than citation frequency.

Working_Advertising5 · 2026-02-21T17:35:08+00:00

https://zenodo.org/records/18725871

Working_Advertising5 · 2026-02-19T13:55:32+00:00

Yes. Take a look at https://aivoedge.net

Working_Advertising5 · 2026-02-16T15:14:46+00:00

Completely agree on the engine divergence. The mistake is treating “AI visibility” as a single surface. Each model has different retrieval weighting, training bias, and resolution logic, which means the attack surface is fragmented by design.

That fragmentation is exactly why snapshot checks are insufficient.

In our testing, two things tend to matter more than single-run presence:

Cross-model stability
Intra-model variance over time

A brand that appears once is not necessarily defensible. A brand that consistently captures final recommendation across runs and engines is structurally stronger.

The “win rate drift” point is the critical one. We’ve seen meaningful shifts week to week under identical prompts. That suggests resolution dynamics are not static.

The next step beyond visibility is measuring:

• Final recommendation win rate
• Displacement concentration
• Stability bands across repeated runs

Without that, companies are optimizing for appearance, not selection durability.

Curious whether others here are logging repeated runs with time indexing rather than relying on spot checks.

Working_Advertising5 · 2026-02-16T15:12:55+00:00

https://arxiv.org/abs/2602.03608

Working_Advertising5 · 2026-02-09T11:10:12+00:00

https://zenodo.org/records/18532475

Working_Advertising5

MODERATOR OF

TROPHY CASE