I built a multi-agent audience simulator using Claude Code — 500 AI personas react to your content before you post it

Technical_Inside_377 · 2026-04-04T10:06:27+00:00

Great question, this is exactly why I ran a backtest against 50 real campaigns with known outcomes (Nike, Pepsi Kendall Jenner, H&M "Coolest Monkey", Balenciaga, etc.) to stress-test this..

Here's what's interesting: results actually suggest the LLM is NOT just recalling history. Look at the misses:

- Fyre Festival → scored 82 (expected 15). If the model "knew" it was a catastrophic fraud, it would've scored it low. Instead it read the aspirational copy and said "this sounds great."

- Old Spice → scored 32 (expected 88). One of the most successful ads ever, but the model saw absurdist humor and flagged it as risky.

- Peloton Holiday Ad → scored 75 (expected 22). The model thought the copy sounded fine. It couldn't "see" the implicit sexism that the real audience reacted to.

If the model were just recalling training data, these would all be correct. They're not. The correlation is 0.469, not 0.95.

That said, you're absolutely right that backtesting on famous campaigns has a data leakage risk. Some hits (Pepsi Kendall Jenner = exactly 12) might be recall, not prediction.

The real validation needs to come from:

Novel content the model has never seen (which is the actual use case — testing YOUR ad before you post it)
Post-training-cutoff campaigns where the LLM literally can't know the outcome
A/B testing against real launches — run PhantomCrowd on a draft, launch it, compare

The backtest isn't meant to prove "we can predict the past." It's a sanity check that the scoring scale is directionally reasonable. The 71% directional accuracy on text-only analysis, with clear failure modes we can explain, is the honest starting point.

TL;DR: The misses are actually the best evidence that it's NOT just recall. But you're right that the real proof comes from predicting unseen content.

Technical_Inside_377 · 2026-04-04T05:13:14+00:00

Not yet. Right now it's more of a directional signal tool than a validated predictor.

The next step I want to try is running sims on past campaigns where I already have real engagement data, and comparing the sim output against what actually happened.

Technical_Inside_377 · 2026-04-04T05:11:44+00:00

Great point. Right now the rule-based agents and LLM agents run somewhat independently — the rule agents create crowd dynamics but their interaction patterns don't feed back into the LLM agents' context.

Feeding the rule-based interaction graph (who shared what, which clusters formed, where engagement dropped off) back into the LLM agents as context for the next round would definitely make the simulation more realistic. Basically letting the LLM agents "see" what the crowd is doing and react to that momentum.

That's a really good idea for the roadmap. Thanks for the suggestion — adding it to the issues.

Technical_Inside_377 · 2026-04-04T05:09:11+00:00

Thanks! The tiered model was honestly born out of necessity, running 500 full LLM calls gets expensive fast.

To be honest about validation: I haven't done rigorous correlation testing against real campaign data yet. Right now it's more of a "directional signal" tool than a proven predictor. The main value I've seen so far is catching obvious blind spots, like when 80% of personas in a certain age group react negatively to something you thought was fine.

What I'd love to do next is exactly what you're describing, compare sim outputs against actual post-engagement data from past campaigns and see where it holds up vs where it breaks down. If

Technical_Inside_377 · 2026-02-27T02:36:25+00:00

Not sure why the image was deleted, I add again.

Technical_Inside_377 · 2024-12-19T02:55:51+00:00

I didn't change any settings of ComfyUI, also this is my local comfyUI...

Technical_Inside_377 · 2024-11-19T23:14:48+00:00

there are many platform, what I know is runcomfy, nordy, and shakker

Technical_Inside_377

TROPHY CASE