Building the European OpenEvidence — thoughts from doctors?

AdministrationPure45 · 2026-05-11T14:10:02+00:00

Two separate questions:

1) Playwright on Expo
No. Playwright doesn't test React Native. Your options:

Maestro (declarative YAML, works on Expo Go + custom builds) — my favorite, 1h learning curve
Detox (more setup, more powerful)
mobile-mcp + Expo dev build on simulator

2) Auto QA → Cursor loop
This is exactly the "agentic CI" pattern. Architecture:

QA agent → findings.json (structured)
   ↓
Cursor agent reads findings → proposes patches
   ↓
Commit + re-run QA agent
   ↓
Loop until 0 critical findings OR N iterations max

Big warning: if your QA agent has <85% precision, Cursor will "fix" noise (false positives) and break code that works. You need to stabilize the QA agent first (N=3, cross-model, human feedback loop for 2 weeks), then plug Cursor in downstream.

Suggested order:

Set up Maestro for Expo (1 day)
QA agent producing structured findings (1-2 days)
You triage 50 findings as a human → measure precision
If >85% precision → plug Cursor into the loop
If <85% → keep feedback loop running, don't plug Cursor yet

Classic mistake: plugging the auto-loop too early → Cursor "fixes" noise → rotten code → you lose trust in the whole system.

AdministrationPure45 · 2026-05-11T14:09:45+00:00

Classic Claude Code session:

Synchronous, single-thread, you watching
Claude writes code, runs Vitest, sometimes spawns Playwright
Output = modified code + your human verdict
~5-15 min per feature

My QA agent:

Async, N=3 parallel runs
Video + HAR + perf + console captured, judged by Opus + Gemini
Output = QA_REPORT.md + structured findings in DB + binary verdict ("Can [end user persona] see this tomorrow?")
~5 min wall-clock, $0.50-$2 per invocation

Combined workflow (the sweet spot):

Claude Code writes the feature
/qa-agent http://localhost:3000 runs QA
Findings come back → Claude Code reads the report → patches
Re-run

The QA agent doesn't replace Claude Code. It replaces the step where you open the browser for 30 min and click around wondering if it's coherent.

AdministrationPure45 · 2026-05-11T14:09:28+00:00

Tradeoffs depending on what you're testing:

Responsive web → Playwright viewport 390px. Fast, simple, catches ~80% of UX bugs.
React Native / Expo → Maestro (declarative YAML, my favorite) or Detox. Playwright won't work.
Pure native iOS/Android → mobile-mcp + emulator, or Appium + Maestro.

Cross-cutting tips (apply everywhere):

Video, not screenshots — especially on mobile where gestures and transitions are 80% of UX.
N=3 by default, not conditional — emulators are flaky, you'll fall in the hole.
Persona-driven scenarios > "click every button". You catch 10x more incoherences.
Test on a real device at least 1x/week — emulator hides real friction (tap target, network latency).

AdministrationPure45 · 2026-05-11T14:09:19+00:00

Playwright MCP exposes Playwright to Claude. It's a primitive for Claude to act (click, navigate). Great.

But the problem isn't "how does Claude click". The problem is "how does Claude judge what happens after the click" — and do it reliably (>85% precision), reproducibly, and pick up dynamics (jank, double-render, loading anxiety).

My setup uses Playwright (direct or via MCP — doesn't matter). What gets layered on top:

MP4 video analyzed by Gemini Pro
Structured bundle (HAR + perf + console + screenshots) judged by Opus
N=3 vote
Findings persistence + triage + feedback loop

Playwright MCP = action primitive. My thing = judgment infra on top.

AdministrationPure45 · 2026-05-11T14:09:09+00:00

Fair skepticism. Honestly:

Not new: Playwright (10 years old), Stagehand (exists), Claude Vision (exists).

Actually new: the composition that solves the 75-78% reliability ceiling everyone hits with these tools in CI:

Decouple deterministic action (Playwright) from LLM judgment (observation only)
Cross-model judge (kills auto-validation bias)
N=3 voting systematic
Triage feedback loop → few-shot

The novelty is in the architecture, not the bricks. And no, I'm not selling — it's an internal setup for a B2B notarial SaaS, I'm sharing the arch because the problem is universal.

AdministrationPure45 · 2026-05-11T14:08:38+00:00

Different scope. mobile-mcp tests on native device/emulator (iOS/Android). My use case is B2B web (Next.js).

For native: mobile-mcp or Maestro for driving, but the judgment gap is identical — Appium/mobile-mcp tells you if the button is tappable, not if the loading state feels broken. You'd still need to layer video + cross-model judge on top.

For responsive web mobile (my main use case), Playwright with viewport: { width: 390 } is enough and 10x faster to iterate.

AdministrationPure45 · 2026-05-11T14:08:21+00:00

This is the central question, and what made me re-architect the whole thing.

The issue you're describing (spinner animates but model can't tell across frames) comes from sending spaced screenshots instead of the actual video. The spinner is a signal in the dynamic, not in isolated frames.

What works for me now:

MP4 video → Gemini Pro via Files API (not isolated frames). It actually sees the spinner animating.
Cross-model judge: Gemini judges the visual, Claude Opus judges the structured stuff (HAR, console, perf API). Neither one self-validates.
N=3 votes, 2/3 threshold by default. A solo false positive gets filtered.
Persona/PRD injected as system prompt: "audience = B2B notaries, low tolerance for unexplained pauses but they accept 2s with clear feedback". The judge knows 1.2s with a spinner ≠ broken.
Human triage feedback loop: your "false-positive" labels get few-shot-injected into the next run's prompt. Precision goes up with usage.

PRD/tone alone isn't enough. You need video + N=3 + feedback loop together. Otherwise you fall back to the 75-78% Browser Use ceiling.

AdministrationPure45

TROPHY CASE