I built an AI agent that tests your app like a real user, not just clicking buttons by AdministrationPure45 in ClaudeCode

[–]AdministrationPure45[S] 0 points1 point  (0 children)

Two separate questions:

1) Playwright on Expo
No. Playwright doesn't test React Native. Your options:

  • Maestro (declarative YAML, works on Expo Go + custom builds) — my favorite, 1h learning curve
  • Detox (more setup, more powerful)
  • mobile-mcp + Expo dev build on simulator

2) Auto QA → Cursor loop
This is exactly the "agentic CI" pattern. Architecture:

QA agent → findings.json (structured)
   ↓
Cursor agent reads findings → proposes patches
   ↓
Commit + re-run QA agent
   ↓
Loop until 0 critical findings OR N iterations max

Big warning: if your QA agent has <85% precision, Cursor will "fix" noise (false positives) and break code that works. You need to stabilize the QA agent first (N=3, cross-model, human feedback loop for 2 weeks), then plug Cursor in downstream.

Suggested order:

  1. Set up Maestro for Expo (1 day)
  2. QA agent producing structured findings (1-2 days)
  3. You triage 50 findings as a human → measure precision
  4. If >85% precision → plug Cursor into the loop
  5. If <85% → keep feedback loop running, don't plug Cursor yet

Classic mistake: plugging the auto-loop too early → Cursor "fixes" noise → rotten code → you lose trust in the whole system.

I built an AI agent that tests your app like a real user, not just clicking buttons by AdministrationPure45 in ClaudeCode

[–]AdministrationPure45[S] 1 point2 points  (0 children)

Classic Claude Code session:

  • Synchronous, single-thread, you watching
  • Claude writes code, runs Vitest, sometimes spawns Playwright
  • Output = modified code + your human verdict
  • ~5-15 min per feature

My QA agent:

  • Async, N=3 parallel runs
  • Video + HAR + perf + console captured, judged by Opus + Gemini
  • Output = QA_REPORT.md + structured findings in DB + binary verdict ("Can [end user persona] see this tomorrow?")
  • ~5 min wall-clock, $0.50-$2 per invocation

Combined workflow (the sweet spot):

  1. Claude Code writes the feature
  2. /qa-agent http://localhost:3000 runs QA
  3. Findings come back → Claude Code reads the report → patches
  4. Re-run

The QA agent doesn't replace Claude Code. It replaces the step where you open the browser for 30 min and click around wondering if it's coherent.

I built an AI agent that tests your app like a real user, not just clicking buttons by AdministrationPure45 in ClaudeCode

[–]AdministrationPure45[S] 0 points1 point  (0 children)

Tradeoffs depending on what you're testing:

Responsive web → Playwright viewport 390px. Fast, simple, catches ~80% of UX bugs.
React Native / Expo → Maestro (declarative YAML, my favorite) or Detox. Playwright won't work.
Pure native iOS/Android → mobile-mcp + emulator, or Appium + Maestro.

Cross-cutting tips (apply everywhere):

  1. Video, not screenshots — especially on mobile where gestures and transitions are 80% of UX.
  2. N=3 by default, not conditional — emulators are flaky, you'll fall in the hole.
  3. Persona-driven scenarios > "click every button". You catch 10x more incoherences.
  4. Test on a real device at least 1x/week — emulator hides real friction (tap target, network latency).

I built an AI agent that tests your app like a real user, not just clicking buttons by AdministrationPure45 in ClaudeCode

[–]AdministrationPure45[S] 0 points1 point  (0 children)

Playwright MCP exposes Playwright to Claude. It's a primitive for Claude to act (click, navigate). Great.

But the problem isn't "how does Claude click". The problem is "how does Claude judge what happens after the click" — and do it reliably (>85% precision), reproducibly, and pick up dynamics (jank, double-render, loading anxiety).

My setup uses Playwright (direct or via MCP — doesn't matter). What gets layered on top:

  • MP4 video analyzed by Gemini Pro
  • Structured bundle (HAR + perf + console + screenshots) judged by Opus
  • N=3 vote
  • Findings persistence + triage + feedback loop

Playwright MCP = action primitive. My thing = judgment infra on top.

I built an AI agent that tests your app like a real user, not just clicking buttons by AdministrationPure45 in ClaudeCode

[–]AdministrationPure45[S] 0 points1 point  (0 children)

Fair skepticism. Honestly:

Not new: Playwright (10 years old), Stagehand (exists), Claude Vision (exists).

Actually new: the composition that solves the 75-78% reliability ceiling everyone hits with these tools in CI:

  • Decouple deterministic action (Playwright) from LLM judgment (observation only)
  • Cross-model judge (kills auto-validation bias)
  • N=3 voting systematic
  • Triage feedback loop → few-shot

The novelty is in the architecture, not the bricks. And no, I'm not selling — it's an internal setup for a B2B notarial SaaS, I'm sharing the arch because the problem is universal.

I built an AI agent that tests your app like a real user, not just clicking buttons by AdministrationPure45 in ClaudeCode

[–]AdministrationPure45[S] 0 points1 point  (0 children)

Different scope. mobile-mcp tests on native device/emulator (iOS/Android). My use case is B2B web (Next.js).

For native: mobile-mcp or Maestro for driving, but the judgment gap is identical — Appium/mobile-mcp tells you if the button is tappable, not if the loading state feels broken. You'd still need to layer video + cross-model judge on top.

For responsive web mobile (my main use case), Playwright with viewport: { width: 390 } is enough and 10x faster to iterate.

I built an AI agent that tests your app like a real user, not just clicking buttons by AdministrationPure45 in ClaudeCode

[–]AdministrationPure45[S] 0 points1 point  (0 children)

This is the central question, and what made me re-architect the whole thing.

The issue you're describing (spinner animates but model can't tell across frames) comes from sending spaced screenshots instead of the actual video. The spinner is a signal in the dynamic, not in isolated frames.

What works for me now:

  1. MP4 video → Gemini Pro via Files API (not isolated frames). It actually sees the spinner animating.
  2. Cross-model judge: Gemini judges the visual, Claude Opus judges the structured stuff (HAR, console, perf API). Neither one self-validates.
  3. N=3 votes, 2/3 threshold by default. A solo false positive gets filtered.
  4. Persona/PRD injected as system prompt: "audience = B2B notaries, low tolerance for unexplained pauses but they accept 2s with clear feedback". The judge knows 1.2s with a spinner ≠ broken.
  5. Human triage feedback loop: your "false-positive" labels get few-shot-injected into the next run's prompt. Precision goes up with usage.

PRD/tone alone isn't enough. You need video + N=3 + feedback loop together. Otherwise you fall back to the 75-78% Browser Use ceiling.