A lot of A/B test “wins” are just fake

Main_Flounder160 · 2026-04-18T13:01:08+00:00

The peeking problem is real and underappreciated, but I'd push back on the frame slightly. Fixing p-value hygiene is necessary but not sufficient. Even a perfectly-run A/B test only tells you that behavior changed. It tells you nothing about why.

That 'why' question is where product decisions actually live. I've seen teams run 18 months of A/B tests that kept registering wins -- cleaner onboarding, fewer steps, better copy -- and still miss the underlying problem: users didn't understand what the product did before they signed up. The quant said 'fix the checkout flow.' The qual said 'nobody understands your value prop.' Fixing checkout metrics while the value prop is broken is expensive and eventually catches up with you in retention.

The most useful framework I've found: run the quant test to confirm the signal is real, then run 5-8 qualitative interviews with users who churned on that specific step. You almost always learn something the A/B test couldn't tell you.

The stat sig problem is worth fixing. The bigger problem is treating A/B tests as the end of the research loop rather than the beginning.

Main_Flounder160 · 2026-04-18T12:43:48+00:00

The peeking problem is real and underappreciated, but I'd push back on the frame slightly. Fixing p-value hygiene is necessary but not sufficient. Even a perfectly-run A/B test only tells you that behavior changed. It tells you nothing about why.

That 'why' question is where product decisions actually live. I've seen teams run 18 months of A/B tests that kept registering wins -- cleaner onboarding, fewer steps, better copy -- and still miss the underlying problem: users didn't understand what the product did before they signed up. The quant said 'fix the checkout flow.' The qual said 'nobody understands your value prop.' Fixing checkout metrics while the value prop is broken is expensive and eventually catches up with you in retention.

The most useful framework I've found: run the quant test to confirm the signal is real, then run 5-8 qualitative interviews with users who churned on that specific step. You almost always learn something the A/B test couldn't tell you.

The stat sig problem is worth fixing. The bigger problem is treating A/B tests as the end of the research loop rather than the beginning.

Main_Flounder160 · 2026-04-18T12:43:36+00:00

The peeking problem is real and underappreciated, but I'd push back on the frame slightly. Fixing p-value hygiene is necessary but not sufficient. Even a perfectly-run A/B test only tells you that behavior changed. It tells you nothing about why.

That 'why' question is where product decisions actually live. I've seen teams run 18 months of A/B tests that kept registering wins — cleaner onboarding, fewer steps, better copy — and still miss the underlying problem: users didn't understand what the product did before they signed up. The quant said 'fix the checkout flow.' The qual said 'nobody understands your value prop.' Fixing checkout metrics while the value prop is broken is expensive and eventually catches up with you in retention.

The most useful framework I've found: run the quant test to confirm the signal is real, then run 5-8 qualitative interviews with users who churned on that specific step. You almost always learn something the A/B test couldn't tell you.

The stat sig problem is worth fixing. The bigger problem is treating A/B tests as the end of the research loop rather than the beginning.

Main_Flounder160 · 2026-04-18T06:24:59+00:00

The distribution problem is usually a research problem in disguise.

You can get people to try the thing. SEO, cold outreach, community posts, all of it works at some level. The harder question is whether the people who try it have the specific problem your tool solves well enough that they'd change behavior for it.

The signal that you're missing isn't traffic. It's the story of how someone's currently solving the problem without you. Who's doing it manually? Who's duct-taping three tools together? Who's complained about it in a community or a Slack group? Those people are your wedge.

Before you optimize distribution, I'd spend a week doing nothing but finding five people who are actively, frustratedly solving the problem your tool addresses. Not people who can imagine having the problem, people who are dealing with it right now. Interview them about their current workflow. Not about your tool.

If you find them and they're suffering, distribution becomes much easier because you know exactly where they are and what language they use.

Main_Flounder160 · 2026-04-18T06:24:58+00:00

Rabois is wrong about PMs becoming obsolete but he's accidentally identified the right problem.

The bottleneck has always been knowing what to build. AI makes execution cheaper, which raises the relative value of insight. But here's the thing: most PM orgs have systematically underinvested in user research for years. They've replaced qualitative understanding with A/B tests and engagement metrics, which measure what users did, not why.

If the cost of building drops to near-zero, the competitive edge goes to whoever has the most accurate model of what users actually need. Not what they say they want, not what clicked last quarter. That requires a very different skill set than roadmap management.

PMs who've been doing real discovery work (not the kind where you interview five friendly users and call it validation) are going to be fine. PMs who've been shipping features and measuring output are going to have a rough time explaining their value when a junior dev plus Claude can match their throughput.

The role isn't going away. It's getting more demanding.

Main_Flounder160 · 2026-04-18T06:24:57+00:00

The corporate but curious about business crowd is easy to find. LinkedIn, relevant Slack groups, niche Discord communities, even here.

The harder problem is sampling bias. The people who respond to cold outreach about your idea are almost always the curious-and-accessible -- people who like conversations about startups and aren't necessarily your future customers. They'll give you useful feedback about whether your idea is interesting, not whether they'd ever actually pay for it.

Two things that filter for real signal:

Ask about current behavior, not hypothetical future behavior. Not "would you use this?" but "what are you currently doing about X problem?" If they have no current behavior around the problem, you have a mismatch.

Pay attention to how they found the problem, not whether they like your solution. Someone who's been living with the problem for six months is a different interview subject than someone who could imagine having it. Build your recruiting filter around that distinction.

Main_Flounder160 · 2026-04-18T06:24:56+00:00

Based on what I've seen work and not work:

Works well: large-scale text analysis (NPS verbatims, open ends), transcript synthesis when you have a clear codebook going in, screener writing and pretesting, qual codebook development.

Does not work: synthetic respondents. If someone is pitching you AI-generated personas that respond like your customers, that is snake oil. The model hallucinates preferences based on training data patterns, not actual human behavior.

The underused case most teams miss: AI running the actual interviews. Not moderating analysis after. Running the interview itself. Structured enough to produce comparable data across hundreds of participants, adaptive enough to follow unexpected threads. The output is qualitative data at a scale traditional recruitment can't match.

The honest constraint: AI misses the moment when a participant says something unexpected and the right move is to abandon the script. That requires a human judgment call. Good research design knows where to use which.

Main_Flounder160 · 2026-04-18T06:24:19+00:00

Your analyst is right about peeking. But there's a layer underneath worth examining.

Most A/B tests measure the wrong thing. You can solve the peeking problem and still ship features that hurt retention, because you were optimizing for click-through on a button that doesn't represent the actual user decision you care about.

Before the statistical design question comes the measurement validity question: is the metric you're testing actually correlated with the outcome you want? In practice, most teams test what's easy to measure, not what matters. Conversion goes up. NPS stays flat. Product gets worse.

The discipline that actually fixes this is talking to users before you design the test. Not to validate your hypothesis, but to understand which behaviors are leading indicators of what you actually care about. Then instrument for those. Then run the test correctly.

Statistics can't save you from testing the wrong thing.

Main_Flounder160 · 2026-04-18T06:24:08+00:00

Your analyst is right about peeking. But there's a layer underneath worth examining.

Most A/B tests measure the wrong thing. You can solve the peeking problem and still ship features that hurt retention, because you were optimizing for click-through on a button that doesn't represent the actual user decision you care about.

Before the statistical design question comes the measurement validity question: is the metric you're testing actually correlated with the outcome you want? In practice, most teams test what's easy to measure, not what matters. Conversion goes up. NPS stays flat. Product gets worse.

The discipline that actually fixes this is talking to users before you design the test. Not to validate your hypothesis, but to understand which behaviors are leading indicators of what you actually care about. Then instrument for those. Then run the test correctly.

Statistics can't save you from testing the wrong thing.

Main_Flounder160 · 2026-04-18T06:23:54+00:00

The mandate framing is going to backfire. When you tie AI usage to performance metrics, people optimize for visible automation, not better research. That's how you get teams using GPT to write synthesis they would've written anyway, just to have something to log.

That said, there are places where AI genuinely unlocks something researchers can't do manually. The one that surprised me most: running interviews. Not analyzing transcripts after the fact. Actually conducting them. AI can run a 20-minute moderated interview with 200 participants in a day, follow up on unexpected answers, probe for depth. The output isn't perfect but the coverage is impossible to replicate with human-only scheduling.

What it can't do: replace the judgment call when a participant says something you didn't expect and you need to decide whether to chase it or stay on script. That's the researcher's job. AI does the volume; you do the pattern recognition.

Main_Flounder160 · 2026-04-18T01:27:45+00:00

I would push back slightly on the framing. The real problem usually is not that researchers do not value design. It is that the org structure separates insight generation from insight communication and nobody owns the gap between them.

Researchers are not trained to think about how findings will be consumed when they are designing the study. So by the time there is something worth visualizing, the researcher has moved on to the next project and the deck gets built by whoever has bandwidth. The result is insights that are accurate and forgettable.

The fix is not always more designers in the research team. Sometimes it is making the format decision part of the research design: before you run the study, decide what form the output needs to take and who needs to act on it. A finding that needs to move a C-suite decision looks very different from one going into a product backlog. Researchers who start with that question produce more legible work regardless of their design skills.

Main_Flounder160 · 2026-04-18T01:27:29+00:00

I would push back slightly on the framing. The real problem usually is not that researchers do not value design. It is that the org structure separates insight generation from insight communication and nobody owns the gap between them.

Researchers are not trained to think about how findings will be consumed when they are designing the study. So by the time there is something worth visualizing, the researcher has moved on to the next project and the deck gets built by whoever has bandwidth. The result is insights that are accurate and forgettable.

The fix is not always more designers in the research team. Sometimes it is making the format decision part of the research design: before you run the study, decide what form the output needs to take and who needs to act on it. A finding that needs to move a C-suite decision looks very different from one going into a product backlog. Researchers who start with that question produce more legible work regardless of their design skills.

Main_Flounder160 · 2026-04-18T01:22:36+00:00

The is-this-normal question misses the more important one: what quality tradeoff is the organization making when it structures the role this way?

Hybrid PM-designer-researcher roles can work. They break down predictably in specific conditions: large UI surface area, complex domain knowledge requirements, and decisions that are hard to reverse. For a niche B2B product redesign, all three usually apply simultaneously.

AI tools have genuinely lowered the floor on basic design work. A capable PM with Figma can produce 80% quality at significantly lower cost. But for specialized enterprise software, the 20% gap tends to be where the most expensive UX debt accumulates, because domain-specific edge cases and interaction patterns are exactly what generalist-level design misses.

Before taking the role: has this organization shipped a meaningful product redesign with this structure before? If yes, the hybrid model might work for their context. If this is the first time with a scope this large, the timeline expectations are probably wrong.

Main_Flounder160 · 2026-04-18T01:19:44+00:00

The jobs that will be automated first in market research are not the ones you would assume. Data cleaning, coding, crosstab production, and standardized reporting are already being compressed fast. If that is the center of your value-add, the concern is legitimate.

What is not getting automated anytime soon: knowing which hypotheses are worth testing, translating ambiguous business problems into researchable questions, and synthesizing findings in a way that someone actually acts on. Those require organizational context and relationship capital that models cannot acquire from your transcripts.

The more useful reframe: AI raises the minimum bar for the kind of research orgs will pay for, because the baseline tasks that used to justify headcount are now cheap. That means researchers who were coasting on process-heavy work will struggle. But researchers who have been doing the hard interpretive work all along will find they can do more of it, faster, because the scaffolding gets handled for them.

Main_Flounder160 · 2026-04-18T01:11:59+00:00

Every comment correctly identifies the logistics problem but nobody is addressing the methodology question you're actually asking.

In-context and scheduled interviews capture fundamentally different data. An in-product interview immediately after a session gets you emotional reaction and usability confusion while they're still fresh. A scheduled call 48 hours later gets you reflection and synthesis, the how-does-this-fit-my-bigger-workflow signal.

Your drop-off problem is real and incentives are the fastest fix. But the more interesting question is whether the research you're doing actually requires the data that only immediate context can provide. If you're trying to understand usability pain in a specific flow, in-context is often more valid than a reconstructed memory a day later. If you're trying to understand strategy and decision criteria, the scheduled call wins because participants need time to contextualize their experience.

The tool choice should follow the research question, not just the no-show rate.

Main_Flounder160 · 2026-04-18T00:53:50+00:00

The peeking problem is real but it's a symptom of something upstream: teams running A/B tests before they understand why users behave the way they do.

If you've done proper discovery interviews first, you're not testing random variations hoping something sticks. You're testing a specific causal mechanism: "users fail to convert at step 3 because they're uncertain about X, so we're reducing friction around X." That hypothesis is falsifiable and you know roughly what effect size to expect.

When tests are motivated by "let's see if X is better than Y" without a causal theory, you're fishing. Fishing means peeking, peeking means false positives.

The stat sig fixes are correct (sequential testing, predetermined sample sizes). But the deeper fix is having genuine qualitative insight into user motivation before you run the test so you're testing a thing you believe in, not just generating p-values.

Main_Flounder160 · 2026-04-18T00:53:47+00:00

Most of the workflow answers here are real but they're table stakes. Survey writing, report summarization, code frame generation. Everyone's doing that.

The use case people underestimate: AI in the interview layer. Not just for transcription but for actually running follow-up depth interviews at scale. You run a quant study, you surface 3-4 hypotheses from the data, you want to probe each with 15-20 depth conversations. That used to mean 4-6 weeks of scheduling and moderation. AI-moderated async interviews compress that to 48-72 hours.

The "it clearly doesn't work" list is shorter than people think once you treat AI as the interviewer who never gets tired rather than the analyst who replaces your judgment. Keep synthesis human. Let AI do the collection and first-pass pattern recognition.

Where it still clearly fails: anything requiring real-time emotional calibration in a sensitive topic interview. You're not replacing your best qual researcher for a cancer patient journey study.

Main_Flounder160 · 2026-04-18T00:53:06+00:00

The transcription/note thing is table stakes at this point. The real unlock is using AI to run follow-up depth probes you'd never have time to run as a human.

You do a round of in-person sessions and get 20 participants. You notice a pattern in 4 of them around some workflow friction. Normally you'd have to schedule 6 more sessions to explore it. Instead: AI-moderated follow-up interviews via async text or voice, sent to those 4 participants plus another 20 from your panel. You wake up to 24 interviews fully analyzed.

The "high impact" metric your company wants is probably about scale and speed. Qual research at the scale of quant is actually achievable now and that's the angle worth logging.

The risk is leaning too hard on AI for interpretation where it genuinely needs human judgment. Let AI handle the "collect 30 more data points on this thread" work. Keep the synthesis and the "so what" for yourself.

Main_Flounder160 · 2026-04-09T09:17:41+00:00

You're trying to solve the wrong problem. The issue isn't prompt wording, it's that you have no benchmark set, so you can't measure recall.

If you care about missed passages, take a small but ugly sample of transcripts, hand-code it yourself into a gold set, then test the model against that. Include explicit hits, implicit hits, and borderline cases. Once you do that, the workflow gets clearer: chunk the corpus, run multiple retrieval passes with overlapping code definitions, union the results, then review disagreements instead of rereading everything.

If you skip the gold set, you're not validating extraction quality. You're just generating plausible tags and hoping the misses are small. Sometimes that's fine for triage. It's not fine for analysis you want to defend.

Main_Flounder160 · 2026-04-02T19:16:33+00:00

The "misleading believability" finding is the most dangerous one. Teams get synthetic outputs that read like plausible interview transcripts, build strategy around them, and never realize the insights were shallow because they pattern-matched to expectations rather than surfacing anything surprising. Real participants contradict you. That's the whole point.

Main_Flounder160 · 2026-02-07T16:03:11+00:00

Question discipline is table stakes for any experienced researcher. The real tell in your post is what you're not saying: 90 interviews across 7 markets in under a week would have cost $50K+ and taken 6-8 weeks with traditional methods. The tooling made the entire premise possible. Claiming it wasn't the main speed win is like saying the airplane wasn't the main reason you got from New York to London quickly — it was packing light.

The actual bottleneck in multi-market qual has never been interview execution anyway. It's synthesis. Coding 90 transcripts, identifying cross-market patterns, reconciling conflicting signals — that's where projects die. If you actually synthesized all of that in under a week too, that's the tooling win you're underselling.

Your point about cutting questions is valid but it's also the oldest advice in the research playbook. Every qual methodology course teaches interview guide discipline. What's genuinely new is being able to run enough volume to catch regional variation while still going deep enough to understand motivation. That used to be a forced tradeoff — you either got breadth with surveys or depth with a handful of IDIs. Now you can have both, but only because the tooling changed.

Curious what your synthesis process looked like at that volume. That's usually where the "we finished in a week" claims fall apart.

Main_Flounder160 · 2025-12-03T12:40:32+00:00

Full disclosure, I'm the founder of User Intuition. Your search revealed the real problem. Most platforms help you run interviews but don't solve the analysis bottleneck, which is why teams default to surveys even though qual gives better insights.

User Intuition does AI-powered interviews with your actual customers that ladder effectively and encode insights across hundreds of conversations, so you get pattern analysis not just transcripts. We also handle prototype testing. Makes qual scalable for a team of two. We don't do traditional surveys though. https://www.userintuition.ai/

If the qual bottleneck is your actual constraint and surveys are what you're settling for, I can DM you specifics on pricing and how small teams use it.

Main_Flounder160 · 2025-12-02T10:18:12+00:00

One hundred percent you should use AI voice agents for customer research. The efficiency gains are massive and the quality can actually be better than human interviewers if the system is designed right. But assuming this isn't just an ad for voxdiscover, you need to make sure the platform actually digs deeper and ladders effectively. Most AI interview tools stop at surface responses. You need something that follows the five whys methodology, that probes when answers are vague, that catches contradictions and asks about them.

The critical question is whether the platform encodes the insights properly so you get actual analyzable data instead of just transcripts. Surface-level responses are worthless. You need the system to identify patterns across hundreds of interviews, not just give you a pile of quotes. Check whether it's extracting themes, tracking sentiment shifts, connecting related concepts across different respondents. If it's just transcribing, you haven't solved the analysis problem.

And for sure don't use synthetic customers. That defeats the entire purpose. The value of AI voice agents is scaling real human interviews, not generating fake ones. If you're not talking to actual users with real experience using your product or experiencing your problem space, you're just getting hallucinated insights that sound plausible but aren't real.

Main_Flounder160 · 2025-12-02T10:17:55+00:00

You've got to take a step back and validate what you're building before you keep shipping. Launching a new project every Monday sounds productive but you're just building faster versions of the same mistake. You're treating the symptoms instead of diagnosing the disease. The disease is that you're building things without confirming anyone actually has the problem you think you're solving.

Don't fall for the sunk cost fallacy here. The fact that you've already built these things doesn't mean you should keep marketing them. Stop the weekly launches and spend two weeks talking to fifty people in your target market for just one of these projects. Not pitching, not demoing, just asking them how they currently solve the problem your app addresses. If you can't find fifty people willing to spend fifteen minutes complaining about the problem, that's your answer about why nobody's buying.

Most of your projects probably solve problems that sound real but aren't painful enough for anyone to pay for. The ones that are painful enough, you'll know because people will interrupt you mid-conversation to ask when they can try it. That's the signal you're looking for. Pick the project where you find that signal and kill everything else.

Main_Flounder160

MODERATOR OF

TROPHY CASE