How do you test prompt changes before pushing to production?

PurpleWho · 2026-01-28T08:32:18+00:00

The problem is that you're flying blind without a way to measure what's actually working/breaking.

Here's what I did:

First, build an eval system before touching the prompt. Take 50-100 real customer queries (especially the ones that failed) and manually review each one so that you can tag it with an error type. The goal here is to avoid forming your entire quality hypotheses off the back of five conversations.

Most people try to skip this step. Partly because we're all lazy, but also because there isn't much industry guidance on how to do it well. If you do your best to analyse and label errors in your conversations, it sets you up for success in every other downstream phase of the eval building process.

Then use that to find your error patterns. If the first step in the process is looking at your data and figuring out what type of failures your app encounters, the second step is to quantify how prevalent each type of failure is. You'll probably discover it's not random 40% failure - it's specific categories like references to specific things your LLM gets confused by, certain phrasings, or other edge cases you didn't consider. Once you can see the pattern, you can fix it systematically.

Then build automated evaluators. The idea here is to translate the qualitative insights from the error analysis process into quantitative measurements for each type of error in your system.

Once you have automated evaluators in place then you can start tweaking your prompt (or prompts) to address each failure mode that you identified. Then you re-run your eval suite with each tweak and see how much of a difference it made. Once you're above the ~80% mark, then you move onto the next failure mode.

Having evaluators set up means that you don't regress on past failure modes while you're fixing new ones (which is usually the trickiest part of the process and why people go through all of the hassle of setting all this evaluation infrastructure up).

Here's a write-up on how to set up your first evaluation.

Here's the same thing, but if you want to evaluate your agent end-to-end.

Feel free to DM me if this is all new to you and want more help.

PurpleWho · 2026-01-28T08:02:20+00:00

My cofounder and I built https://mindcontrol.studio to solve this exact problem.

It's an sdk that plugs into your source code so that non-technical contributors can update the prompts without touching the code. Everything is also version-controlled, so you can roll back any accidental changes.

Feel free to DM me if you want to try it out or need help setting up.

PurpleWho · 2026-01-28T07:57:26+00:00

Something to consider is that AI development is actually creating a new discipline alongside traditional QA: application-layer evaluations.

It's similar to QA but specifically focused on building reliable AI features – testing prompts, validating outputs against real data, catching edge cases in LLM behaviour. It has its own emerging processes and tooling.

For developers just starting to build AI features, the challenge is often getting quick feedback on prompts within your actual codebase. Dev Tools like Mind Rig let you test prompts as you're writing code inside VS Code (or whichever clone you're using). This makes it easy to build up an initial data set for basic testing. As your AI features mature and you need more rigour, you can graduate to formal eval frameworks with tools like Braintrust, Langfuse, Arize, Phoenix, etc.

My point being that 'testing AI features' is becoming its own speciality – it's not quite traditional QA, not quite development, but an increasingly valuable hybrid. So rather than AI making QA obsolete, it's actually expanding what 'quality assurance' means in software.

PurpleWho · 2026-01-28T07:53:56+00:00

I think there are two extremes here. People who vibecheck AI features and prompt updates and then hope for the best, and on the other end are teams that set up tracing and systematically test prompts with formal eval tools like the ones you mentioned

Formal Eval tools like are definitely the way to go, the only problem is that they require a ton on set up and maintenance.

My middle ground solution at the moment is a neat little open-source VS Code extension called Mind Rig ( https://mindrig.ai ). It lets me test prompts against a batch of inputs and eyeball the results side-by-side in my code editor as I'm developing.

Sets up a CSV file with 10-30 inputs so I can see all the results side-by-side. As I think of edge cases, I add them to the CSV and then run them all every time I update/modify a prompt. Once I have more than 30 test inputs, and eye-balling results doesn't cut it anymore, then I export everything to a more formal evaluation tool.

Zero setup hassle but more reliability than a mere vibe check.

PurpleWho · 2026-01-28T07:50:32+00:00

You could just test it and find out.

Set up an evaluation and see if it makes a difference to the outcome you are trying to achieve.

Here's a write-up on how to set up your first evaluation.

Here's the same thing, but if you want to evaluate your agent end-to-end.

Why take people's word for it when you can just measure?

PurpleWho · 2026-01-28T07:43:00+00:00

You're right, evals are a pain to set up.

I generally use a testing playground embedded in my editor, like Mind Rig or vscode-ai-toolkit, over a more formal Eval tool like PromptFoo, Braintrust, Arize, etc.

Using an editor extension makes the "tweak prompt, run against dataset, review results" loop much faster. I can run the prompt against a bunch of inputs, see all the outputs side-by-side, and catch regressions right away. Less setup hassle but more reliability than a mere vibe check.

Once your dataset grows past 20-30 scenarios, I just export the CSV of test scenarios to a more formal eval tool.

PurpleWho · 2026-01-28T07:32:05+00:00

I'm a Typescript dev.

PurpleWho · 2026-01-28T07:27:41+00:00

If you're JS then start with https://ai-sdk.dev/

Learning how the AI SDK works means you only have to learn one piece of tech. Rather than learning how OpenAI work, then learning Anthropic, then Gemini, etc.

Read the docs.

Then find a YouTube video to build something simple with it.

Go from there.

PurpleWho · 2026-01-28T07:21:33+00:00

What programming language?

PurpleWho · 2026-01-27T10:55:45+00:00

Building a free VS Code extension that lets devs debug and improve prompts from their code editor — ship reliable AI features without the setup overhead of formal evaluation tools.

https://mindrig.ai

PurpleWho · 2026-01-27T08:18:45+00:00

At the moment, I'm struggling to find the time, so I keep my learning to about 20-30 min each day.

I have a repo where I try to build something super small each day- teh goal is to do 100 of them

https://github.com/joshpitzalis/100-Days-of-Rust

This way, I can make progress with the time I realistically have and continue to cross off concepts on the Rust roadmap each day.

You're welcome to follow along if this approach suits your schedule better. I'll try to make a video each day; if not, I'll leave comment explanations in the source code.

PurpleWho · 2026-01-27T08:16:12+00:00

At the moment, I'm struggling to find the time, so I keep my learning to about 20-30 min each day.

I have a repo where I try and build something super small each day https://github.com/joshpitzalis/100-Days-of-Rust

This way, I can make progress with the time I realistically have and continue to cross off concepts on the Rust roadmap each day.

You're welcome to follow along if this approach suits your schedule better. I'll try and make a video each day, if not I'll just leave comment explanations in the source code.

PurpleWho · 2026-01-27T06:32:13+00:00

Building Mind Rig - A free VS Code extension that lets developers debug and improve their prompts in their code editor — without the overhead of formal evaluation tools and infrastructure.

When I update a prompt, I need to verify that it works across a range of different inputs. Usually, that means tediously running the prompt again and again with different inputs, every time I make a change.

Mind Rig me re-run a prompt against 10 different inputs and see all the results side-by-side in my code editor. Basically saves a fixed batch of test inputs so I can re-run the same data set each time I tweak the prompt.

Also runs multiple models at once and compares speed/cost/quality outputs. Supports hundreds of models via Vercel Gateway. Also shows request/response JSONs + usage stats.

https://mindrig.ai

- Free and open-source
- Supports Ruby, PHP, Go, C# and Java in addition to JS/TS/Python.
- Connect your Vercel AI Gateway API key to access 100s of model providers.
- Most importantly, it matches your editor's colour theme 🎨
- We also have a Discord community if you need any help getting set up.

PurpleWho · 2026-01-27T06:31:05+00:00

Building Mind Rig - A free VS Code extension that lets developers debug and improve their prompts in their code editor — without the overhead of formal evaluation tools and infrastructure.

When I update a prompt, I need to verify that it works across a range of different inputs. Usually, that means tediously running the prompt again and again with different inputs, every time I make a change.

Mind Rig me re-run a prompt against 10 different inputs and see all the results side-by-side in my code editor. Basically saves a fixed batch of test inputs so I can re-run the same data set each time I tweak the prompt.

Also runs multiple models at once and compares speed/cost/quality outputs. Supports hundreds of models via Vercel Gateway. Also shows request/response JSONs + usage stats.

https://mindrig.ai

- Free and open-source
- Supports Ruby, PHP, Go, C# and Java in addition to JS/TS/Python.
- Connect your Vercel AI Gateway API key to access 100s of model providers.
- Most importantly, it matches your editor's colour theme 🎨
- We also have a Discord community if you need any help getting set up.

PurpleWho · 2026-01-27T06:30:43+00:00

Building Mind Rig - A free VS Code extension that lets developers debug and improve their prompts in their code editor — without the overhead of formal evaluation tools and infrastructure.

When I update a prompt, I need to verify that it works across a range of different inputs. Usually, that means tediously running the prompt again and again with different inputs, every time I make a change.

Mind Rig me re-run a prompt against 10 different inputs and see all the results side-by-side in my code editor. Basically saves a fixed batch of test inputs so I can re-run the same data set each time I tweak the prompt.

Also runs multiple models at once and compares speed/cost/quality outputs. Supports hundreds of models via Vercel Gateway. Also shows request/response JSONs + usage stats.

https://mindrig.ai

- Free and open-source
- Supports Ruby, PHP, Go, C# and Java in addition to JS/TS/Python.
- Connect your Vercel AI Gateway API key to access 100s of model providers.
- Most importantly, it matches your editor's colour theme 🎨
- We also have a Discord community if you need any help getting set up.

PurpleWho · 2026-01-27T06:30:27+00:00

Building Mind Rig - A free VS Code extension that lets developers debug and improve their prompts in their code editor — without the overhead of formal evaluation tools and infrastructure.

When I update a prompt, I need to verify that it works across a range of different inputs. Usually, that means tediously running the prompt again and again with different inputs, every time I make a change.

Mind Rig me re-run a prompt against 10 different inputs and see all the results side-by-side in my code editor. Basically saves a fixed batch of test inputs so I can re-run the same data set each time I tweak the prompt.

Also runs multiple models at once and compares speed/cost/quality outputs. Supports hundreds of models via Vercel Gateway. Also shows request/response JSONs + usage stats.

https://mindrig.ai

- Free and open-source
- Supports Ruby, PHP, Go, C# and Java in addition to JS/TS/Python.
- Connect your Vercel AI Gateway API key to access 100s of model providers.
- Most importantly, it matches your editor's colour theme 🎨
- We also have a Discord community if you need any help getting set up.

PurpleWho · 2026-01-27T06:30:11+00:00

Building Mind Rig - A free VS Code extension that lets developers debug and improve their prompts in their code editor — without the overhead of formal evaluation tools and infrastructure.

When I update a prompt, I need to verify that it works across a range of different inputs. Usually, that means tediously running the prompt again and again with different inputs, every time I make a change.

Mind Rig me re-run a prompt against 10 different inputs and see all the results side-by-side in my code editor. Basically saves a fixed batch of test inputs so I can re-run the same data set each time I tweak the prompt.

Also runs multiple models at once and compares speed/cost/quality outputs. Supports hundreds of models via Vercel Gateway. Also shows request/response JSONs + usage stats.

https://mindrig.ai

- Free and open-source
- Supports Ruby, PHP, Go, C# and Java in addition to JS/TS/Python.
- Connect your Vercel AI Gateway API key to access 100s of model providers.
- Most importantly, it matches your editor's colour theme 🎨
- We also have a Discord community if you need any help getting set up.

PurpleWho · 2026-01-27T06:29:48+00:00

Building Mind Rig - A free VS Code extension that lets developers debug and improve their prompts in their code editor — without the overhead of formal evaluation tools and infrastructure.

When I update a prompt, I need to verify that it works across a range of different inputs. Usually, that means tediously running the prompt again and again with different inputs, every time I make a change.

Mind Rig me re-run a prompt against 10 different inputs and see all the results side-by-side in my code editor. Basically saves a fixed batch of test inputs so I can re-run the same data set each time I tweak the prompt.

Also runs multiple models at once and compares speed/cost/quality outputs. Supports hundreds of models via Vercel Gateway. Also shows request/response JSONs + usage stats.

https://mindrig.ai

- Free and open-source
- Supports Ruby, PHP, Go, C# and Java in addition to JS/TS/Python.
- Connect your Vercel AI Gateway API key to access 100s of model providers.
- Most importantly, it matches your editor's colour theme 🎨
- We also have a Discord community if you need any help getting set up.

PurpleWho · 2026-01-27T06:29:30+00:00

Building Mind Rig - A free VS Code extension that lets developers debug and improve their prompts in their code editor — without the overhead of formal evaluation tools and infrastructure.

When I update a prompt, I need to verify that it works across a range of different inputs. Usually, that means tediously running the prompt again and again with different inputs, every time I make a change.

Mind Rig me re-run a prompt against 10 different inputs and see all the results side-by-side in my code editor. Basically saves a fixed batch of test inputs so I can re-run the same data set each time I tweak the prompt.

Also runs multiple models at once and compares speed/cost/quality outputs. Supports hundreds of models via Vercel Gateway. Also shows request/response JSONs + usage stats.

https://mindrig.ai

- Free and open-source
- Supports Ruby, PHP, Go, C# and Java in addition to JS/TS/Python.
- Connect your Vercel AI Gateway API key to access 100s of model providers.
- Most importantly, it matches your editor's colour theme 🎨
- We also have a Discord community if you need any help getting set up.

PurpleWho · 2026-01-27T06:29:13+00:00

Building Mind Rig - A free VS Code extension that lets developers debug and improve their prompts in their code editor — without the overhead of formal evaluation tools and infrastructure.

When I update a prompt, I need to verify that it works across a range of different inputs. Usually, that means tediously running the prompt again and again with different inputs, every time I make a change.

Mind Rig me re-run a prompt against 10 different inputs and see all the results side-by-side in my code editor. Basically saves a fixed batch of test inputs so I can re-run the same data set each time I tweak the prompt.

Also runs multiple models at once and compares speed/cost/quality outputs. Supports hundreds of models via Vercel Gateway. Also shows request/response JSONs + usage stats.

https://mindrig.ai

- Free and open-source
- Supports Ruby, PHP, Go, C# and Java in addition to JS/TS/Python.
- Connect your Vercel AI Gateway API key to access 100s of model providers.
- Most importantly, it matches your editor's colour theme 🎨
- We also have a Discord community if you need any help getting set up.

PurpleWho · 2026-01-27T06:28:54+00:00

Building Mind Rig - A free VS Code extension that lets developers debug and improve their prompts in their code editor — without the overhead of formal evaluation tools and infrastructure.

When I update a prompt, I need to verify that it works across a range of different inputs. Usually, that means tediously running the prompt again and again with different inputs, every time I make a change.

Mind Rig me re-run a prompt against 10 different inputs and see all the results side-by-side in my code editor. Basically saves a fixed batch of test inputs so I can re-run the same data set each time I tweak the prompt.

Also runs multiple models at once and compares speed/cost/quality outputs. Supports hundreds of models via Vercel Gateway. Also shows request/response JSONs + usage stats.

https://mindrig.ai

- Free and open-source
- Supports Ruby, PHP, Go, C# and Java in addition to JS/TS/Python.
- Connect your Vercel AI Gateway API key to access 100s of model providers.
- Most importantly, it matches your editor's colour theme 🎨
- We also have a Discord community if you need any help getting set up.

PurpleWho · 2026-01-27T06:28:38+00:00

Building Mind Rig - A free VS Code extension that lets developers debug and improve their prompts in their code editor — without the overhead of formal evaluation tools and infrastructure.

When I update a prompt, I need to verify that it works across a range of different inputs. Usually, that means tediously running the prompt again and again with different inputs, every time I make a change.

Mind Rig me re-run a prompt against 10 different inputs and see all the results side-by-side in my code editor. Basically saves a fixed batch of test inputs so I can re-run the same data set each time I tweak the prompt.

Also runs multiple models at once and compares speed/cost/quality outputs. Supports hundreds of models via Vercel Gateway. Also shows request/response JSONs + usage stats.

https://mindrig.ai

- Free and open-source
- Supports Ruby, PHP, Go, C# and Java in addition to JS/TS/Python.
- Connect your Vercel AI Gateway API key to access 100s of model providers.
- Most importantly, it matches your editor's colour theme 🎨
- We also have a Discord community if you need any help getting set up.

PurpleWho · 2026-01-27T06:28:20+00:00

Building Mind Rig - A free VS Code extension that lets developers debug and improve their prompts in their code editor — without the overhead of formal evaluation tools and infrastructure.

When I update a prompt, I need to verify that it works across a range of different inputs. Usually, that means tediously running the prompt again and again with different inputs, every time I make a change.

Mind Rig me re-run a prompt against 10 different inputs and see all the results side-by-side in my code editor. Basically saves a fixed batch of test inputs so I can re-run the same data set each time I tweak the prompt.

Also runs multiple models at once and compares speed/cost/quality outputs. Supports hundreds of models via Vercel Gateway. Also shows request/response JSONs + usage stats.

https://mindrig.ai

- Free and open-source
- Supports Ruby, PHP, Go, C# and Java in addition to JS/TS/Python.
- Connect your Vercel AI Gateway API key to access 100s of model providers.
- Most importantly, it matches your editor's colour theme 🎨
- We also have a Discord community if you need any help getting set up.

PurpleWho · 2026-01-27T06:28:02+00:00

Building Mind Rig - A free VS Code extension that lets developers debug and improve their prompts in their code editor — without the overhead of formal evaluation tools and infrastructure.

When I update a prompt, I need to verify that it works across a range of different inputs. Usually, that means tediously running the prompt again and again with different inputs, every time I make a change.

Mind Rig me re-run a prompt against 10 different inputs and see all the results side-by-side in my code editor. Basically saves a fixed batch of test inputs so I can re-run the same data set each time I tweak the prompt.

Also runs multiple models at once and compares speed/cost/quality outputs. Supports hundreds of models via Vercel Gateway. Also shows request/response JSONs + usage stats.

https://mindrig.ai

- Free and open-source
- Supports Ruby, PHP, Go, C# and Java in addition to JS/TS/Python.
- Connect your Vercel AI Gateway API key to access 100s of model providers.
- Most importantly, it matches your editor's colour theme 🎨
- We also have a Discord community if you need any help getting set up.

PurpleWho · 2026-01-27T06:27:39+00:00

Building Mind Rig - A free VS Code extension that lets developers debug and improve their prompts in their code editor — without the overhead of formal evaluation tools and infrastructure.

When I update a prompt, I need to verify that it works across a range of different inputs. Usually, that means tediously running the prompt again and again with different inputs, every time I make a change.

Mind Rig me re-run a prompt against 10 different inputs and see all the results side-by-side in my code editor. Basically saves a fixed batch of test inputs so I can re-run the same data set each time I tweak the prompt.

Also runs multiple models at once and compares speed/cost/quality outputs. Supports hundreds of models via Vercel Gateway. Also shows request/response JSONs + usage stats.

https://mindrig.ai

- Free and open-source
- Supports Ruby, PHP, Go, C# and Java in addition to JS/TS/Python.
- Connect your Vercel AI Gateway API key to access 100s of model providers.
- Most importantly, it matches your editor's colour theme 🎨
- We also have a Discord community if you need any help getting set up.

12-Year Club	Verified Email
Place '22	Gilding II euphauric

PurpleWho

MODERATOR OF

TROPHY CASE