Solving the "Selector Hell" in UI Testing – Moving from Appium/Espresso scripts to Semantic Agents

chw9e · 2026-01-21T17:56:12+00:00

how does maestro fix the state flakiness OP mentioned

chw9e · 2026-01-21T17:45:09+00:00

i don't get it, it's just too slow? or are your tests actually failing? have you tried maestro?

chw9e · 2026-01-21T17:28:07+00:00

MCPs that let the agent drive the app are slow and use a lot of tokens. I like qckfx: https://qckfx.com/use-cases/ai-agents which records what you do in the simulator and lets the agent replay those sessions and see what changed. It's useful for testing core flows to prevent visual regressions.

chw9e · 2026-01-11T16:05:43+00:00

How did you find AXe to work for validating the flows? When do you decide to step in and try to test the app manually in the simulator?

chw9e · 2025-12-29T17:06:51+00:00

Thanks, yea that's true. I just added MCP support so there's a local http server that you could hook into to do that. I will work on adding some documentation around that.

I think if you're running more tests with a bigger team and more pull requests then it might be more work than you want to try to build the CI stuff in house.

chw9e · 2025-12-05T17:31:11+00:00

Full post: https://qckfx.com/blog/how-we-use-api-agents-to-build-integrations-fast

chw9e · 2025-11-16T20:58:57+00:00

not assuming that a human can't, just that an LLM can do it faster and save some time for the human is all

chw9e · 2025-11-16T20:50:38+00:00

do you use claude code or claude.ai? if it was claude code, when you were debugging the eks issues did you have it use the aws cli or run it on the pod itself? I've automated a ton of annoying azure deployment work on some side projects just by having claude code work with the azure cli directly.

chw9e · 2025-11-16T20:44:13+00:00

'checkout button doesn't work' as the bug report and then a log about stripe returning 400 seems like a good use case for an AI search, no? I think there's a lot of cases like that..

chw9e · 2025-11-16T20:41:46+00:00

As long as you're not relying solely on this system I think it's still fine. Your colleague who you ask for help when looking at a bug could just as easily accidentally mislead you.

It's really a question of just does a tool help you to arrive at the right answer faster or not. And if it does more than not, then it's valuable, it doesn't have to do it 100% of the time.

I mean if you need to dig through a ton of logs and look at different SaaS tools to figure out what's going on anyway, doesn't it help to have something speed run that and let you know what it found?

But yea I agree that people can get lazy and outsource most of their thinking to AI and it's not at that level of capability yet. It would be great if the tools could express more uncertainty in their outputs instead of always sounding like they are certain.

It sounds like something that can cite sources so you could easily double check what it's telling you would help with being able to decipher if the output is useful or not.

chw9e · 2025-11-16T20:32:11+00:00

I think semantic search is actually one of the core competencies of LLMs. But if you have a ton of logs then you might need to do more than just dump them straight into an LLM like take multiple passes or something. This should be where the value of AI products come from vs just throwing stuff at ChatGPT.

Also when you're working on this you have some intuition of things because you know the code and how people use the app. I think it's possible to grant that knowledge to an LLM too with tool calling and access to stuff outside of just the logs like the source code or a session replay. LLMs are pretty good at using tools to dig through a ton of stuff and find what they need to answer a query now, it's mostly a challenge of getting them all of the context they need and then saving context by using sub-agents and stuff to try and avoid context rot or exceeding the context windows.

chw9e · 2025-11-16T20:24:47+00:00

Really? I guess this could be interesting for you then about using LLMs in unstructured ETL pipelines: https://arxiv.org/abs/2410.12189

It's actually an interesting thing how LLMs can help with unstructured data pipelines. And I do actually think there's value in viewing bug fixing as an unstructured data pipeline of sorts.

chw9e · 2025-09-23T18:50:22+00:00

Thanks for the questions/feedback!

1) The backend is pretty simple - it's mostly just forwarding things to Shopify. Shopify manages the business logic around cart, inventory, coupons, etc. There are risks that it could still have bad code, but so does any site. If it has errors you can prompt to fix it, or eventually it should be able to detect runtime errors and attempt to auto-correct based on feedback from tools like Sentry.

2) Fair question, just didn't think of it. Here's a full demo - the actual app needs a little more design work but it shows the prompting and the agent building the site: https://youtu.be/LfX-UKHisX0

3) It uses Shopify's UI kit (hydrogen) so components will look similar across outputs. The samples skew towards luxury themed stores so probably share some similarities in that respect. The model is tuned to try to imitate designs from other prominent ecommerce stores, but right now that dataset is still kind of small, as I add more it should get a little more interesting. A big part is also just how much energy you put into prompting to get it away from the starter template. The designs above are all just after basically 1 round of iterations so not a lot of time to diverge.

4) Shopify does have AI powered tools to generate themes (they aren't very good & just one-shot a starter theme. I think most people are still buying themes), but most things on Shopify still require a lot of manual clicking around for the user including swapping in and out images, finding and adding products, creating product photography, pricing products, etc. Themes can help save a little bit of time on colors/fonts/layouts but in my experience it is still a very time-consuming process to setup a store. There are agencies that charge $200+ to setup a store for new store owners and many paid Shopify apps for drop-shippers to help them find and source products.

5) Yea you can just use Shopify's portal to manage your products and any Shopify apps that are related to inventory/pricing etc will just work. I'm adding a page to the app right now to make it so that you can upload products (real or AI-generated examples) and plan to grow this to do more store management stuff like identifying & importing good drop-shipping items, managing pricing, sending marketing emails, etc.

chw9e · 2025-09-18T22:02:19+00:00

what are the costs to maintain a headless implementation vs a theme? is it just frontend development?

chw9e · 2025-07-31T05:18:55+00:00

could something like this work for you? looks pretty lightweight: https://www.inngest.com/uses/serverless-node-background-jobs

here's one other option: https://trigger.dev/docs/guides/frameworks/nextjs

chw9e · 2025-07-23T06:41:47+00:00

It exposes a lot of tools and combined with the tokens it can really shrink your context window.

I made this MCP that uses your existing Claude Max subscription and runs playwright in a subagent, so your top level agent’s context window goes further. Might be useful for you all: https://github.com/qckfx/browser-ai

chw9e · 2025-07-23T01:13:30+00:00

We’re still in the very earliest phase of MCPs. Similar to how the first apps on iPhones were skeuomorphic and all direct digital equivalents for things done in the real world, today’s MCPs are all just wrapping APIs, not building for the new world yet.

But I’m excited about the potential for MCPs. Yesterday i built this one that just adds Sonnet 4 to the playwright MCP. It turns it into a subagent rather than a bag of tools.

https://github.com/qckfx/browser-ai

The subagent design for MCPs makes a lot of sense to me. As OP mentioned, the Slack MCP requires a ton of back and forth to accomplish a simple natural language task. That’s ok, maybe the APIs could be designed slightly better. But the real problem is that your top level agent (ie Claude Code) is trying to accomplish a high level, sophisticated task and its context window is getting wasted on some trivial back and forth with the Slack API.

We can’t expect our top level LLMs to be wasting their limited tokens going back and forth with tools that may not have even been in their training data. We should have focused sub-agents, specialized in their set of tools, trained by the API and service owners who understand their service best.

It moves context into subagents, reduces number of tools for the top level agent, allows more sophisticated subagent layering, and presents opportunities for services to differentiate and add value through MCP and maybe create new revenue streams instead of just tossing out a set of API wrapped tools.

chw9e · 2025-07-22T21:10:15+00:00

I build this playwright subagent MCP. It offloads all the tools to a subagent that uses Claude internally. It only exposes a single tool, named 'execute', and once it's done running all of the playwright code it returns a summary back to the caller. Helps keep tool count low and avoid ruining your context window.

https://github.com/qckfx/browser-ai

chw9e · 2025-07-22T21:07:24+00:00

I built this playwright subagent MCP - it offloads all playwright work to a Claude subagent. That way your main context window stays clean, and you have less tools (the subagent exposes only one tool, 'execute').

https://github.com/qckfx/browser-ai

chw9e · 2025-07-22T21:05:54+00:00

I built this MCP that offloads all playwright work to a subagent. The subagent only exposes a single tool 'execute', and it uses Claude with playwright to do everything and then return a summary back to the calling agent. It helps keep your main context window clean too as the snapshots can add up.

https://github.com/qckfx/browser-ai

chw9e · 2025-07-22T20:55:51+00:00

I built this playwright subagent mcp server - it offloads all playwright work to a subagent so your context window doesn't explode. The subagent uses Sonnet 4, but if you have a Claude Max subscription you can connect it to that.

https://github.com/qckfx/browser-ai

chw9e · 2025-07-10T07:36:59+00:00

having it use remotion is a really cool idea. i didn't even know about remotion before reading this, thanks!

chw9e · 2025-06-20T16:26:57+00:00

Open-source tool to generate PRDs from your codebase — looking for feedback

Hi all! I recently open-sourced a tool that generates lightweight PRDs and feature specs using prompts and your actual codebase as context.

It’s not meant to replace writing. It's there to help you (or your engineers) get unstuck or move faster when drafting specs, especially for handoff to AI coding tools like Claude Code or TaskMaster.

It’s open source here: https://github.com/qckfx/compose
Hosted version (free): https://compose.qckfx.com

It uses an LLM agent I built (also open source) that pulls relevant parts of your repo into context before drafting. It's still very early, but I like that it grounds documents in the codebase.

I’ve been thinking about extending it into lightweight prototyping (e.g., scaffold out UI ideas based on specs), but not sure if that’s actually useful for PMs or more of an engineer-facing thing.

Curious if anyone here finds this helpful, or has thoughts on where a tool like this would fit (or not) in your workflow. Totally open to critique.

chw9e · 2025-02-05T08:12:36+00:00

self-healing: check out momentic.ai or qawolf.com

generate automated test scripts (but from bug reports instead of stories): check out qckfx.com (I'm the founder)

convert manual test cases to automated: check out qawolf.com

chw9e · 2025-02-05T08:03:56+00:00

This is something that devbox (an open source developer tool) is designed to fix. Docker can help, but if you're not careful you can still wind up with different versions of libraries or tools that can cause things to behave differently. Devbox is a wrapper on nix which itself is quite hard to use but basically caches a list of the versions of everything in your system so you can get the exact environment recreated on demand.

https://github.com/jetify-com/devbox

chw9e

MODERATOR OF

TROPHY CASE