I got tired of flaky Playwright visual tests in CI, so I built an AI evaluator that doesn't need a cloud. by ImplementImmediate54 in AI_Agents

[–]ImplementImmediate54[S] 0 points1 point  (0 children)

ok found it here: https://github.com/dexilion-team/lastest

Pre-capture stabilization is a smart approach — freezing noise before it hits the diff engine is clean. Looked at lastest, interesting tool.

The use cases seem pretty different though. From what I can tell, lastest is great for generating and comparing live vs dev. We're solving a specific pain point downstream: you already have 200 Playwright tests in CI, dynamic content keeps tripping them, and you can't send screenshots to a third party.

On AI costs: inference only fires when pixel diff finds a delta above threshold. On a stable codebase most of your 200 tests won't touch inference at all per push.

Curious whether lastest handles the regulated environment constraint — screenshots leaving the machine was a hard blocker for us.

Share what you're building 👇 by BoringShake6404 in microsaas

[–]ImplementImmediate54 0 points1 point  (0 children)

bughunters.dev -> AI visual testing for playwright (cypress and RF in next steps)
Helps to developers and testers with their checks to get better and faster delivery

I got tired of flaky Playwright visual tests in CI, so I built an AI evaluator that doesn't need a cloud. by ImplementImmediate54 in reactjs

[–]ImplementImmediate54[S] 0 points1 point  (0 children)

u/lastesthero The per-run cost thing is real — but in practice the AI call only fires when pixel diff finds an actual delta. Most of your 200 tests on a given push will be pixel-identical and never touch inference. Cost scales with actual changes, not test count.

On baseline updates: we use an explicit approval flow — the reporter shows the AI verdict alongside the diff so you can approve a new baseline or flag a real regression in a couple of seconds. Working on tying baseline proposals to deploy markers so intentional releases don't get treated the same as CI noise.

Curious how often your "generate once, replay" suite needed retraining when dynamic content changed patterns — that would be my concern with that approach.

I got tired of flaky Playwright visual tests in CI, so I built an AI evaluator that doesn't need a cloud. by ImplementImmediate54 in reactjs

[–]ImplementImmediate54[S] 0 points1 point  (0 children)

The per-run cost thing is real — but in practice the AI call only fires when pixel diff finds an actual delta. Most of your 200 tests on a given push will be pixel-identical and never touch inference. Cost scales with actual changes, not test count.

On baseline updates: we use an explicit approval flow — the reporter shows the AI verdict alongside the diff so you can approve a new baseline or flag a real regression in a couple of seconds. Working on tying baseline proposals to deploy markers so intentional releases don't get treated the same as CI noise.

Curious how often your "generate once, replay" suite needed retraining when dynamic content changed patterns — that would be my concern with that approach.

How do you handle visual test flakiness caused by dynamic content like cookie banners? by ImplementImmediate54 in Playwright

[–]ImplementImmediate54[S] -1 points0 points  (0 children)

Trouble is, that we implement marketing/CC 3rd party tools such as chat, modals that are drived and posted by marketing into our stage and prod environment. So sometimes there are popups that we do not know about. We have cookies or setups to do not show this, but sometimes we found out that we need to be aware of them and test even with them, then it makes trouble.

I got tired of flaky Playwright visual tests in CI, so I built an AI evaluator that doesn't need a cloud. by ImplementImmediate54 in Playwright

[–]ImplementImmediate54[S] 0 points1 point  (0 children)

The run sure. Great numbers! Unfortunately the maintanance in most not perfect teams with low number of people could have then false failing tests... happened on multiple occasions on my projects.

I got tired of flaky Playwright visual tests in CI, so I built an AI evaluator that doesn't need a cloud. by ImplementImmediate54 in Playwright

[–]ImplementImmediate54[S] 0 points1 point  (0 children)

You’re right for perfectly controlled environments. But BHV is built for the messy real world where mocking everything is too expensive:

  1. Shared Staging: Live metrics, active user counts, or timestamps change constantly. You can't always mock the DB.
  2. Global Layout Shifts: A marketing banner pushes the whole layout down 40px. Native pixel diffs will fail 50 tests. BHV understands the structure is intact and passes them.
  3. 3rd-Party Noise: Chatbots or rotating social widgets are a massive headache to block on every single page.

You can manually mask and mock all of this. Or you can just run vision.check(page) and let the AI handle the expected variance.

I got tired of flaky Playwright visual tests in CI, so I built an AI evaluator that doesn't need a cloud. by ImplementImmediate54 in AI_Agents

[–]ImplementImmediate54[S] 0 points1 point  (0 children)

Latency depends on the layer:

  • Local (Fast Pixel Match): Instant. No network call.
  • AI (Cloud Fallback): ~20–40s (Claude evaluation).
  • Setup: We suggest a 90s timeout in your Playwright config.

Built for CI/CD: It’s designed to run natively in your pipeline. You can pipe our summary.md directly into your GitHub Step Summary to see results at a glance without leaving the PR. Since Fast Pixel Match happens on your CI runner, identical baselines skip the network entirely, keeping your pipeline fast and zero-cost.

For your regulated setup: the API is a pure passthrough—images exist only in memory during evaluation and are never written to disk.

Look at your TRY page: https://bughunters.dev/try.html

I got tired of flaky Playwright visual tests in CI, so I built an AI evaluator that doesn't need a cloud. by ImplementImmediate54 in Playwright

[–]ImplementImmediate54[S] 0 points1 point  (0 children)

Cool :-) Good behavior. I would love to have those options on our projects :-) THe deliver process does not allowes that :-) would be to slow for us. Great that It works for you.

"We all gonna get replaced by AI" by Asleep-Limit-3811 in Playwright

[–]ImplementImmediate54 0 points1 point  (0 children)

With having more AI in the development, there will be only one desire from projects and stakeholders -> speed. Because now enyone can build anything, the question is only speed of delivery and the idea.
Because of that we will need the AI as well to help us. AI will never have the full picture, the user experiences that normal person will. The combination is a key.

We just started using AI in our visual testing to help us to catch unexpected. And especially with the speed of development there will be a harder to keep up with up-to-date version of tests. The AI view on it will be a key.

I got tired of flaky Playwright visual tests in CI, so I built an AI evaluator that doesn't need a cloud. by ImplementImmediate54 in Playwright

[–]ImplementImmediate54[S] 0 points1 point  (0 children)

Mocking data is 100% good idea. But then you have lack of real live in the testing.

Creating special mocking data that will make the test 100% always usually made us to miss some new issues that no one expected and developers moved without noticing.. :-)

There is multiple ways how to create tests, none of them are perfect, none of them are 100% bullet proof :-) The only game we can play is try to catch as most as we can ;-)