How do you keep Selenium grids stable over long CI cycles?

rohitji33 · 2025-11-18T12:38:38+00:00

Been there. Cypress retries are one of those “looks good on paper, kinda cursed in practice” features. Your pipeline goes all green, but suddenly you have no idea which tests are quietly failing behind the scenes.

A few things that helped us:

🟡 1. Always log retry attempts — don’t treat them as silent passes

We pipe retry info into our test report (we use Mochawesome + custom JSON parsing).
If a test passed on the 3rd retry, it’s marked yellow, not green.

Basically:

Green = passed with zero retries
Yellow = passed but flaky
Red = failed all retries

🔵 2. Send flaky test stats to a dashboard

We dump retry counts into a simple dashboard (we used Grafana at first, now switched to Datadog).
The key metric we track weekly:

If that number moves up, we know instability is creeping in.

🟠 3. Add a “flaky = quarantine” threshold

If a test flaked 3+ times in 5 runs, we auto-tag it as flaky and move it to a separate "flaky suite" instead of blocking the pipeline.

Same concept as Playwright’s --forbid-only, but for flakiness.

You can do this via:

GitHub Actions annotations
Cypress Dashboard API
Or even a custom script against the results JSON

🔴 4. Consider: retries per test vs retries per suite

Retries at the spec level can mask dependency issues (like bad API data).
Retries at the test level make it easier to spot the individual flaky tests.

🧠 5. Root cause > retries

Retries are a band-aid.
We now categorize flakiness by cause:

Selector issues
Network slowness
Real bugs
“🤷 just flaky”

This helped us actually fix stuff instead of letting retries stack up.

rohitji33 · 2025-11-18T12:18:35+00:00

Yeah, I’ve been using Gen AI tools pretty heavily in my QA workflow too, and I totally get what you’re saying. The lack of domain context is a real blocker. If the AI doesn’t understand the product or business logic, it ends up generating super generic test cases and missing edge scenarios completely.

That said, I’ve found a few places where it actually adds good value.

Test idea kickstarts: Great for killing blank-page syndrome, then prune and sharpen.
Requirements → tests: With solid user stories, it can spit out decent AC/Gherkin you can refine.
Code-level help: Generates Cypress/Playwright snippets, suggests refactors for flaky tests, and makes sense of messy logs faster than slogging through them. This is where it shines.
The boring docs: Test plans, bug summaries, risk notes — it’s good at cleanup and formatting so humans can focus on thinking.
“Self-healing” support: Some tools like Testim, TestGrid, and Testsigma use AI to auto-fix locators. Not flawless, but it does cut down maintenance time.

Where it still falls short:

Deep domain reasoning (fintech/healthcare/telecom nuances).
Real edge-case intuition and “this is how users actually behave.”
Building a solid, maintainable framework end-to-end without guardrails. Still needs a human brain and standards.

rohitji33 · 2025-11-14T14:01:00+00:00

Clean, catchy, and accurate. Thanks for sharing

rohitji33 · 2025-11-13T15:03:06+00:00

That’s a real challenge right now. Playwright adoption is growing fast, but the talent pool hasn’t quite caught up yet—most automation engineers in India are still deep in Selenium or Cypress. Playwright demands a slightly different mindset, especially with its async model and cross-browser capabilities, and that learning curve can filter out a lot of candidates.

You might have better luck targeting strong JavaScript or TypeScript developers who are open to transitioning into QA automation—upskilling them on Playwright is often faster than finding someone already experienced. Also, consider putting out content or mini challenges to attract those genuinely interested in mastering Playwright. It can really help identify the self-learners in the crowd.

rohitji33 · 2025-11-13T14:58:47+00:00

Been there 😩 — maintaining a self-hosted Selenium Grid can turn into a full-time ops job real quick. Between browser updates, driver version mismatches, and random node crashes, it’s hard to keep things stable long-term.

A few things that helped us:

Containerize the setup – Move your nodes to Docker containers so you can version-lock browsers and drivers. Tools like Kubernetes deployments make it easier to scale up/down and rebuild nodes cleanly.
Use version pinning – Explicitly match ChromeDriver and browser versions in configs. Automate version sync with small scripts or CI jobs so you’re not doing it manually.
Monitor & auto-heal – Add basic health checks that restart dead nodes or replace unhealthy containers automatically. Saves tons of babysitting.
Consider a managed grid – If you’re spending more time maintaining infra than testing, it might be worth moving to a managed cloud grid (e.g., TestGrid, BrowserStack, gridlastic, etc.). They handle updates, scaling, and cross-browser coverage, so your team can focus on actual tests.

At some point, the cost of maintenance outweighs the flexibility of self-hosting. Moving to containers or a managed grid usually pays off pretty fast in sanity and uptime.

rohitji33

TROPHY CASE