How-To10 min read

How to fix flaky tests in 2026

By qtrl Team · Engineering

Every test suite past a certain size has flaky tests. They fail on the first run, pass on retry, and nobody gets around to digging into them. The suite keeps drifting, and over time the team stops trusting it.

This post walks through what makes tests flaky in 2026, why AI coding tools pushed flake rates up without anyone noticing, and the fixes that hold up across Playwright, Cypress, and Selenium suites.

What a flaky test is

A flaky test is one that passes and fails against the same code, on the same environment, without you changing anything. The build is red. You hit retry. It's green. You move on.

That's the part that destroys trust. Once a suite has a few known flakes, developers start treating every failure as probably a flake. Real regressions ride along in the noise for days before someone notices. By the time you catch them, three more PRs have landed on top and the root cause is a layer cake.

Flakiness isn't a test problem. It's an information problem. A flaky suite gives you the wrong answer some of the time, which is worse than giving you no answer at all.

The five real causes of flaky tests

Almost every flaky test traces back to one of five root causes. The framework doesn't matter much. Selenium, Playwright, and Cypress suites all flake for the same reasons, just with slightly different failure modes.

1. Waits and timing

The classic. A test clicks a button, then checks for an element that's supposed to appear, but the app hasn't rendered it yet. Implicit waits paper over it on a fast machine and fall apart in CI under load. Hardcoded sleep(2000) is worse: too short on a slow day, wasteful on a fast one, and always wrong eventually.

Race conditions in async code are the same bug wearing different clothes. The test assumes a sequence that isn't guaranteed, and most of the time the sequence happens to line up. Until it doesn't.

2. Selectors that don't survive UI changes

XPath selectors pinned to layout, CSS selectors pinned to generated class names, and chains like div > div > span:nth-child(3)all break the minute a designer reshuffles a component. The test isn't wrong. It was never stable. It just hadn't been touched yet.

This is the failure mode that explodes when teams adopt AI coding tools. More UI changes per week means more selectors going stale, which we covered in why AI coding tools break test automation.

3. Test data that rots

A test passes on day one because the seeded user exists, has the right role, and hasn't hit any rate limits. By day thirty, someone else's test changed the user's email, a migration reset the role, and a cron job archived the record. The test now fails for reasons that have nothing to do with the code it's meant to be checking.

Shared fixtures that mutate across runs are the most common version of this. So are tests that depend on a specific ordering of pre-existing records.

4. Shared state and test order

Tests that pass when you run them in order and fail when you run them in parallel are leaking state somewhere. Usually it's the database, sometimes it's a global in memory, occasionally it's a browser session cookie that survives between cases.

Order dependence is the silent killer. It hides until you enable sharding, flip on parallelism, or someone on your team adds a--shuffle flag. Then half the suite goes red and nobody can explain why.

5. CI and environment drift

Flakiness you can't reproduce locally is almost always environment drift. CI runners have less CPU. Staging hits a shared database that's under load from other pipelines. The third-party sandbox your payment test relies on rate-limits you randomly on Tuesday afternoons.

You can't fix every source of environment instability, but you can stop pretending your CI environment is the same as your laptop. They aren't. Plan for it.

How fixes really work

There's no single fix. There's a set of habits that, taken together, cut your flake rate to something manageable.

Replace sleeps with auto-waiting

Modern frameworks already solve most of this. Playwright auto-waits for elements to be actionable before interacting. Cypress retries assertions until they pass or time out. Use the framework the way it was designed, and delete every sleep() you can find on sight. If a test needs to wait for a backend event, wait for the event, not for wall-clock time.

Use semantic selectors

Prefer selectors tied to meaning over selectors tied to layout. Roles, labels, and test IDs survive refactors. CSS paths don't. The order of preference most teams settle on:

  • Accessible roles and labels (getByRole, getByLabel)
  • Dedicated test IDs (data-testid) for elements that don't have natural semantics
  • Text content, for buttons and links where copy is stable
  • CSS classes only as a last resort, and never anything auto-generated

The fewer selectors your suite has that depend on layout, the fewer flakes you'll earn the next time a designer touches the component library.

Own your test data

The single highest-leverage fix most teams skip. Stop sharing fixtures across tests. Create what you need at the start of each test, clean up after, and scope everything to a test-specific tenant or namespace where you can.

Factories beat fixtures in almost every case. The same discipline you'd apply to a unit test (set up state, run the assertion, tear down) applies just as cleanly end-to-end, even if it feels heavier at first. Google's engineering teams published data years ago showing how much of their flakiness came down to test isolation, and the lesson holds.

Isolate tests by default

Run your suite in a random order at least once a week. If anything breaks, you have order dependence, and you want to find it before your next attempt at parallelization does. Avoid globals. Avoid shared sessions. Treat every test as if it's the only test that will ever run, because one day you'll need that to be true.

Quarantine, don't retry

Blanket retries: 3 in your CI config is a flake amnesty program. It hides the symptom and teaches the team to stop caring. Better: quarantine flaky tests into a separate suite that doesn't block merges, and give them a fixed lifespan. If nobody fixes a quarantined test in two weeks, delete it. A test nobody owns isn't a test, it's a liability.

Targeted retries for known external flakiness (a third-party API that's genuinely unreliable) are fine. Global retries as a default are not.

Track flake rate as a first-class metric

You can't fix what you don't measure. Track the percentage of test runs that pass on retry after failing on first attempt, per test and per suite. Put it on the same dashboard you use for coverage and CI time, and review it in the same meeting. Once the problem is visible, it starts getting fixed. It almost never does before.

Where agentic testing helps

A lot of the pain above comes from test scripts being pinned to exact selectors and exact sequences. When the UI drifts, the test drifts with it and breaks. Agentic testing flips that. An AI agent reads the page the way a user would, finds the right element by intent, and keeps going. When the button moves, the agent still finds it. When the layout reshuffles, the test doesn't care.

That doesn't make flakiness disappear. Shared state still leaks, CI still drifts, third-party sandboxes still rate-limit. Agents reduce the selector-level fragility that drives most of the visible flakes in a suite, which is usually the biggest bucket. The rest still needs the discipline above.

Framework choice matters less than most teams think. The framework isn't the hard part. Test data, isolation, and ownership are. Fix those, and you've already done most of the work.

Where to start this week

Pick the five tests that flake most often. Find them by looking at the last month of CI runs, not by asking people which ones annoy them most. Put those five on a board. Pair on each one until you can name the root cause, then fix it or delete the test.

Do that for a month. You'll learn more about your suite than any refactor plan would teach you, and you'll probably cut your flake rate by more than you expect. The rest of the suite follows the same patterns. Once you've seen them, the fixes stop being a mystery.

Frequently asked questions about flaky tests

Why are my tests flaky? Most flaky tests trace back to one of five causes: bad waits or race conditions, selectors pinned to layout, test data that rots, shared state between tests, or environment drift in CI. The fix starts with diagnosing which of the five applies before changing any code.

How do I fix flaky Playwright, Cypress, or Selenium tests? The approach is the same, the wait strategy differs. In Playwright and Cypress, lean on the built-in auto-waiting and assertion retries instead of hardcoded sleeps. In Selenium, use explicitWebDriverWait conditions tied to application state. Across all three: prefer role and label selectors over CSS paths, scope test data to each test, and quarantine flakes instead of globally retrying. Track flake rate as a metric so you can tell whether your fixes are working.

Should I retry failed tests? Targeted retries for known external flakiness are fine. Global retries applied to every test are a bad idea, because they hide real failures and normalize flakiness. Quarantine flaky tests into a separate suite, assign an owner, and give them a fixed deadline to fix or delete.

Does AI testing solve flaky tests? It solves the biggest bucket, which is selector-level fragility when the UI changes. Agents find elements by intent rather than by exact path, so minor UI refactors don't break them. Flakiness from shared state, test data drift, and CI instability still needs the same discipline you'd apply to a script-based suite.


qtrl's AI agents adapt to UI changes the way a user would, so your tests keep running when selectors would have gone stale. Pair that with test management, ownership, and audit trails in one place, and you get a suite that earns developers' trust back instead of losing it every week. Start free with qtrl or read more on the real cost of test automation.

Have more questions about AI testing and QA? Check out our FAQ