Insights10 min read

Self-healing tests: how they work and where they break

By qtrl Team · Engineering

A test that worked yesterday fails today. The flow still works in the browser. A developer renamed a CSS class, moved a button into a different wrapper, or swapped a data-testid for an aria label. The code ships, the test breaks, and your team spends Monday morning fixing selectors instead of writing new coverage.

Self-healing tests are supposed to absorb that kind of change. The promise: when the page shifts under your test, the test re-identifies what it was looking for and keeps going. That promise is partly real and partly oversold. Worth understanding the difference before you bet a regression suite on it.

What "self-healing" actually means

A self-healing test is one that, when it can't find the element it expected, doesn't fail immediately. It tries to find the same element a different way. If it succeeds with high confidence, it continues the run and (in most tools) updates the test definition so the next run uses the new locator from the start.

The mechanic is simple. The hard part is the word "same." What counts as the same button after a redesign? If the original target had the text "Submit" and now there are three buttons that say "Save," "Continue," and "Submit" in a different order, which one is correct? Self-healing tools answer that question with varying degrees of cleverness, and the cleverness is what separates the two generations of this technology.

Locator-based vs agentic self-healing

It's worth separating the two main approaches before going further, because they fix different things.

ApproachHow it worksStrengthsLimits
Locator-based (Healenium, Mabl, Testim, Functionize)Stores multiple selector candidates per element; falls back through them; uses ML to score likely matchesFast, deterministic, predictableHeals only when the candidate set still contains the element
Agentic (LLM-driven)Reads the page at runtime, reasons about which element matches the intent, and re-identifies via accessibility tree, text, and visual contextHandles deeper structural changes; works across redesignsNon-deterministic; needs guardrails and review for low-confidence heals

Locator-based self-healing has been around since roughly 2018. The idea is that for every element your test interacts with, the framework stores not just one selector but several: the CSS class, the XPath, the surrounding text, the relative position, sometimes a visual fingerprint. When the primary selector misses, the tool walks through the alternates and ranks each candidate match. The highest-confidence match wins. If nothing scores above a threshold, the test fails.

It works. For the kinds of changes engineers actually ship most often (renaming a class, swapping id for data-testid, moving a button into a flex container), the fallback list usually covers the new state. Healenium, an open-source project, does this on top of Selenium and is still actively maintained. Mabl and Testim package the idea inside a broader platform. Within their scope, all of them save real maintenance time on flat UI churn.

Agentic self-healing is the newer approach. Instead of pre-computing fallback selectors at recording time, it asks an LLM at runtime to find the right element on the current page. The agent reads the accessibility tree, sometimes a screenshot, the text labels, the surrounding context, and decides "the button labeled Submit that progresses checkout to the next step is now in this card on the right." It can handle cases where no pre-recorded selector survives, because it isn't relying on selectors at all. It's relying on what the pagemeans.

Where self-healing reliably saves time

Most of the breakage that hits scripted suites is shallow. A list of the kinds of changes self-healing handles well:

  • Renamed CSS classes or hashed class names from a new build
  • A removed or changed data-testid
  • An element moved into a new parent (a button now lives inside a card)
  • Reordered elements on a page where the target is still uniquely identifiable
  • Minor copy edits: "Submit" becomes "Save changes"
  • An id swapped for an aria label, or vice versa
  • A form field wrapped in a new label component

These changes break tests but don't change what the test is checking. The user still clicks the same button to do the same thing. A self-healing test absorbs the diff and moves on. For a suite of a few hundred tests running through weekly UI tweaks, this can cut maintenance time meaningfully. We've seen this pattern in the flaky-test work that's adjacent to it: a lot of what looks like "flake" is actually unannounced selector drift, and self-healing absorbs most of that category.

Where self-healing breaks down

The honest answer: self-healing handles the surface, not the substance. When the change is shallow, it's a real win. When the change is deep, it'll either fail loudly (best case) or quietly "heal" to the wrong element and pass a test that should have failed (worst case).

Intent-level drift is the clearest example. Your checkout used to be three steps: cart, shipping, payment. The team ships a new flow that collapses shipping and payment into one step. The test was written against the old shape. A locator-based healer will fail because the "Continue to payment" button doesn't exist anymore. An agentic healer might be smart enough to recognize that the flow changed, but it shouldn't silently pretend the test still passes. The test is no longer valid. It needs a human to update what the test is for, not just where it clicks.

Removed features are similar. If a feature is gone, the test for it should fail in a way that surfaces the deletion, not heal toward whatever vaguely similar element happens to be on the page now. This is the trap with overly aggressive healing: it papers over real product changes.

Ambiguous matches are the third common failure mode. After a redesign, five buttons on the page could plausibly be the "next step" button. A confidence-scored healer should refuse to pick one, but cheap implementations grab the highest-ranked candidate even when the spread between top candidates is tiny. You get a test that passes because it clicked something, not because it verified the right thing.

And then there's the case where what the test should target has changed entirely. A button moved behind a feature flag. A form is now gated behind an upsell modal. A flow that was anonymous now requires login. No selector strategy can heal those: the page state itself has changed, and someone has to decide what the test should do about it.

What a heal log should actually contain

Whether self-healing helps or hurts mostly comes down to what the tool records when it heals. Here's the shape of a useful heal log:

{
  "test": "checkout_with_discount_code",
  "step": 4,
  "action": "click",
  "expected_locator": "#submit-payment",
  "expected_intent": "submit payment to complete order",
  "heal": {
    "trigger": "element_not_found",
    "candidates": [
      {
        "locator": "button[data-action='complete-order']",
        "label": "Complete order",
        "confidence": 0.94,
        "reasoning": "Same role and intent as original. Located in payment form, not in cart upsell modal."
      },
      {
        "locator": "button.btn-primary.upsell",
        "label": "Add warranty and continue",
        "confidence": 0.31,
        "reasoning": "Visually similar but progresses an upsell flow, not the payment submission."
      }
    ],
    "selected": "button[data-action='complete-order']",
    "confidence": 0.94,
    "auto_applied": true,
    "review_required": false
  }
}

Three things make this log useful. The expected intent is recorded alongside the original locator, so a reviewer can tell whether the new element actually does the same thing. Every candidate is scored and kept in the log, so you can see whether the choice was obvious or a coin flip between two near-equal options. And the auto-apply decision is explicit and tied to a confidence threshold you control.

Compare with what most black-box tools surface:

[INFO] test: checkout_with_discount_code
[INFO] step 4: locator updated automatically
[PASS] checkout_with_discount_code (3.2s)

Same outcome on the surface. But the second log gives you no way to verify the heal was correct, no way to audit it later, and no way for the next person on call at 2am to reason about whether the test still means what it used to.

The blind-trust problem

Self-healing tools produce clean output. Pass, fail, screenshots, a green check on the run report. Tempting to take at face value. But a test that healed three times during one run isn't the same kind of signal as a test that ran clean. It's asserting something weaker: "I found something that looked close enough to what I expected, three times."

The teams that get real value from self-healing treat heals as events, not non-events. Every successful heal generates a notification or a review queue item. A human (or a reviewing agent) confirms that the new locator points to the right thing before the heal is committed permanently. That generate-heal-review loop is the difference between a suite that adapts and a suite that drifts.

Without that loop, you end up with a suite that's passed continuously for six months while quietly checking the wrong elements. That's worse than a brittle suite. A brittle suite at least tells you it's broken.

What to look for in a self-healing tool

The space is crowded enough that the marketing claims start to blur. A few things worth pressing on when you evaluate:

  • Transparent heal logs. You should see what the original target was, what the new target is, and why the tool decided they're the same. If healing happens in a black box, you can't verify it later.
  • Confidence scores and thresholds. A tool that always heals to something isn't self-healing, it's guessing. The good ones expose a confidence number and let you set a threshold below which a heal isn't auto-applied.
  • Review workflow for low-confidence heals. When the confidence is borderline, the heal should land in a review queue, not silently mutate the test. A platform that combines structured test management with an approval step for AI changes is doing this right.
  • Persistent memory of past heals. If the agent saw a similar change last sprint and approved a particular fix, it shouldn't re-deliberate from scratch. Look for tools that accumulate context about your application across runs, not ones that treat every execution as a cold start.
  • Real browsers, not headless approximations. Heals based on a simplified DOM miss visual and layout-driven differences. The test surface needs to match what your users see.
  • A way to opt out. Some tests should never auto-heal. A regulatory check that asserts an exact label on an exact element shouldn't silently update if that label changes. You want the ability to mark a test as no-heal.

How this fits with the rest of your QA stack

Self-healing alone isn't a strategy. It's a feature that helps a suite survive in environments where UI churn is constant, which is most modern web products and especially the ones with a lot of AI-generated code landing every week. Underneath it, you still need the foundation: a clear list of what's being tested, a coverage map, run history you can audit, and a way to track which tests are healthy versus which have been quietly limping along.

The teams that benefit most from self-healing are the ones who already have that structure. They can tell when a heal is suspicious because they know what the test was for. Teams without structure end up with a suite that mutates faster than they can keep track of, and the heals compound until nobody trusts the results.

Self-healing tests: frequently asked questions

Are self-healing tests the same as AI testing? No. AI testing is the broader category. Self-healing is one feature within it, focused specifically on keeping existing tests alive when the UI changes. You can have AI-generated tests that don't self-heal, and you can have self-healing tests that weren't generated by AI.

Can self-healing fix flaky tests? Sometimes. A lot of what teams call flake is actually selector drift, and self-healing absorbs that. But timing-related flake (race conditions, slow network, animation issues) is a different problem and won't be fixed by smarter locators. For that, see how to fix flaky tests in 2026.

Does self-healing work with Playwright or Selenium? Both. Healenium plugs into Selenium directly. Most agentic platforms run on top of Playwright under the hood. You don't usually need to migrate frameworks to add self-healing; you add a layer on top.

Will self-healing tests eliminate test maintenance? No. They'll cut maintenance on shallow UI changes meaningfully. Maintenance for intent changes, business logic updates, and removed flows still needs a human. The realistic outcome is that your team spends less time on tedious selector fixes and more time on the judgement work that actually requires expertise.

How is agentic self-healing different from auto-healing in tools like Mabl? Mabl, Testim, and similar platforms use locator-based healing with ML scoring. They work well within the bounds of their candidate sets. Agentic healing decides at runtime by reasoning about the page, which means it can handle changes that fall outside any pre-computed locator list. It's also more expensive per run and needs governance to keep it honest.


qtrl runs agentic self-healing in real browsers and records every heal with the original intent, the candidates that were considered, and the confidence that picked the winner. Low-confidence heals route to a review queue instead of silently mutating your tests, and the agent carries memory of past heals across runs so it doesn't re-deliberate the same fix every sprint.

All of it sits on top of structured test management, so a heal you accepted last sprint stays connected to the test it was for, and you can tell at a glance which tests are healthy versus which have been quietly drifting. See how it works.

Have more questions about AI testing and QA? Check out our FAQ