Insights10 min read

What is agentic testing? How autonomous QA agents actually work

By qtrl Team · Engineering

A year ago, "AI testing" meant generating test scripts from prompts. You described what to test, an LLM spat out Playwright or Cypress code, and you ran it. Useful, but not that different from what a junior engineer does with better autocomplete.

Something has shifted. A new category of testing tools doesn't just generate scripts. They make decisions. They navigate your app, figure out what to check, adapt when the UI changes, and build up knowledge about how your product works over time. The industry is calling this "agentic testing," and it's worth understanding what the term actually means before it loses all meaning.

Three generations of test automation in five minutes

To see why agentic testing matters, it helps to know where it sits in the history of how teams automate QA.

Generation one: scripted automation. You write code that drives a browser. Selenium, Cypress, Playwright. Every action is explicit: go to this URL, click this selector, assert this text. The test does exactly what you tell it to, nothing more. When the UI changes, the test breaks, and a human fixes it. Teams with thousands of scripted tests often spend more time fixing them than writing new ones.

Generation two: AI-assisted automation. An LLM generates the test scripts for you. You describe the test in natural language, the AI produces runnable code. Authoring gets faster. But the tests themselves are still static scripts. They still break when selectors change. The AI helped you write the test faster; it didn't make the test smarter.

Generation three: agentic testing. The AI doesn't produce a script and hand it off. It is the test runner. It reads the page, decides what to do next, handles unexpected UI changes on the fly, and makes judgment calls about whether something looks right. Instead of following a fixed sequence of steps, it pursues an objective: "verify that a user can complete checkout with a discount code." How it gets there can vary from run to run, because it's navigating a real browser in real time, not replaying a recording.

What makes a test "agentic"

An agentic test has four properties that scripted tests don't.

The first is autonomy in execution. You give it an intent ("test the password reset flow"), not a step-by-step script. The agent figures out how to navigate to the right page, what to click, what to type. If a modal appears that wasn't there last week, it doesn't crash. It reads the modal, decides whether it's relevant, and continues.

Then there's self-healing at the interaction layer. When a button's class name changes or a form field gets restructured, the agent re-identifies the element using context (accessibility attributes, surrounding text, visual position) rather than failing on a stale selector. It works well for UI-level changes, not intent-level changes like when your checkout flow goes from three steps to two. That still needs a human to update the test's objective. But for the kind of breakage that eats up most maintenance time (renamed classes, moved elements, restructured forms), it's a real improvement over fixing selectors by hand.

The third property is persistent memory. The agent builds a model of your application over time. It learns which screens connect to which features, what normal behavior looks like, where it's seen issues before. Early runs are slower and less precise. As the agent accumulates context, it gets faster and more targeted. Scripted tests are equally dumb on run one thousand as they were on run one.

Finally, exploration beyond the script. An agentic test runner doesn't just verify the happy path you told it about. It can explore adjacent flows, try unexpected inputs, and surface issues in areas you didn't think to check. Not as a replacement for structured testing, but as a layer on top of it.

Why now

Agentic testing wasn't possible two years ago. Three things changed.

LLMs got good enough at understanding web UIs. Reading a page's accessibility tree or interpreting a screenshot and deciding "I should click the blue button labeled Submit" sounds simple, but it requires spatial and semantic reasoning that earlier models couldn't do reliably. Current models can.

Browser automation infrastructure matured in parallel. The MCP ecosystem, Playwright's API, and tools like Stagehand gave AI agents reliable ways to control browsers programmatically. The plumbing that connects "the model decided to click Submit" to "a real browser clicked Submit" is solid now.

And inference costs dropped enough to make it practical. An agentic test run involves many LLM calls: reading the page, deciding an action, verifying results, deciding the next step. At 2024 API prices, running a hundred agentic tests would have blown most QA budgets. That's no longer the case.

What agentic testing is good at

Not everything benefits from an agentic approach. But a few categories of testing get noticeably better.

Regression testing across UI changes is the obvious one. If your team ships UI updates frequently (and if you're doing continuous deployment, you do), scripted tests break constantly. Agentic tests absorb most UI changes without intervention. Your QA team stops spending Monday mornings fixing selectors and starts spending that time on work that actually requires human judgment.

Exploratory testing also scales differently. A skilled human tester explores your app with intuition and creativity. An agentic tester explores with breadth and persistence. It can spend hours navigating paths nobody assigned it, finding broken states and edge cases. The two approaches complement each other: humans find the subtle, contextual bugs; agents find the ones hiding in corners nobody visits.

Smoke testing across environments is another good fit. You deploy to staging, you want a quick sanity check that critical flows work. Agentic tests handle this well because they adapt to minor environment differences (slightly different data, different load times) without failing spuriously. No need to maintain a separate set of environment-specific scripts.

And for teams where QA isn't deeply technical, natural language specifications close the gap between "what we want to verify" and "what actually runs." A product manager describes the scenario in plain English. The agent handles the browser interaction. No translation from requirements to code required.

Where agentic testing needs guardrails

The trade-offs are real, and being honest about them is the difference between productive adoption and disappointment.

Non-determinism is the one that catches people off guard. Because the agent makes decisions at runtime, two runs of the same test might take slightly different paths. Usually that's fine; both paths verify the same intent. Occasionally it causes confusion when a test passes one run and fails the next, not because the app changed, but because the agent took a different route. Detailed execution logs help here. You need to see what happened and why.

Complex business logic is a harder problem. Agents are strong at navigating UIs and verifying that elements appear correctly. They're weaker at validating calculations or data transformations that require domain knowledge. "Does this tax amount look right for a customer in Ontario?" isn't something an agent can answer by looking at the screen. For those cases, you still need explicitly defined assertions from a human reviewer.

But the biggest risk isn't technical. It's blind trust. Agentic tests produce professional-looking results: screenshots, step logs, pass/fail verdicts. Tempting to take at face value. But an agent can "pass" a test by verifying the wrong thing, or miss a bug because it didn't check the right element. The teams that succeed with agentic testing build review into the workflow: AI generates or executes the test, a human reviews and approves it, approved tests run autonomously going forward. That generate-review-approve cycle is what separates real confidence from the illusion of it.

There's also the governance question. In regulated industries, you need to know exactly what was tested, when, by whom (or what), and what the results were. A black-box agent that produces a pass/fail verdict isn't enough. Full audit trails, every action, every assertion, every decision point. If your agentic testing setup doesn't provide this, you'll end up building the audit layer yourself.

What to look for when evaluating tools

This category is moving fast, and the gap between a polished demo and a production-ready platform is wide. A few things to pay attention to.

Does the agent run in actual browser instances, or a simulated environment? Your users interact with real browsers. Your tests should too.

Can you start small? You shouldn't have to go from zero to "the AI runs everything" overnight. The best platforms let you start with structured test management, layer in AI-generated tests with human review, and expand autonomy as the agent earns your trust through results. Not by demanding you hand over the keys on day one.

Is there real test management underneath the agent layer? Agents without structure are just sophisticated explorers. You need the ability to organize tests into plans and runs, track what's been covered, and maintain a clear picture of where your gaps are. The agent handles execution. The structure ensures the right things get executed.

Does the agent remember anything between runs? One that starts from scratch every time is wasting money and repeating mistakes. Look for platforms where the agent accumulates knowledge about your application: what the screens look like, which flows connect where, what normal behavior looks like. That accumulated context is what makes the agent faster and more accurate over time.

And can you actually see what the agent did? Every step it took, why it took it, what it checked. If you can't explain what happened during a test run, you can't trust the result. Screenshots and action logs aren't optional.

Where this is heading

Agentic testing is still early. Most teams haven't adopted it yet, and the tooling is changing month to month.

But the pieces are in place. Models keep getting better at understanding UIs and handling edge cases. Costs keep dropping. The infrastructure for running agents at scale (browser pools, orchestration, result management) is catching up. Within a year or two, this will go from "interesting new approach" to "how most teams run their regression suites."

The teams that benefit most won't be the ones who jumped in without preparation. They'll be the ones who built a foundation first: organized test cases, clear coverage goals, governance workflows. When you hand an agent well-structured context about what to test and why, it performs better than when you point it at an app and say "go."

Structure doesn't slow you down. It makes the agents smarter.


qtrl combines structured test management with autonomous QA agents that execute in real browsers, build adaptive memory of your application, and operate within your defined rules. Start with the structure, add AI when you're ready, and expand autonomy as trust grows. See how it works.