AI agent governance: why uniform rules fail testing teams
By qtrl Team · Engineering
Most teams that try a testing agent land in one of two places. Either the agent is locked down so tightly it can barely open a page, and people quietly go back to writing scripts by hand. Or it has broad access to a staging environment and runs on its own, and nobody can say with confidence what it touched overnight. Both happen a lot, and neither feels right.
That gap is now something Gartner has put a number on. The pattern underneath it has a name, a failure mode, and a fix that maps almost exactly onto how a QA team should think about letting an agent into its pipeline.
What Gartner actually said
In May 2026, Gartner published a release with a blunt title: applying uniform governance across AI agents will lead to enterprise AI agent failure. The headline prediction: by 2027, 40% of enterprises will demote or decommission autonomous AI agents because of governance gaps that only surfaced after a production incident.
The root cause, in Gartner's framing, is that organizations treat agent governance as binary. An agent is either locked down or fully trusted. When the same controls get applied to every agent regardless of what it does, two things go wrong. Simple agents get over-restricted, which slows delivery and pushes people toward shadow tooling they set up on their own. More autonomous agents get under-restricted, which raises operational, security, and compliance risk. One failure is quiet, the other is loud, and most teams manage to hit both at once.
This sits inside a broader skepticism Gartner has been signaling about the category. The same firm expects that more than 40% of agentic AI projects will be cancelled by the end of 2027, and its 2026 Hype Cycle for Agentic AI places the whole field at the peak of inflated expectations. The interesting part isn't the doom. It's the specific reason agents get pulled: not that the models were bad, but that the governance around them was the wrong shape.
Why all-or-nothing breaks down in QA
Testing is where this binary trap bites early, because a testing agent is useful at several different levels of trust, and they aren't the same job.
Lock it down too hard and you get the first failure mode. An agent that can only run a fixed, pre-approved script with no room to navigate is barely an agent. Your team stops reaching for it, and the people who liked the idea go set up their own browser automation in a corner of the repo nobody reviews. You wanted control. What you got was a tool nobody uses and a second tool nobody can see.
Open it up too far and you get the second one. An agent with write access to a production-like environment, running unattended, can place an order, cancel a subscription, or send an email to a real address that happens to be sitting in your staging data. The output looks clean. The screenshots are tidy. And then someone finds the test account did something it shouldn't have, and trust in the whole approach evaporates in an afternoon.
The fix isn't to pick a point on that dial and hope. It's to stop treating autonomy as a single dial at all.
Four autonomy levels, applied to a testing agent
Gartner's recommendation is proportional governance: classify agents across distinct autonomy levels, where each level is a different trust boundary with its own controls. The framework names four levels: observe, advise, act with approval, and act autonomously. They map cleanly onto what a testing agent actually does.
Read left to right, that's a path, not a switch. An agent can earn its way up it as it proves out, and different agents can sit at different rungs at the same time.
| Level | What the testing agent does | Governance that fits |
|---|---|---|
| Observe | Explores the app read-only, maps flows, flags what looks off | Read access, no state changes, full log of where it went |
| Advise | Suggests test cases and coverage gaps for a human to act on | Proposals only; nothing runs until a person says so |
| Act with approval | Generates and runs tests, but results stay in draft | A review gate; approval before a result counts as truth |
| Act autonomously | Runs an approved regression suite on a schedule | Scoped to known-safe paths, every action audited |
Notice that the controls change at every rung. The observe-level agent needs strong read access and not much else. The autonomous one needs a tight scope and a complete audit trail, because the cost of a wrong move is highest there. Applying the autonomous-level controls to an observe-level agent is the over-restriction Gartner warns about. Applying observe-level trust to an autonomous one is the under-restriction.
Ability to act is not the same as scope of access
The sharpest line in Gartner's analysis is the distinction between what an agent is allowed to do and what it is allowed to reach. Teams collapse the two, and that collapse is where the incidents come from.
In testing terms: an agent can be highly autonomous and still tightly scoped. Letting an agent run an approved suite without a human watching each step is an ability to act decision. Whether that agent can touch real payment rails, see production customer data, or use live credentials is a scope of access decision. You can grant the first and withhold the second. The agent runs the checkout regression every night, on its own, against a sandboxed environment with synthetic data and secrets it never actually sees. High autonomy, narrow blast radius.
This is also where the practical guardrails live. Keeping secrets out of the agent's reach entirely, scoping it to a single project, and running against a dedicated environment are all scope decisions you can make independently of how much you let the agent decide for itself. Get that separation right and the scary version of autonomy mostly goes away.
What proportional governance looks like day to day
The workflow that comes out of this is the generate-review-approve loop, and it's worth being concrete about how it runs.
A new agent starts at observe. It explores, it adapts to UI changes, it tells you what it found. You read its proposals (advise) and pick the ones worth turning into tests. It generates and runs those, and the results land in a draft state for a human to confirm (act with approval). Once a suite has proven stable across enough runs, you promote it to run on its own (act autonomously), still scoped and still logged. An agent doesn't arrive trusted. It earns the next rung by being right on the current one.
Two things make this hold together. The first is a real audit trail: every action, every assertion, every decision point, recorded so you can answer what was tested, when, and by which agent. Without it, "act autonomously" is just hoping. The second is that the promotion between levels is a deliberate, reversible choice, not a config flag someone set once and forgot. If an autonomous suite starts behaving oddly, you demote it back to act-with-approval and look at the logs. That demotion is the system working, not failing.
None of this is unique to regulated industries, though they feel it first. The accountability question of who signed off on an agent acting on its own is one every team ends up answering, usually right after the first surprise.
Proportional governance for testing agents: FAQ
Isn't this just role-based access control with extra steps? RBAC covers part of it, the scope-of-access half. The autonomy levels add the other half: how much the agent is allowed to decide and act without a human in the loop. You need both, and Gartner's point is that teams tend to set one and assume it covers the other.
How many agents actually need the top level? Fewer than you'd think. Most of the value in agentic testing shows up at observe and act-with-approval, where the agent does the tedious exploration and authoring and a human still confirms. Full autonomy is worth reserving for stable, well-understood suites where the cost of a missed check is low and the path is known.
Does proportional governance slow adoption down? It does the opposite. The thing that stalls adoption is a bad incident that makes leadership distrust the whole idea. Starting agents low and promoting them on evidence is how you avoid the incident that gets the program cancelled, which is exactly the outcome Gartner is forecasting for teams that skip this.
Where do I start if I have one agent and no framework? Put it at observe for a week. Read what it surfaces. The act of reviewing its output at the lowest level tells you whether you trust it enough to move it up, and it costs you almost nothing if the answer is no.
qtrl is built around exactly this idea. Agents operate within rules you set, results flow through a review-and-approve workflow, secrets stay out of the agent's reach, and every action lands in an audit trail. Autonomy is something an agent earns by proving value, not something you switch on and hope. If you're comparing options, here's how the agentic testing tools stack up. See how qtrl handles permissioned autonomy.
Have more questions about AI testing and QA? Check out our FAQ