How to Test Non-Deterministic AI for EU AI Act Compliance

This is Part 2 of our series on the EU AI Act and QA. Part 1 covered what the Act means for your testing process. This post gets into the practical side: how do you actually test AI systems that don't produce the same output twice?

Traditional software testing rests on a simple assumption: given the same input, you get the same output. You write an assertion, run it a thousand times, and it either passes or fails. Deterministic behavior. Clean, binary, done.

AI doesn't work that way. Ask a language model the same question twice and you'll get two different answers. Run an image classifier on slightly rotated versions of the same photo and the confidence scores shift. Feed an AI-powered recommendation engine the same user profile on Monday and Friday, and the suggestions change because the model learned something in between.

The EU AI Act doesn't care about this distinction. Article 9 requires testing against "prior defined metrics and probabilistic thresholds." Article 15 requires "appropriate levels of accuracy, robustness and cybersecurity" with consistent performance. The regulation expects you to prove your AI works correctly, even when "correctly" doesn't mean "identically every time."

So how do you test something that gives different answers every time, in a way that satisfies both your engineering standards and a regulatory audit?

Why traditional assertions break down

If you're testing a checkout flow, you assert that the total equals the sum of items plus tax. Either it does or it doesn't. The test is pass/fail because the behavior is deterministic.

Now imagine you're testing an AI-powered search feature. A user types "running shoes for flat feet" and the system returns ten results. Are those the right ten results? There's no single correct answer. There's a range of acceptable answers, and that range depends on context, training data, and the model's internal state.

This is where most QA teams get stuck. They try to apply deterministic testing patterns to non-deterministic systems, and the result is either tests that are so loose they catch nothing, or tests that are so tight they fail randomly. Both are useless. The first gives you false confidence. The second gives you alert fatigue.

You need different testing techniques. Not weaker ones. Different ones.

Metamorphic testing: checking relationships instead of exact outputs

Metamorphic testing is the technique most directly useful for AI systems under the EU AI Act. Instead of asserting that the output matches a specific expected value, you assert that the relationship between inputs and outputs holds.

Here's the idea. You start with an input and record the output. Then you transform the input in a way where you know what should happen to the output. If the relationship breaks, something is wrong.

A few examples. Say your model classifies "This product is great" as positive. Then "This product is absolutely great" should also be positive. Adding emphasis shouldn't flip the classification. That's a metamorphic relation: the output property (positive sentiment) should be invariant under that input transformation.

Same idea in a medical context. If a patient profile with a heart rate of 120 gets flagged as urgent, the same profile with a heart rate of 140 should also be flagged as urgent, or more so. Increasing severity shouldn't lower the risk score.

The most powerful application for compliance is bias detection. Change a candidate's name or gender on an otherwise identical resume. If the ranking changes, you've found a discriminatory pattern, which is exactly what Article 10 of the Act requires you to test for.

The power of metamorphic testing is that you don't need to know the "right" answer. You just need to know what shouldn't change (or how it should change) when you modify the input. That makes it practical for non-deterministic systems where pinning down exact outputs is impossible.

Statistical acceptance bands: defining "good enough" with numbers

The EU AI Act requires that accuracy levels be declared in the system's instructions of use (Article 15). That means you need actual numbers. Not "our model is accurate." Numbers.

Statistical acceptance bands give you a framework for this. Instead of asserting exact outputs, you define acceptable performance ranges and test whether the system stays within them over a statistically meaningful number of runs.

Say your AI classification system needs to achieve 95% accuracy on a benchmark dataset. You don't run the test once. You run it across the full dataset, calculate the accuracy, and check whether it falls within your declared band (say, 93% to 97%, accounting for variance). You also track this over time. If accuracy drifts below the lower bound, something changed and you need to investigate.

Which metrics you track depends on the system. For classification, that's precision, recall, and F1 score, broken down by class. A model that's 95% accurate overall but 60% accurate on a minority class has a bias problem hiding behind a good average.

Generative AI is harder. You're looking at semantic similarity scores, factual accuracy rates checked against ground truth, toxicity rates, hallucination frequency on known-answer questions. None of these are as clean as a classification metric, but they're measurable and they're trackable, which is what the Act cares about. Recommendation engines sit somewhere in between: precision@K and recall@K are well-established, but you also need diversity metrics to catch demographic blind spots.

The important thing for compliance: document these thresholds before you test, not after. Article 9 says "prior defined." If you pick your thresholds after seeing the results, an auditor will notice.

Adversarial robustness: testing how your AI fails

Article 15 requires AI systems to be "resilient regarding errors, faults, or inconsistencies that may occur within the system or the environment." In plain language: your AI needs to handle bad inputs gracefully, and you need to prove it.

Adversarial testing means deliberately trying to break your AI. Not in a vague, exploratory way. In a structured, documented way that produces auditable results.

For most teams, this means three categories of tests:

Input perturbation

Take valid inputs and make small, realistic changes. Typos in text. Noise in images. Missing fields in structured data. Your AI should either handle these gracefully or explicitly flag that it can't process them. What it shouldn't do is silently produce garbage output with high confidence.

Boundary probing

Every AI system has edge cases where it performs poorly. The Act doesn't expect perfection. It expects you to know where those edges are and document them. Test inputs at the boundaries of your training distribution. Inputs in languages or formats the model wasn't trained on. Extremely short or extremely long inputs. The goal is to map the failure boundary, not eliminate it.

Adversarial attacks

For high-risk systems, you should also test against known attack patterns. Prompt injection for language models. Adversarial patches for vision systems. Data poisoning scenarios for systems that learn from user feedback. The Act specifically calls out cybersecurity resilience, and adversarial attacks are part of that surface.

Document what you tested, what failed, and what mitigations you put in place. The Act doesn't require that your system is invulnerable. It requires that you assessed the risks and addressed them proportionately.

Drift monitoring: testing doesn't stop at deployment

We covered post-deployment monitoring in Part 1, but it's worth going deeper here because drift is the specific mechanism that makes AI systems unreliable over time.

There are two kinds of drift to watch for. Data drift is when the inputs your system sees in production start looking different from the data it was trained on. Seasonal patterns, changing user demographics, a new product category. The model's performance degrades not because it changed, but because the world around it did.

Then there's concept drift, which is sneakier. The relationship between inputs and the correct output shifts over time. What counted as a good recommendation six months ago isn't anymore. This is especially common in systems that learn continuously, and Article 15 specifically flags it: systems that keep learning after deployment must be designed to prevent feedback loops that produce biased outputs.

In practice, drift monitoring means running your statistical acceptance tests continuously against production data, not just against a frozen benchmark. Set up alerts when metrics drop below your declared thresholds. Log everything. When drift triggers an alert, that's a compliance event: you need to investigate, document what happened, and either retrain or adjust your declared accuracy levels.

If you're using qtrl, this maps directly to running the same test suites across development, staging, and production. Your compliance-critical tests don't live in a separate workflow. They run alongside your functional tests, on the same schedule, with the same audit trail.

Bias testing: the part most teams skip

Article 10 of the Act requires that training, validation, and test datasets be examined for "possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights, or lead to discrimination."

Most teams test for accuracy. Far fewer test for fairness. And the ones that do often treat it as a one-time audit rather than a continuous process.

Bias testing for EU AI Act compliance means checking your system's performance across protected characteristics: age, gender, ethnicity, disability status, and others depending on your domain. The technique is straightforward even if the execution takes effort:

Slice your test results by demographic group. If overall accuracy is 95% but accuracy for a specific group is 78%, you have a disparity that needs addressing.
Use metamorphic tests (as described above) to check for discriminatory behavior. Swap protected characteristics in otherwise identical inputs and compare outputs.
Test with representative datasets. If your training data skews toward one demographic, your test data should deliberately include underrepresented groups so you can measure the gap.

Build these checks into your regression suite. Bias isn't something you test for once. It's something you watch for continuously, because models can develop new biases as data distributions shift over time.

Where to start if this is new to you

If you're a QA lead looking at this list and wondering how to get from zero to compliant, don't try to do all of it at once. Here's the order we'd recommend.

First, figure out which AI features in your product are high-risk. Annex III of the Act lists the domains: healthcare, employment, education, financial services, law enforcement. If your AI touches any of those, start there. Write metamorphic test cases that check the properties that matter most for compliance: fairness, monotonicity (more severe inputs should produce more severe outputs), and consistency under minor input variations.

Next, get your data scientists, product managers, and legal team in a room and agree on acceptance thresholds. What does "good enough" look like in numbers? Write those numbers down before you run the tests. Article 9 says "prior defined," which means after-the-fact thresholds won't fly.

From there, layer in adversarial tests. Input perturbation is the easiest starting point. Boundary probing comes next. If your system is high-risk, add adversarial attack testing. Then set up drift monitoring against production data on a regular schedule, and make sure bias checks are part of your regression suite rather than a separate quarterly project.

The techniques are only half the compliance picture, though. The other half is proving you did them. Every test run, every metric, every threshold decision needs to be traceable and retrievable when someone asks. That's where structured test management earns its keep.

The tools gap is real

One honest observation: the tooling for AI-specific testing is still maturing. Most testing frameworks were built for deterministic software. They work well for functional tests, API tests, and end-to-end browser tests. They weren't designed for statistical acceptance bands, metamorphic relations, or drift monitoring.

Some teams are building custom harnesses on top of pytest or Jest. Others are using specialized libraries like Google's ML Test Score rubric or Fairlearn for bias detection. The challenge isn't running the tests. It's connecting them to a compliance workflow where results link back to requirements and every run produces an auditable record.

This is the gap we're building toward at qtrl. Your metamorphic checks and acceptance band validations need to live alongside your functional tests in the same traceability chain: requirement to test case to recorded result, with full history. When the auditor asks how you validated your AI's fairness, you pull up the test plan, the execution records, and the trend line over the last six months. That's what compliance actually looks like.

Non-deterministic doesn't mean untestable

We hear this a lot from QA teams: "AI is too unpredictable to test rigorously." We get the instinct. If you've spent your career writing assertions against exact expected values, a system that gives different answers every time feels like it's rejecting the premise of testing itself.

But the premise was always too narrow. Testing isn't about checking exact outputs. It's about building justified confidence that a system behaves acceptably. Metamorphic relations, acceptance bands, adversarial probes, drift alerts, demographic slicing: these are just the tools for doing that when the system is probabilistic instead of deterministic.

The EU AI Act makes these techniques mandatory for high-risk systems shipping into Europe. But even without the regulation, they're the right way to test AI. The teams already doing this will pass their first conformity assessment without scrambling. Everyone else will be retrofitting under deadline pressure.

This is Part 2 of our EU AI Act series. Read Part 1: What Your Testing Process Needs by August for the full overview of the Act's QA requirements and compliance deadlines.

qtrl gives your team structured test management with full traceability, immutable audit trails, and the ability to run the same suites across dev, staging, and production. If you're building AI features that need to hold up under the EU AI Act, try it free and see how it fits.

How to test non-deterministic AI systems under the EU AI Act