How-To11 min read

How to catch security bugs in AI-generated code before they ship

By qtrl Team · Engineering

Your team adopted Cursor or Copilot six months ago. Pull requests are up. Features land faster. And somewhere in the last few deploys, a SQL injection slipped into production that nobody caught, because nobody wrote that code by hand and nobody reviewed it with security in mind.

It's not an isolated incident. The velocity gains from AI coding tools are real. So are the security gaps they introduce.

The numbers: what the data actually says

Veracode tested AI-generated code from more than 100 large language models across 80+ coding tasks. 45% of samples contained security vulnerabilities, including OWASP Top 10 flaws. Java was the worst performer at over 70%. For context-specific flaws like cross-site scripting and log injection, fewer than 1 in 7 samples passed.

Failure rate by languageJava72%C#45%JavaScript43%Python38%Failure rate by vulnerability typeLog injection88%Cross-site scripting86%Source: Veracode 2025 GenAI Code Security Report (100+ models, 80+ tasks)
Security failure rates in AI-generated code. Larger models didn't produce meaningfully more secure code than smaller ones.

That's not an outlier finding. Stanford researchers found that developers using AI coding assistants were more likely to introduce security bugs and, more concerning, more likely to rate their insecure code as secure. Only 67% of AI-assisted participants produced correct solutions, compared to 79% in the control group. The AI creates a false confidence that makes the problem harder to catch.

CodeRabbit's analysis of open-source pull requests confirmed the pattern: AI-generated code was 2.74x more likely to contain XSS vulnerabilities, 1.91x more likely to include insecure direct object references, and 1.88x more likely to introduce improper password handling.

Why AI code has different security bugs

AI-generated code doesn't just have more security bugs. It has different ones than what you're used to catching.

Human developers who write insecure code usually do so from ignorance or shortcuts: they skip input validation because they're rushing, or they don't know about a specific attack vector. The pattern is predictable. Security training and code review catch most of it.

AI models produce a different failure mode. They've been trained on millions of code samples, including insecure ones. They optimize for code that looks correct and compiles cleanly. They don't reason about security. An AI will generate a perfectly structured authentication flow that's vulnerable to timing attacks, because the training data contained plenty of authentication code with the same flaw.

We keep seeing the same three:

The happy-path problem

AI excels at the main flow. It builds clean login pages, functional CRUD endpoints, well-structured API routes. But it consistently skips the security edges: what happens when the session token is malformed, when the request body contains unexpected types, when a user sends a request they shouldn't have access to. Your existing tests probably check the happy path too, which means the AI's blind spots and your test suite's blind spots overlap.

The dependency problem

AI models reach for libraries. When you prompt "build a file upload endpoint," the AI pulls in packages it's seen in training data. Some of those packages are outdated, unmaintained, or have known CVEs. The Cloud Security Alliance found that AI coding tools are increasingly a vector for supply chain exposure, because developers trust the AI's library choices without auditing them.

The iteration trap

Teams use AI to fix AI-generated code. An IEEE-ISTAS study showed this makes security worse, not better. Researchers tested 400 AI-generated code samples and found a 37.6% increase in critical vulnerabilities after five rounds of iterative refinement. Average vulnerabilities per sample climbed from 2.1 in the first iteration to 6.2 by the tenth, even when the prompts explicitly asked for security improvements.

Five layers of defense that actually work

Catching security bugs in AI-generated code isn't one tool or one process. It's a stack. Each layer catches what the ones below miss.

Layer 5: Mindset: treat AI code as untrustedSame rigor as external contributionsLayer 4: E2E security testsAuth, authorization, payments, access controlLayer 3: Security-focused code reviewChecklist on every AI-generated PRLayer 2: Dependency auditingSCA, new-package review, supply chainLayer 1: Static analysis on every commitSemgrep + ESLint security rules + secret detectionFoundational (bottom) catches the most. Each layer catches what the ones below miss.
A layered defense for AI-generated code. Static analysis is the foundation; each layer above catches a different class of bug.

Layer 1: Static analysis on every commit

The easiest layer to automate and the one with the highest catch rate. Run Semgrep and ESLint security rules on every commit, either in a pre-commit hook or your CI pipeline.

Semgrep catches security-specific patterns that ESLint misses: SQL injection, XSS, insecure direct object references, hardcoded secrets, and dangerous deserialization. ESLint's security plugin catches JavaScript-specific issues like unsafe eval() usage and prototype pollution patterns.

The key: don't just run them. Block the merge if they flag critical findings. AI-generated code gets a pass from human review more often than it should. Static analysis doesn't get tired or distracted.

Layer 2: Dependency auditing

Every AI-generated import is a trust decision. Treat it like one.

Run npm audit, Snyk, or Dependabot on every PR. But go beyond known CVEs. Check whether the AI pulled in a package that's unmaintained, has very low weekly downloads, or was last published years ago. AI models recommend packages from their training data, which can be outdated. A package that was popular in 2023 might be abandoned with unfixed vulnerabilities in 2026.

The practical step: add a policy that any new dependency introduced by AI must be reviewed by a human before merge. This catches supply chain risk before it enters your lockfile.

Layer 3: Security-focused code review for AI PRs

Standard code review asks "does this work?" Security review asks "how can this be abused?"

For AI-generated code, add a lightweight security checklist to your PR template:

  • Does this validate all user inputs at system boundaries?
  • Does this check authorization before accessing resources?
  • Does this handle authentication tokens and secrets correctly?
  • Does this avoid exposing internal error details to the client?
  • Does this sanitize output that gets rendered in HTML?

Not bureaucracy. A forcing function. AI code looks clean, passes lint, and compiles without warnings. The security gaps hide in what the code doesn't do, and a checklist makes those gaps visible.

Layer 4: End-to-end tests for security-critical flows

Here's where most teams have the biggest gap. Unit tests verify that a function returns the right value. They don't verify that your authentication flow is secure against token replay, that your payment endpoint rejects tampered amounts, or that your API properly enforces role-based access.

For the flows that matter most (auth, payments, data access, file uploads, admin operations), write end-to-end tests that exercise the security boundaries:

  • Can a logged-out user access a protected route?
  • Can user A access user B's data?
  • Does the API reject malformed tokens instead of failing open?
  • Do rate limits actually work under load?

Integration-level security bugs are what static analysis misses and what AI-generated code consistently introduces.

Layer 5: Treat AI code as untrusted by default

The Cloud Security Alliance now recommends treating all AI-generated code as untrusted third-party components. That's the right mental model.

You wouldn't merge a PR from an unknown contractor without reviewing it carefully. Apply the same standard to AI. The code might be correct. It might even be good. But it hasn't earned your trust yet, and the data says it fails security checks nearly half the time.

Nobody is saying reject AI code. Just verify it with the same rigor you'd apply to any external contribution.

The minimum viable security checklist

If your team ships AI-generated code, this is the baseline:

WhenWhat to run
Pre-commitSemgrep (OWASP patterns, injection, XSS), ESLint security plugin, secret detection (git-secrets or TruffleHog)
Per PRDependency audit (npm audit, Snyk, Dependabot), new-dependency review, security checklist on PR template, human review of auth and payment code
Per releaseE2E security tests for critical flows, DAST scan against staging, supply chain audit of new transitive dependencies
OngoingMonitor for new CVEs in dependency tree, track which code paths are AI-generated, update Semgrep rules as new patterns emerge

Frequently asked questions

How much of AI-generated code has security vulnerabilities?

Veracode found 45% across 100+ models. Java is worst at over 70%. XSS and log injection failure rates exceed 85%. These numbers are consistent across model sizes. Larger models don't produce meaningfully more secure code.

Does iterating on AI code improve its security?

No. An IEEE study found the opposite: asking an AI to improve its own code led to a 37.6% increase in critical vulnerabilities after five rounds. Vulnerabilities nearly tripled by iteration ten. Human review between iterations is essential.

What static analysis tools should I use?

Semgrep for security-specific patterns (injection, XSS, auth bypass). ESLint's security plugin for JavaScript/TypeScript-specific issues. Run both. They catch different things.

Should I stop using AI coding tools?

No. The productivity gains are real. The fix is adding security verification that matches the pace of AI-generated code: static analysis, dependency auditing, security-focused review, and E2E tests for critical flows.

What about the OWASP Top 10 for LLM Applications?

The OWASP LLM list covers risks in AI-powered applications (prompt injection, data poisoning, excessive agency). It's complementary to what we've covered here. If your product uses LLMs, you need both: secure the code the AI writes and secure the AI-powered features your users interact with. We've written a QA playbook for testing AI agents that covers that side.


Security bugs in AI-generated code hide in the gaps between units: in auth flows, data access patterns, and the packages nobody audited. qtrl's AI-powered browser testing validates the flows that matter most in a real browser, without maintaining the infrastructure yourself. Start free.

Have more questions about AI testing and QA? Check out our FAQ