How to test AI-generated code (QA for vibe-coded apps)
By qtrl Team · Engineering
A developer on your team builds a feature in an afternoon. They describe what they want in plain language, accept what the AI produces, wire it up, and ship it. The feature works. Users are happy. Two weeks later, something breaks and nobody can explain why, because nobody actually wrote the code.
This is the QA side of vibe coding. And most teams aren't ready for it.
What vibe coding actually means for QA
Andrej Karpathy coined the term in February 2025: "fully give in to the vibes, embrace exponentials, and forget that the code even exists." It became Collins Dictionary's Word of the Year by December. A year later, close to half of all new code is AI-generated, and among Y Combinator's Winter 2025 cohort, one in five startups reported codebases that are over 90% AI-generated.
It's not the same thing as AI-assisted coding. When a developer uses an AI assistant for autocomplete, they're still in control. They understand the code, make the architectural calls, and know where the edge cases hide. Vibe coding is different in kind, not just degree. The developer describes intent. The AI decides how to build it. The developer may not read every line of what comes back.
For QA, this distinction matters. When developers understand their code, they can tell you where it's fragile. When they've vibe-coded it, they often can't. The codebase becomes something the team uses but doesn't fully understand. QA is no longer just verifying that code works correctly. It's often the only line of defense against code nobody can explain.
How vibe-coded apps break
We've already covered why AI coding tools stress your test suite: the shared blind spots between AI-generated code and AI-generated tests, the security gaps, the flakiness that compounds with velocity. All of that applies to vibe coding, but vibe coding adds its own layer of problems.
The core issue is architectural. When a developer prompts an AI to "build a user dashboard with analytics," the AI makes dozens of implementation decisions: how to structure state, which libraries to pull in, how components communicate, where data gets fetched. These decisions tend to work fine on day one. The problems show up weeks later, when the next feature needs to interact with the first one in ways the AI didn't anticipate. It wasn't thinking ahead. It was solving the prompt.
Then there's the debugging wall. When something breaks in code a developer wrote by hand, they have mental context for how it works. With vibe-coded features, that context doesn't exist. The person who prompted the AI may not understand the implementation well enough to trace a bug to its source. Teams often re-prompt the AI to fix the issue, which sometimes works and sometimes introduces new problems in the process.
Dependency sprawl is another pattern. AI models reach for libraries rather than writing solutions from scratch. That's not inherently bad, but vibe-coded apps can accumulate dependencies nobody on the team evaluated. Each one is an attack surface, a maintenance liability, and a potential source of breaking changes on the next update.
Security is where it gets serious. Veracode found that 45% of AI-generated code fails basic security tests. When the developer writing the prompt can't tell the difference between a secure auth implementation and a vulnerable one, they won't catch it in review. The AI produces what looks right. Whether it actually is right depends on training data patterns, not on your security requirements.
Your existing QA process assumes things that aren't true anymore
Most QA processes are built on a few assumptions. Someone on the team understands the architecture. Changes are incremental and reviewable. The codebase evolves gradually. Tests get written by someone who understands the feature they're testing.
Vibe coding breaks all of these. Features arrive fully formed from a prompt. Architecture emerges from whatever patterns the model has seen, not from design sessions. The person who "wrote" the code may not understand it well enough to review it meaningfully. And the pace of change can outrun any test suite that relies on humans writing every test by hand.
That doesn't make QA irrelevant. It means QA has to work differently.
Test from the outside in
If nobody fully understands the implementation, stop trying to test the implementation. Test behavior instead.
End-to-end tests that verify what a real user can do in a real browser are more valuable for vibe-coded apps than unit tests for internal functions nobody can explain. A unit test for a function you don't understand is just a snapshot of today's behavior. It'll pass tomorrow even if the behavior is wrong, because nobody knows what "right" looks like at that level.
Behavior-level tests work differently. "A user can sign up, create a project, invite a teammate, and run a report." That test is valid regardless of the implementation underneath. If a vibe-coded refactor changes how the internals work but the user flow still functions correctly, the test passes. That's the right outcome. If the refactor breaks the flow, the test catches it, even if nobody understands why the code failed.
That flips the usual testing pyramid. Most testing advice says: broad base of unit tests, narrow layer of E2E tests. For vibe-coded apps, flip it. Put your confidence at the behavior layer, where you can define "correct" without understanding the code.
Pick your battles with risk-based testing
When code arrives faster than your team can test it, you can't cover everything. You have to choose.
The AI doesn't know that your payment integration matters more than your settings page. It doesn't know that a bug in user permissions could be a security incident. It doesn't know that your onboarding flow is the one thing standing between a free trial and a paying customer. A human has to make those calls.
Start with a list. What are the 10 to 15 user flows where a bug would hurt the most? Those get tested first, every time, no exceptions. Everything else gets tested as capacity allows. This isn't about accepting lower quality. It's about focusing quality effort where it has the highest impact, because you no longer have the luxury of testing everything by hand.
Lock down the boundaries
The boundaries of your application (APIs, database schemas, third-party integrations) are the most reliable testing surface for vibe-coded apps. The internals might shift with every prompt-driven refactor, but the contracts between systems should stay stable.
API contract tests catch it when an endpoint starts returning a different shape or status code than the frontend expects. Schema tests catch it when a migration silently drops a column. And integration tests tell you whether your app and its dependencies still speak the same language after a refactor.
When a developer vibe-codes a feature that restructures how the frontend calls the backend, a contract test catches the mismatch before users do. These tests are cheap to write, fast to run, and resilient to the kind of internal churn that vibe coding produces.
Explore what you don't understand
Scripted tests cover the known risks. Exploratory testing covers the unknowns. For vibe-coded apps, the unknowns are bigger than usual.
A skilled tester or an AI agent navigating your app without a script will find things that planned tests miss. Weird state transitions. Missing error handling. Flows that technically work but feel broken. Edge cases in multi-step processes where the AI's implementation didn't account for interruptions or back-navigation.
This is where agentic testing earns its value. An AI agent that explores your app in a real browser, navigates unexpected paths, and reports what it finds is well-suited for codebases where the team doesn't have a complete mental model of how everything connects. The agent doesn't need to understand the code. It just needs to interact with the product the way a user would, including the ways users aren't supposed to.
Treat vibe-coded PRs like external contributions
When a contractor you've never worked with submits a pull request, you review it with extra care. You don't just check if it works. You check whether it follows your patterns, handles errors at boundaries, respects security conventions, and doesn't introduce dependencies you haven't vetted.
Vibe-coded PRs deserve the same treatment. Some teams have started using checklists for reviewing AI-generated code:
- Does the change handle authentication and authorization correctly?
- Does it validate inputs at system boundaries?
- Are there error handling paths, or only happy paths?
- Did it introduce new dependencies? Are they maintained and vetted?
- Do the tests actually test behavior, or do they just mock everything and assert on mocks?
That last point matters more than people realize. When AI generates both the feature and the tests, the tests often test nothing meaningful. They mock the database, mock the network, mock the service under test, and assert that a fake function returned a fake value. Coverage goes up. Confidence should not.
Capture every production bug as a regression test
In a traditional codebase, experienced developers can predict where bugs will appear. They know which parts are fragile, which edge cases are tricky, which integrations are finicky. In a vibe-coded app, that institutional knowledge doesn't exist in the same way. You're discovering the architecture's weak points in real time.
Every bug that reaches production is information. It tells you where the AI cut corners, where the implementation diverged from what the prompt described, where the edge cases live. Capture that information in a regression test. Over time, your test suite becomes a map of the places where vibe coding tends to fail. That map is more valuable than any coverage metric.
AI testing helps here (with structure underneath)
There's an irony here. The same AI capabilities that create the vibe coding problem also make testing vibe-coded apps more practical.
Self-healing test execution handles the constant churn. When vibe-coded refactors change selectors, class names, and DOM structure every week, a self-healing agent adapts instead of breaking. That's a real advantage over traditional automation, which would generate a wall of false failures after every major refactor.
AI-driven test generation from user stories helps teams build coverage quickly. When a vibe-coded feature lands and needs tests, describing the expected behavior in natural language and generating test cases is faster than writing them by hand. Especially when the person writing the tests didn't write the code either.
But here's the part most teams miss: AI testing without structure underneath is just adding more AI to a problem caused by too much unstructured AI. If your tests aren't organized, if nobody owns them, if there's no record of what was tested and when, then AI execution produces more noise, not more confidence. You need structured test management as the foundation: clear test flows, ownership, traceability. The AI amplifies whatever sits underneath it. Make sure what sits underneath is solid.
Vibe coding is evolving, not disappearing
Karpathy himself has already moved past the term. In early 2026, he started calling it "agentic engineering": orchestrating AI agents to write code while humans provide architecture and oversight. The language is evolving, but the underlying reality isn't going anywhere. Teams will keep shipping AI-generated code at speeds that would have been unthinkable two years ago.
The QA teams that thrive won't be the ones trying to slow things down. They'll be the ones that adapt: testing behavior instead of implementation, focusing effort on the flows that matter most, locking down boundaries, and using AI-powered testing to keep pace with AI-powered development.
Structure underneath. Speed on top. That's how you QA a vibe-coded app.
qtrl pairs AI-powered testing with structured test management, so your team gets real browser validation on the flows that matter and full traceability for every test run. Built for codebases that move fast. Start free.
Have more questions about AI testing and QA? Check out our FAQ