Trust the Tests
Trust the Tests
Intent
Trust the tests, and not the agent. Create external quality checks, and ensure that the software you create passes those checks even when the agent reports that it’s “done”.
Motivation
Software engineers organize quality-checking work into two categories:
- Verification: determining whether the product is correctly built.
- Validation: determining whether the product provides the expected value.
In other words, verification asks “did we built it right,” while validation asks “is it the right thing to build?”
When you prompt a coding assistant to generate software, the assistant is incapable of validation as it doesn’t communicate with your customer and other project stakeholders. The assistant has limited verification capabilities, but the model’s completion bias leads it to prioritize completing implementation code over creating tests, or ensuring that tests it does create pass or even run correctly.
Additionally, when you Write Down TODOs to create “higher-order prompts”—prompts that create prompts—you introduce the possibility that the code-generation prompt diverges from the intentions you set out in your initial prompt.
A coding assistant that reports that its work is complete might not have created everything specified in your prompt, and the output it creates might not be something your project needs. Create an external system for verification and validation (whether you get an assistant to generate all, some, or none of its components) and use that to determine when the software is ready. You can set up this system so the Model Reviews the code it generates.
Applicability
Trust the Tests applies whenever you prompt a model to generate code that you use in an important system. You should take a risk-based approach to determining the level of review, trust, and manual effort you put into your evaluation.
If you’re using an agentic workflow in which coding agents automatically generate code, and find that they create code more quickly than you can review it through manual inspection, then you can’t implement Trust the Tests by having the agents generate test code that you manually inspect. Such a change in workflow doesn’t change the fundamental constraint, that the agents create code faster than you can read it; whether the code you read is the implementation code, or the tests.
Consider automated analysis tools, including other LLM-based tools that engage in Adversarial Dialectics with your coding assistants, to discover and report issues. The only people who can tell you if your customers are happy are your customers—so have them validate your work, or help you define acceptance criteria that you evaluate automatically.
Consequences
Trust the Tests offers the following benefits:
- Provide independent assessment of the code a model generates.
- Reduce your reliance on manual inspection.
- Choose a level of scrutiny for generated code based on risk analysis.
Implementation
Trust the Tests by building a verification and validation system that doesn't rely on the model's report of whether a feature is implemented, or a bug fixed. This system can include:
- Automated tests, at any level of the test pyramid. You might get a coding assistant to generate these tests. Consider using tests that investigate a broad range of scenarios, for example property-based testing, to reduce the amount of test code you create that additionally needs review.
- Mutation testing, where you make changes to your generated code and confirm that the automated tests detect the changed behavior as a regression. You can use a coding assistant to mutate your code.
- Static analysis tools and compiler diagnostics, that scan source code for problems.
- Dynamic analysis tools, for example symbolic execution tools, that discover code paths that lead to unwanted outcomes, for example, crashes.
- Customer validation.
You don’t need to use every aspect of your verification and validation system at every point during development. Take a risk-based approach that ensures all quality checks are complete by the time customers start using your generated code, and that you and your colleagues receive timely information about changes you and coding assistants make during development.
Example
Are you really done?
In a recent session with agents using the Devstral 2, Devstral Small 2, and Qwen3 Coder Next models, the coding assistant (Mistral Vibe) reported that it had completed a task “with production-quality code and 100% test coverage”. The code is a Spring Boot web application written in Java. Running the unit tests with mvn test, I see that they pass, but the acceptance tests—written in Cucumber, that drive a web browser using the Selenium web driver—fail.
According to the failure report, the system doesn’t display any output on either success or failure. That supposedly production-ready, fully tested code isn’t doing anything useful for the people who use my application. I Call Out Error to get the model to integrate the code.
Related Patterns
In Adversarial Dialectics, the Model Reviews generated code and creates a report of problems for the model to address.
You Review tests that the model generates.
Generate Documentation to use in validation—either as the basis for automated tests, or for stakeholders to determine whether the software does what they expect.
Build a Model for people to validate without waiting for the whole system to be ready.
No comments to display
No comments to display