Bake Off
Bake Off
Intent
Give the same prompt to multiple assistants—or to multiple instances of the same assistant. Compare the results and identify whether you need to change your tools, your model, or your prompt.
Motivation
A coding assistant comprises multiple components. The important contributions to generating output in response to your instructions include:
- Your user prompt, the description of your instructions that you provide to the assistant.
- The system prompt, that defines general goals and guidelines for the model to pay attention to when generating a response.
- The language model, including its training process and any post-training fine tuning applied.
- The model’s generation parameters, including the temperature and token-sampling approach, and the random seed value.
- The assistant’s context, including any source code files, documentation, or memory to which you give the assistant access.
- The collection of tools, agent skills, and MCP servers to which the assistant gives the model access.
Changes to any of these components can lead to differences in the output the assistant generates for a given task. Take a scientific approach to evaluating configuration changes, modifying one component in the system at a time and measuring the effect on the generated output. Use quantitative measurements wherever possible, and statistical analysis techniques where appropriate.
You can even keep all of the parameters the same and compare multiple instances of generated output in identical circumstances, to understand how randomness leads to differences in results. Use this technique as an alternative to Disclose Ambiguity to identify ambiguities or missing constraints in your prompts.
Applicability
Use Bake Off as part of a continuous improvement process, in which you evaluate your tools and the ways in which you use them, and hypothesise and test potential changes to the tools, their configuration, and your workflow.
For example, when you consider applying a different pattern from this catalog, design a Bake Off between your current practice and the pattern application and compare results. Alternatively, Bake Off the same task specification with a large model and a reasoning model, or with two models with differing token costs, to identify whether a task requires a frontier model to get acceptable quality, or whether you can use a cheaper model or even local inference and still get useful results.
Consequences
Bake Off offers the following benefits:
- Systematically understand and adapt your workflow and tools.
- Apply continuous process improvement to your interactions with coding assistants.
- Use objective criteria to evaluate your processes.
Implementation
Implement Bake Off as a continuous improvement process by following these steps:
- Identify a potential change to your toolset or process; for example, using a different model.
- Identify a measurement that would indicate whether the change is successful; for example, quality improves, or cost of response decreases.
- Prompt your assistant to generate responses in both the unchanged and the changed situation, and gather data on your measurement criteria.
- Analyze the data and determine whether the change represents an improvement.
Alternatively, implement Bake Off as a sequential variant of Ask for Alternatives by keeping all parameters of the generation the same, and invoking the assistant multiple times. Use this approach to test the reliability of “lights-out” agentic tasks, or to explore the effects of randomness and ambiguity in your prompt on the generated output.
Examples
Programmer documentation
The Generate Documentation pattern might need complex capabilities, because the model needs to read the prompt, explore source code, and then create documentation that gets the detailed steps correct, in the correct sequence, and matches readability and style criteria. I hypothesize that a cheaper, less capable module produces lower-quality documentation, and design a Bake Off to try this out.
The variable I change is the choice of model, so I keep the assistant (Gemini CLI 0.34.0), source code (the AppScript project), and prompts constant. I use the “default” agent’s system prompt. Here’s the user prompt I use:
Create a debugging guide for this project, that shows a developer how to catch an exception they encounter in their AppScript code, interpret the exception information to diagnose the fault, and fix the problem. Include an example of a realistic scenario where a script raises an exception, and the steps the developer follows to find and fix the exception. Use clear, direct language, and aim for a Flesch-Kincaid reading ease score of at least 50. Save the guide in a Markdown file.
I run /clear or /exit after each iteration of the experiment, creating a Clean Slate for the model, and delete any guide or intermediate files created in earlier iterations. I complete this task three times per model with two different models: Gemini 2.5 Flash and Gemini 2.5 Pro, each time giving the generated text a score that’s its Flesch reading ease score if the document accurately describes the debugging workflow, or 0 if the document is inaccurate (regardless of reading ease).
All six documents generated by the models are available in the sample code repository in the documentation_bake_off folder. The scores for the documents are shown in the table below.
| Document | Model |
Accurate? |
Reading Ease |
Score |
| flash25-1.md | Gemini Flash 2.5 | Yes | 30.3 | 30.3 |
| flash25-2.md | Gemini Flash 2.5 | Yes | 56.9 | 56.9 |
| flash25-3.md | Gemini Flash 2.5 | Yes | 39.4 | 39.4 |
| pro25-1.md | Gemini Pro 2.5 | No | 66.1 | 0 |
| pro25-2.md | Gemini Pro 2.5 | Yes | 66.5 | 66.5 |
| pro25-3.md | Gemini Pro 2.5 | No | 65.7 | 0 |
The mean score for Gemini 2.5 Flash is 42.2. The mean score for Gemini 2.5 Pro is 22.2.
Contrary to my expectation, Gemini Flash 2.5 performs better than Gemini Pro 2.5 at this task; the documents it created are uniformly less readable but are more likely to be accurate. Inspecting the transcripts for these six sessions, Gemini Flash 2.5 read the project source files relevant to constructing exceptions, while Gemini Pro 2.5 used the grep tool to search for exception keywords. As a result, Gemini Pro 2.5 ingested less of the relevant source code before generating the documentation.
Side-by-side comparison
The side-by-side skill in the Chiron Codex agent skills repository instructs an agent to run the same task multiple times in separate workspaces, and report or compare the results. Coding assistants tend to optimise for task completion so it proved difficult to get any assistant to deliberately repeat work.
To implement the programmer documentation Bake Off in the example above, I needed to reinforce the skill’s instructions in my user prompt, after a couple of rounds of Call Out Error and Extract Prompt to strengthen the skill instruction file.
The user prompt I provided to initiate the Bake Off (in Gemini CLI, using the Gemini 3 Pro model) was:
/side-by-side Instantiate distinct subagents in different workspaces, that use the gemini-2.5-pro and gemini-2.5-flash models, and give them the following prompt: ``Create a debugging guide for this project, that shows a developer how to catch an exception they encounter in their AppScript code, interpret the exception information to diagnose the fault, and fix the problem. Include an example of a realistic scenario where a script raises an exception, and the steps the developer follows to find and fix the exception. Use clear, direct language, and aim for a Flesch-Kincaid reading ease score of at least 50. Save the guide in a Markdown file.'' The purpose of this task is a scientific comparison of the output of the different models, so you MUST delegate this work to separate subagents and you MUST give them the same prompt in different workspaces.
Gemini used the Gemini CLI tool in “headless” mode to conduct the Bake Off.
Related Patterns
You Reflect to identify workflow improvements to try out.
Ask for Alternatives to get inspiration for experiments to try in a Bake Off.
No comments to display
No comments to display