Experiments and their role
To systematically understand and improve how your system behaves, you need a way to isolate cause and effect. That's what experiments give you. You pick one variable, run your dataset through two versions of your system, and compare what comes out. The result tells you whether a change actually helped, and by how much. To understand by how much, you also need to evaluate your experiment output (see evalaute TODO). This section covers the systematic experimentation part before evaluation.
![]()
The anatomy of an experiment
Every experiment has four components.
| Component | What it is |
|---|---|
| Baseline | Your current production system — the control condition everything else gets measured against. Keep it fixed while you vary one thing. |
| Dataset | The inputs you run both conditions against. Keep the same dataset across experiments so results are comparable over time. |
| Variable | The single thing you're changing — model, prompt, context, tool access, or agent architecture. See On variables below. |
| Outputs to compare | What your system produces under each condition. Comparing these is the actual work of running an experiment. |
Change two things at once and you can't tell which caused the difference.
On variables
- Model. The AI model you are using. There are thinking models, cheap models, fast models and all of them come with different tradeoffs is result quality and cost.
- Prompt. The most common lever. Before running a prompt experiment, ask: is the failure a specification problem (ambiguous or incomplete prompt) or a generalization problem (model applies clear instructions inconsistently)? The latter is worth measuring.
- Context. What information you include in the prompt: retrieved documents, conversation history, user metadata.
- Tool access. Adding or removing tools changes what paths your system can take.
- Agent architecture. Single agent vs. multi-agent, which framework, how tasks are decomposed. The biggest bets, the hardest to isolate.
How is it used?
The core loop: pick a variable & form a hypothesis, run both conditions against your dataset, compare outputs, learn something & repeat.
![]()
Typical attempts can include:
- There is a new model: Will it improve the performance of our system?
- Does my prompt change improve the output quality of our system?
- Is our new agent harness architecture creating better results than our multi-agent system?
Comparing outputs starts qualitative. Open traces from the same input under both conditions side by side. Read them. Which one actually answered the question? Where did each one fail? This is how you build intuition for what your system is doing, and if you're not willing to read actual outputs on a regular cadence, your experiments will mislead you.
With scores, comparison gets concrete. You're looking at win rates ("condition B beats condition A on 68% of inputs"), score distributions (does B win consistently or just on a few inputs?), and cost-latency tradeoffs (B is better but 2x more expensive, is it worth it?).
What you're optimizing for matters. Performance, cost, and latency pull in different directions. A better model costs more. A shorter prompt is faster but might miss edge cases. A more capable agent architecture adds latency. Experiments let you see those tradeoffs in your data rather than in theory.
Where to start
TODO: Vibe patch and error analyse before
Don't set up the full evaluation pipeline before running anything. A few examples side traces side by side will teach you more in the first hour than a week of infrastructure work.
- Get 20–30 real examples. Pull them from production traces. They don't need to cover everything, just a real slice of what your application handles.
- Change one thing and run both versions. Keep everything else identical.
- Read traces side by side. No evaluator needed yet. Just read. What's different? Which one is actually better and why? Pay attention to the type of failure — is the prompt unclear, or is the model applying clear instructions inconsistently? That distinction tells you what kind of fix to try next.
- Add an evaluator once you have intuition. After a few manual rounds you'll know what you're looking for. Encode it. Now you can scale.
What comes next
To see whether your experiment led to an improvement, you need to evaluate your results. Learn more about evaluation methods in the next section.