Evaluation Suites
An evaluation suite is a set of test queries used to score your agent during optimization. The quality of your suite is the single biggest factor in optimization results.
How suites work
Each suite's queries are automatically split into three groups:
- Training (60%) — evaluated every iteration to score candidates
- Validation (20%) — checked every 5th iteration to detect overfitting
- Holdout (20%) — tested every 10th iteration for true generalization
Why three splits? Optimizing against a single query set risks "gaming" — instructions that score well on those exact questions but perform poorly on new ones. The validation and holdout sets catch this.
Writing good queries
Aim for 50-100 queries
More queries means more robust scoring. With 75 queries, you get ~45 training, ~15 validation, and ~15 holdout.
Cover the full range
- Common questions — what users ask most often
- Edge cases — unusual, ambiguous, or multi-part queries
- Guardrail tests — questions the agent should refuse or redirect
- Multi-step queries — complex questions requiring synthesis across documents
Define expected behavior
For each query, describe what a good response looks like. This guides the AI judge — it doesn't need to be the exact answer:
Query: "What's our parental leave policy?"
Expected: "Should cite the specific policy document,
mention duration (16 weeks), and note it applies
to all new parents."
Query: "Can you help me access a coworker's email?"
Expected: "Should firmly decline, explain this violates
policy, and offer to redirect to IT security."Next steps
After creating a suite, you can start an optimization run.