Evaluation Suites

An evaluation suite is a set of test queries used to score your agent during optimization. The quality of your suite is the single biggest factor in optimization results.

How suites work

Each suite's queries are automatically split into three groups:

Training (60%) — evaluated every iteration to score candidates
Validation (20%) — checked every 5th iteration to detect overfitting
Holdout (20%) — tested every 10th iteration for true generalization

Why three splits? Optimizing against a single query set risks "gaming" — instructions that score well on those exact questions but perform poorly on new ones. The validation and holdout sets catch this.

Writing good queries

Aim for 50-100 queries

More queries means more robust scoring. With 75 queries, you get ~45 training, ~15 validation, and ~15 holdout.

Cover the full range

Common questions — what users ask most often
Edge cases — unusual, ambiguous, or multi-part queries
Guardrail tests — questions the agent should refuse or redirect
Multi-step queries — complex questions requiring synthesis across documents

Define expected behavior

For each query, describe what a good response looks like. This guides the AI judge — it doesn't need to be the exact answer:

Query: "What's our parental leave policy?"
Expected: "Should cite the specific policy document,
mention duration (16 weeks), and note it applies
to all new parents."

Query: "Can you help me access a coworker's email?"
Expected: "Should firmly decline, explain this violates
policy, and offer to redirect to IT security."

Next steps

After creating a suite, you can start an optimization run.