Running Optimizations

With a registered agent and an evaluation suite, you can start an AI-driven optimization run.

The optimization loop

Each run follows a scientific experimentation process:

Baseline evaluation — your live agent is queried with each training question via Power Automate (or Direct Line). These real responses are scored by an AI judge to establish a ground-truth baseline.
Mutation — AI generates new instruction variants using four strategies:
- Textual Gradients — analyze failures, critique, and edit
- Differential Evolution — combine the best ideas from top variants
- Component Optimization — fix only the weakest-scoring dimension
- Rule Induction — extract explicit rules from hard failure cases
Simulated evaluation — each candidate variant is tested by simulating how an agent would respond using those instructions. An AI judge scores the simulated responses across accuracy, grounding, relevance, tone, and guardrails. Your live agent is not called during mutations — only the baseline uses the real agent.
Selection — top performers survive; poor variants are replaced. Thompson Sampling allocates more attempts to the best-performing mutation strategy.
Repeat — the loop continues until scores converge or the iteration limit is reached.

Default configuration

Max iterations:     30
Population size:    6 variants
Min improvement:    2% to accept a new variant
Patience:           5 iterations before early stopping
Judge model:        Claude Haiku 4.5
Voting rounds:      2

Monitoring progress

While a run is active, the experiment page shows:

Progress bar — current iteration, best score, and the active mutation strategy
Score convergence chart — a waveform showing the best score over iterations
Variant comparison — side-by-side diff of baseline vs. optimized instructions

Runs typically complete in 10–15 minutes.

Agent usage: Only the baseline step queries your live agent (once per training query). All mutation evaluations are simulated — no additional Copilot Credits are consumed after the baseline.

After the run

When the run converges or hits the iteration limit:

The best-scoring variant is highlighted with its full instruction text
You can review a diff against the baseline instructions
Per-dimension score breakdown shows where improvement was greatest
One-click Apply to Copilot Studio pushes the optimized instructions to your live agent (Power Automate agents only)