AxAgent Optimize Codegen Rules (@ax-llm/ax)
This skill helps an LLM generate correct AxAgent tuning and evaluation code using @ax-llm/ax. Use when the user asks about agent.optimize(…), judgeOptions, eval datasets, optimization targets, saved optimizedProgram artifacts, or agent optimization guidance.
Install
Install only this skill for TypeScript:
npx skills add https://ax-llm.github.io/ax/typescript/ --skill 'ax-agent-optimize'Published skill file: ax-agent-optimize/SKILL.md.
Source
- Source: src/ax/skills/ax-agent-optimize.md
- Version:
22.0.3
Skill Instructions
Use this skill for agent.optimize(...) workflows. Prefer short, modern, copyable patterns. Do not repeat general agent-authoring guidance unless the user needs it. For generic ax(...) or flow(...) tuning with top-level optimize(...), use the ax-gepa skill instead.
Your job is to help the model choose a good optimization setup for the user’s actual goal:
- If the user wants better tool use, prefer action-aware tasks and either a deterministic metric or the built-in judge depending on how objective the scoring is.
- If the user wants better wording only, responder optimization may be enough.
- If the user wants reusable improvements, include artifact save/load.
- If the user wants cost, tool-use, or child-agent delegation behavior improved, make the eval tasks expose those tradeoffs explicitly.
Use These Defaults
- Use
agent.optimize(...)only after the agent is already configured and runnable. - Prefer the built-in judge path first for normal agent tuning. Most users should start with tasks that include
inputandcriteria, then letagent.optimize(...)use its default actor target and judge-based metric. - Keep top-level
optimize(program, train, metric, options)for non-agent generators and flows; do not rewrite normal agent task-record examples to the generic helper. - Prefer a deterministic custom
metriconly when success is easy to score from the prediction and task record. - Add
judgeAIplusjudgeOptionswhen the judge should run on a stronger or separate model than the agent runtime model. - Only reach for a plain typed
AxGenevaluator when the user needs LLM-as-judge behavior outside the built-inagent.optimize(...)flow. - Default optimize target is the actor path; do not surface
targetunless the user clearly wants responder-only tuning or explicit program IDs. - Use eval-safe tools or in-memory mocks because optimization replays tasks many times.
- Prefer precise tool return schemas such as
f.object(...)over vaguef.json(...)whenever the agent must reason about returned fields. - Prefer task wording with canonical entity names like “the Atlas project” instead of ambiguous labels like “Atlas” when ambiguity could trigger pointless clarification.
- Save artifacts with
axSerializeOptimizedProgram(result.optimizedProgram!), then restore withaxDeserializeOptimizedProgram(saved)andagent.applyOptimization(...). - For browser-safe persistence, let the caller store the serialized JSON anywhere they want such as localStorage, IndexedDB, or a backend.
- If
bootstrapis enabled, bootstrapped demos are persisted insideresult.optimizedProgram.demos; raw failed traces are not saved in v1. - For first examples, pass a plain task array instead of splitting into
trainandvalidationunless the user already has a holdout set. - GEPA-backed
agent.optimize(...)now optimizes generic components exposed by the selected target programs;target: 'actor'only tunes actor components,target: 'responder'only tunes responder components, andtarget: 'all'broadens the component set. result.optimizedProgram.componentMapis the canonical saved artifact for agent GEPA runs. It may include actor instructions, descriptions, tool descriptions/names, templates, or runtime primitives depending on what the selected target exposes.- When child-agent delegation matters, expose the child agents as named functions and tune against realistic call/no-call tasks.
Decision Guide
Pick the optimization shape from the user’s need:
- “Make the agent use tools correctly” -> keep the default actor target and use
expectedActionsandforbiddenActions. - “Make final answers read better” -> consider
target: 'responder', but only if the task is not mostly tool-selection or clarification behavior. - “Make the whole agent better” -> use the default actor target first; only broaden target selection when the user clearly wants that extra scope.
- “Tune child-agent delegation” -> use tasks that exercise when to call the child agent, when to call normal tools, and when to answer directly.
- “Compare before and after” -> include a held-out task plus artifact save/load and replay.
Choose task design carefully:
- Prefer a small number of realistic tasks over broad but vague datasets.
- Prefer concrete criteria over generic “be helpful” language.
- Prefer explicit action expectations when correctness depends on tools, recipients, dates, or side effects.
- Prefer eval-safe mocks anytime the task touches email, scheduling, external APIs, or persistence.
Make Agents Optimizable
Optimization works much better when the agent and dataset remove avoidable ambiguity:
- Prefer typed tool outputs over free-form JSON blobs so the actor can rely on exact field names.
- Tell the actor the exact tool fields it may use when payload shape matters.
- Explicitly ban invented fields if the model has any reason to guess hidden IDs or alternate key names.
- If a child agent needs parent values, declare those fields in the child signature and pass them explicitly at the call site.
- For specialist synthesis, tell the agent what narrowed context should be passed to the child agent.
- Keep
maxSubAgentCallssmall in examples unless the user is explicitly testing broad fan-out behavior. - Use canonical, unambiguous task wording so the model does not burn turns asking for fake clarification.
- In JS-runtime agents, require raw runnable JavaScript only. Ban
javascript:prefixes, mixed prose/code, and multi-snippet turns.
Good pattern:
- tool schema says exactly what fields exist
- task names the exact entity to look up
- actor prompt says which fields to extract before calling a child agent
- metric or judge penalizes unnecessary child-agent calls and tool misuse
Bad pattern:
- tool returns
jsonwith an underspecified shape - task uses overloaded names like
Atlaswithout clarifying whether that is a project, team, or account - child agent is expected to infer hidden parent state that was never passed in its call arguments
- code agent is allowed to mix natural language with JavaScript in the same turn
Metric vs Judge
Choose the scoring path based on how objectively the task can be measured:
- Use a custom
metricwhen you can score success directly frompredictionandexample. - Use the built-in agent judge when success depends on a full-run qualitative review across tool choices, clarifications, and final output.
- Use
judgeOptions.descriptionto tell the built-in judge what to value most. - Use helper-based judge code only when the user is not inside
agent.optimize(...)and still wants LLM judging.
Quick rules:
- Tool correctness with exact expected calls or forbidden calls: prefer a deterministic metric first.
- Simple extraction or classification with known correct answers: prefer a deterministic metric.
- Open-ended assistant quality, nuanced clarification behavior, or broad synthesis quality: prefer the built-in judge.
- GEPA or optimizer flows outside agents that still need LLM judging: use a plain typed
AxGenevaluator.
Important:
- A custom
metricoverrides the built-in judge path entirely. - Do not introduce a dedicated judge abstraction in new examples; prefer a plain typed
AxGen. - Do not add both a custom
metricand judge guidance unless the user explicitly wants two separate scoring systems and understands only the custom metric drives optimization. - If the user builds a plain
AxGenjudge metric, prefer a numericscore:numberoutput over a string tier when possible. It is simpler and less fragile in practice.
Canonical Pattern
import {
AxAIGoogleGeminiModel,
AxJSRuntime,
axDefaultOptimizerLogger,
agent,
ai,
f,
fn,
axDeserializeOptimizedProgram,
axSerializeOptimizedProgram,
} from '@ax-llm/ax';
const tools = [
fn('sendEmail')
.namespace('email')
.description('Send an email message')
.arg('to', f.string('Recipient email address'))
.arg('body', f.string('Email body text'))
.returns(
f.object({
sent: f.boolean('Whether the email was sent'),
to: f.string('Recipient email address'),
})
)
.handler(async ({ to }) => ({ sent: true, to }))
.build(),
];
const studentAI = ai({
name: 'google-gemini',
apiKey: process.env.GOOGLE_APIKEY!,
config: { model: AxAIGoogleGeminiModel.Gemini25FlashLite, temperature: 0.2 },
});
const judgeAI = ai({
name: 'google-gemini',
apiKey: process.env.GOOGLE_APIKEY!,
config: { model: AxAIGoogleGeminiModel.Gemini3Pro, temperature: 1.0 },
});
const assistant = agent('query:string -> answer:string', {
ai: studentAI,
judgeAI,
contextFields: [],
runtime: new AxJSRuntime(),
functions: tools,
contextPolicy: { preset: 'checkpointed', budget: 'balanced' },
judgeOptions: {
description: 'Prefer correct tool use over polished wording.',
model: 'judge-model',
},
});
const tasks = [
{
input: { query: 'Send an email to Jim saying good morning.' },
criteria: 'Use the email tool and send the message to Jim.',
expectedActions: ['email.sendEmail'],
},
];
const result = await assistant.optimize(tasks, {
maxMetricCalls: 12,
verbose: true,
});
const saved = axSerializeOptimizedProgram(result.optimizedProgram!);
const restored = axDeserializeOptimizedProgram(saved);
assistant.applyOptimization(restored);Minimal Normal-User Pattern
Start here unless the user clearly needs a hand-built scorer:
const tasks = [
{
input: { query: 'Send an email to Jim saying good morning.' },
criteria: 'Use the email tool and send the message to Jim.',
expectedActions: ['email.sendEmail'],
},
];
const result = await assistant.optimize(tasks);
assistant.applyOptimization(result.optimizedProgram!);targetdefaults to actor optimization.metricdefaults to the built-in LLM judge.judgeAIis optional; if omitted, the agent falls back to its configured judge model or runtime model.bootstrap: trueis a good next step for tool-heavy agents when you want GEPA to start from successful traces from the provided tasks.- The one thing users still need is realistic task records with clear
criteria.
Deterministic Metric Pattern
Use this when the task has crisp correctness and cost/behavior tradeoffs:
const result = await assistant.optimize(tasks, {
target: 'actor',
metric: ({ prediction, example }) => {
if (prediction.completionType !== 'final' || !prediction.output) {
return 0;
}
let score = 0;
if (prediction.output.answer.includes('Jim')) score += 0.4;
if (
prediction.functionCalls.some(
(call) => call.qualifiedName === 'email.sendEmail'
)
) {
score += 0.4;
}
if (prediction.turnCount <= 3) {
score += 0.2;
}
return score;
},
});Use this pattern when:
- the task has a known correct answer or exact action pattern
- tool count, child-agent calls, or turn count must be measured explicitly
- you want repeatable, low-variance optimization runs
Built-In Judge Pattern
Use this when the agent behavior needs holistic review:
const result = await assistant.optimize(tasks, {
judgeAI,
judgeOptions: {
model: AxAIGoogleGeminiModel.Gemini3Pro,
description:
'Be strict about unnecessary child-agent calls, weak clarifications, and incorrect tool choices.',
},
maxMetricCalls: 12,
});Use this pattern when:
- task quality is open-ended or hard to score exactly
- the final answer quality matters together with the action trace
- the user wants a judge to consider clarifications, tool errors, and overall completion quality
Plain AxGen Judge Pattern
Use this only when the user needs LLM judging outside the built-in agent.optimize(...) path:
import { AxGen, s } from '@ax-llm/ax';
const judgeGen = new AxGen(
s(`
taskInput:json "Task input",
candidateOutput:json "Candidate output",
expectedOutput?:json "Optional reference output"
->
score:number "Normalized score from 0 to 1"
`)
);
judgeGen.setInstruction(
'Score the candidate output from 0 to 1. Reward correctness and task completion. Return only the score field.'
);
const metric = async ({ prediction, example }) => {
const result = await judgeGen.forward(judgeAI, {
taskInput: example,
candidateOutput: prediction,
expectedOutput: example.expectedOutput,
});
return Math.max(0, Math.min(1, result.score));
};
const result = await optimizer.compile(program, train, metric, {
validationExamples: validation,
});Use this pattern when:
- the user is optimizing an
AxGen, flow, or another program directly - the user wants LLM judging without the higher-level
agent.optimize(...)wrapper - the user wants to inspect judge results directly, not just a numeric score
Dataset And Judge Rules
- Pass already-loaded tasks. Do not invent a benchmark loader unless the user asks for one.
- Use
expectedActionsandforbiddenActionswhen tool correctness matters. judgeOptionsmirrors normal forward options and supports extra judge guidance throughdescription.- The built-in judge scores from the full agent run, not just the final reply. It can see completion type, clarification payload, final output, action log, normalized function calls, tool errors, and turn count.
- If the user provides a custom
metric, that overrides the built-in judge path. - If the user provides an LLM-based custom metric, keep the output schema as small as possible and prefer a direct numeric score.
Decision rules:
- Prefer a custom metric when the user has deterministic business scoring, exact action expectations, or explicit cost tradeoffs.
- Prefer the built-in judge when the user wants practical assistant-quality tuning and does not already have a trusted metric.
- Prefer a plain typed
AxGenevaluator when the user is not callingagent.optimize(...)but still wants LLM judging. - Prefer
judgeOptions.descriptionto steer the judge toward the user’s real priority, such as tool correctness, brevity, groundedness, or policy compliance.
Eval Semantics
agent.optimize(...)runs each evaluation rollout from a clean continuation state.- Saved runtime state from
getState()andsetState(...)is not used during eval rollouts. - During optimize/eval,
askClarification(...)is treated as a scored evaluation outcome instead of going through the responder. - For clarification outcomes in custom metrics, expect
prediction.completionType === 'askClarification', populatedprediction.clarification, and absentprediction.output. - For final outcomes in custom metrics, expect
prediction.completionType === 'final'and populatedprediction.output. target: 'responder'still works, but clarification-heavy tasks are usually low-signal for responder optimization.
Delegation Optimization Notes
- Prefer explicit child agents in
functions: [...]for specialist delegation. Their calls appear as normal function-call records. - When delegation behavior matters, tune against the same child-agent/tool structure you expect in production.
- Tell the actor which fields to pass to the child agent and which tasks should stay local.
- For synthesis-style tasks, specify the desired delegation pattern explicitly, for example “call
team.writer(...)only after narrowing tool output in JS.” - Penalize unnecessary child-agent calls directly in the metric or judge prompt.
- If one training task keeps collapsing to zero, inspect that task first instead of adding more optimizer rounds. Most failures come from task ambiguity, weak tool schemas, or vague delegation guidance rather than GEPA itself.
Artifacts And Replay
- Save
result.optimizedProgramif the user wants portable artifacts. - Restore artifacts with
new AxOptimizedProgramImpl(...), then callagent.applyOptimization(...). - Preserve the full optimized program when saving GEPA artifacts;
componentMapreapplies the learned strings. - For demonstrations, use fresh eval-safe tool state for baseline, optimize, and restored replay so side effects do not leak across phases.
- If the user wants to show improvement, run a held-out task before optimization, then replay it on a freshly restored optimized agent.
Examples
- RLM Agent Optimize — Gemini office-assistant tuning with save/load
- AxAgent GEPA Component Optimization — compact support-agent GEPA run with deterministic metric and artifact replay
Do Not Generate
- Do not optimize against production tools with real side effects unless the user explicitly wants that.
- Do not recommend responder-only optimization by default for clarification-heavy workflows.
- Do not omit artifact save/load steps when the user asks for reusable optimized configurations.
- Do not introduce a dedicated judge class or helper abstraction in new agent-optimize examples; prefer the built-in judge path or a plain typed
AxGen. - Do not rely on vague
jsontool returns when the agent must reason about specific fields across tool or child-agent calls. - Do not leave child-agent inputs implicit. If the child needs a fact, pass it explicitly.
- Do not let code-generation agents mix prose and JavaScript if the user is optimizing runtime behavior.