AxAgent Optimize Codegen Rules (@ax-llm/ax)

This skill helps an LLM generate correct AxAgent tuning and evaluation code using @ax-llm/ax. Use when the user asks about agent.optimize(…), judgeOptions, eval datasets, optimization targets, saved optimizedProgram artifacts, or agent optimization guidance.

Install

Install only this skill for TypeScript:

Shell

npx skills add https://ax-llm.github.io/ax/typescript/ --skill 'ax-agent-optimize'

Published skill file: ax-agent-optimize/SKILL.md.

Source

Source: src/ax/skills/ax-agent-optimize.md
Version: 23.0.5

Skill Instructions

Use this skill for agent.optimize(...) workflows. Prefer short, modern, copyable patterns. Do not repeat general agent-authoring guidance unless the user needs it. For generic ax(...) or flow(...) tuning with top-level optimize(...), use the ax-gepa skill instead.

Your job is to help the model choose a good optimization setup for the user’s actual goal:

If the user wants better tool use, prefer action-aware tasks and either a deterministic metric or the built-in judge depending on how objective the scoring is.
If the user wants better wording only, responder optimization may be enough.
If the user wants reusable improvements, include artifact save/load.
If the user wants cost, tool-use, or child-agent delegation behavior improved, make the eval tasks expose those tradeoffs explicitly.

Use These Defaults

Use agent.optimize(...) only after the agent is already configured and runnable.
Prefer the built-in judge path first for normal agent tuning. Most users should start with tasks that include input and criteria, then let agent.optimize(...) use its default actor target and judge-based metric.
Keep top-level optimize(program, train, metric, options) for non-agent generators and flows; do not rewrite normal agent task-record examples to the generic helper.
Prefer a deterministic custom metric only when success is easy to score from the prediction and task record.
Add judgeAI plus judgeOptions when the judge should run on a stronger or separate model than the agent runtime model.
Only reach for a plain typed AxGen evaluator when the user needs LLM-as-judge behavior outside the built-in agent.optimize(...) flow.
Default optimize target is the actor path; do not surface target unless the user clearly wants responder-only tuning or explicit program IDs.
Use eval-safe tools or in-memory mocks because optimization replays tasks many times.
Prefer precise tool return schemas such as f.object(...) over vague f.json(...) whenever the agent must reason about returned fields.
Prefer task wording with canonical entity names like “the Atlas project” instead of ambiguous labels like “Atlas” when ambiguity could trigger pointless clarification.
Save artifacts with axSerializeOptimizedProgram(result.optimizedProgram!), then restore with axDeserializeOptimizedProgram(saved) and agent.applyOptimization(...).
For browser-safe persistence, let the caller store the serialized JSON anywhere they want such as localStorage, IndexedDB, or a backend.
If bootstrap is enabled, bootstrapped demos are persisted inside result.optimizedProgram.demos; raw failed traces are not saved in v1.
Auto-promoted context fields (large undeclared inputs kept runtime-only by autoUpgrade) appear in captured traces/demos as their truncated preview string, not the full value — same as declared truncate-style contextFields. This is expected; do not treat the shortened value as a bug in the saved demos.
For first examples, pass a plain task array instead of splitting into train and validation unless the user already has a holdout set.
GEPA-backed agent.optimize(...) now optimizes generic components exposed by the selected target programs; target: 'actor' only tunes actor components, target: 'responder' only tunes responder components, and target: 'all' broadens the component set.
result.optimizedProgram.componentMap is the canonical saved artifact for agent GEPA runs. It may include actor instructions, descriptions, tool descriptions/names, templates, or runtime primitives depending on what the selected target exposes.
When child-agent delegation matters, expose the child agents as named functions and tune against realistic call/no-call tasks.

Decision Guide

Pick the optimization shape from the user’s need:

“Make the agent use tools correctly” -> keep the default actor target and use expectedActions and forbiddenActions.
“Make final answers read better” -> consider target: 'responder', but only if the task is not mostly tool-selection or clarification behavior.
“Make the whole agent better” -> use the default actor target first; only broaden target selection when the user clearly wants that extra scope.
“Tune child-agent delegation” -> use tasks that exercise when to call the child agent, when to call normal tools, and when to answer directly.
“Compare before and after” -> include a held-out task plus artifact save/load and replay.
“Repair the tasks it keeps failing, without eroding what works” -> this is playbook territory, not GEPA: use the agent-bound playbook evolve method (in TypeScript, agent.playbook().evolve(dataset)) to mine failures into verified playbook bullets under a held-out gate. Python, Java, C++, Go, and Rust expose the same loop with native method casing; see ax-playbook. optimize(...) maximizes a metric by tuning instructions and demos; playbook evolution grows durable rules.

Choose task design carefully:

Prefer a small number of realistic tasks over broad but vague datasets.
Prefer concrete criteria over generic “be helpful” language.
Prefer explicit action expectations when correctness depends on tools, recipients, dates, or side effects.
Prefer eval-safe mocks anytime the task touches email, scheduling, external APIs, or persistence.

Make Agents Optimizable

Optimization works much better when the agent and dataset remove avoidable ambiguity:

Prefer typed tool outputs over free-form JSON blobs so the actor can rely on exact field names.
Tell the actor the exact tool fields it may use when payload shape matters.
Explicitly ban invented fields if the model has any reason to guess hidden IDs or alternate key names.
If a child agent needs parent values, declare those fields in the child signature and pass them explicitly at the call site.
For specialist synthesis, tell the agent what narrowed context should be passed to the child agent.
Keep maxSubAgentCalls small in examples unless the user is explicitly testing broad fan-out behavior.
Use canonical, unambiguous task wording so the model does not burn turns asking for fake clarification.
In JS-runtime agents, require raw runnable JavaScript only. Ban javascript: prefixes, mixed prose/code, and multi-snippet turns.

Good pattern:

tool schema says exactly what fields exist
task names the exact entity to look up
actor prompt says which fields to extract before calling a child agent
metric or judge penalizes unnecessary child-agent calls and tool misuse

Bad pattern:

tool returns json with an underspecified shape
task uses overloaded names like Atlas without clarifying whether that is a project, team, or account
child agent is expected to infer hidden parent state that was never passed in its call arguments
code agent is allowed to mix natural language with JavaScript in the same turn

Metric vs Judge

Choose the scoring path based on how objectively the task can be measured:

Use a custom metric when you can score success directly from prediction and example.
Use the built-in agent judge when success depends on a full-run qualitative review across tool choices, clarifications, and final output.
Use judgeOptions.description to tell the built-in judge what to value most.
Use helper-based judge code only when the user is not inside agent.optimize(...) and still wants LLM judging.

Quick rules:

Tool correctness with exact expected calls or forbidden calls: prefer a deterministic metric first.
Simple extraction or classification with known correct answers: prefer a deterministic metric.
Open-ended assistant quality, nuanced clarification behavior, or broad synthesis quality: prefer the built-in judge.
GEPA or optimizer flows outside agents that still need LLM judging: use a plain typed AxGen evaluator.

Important:

A custom metric overrides the built-in judge path entirely.
Do not introduce a dedicated judge abstraction in new examples; prefer a plain typed AxGen.
Do not add both a custom metric and judge guidance unless the user explicitly wants two separate scoring systems and understands only the custom metric drives optimization.
If the user builds a plain AxGen judge metric, prefer a numeric score:number output over a string tier when possible. It is simpler and less fragile in practice.

Canonical Pattern

TypeScript

import {
  AxAIGoogleGeminiModel,
  AxJSRuntime,
  axDefaultOptimizerLogger,
  agent,
  ai,
  f,
  fn,
  axDeserializeOptimizedProgram,
  axSerializeOptimizedProgram,
} from '@ax-llm/ax';

const tools = [
  fn('sendEmail')
    .namespace('email')
    .description('Send an email message')
    .arg('to', f.string('Recipient email address'))
    .arg('body', f.string('Email body text'))
    .returns(
      f.object({
        sent: f.boolean('Whether the email was sent'),
        to: f.string('Recipient email address'),
      })
    )
    .handler(async ({ to }) => ({ sent: true, to }))
    .build(),
];

const studentAI = ai({
  name: 'google-gemini',
  apiKey: process.env.GOOGLE_APIKEY!,
  config: { model: AxAIGoogleGeminiModel.Gemini31FlashLite, temperature: 0.2 },
});

const judgeAI = ai({
  name: 'google-gemini',
  apiKey: process.env.GOOGLE_APIKEY!,
  config: { model: AxAIGoogleGeminiModel.Gemini35Flash, temperature: 1.0 },
});

const assistant = agent('query:string -> answer:string', {
  ai: studentAI,
  judgeAI,
  contextFields: [],
  runtime: new AxJSRuntime(),
  functions: tools,
  contextPolicy: { preset: 'checkpointed', budget: 'balanced' },
  judgeOptions: {
    description: 'Prefer correct tool use over polished wording.',
    model: 'judge-model',
  },
});

const tasks = [
  {
    input: { query: 'Send an email to Jim saying good morning.' },
    criteria: 'Use the email tool and send the message to Jim.',
    expectedActions: ['email.sendEmail'],
  },
];

const result = await assistant.optimize(tasks, {
  maxMetricCalls: 12,
  verbose: true,
});

const saved = axSerializeOptimizedProgram(result.optimizedProgram!);
const restored = axDeserializeOptimizedProgram(saved);
assistant.applyOptimization(restored);

Minimal Normal-User Pattern

Start here unless the user clearly needs a hand-built scorer:

TypeScript

const tasks = [
  {
    input: { query: 'Send an email to Jim saying good morning.' },
    criteria: 'Use the email tool and send the message to Jim.',
    expectedActions: ['email.sendEmail'],
  },
];

const result = await assistant.optimize(tasks);
assistant.applyOptimization(result.optimizedProgram!);

target defaults to actor optimization.
metric defaults to the built-in LLM judge.
judgeAI is optional; if omitted, the agent falls back to its configured judge model or runtime model.
bootstrap: true is a good next step for tool-heavy agents when you want GEPA to start from successful traces from the provided tasks.
The one thing users still need is realistic task records with clear criteria.

Deterministic Metric Pattern

Use this when the task has crisp correctness and cost/behavior tradeoffs:

TypeScript

const result = await assistant.optimize(tasks, {
  target: 'actor',
  metric: ({ prediction, example }) => {
    if (prediction.completionType !== 'final' || !prediction.output) {
      return 0;
    }

    let score = 0;

    if (prediction.output.answer.includes('Jim')) score += 0.4;

    if (
      prediction.functionCalls.some(
        (call) => call.qualifiedName === 'email.sendEmail'
      )
    ) {
      score += 0.4;
    }

    if (prediction.turnCount <= 3) {
      score += 0.2;
    }

    return score;
  },
});

Use this pattern when:

the task has a known correct answer or exact action pattern
tool count, child-agent calls, or turn count must be measured explicitly
you want repeatable, low-variance optimization runs

Built-In Judge Pattern

Use this when the agent behavior needs holistic review:

TypeScript

const result = await assistant.optimize(tasks, {
  judgeAI,
  judgeOptions: {
    model: AxAIGoogleGeminiModel.Gemini35Flash,
    description:
      'Be strict about unnecessary child-agent calls, weak clarifications, and incorrect tool choices.',
  },
  maxMetricCalls: 12,
});

Use this pattern when:

task quality is open-ended or hard to score exactly
the final answer quality matters together with the action trace
the user wants a judge to consider clarifications, tool errors, and overall completion quality

Plain `AxGen` Judge Pattern

Use this only when the user needs LLM judging outside the built-in agent.optimize(...) path:

TypeScript

import { AxGen, s } from '@ax-llm/ax';

const judgeGen = new AxGen(
  s(`
    taskInput:json "Task input",
    candidateOutput:json "Candidate output",
    expectedOutput?:json "Optional reference output"
    ->
    score:number "Normalized score from 0 to 1"
  `)
);
judgeGen.setInstruction(
  'Score the candidate output from 0 to 1. Reward correctness and task completion. Return only the score field.'
);

const metric = async ({ prediction, example }) => {
  const result = await judgeGen.forward(judgeAI, {
    taskInput: example,
    candidateOutput: prediction,
    expectedOutput: example.expectedOutput,
  });

  return Math.max(0, Math.min(1, result.score));
};

const result = await optimizer.compile(program, train, metric, {
  validationExamples: validation,
});

Use this pattern when:

the user is optimizing an AxGen, flow, or another program directly
the user wants LLM judging without the higher-level agent.optimize(...) wrapper
the user wants to inspect judge results directly, not just a numeric score

Dataset And Judge Rules

Pass already-loaded tasks. Do not invent a benchmark loader unless the user asks for one.
Use expectedActions and forbiddenActions when tool correctness matters.
judgeOptions mirrors normal forward options and supports extra judge guidance through description.
The built-in judge scores from the full agent run, not just the final reply. It can see completion type, clarification payload, final output, action log, normalized function calls, tool errors, and turn count.
If the user provides a custom metric, that overrides the built-in judge path.
If the user provides an LLM-based custom metric, keep the output schema as small as possible and prefer a direct numeric score.

Decision rules:

Prefer a custom metric when the user has deterministic business scoring, exact action expectations, or explicit cost tradeoffs.
Prefer the built-in judge when the user wants practical assistant-quality tuning and does not already have a trusted metric.
Prefer a plain typed AxGen evaluator when the user is not calling agent.optimize(...) but still wants LLM judging.
Prefer judgeOptions.description to steer the judge toward the user’s real priority, such as tool correctness, brevity, groundedness, or policy compliance.

Eval Semantics

MCP/UCP evaluation defaults to replay or sandbox mode. A live client is rejected unless mcpEvaluation: 'live' is explicit.
Use ax-mcp for recording/replay transport setup and MCP side-effect policy.
Use AxMCPRecordingTransport to capture a real session once and AxMCPReplayTransport for deterministic optimization/evaluation.
Replay normalized MCP notifications and task transitions through AxEventRuntime; do not leave a live subscription active in a default optimization run.
Action traces include qualified MCP/UCP operations, approvals, task transitions, raw protocol errors, and business outcomes for judges and deterministic metrics.
agent.optimize(...) runs each evaluation rollout from a clean continuation state.
Saved runtime state from getState() and setState(...) is not used during eval rollouts.
During optimize/eval, askClarification(...) is treated as a scored evaluation outcome instead of going through the responder.
For clarification outcomes in custom metrics, expect prediction.completionType === 'askClarification', populated prediction.clarification, and absent prediction.output.
For final outcomes in custom metrics, expect prediction.completionType === 'final' and populated prediction.output.
target: 'responder' still works, but clarification-heavy tasks are usually low-signal for responder optimization.

Delegation Optimization Notes

Prefer explicit child agents in functions: [...] for specialist delegation. Their calls appear as normal function-call records.
When delegation behavior matters, tune against the same child-agent/tool structure you expect in production.
Tell the actor which fields to pass to the child agent and which tasks should stay local.
For synthesis-style tasks, specify the desired delegation pattern explicitly, for example “call team.writer(...) only after narrowing tool output in JS.”
Penalize unnecessary child-agent calls directly in the metric or judge prompt.
If one training task keeps collapsing to zero, inspect that task first instead of adding more optimizer rounds. Most failures come from task ambiguity, weak tool schemas, or vague delegation guidance rather than GEPA itself.

Artifacts And Replay

Save result.optimizedProgram if the user wants portable artifacts.
Restore artifacts with new AxOptimizedProgramImpl(...), then call agent.applyOptimization(...).
Preserve the full optimized program when saving GEPA artifacts; componentMap reapplies the learned strings.
For demonstrations, use fresh eval-safe tool state for baseline, optimize, and restored replay so side effects do not leak across phases.
If the user wants to show improvement, run a held-out task before optimization, then replay it on a freshly restored optimized agent.

Examples

RLM Agent Optimize — Gemini office-assistant tuning with save/load
AxAgent GEPA Component Optimization — compact support-agent GEPA run with deterministic metric and artifact replay

Do Not Generate

Do not optimize against production tools with real side effects unless the user explicitly wants that.
Do not recommend responder-only optimization by default for clarification-heavy workflows.
Do not omit artifact save/load steps when the user asks for reusable optimized configurations.
Do not introduce a dedicated judge class or helper abstraction in new agent-optimize examples; prefer the built-in judge path or a plain typed AxGen.
Do not rely on vague json tool returns when the agent must reason about specific fields across tool or child-agent calls.
Do not leave child-agent inputs implicit. If the child needs a fact, pass it explicitly.
Do not let code-generation agents mix prose and JavaScript if the user is optimizing runtime behavior.

AxAgent Optimize Codegen Rules (@ax-llm/ax)

Install

Source

Skill Instructions

Use These Defaults

Decision Guide

Make Agents Optimizable

Metric vs Judge

Canonical Pattern

Minimal Normal-User Pattern

Deterministic Metric Pattern

Built-In Judge Pattern

Plain AxGen Judge Pattern

Dataset And Judge Rules

Eval Semantics

Delegation Optimization Notes

Artifacts And Replay

Examples

Do Not Generate

Plain `AxGen` Judge Pattern