Audio I/O Codegen Rules (@ax-llm/ax)

This skill helps an LLM generate correct audio code with @ax-llm/ax. Use when the user asks about ai.transcribe(), ai.speak(), signature audio inputs or outputs, agent audio behavior, .chat() conversational audio, OpenAI audio or realtime models, Gemini Live native audio, Grok Voice Agent models, voices, formats, transcripts, or how audio fits with structured outputs.

Install

Install only this skill for TypeScript:

Shell

npx skills add https://ax-llm.github.io/ax/typescript/ --skill 'ax-audio'

Published skill file: ax-audio/SKILL.md.

Source

Source: src/ax/skills/ax-audio.md
Version: 23.0.5

Skill Instructions

Use this skill for audio in Ax. Pick the smallest audio surface that matches the job:

Use ai.transcribe(...) for batch speech-to-text.
Use ai.speak(...) for batch text-to-speech.
Use speech:audio signature outputs for structured programs that should return synthesized audio artifacts.
Use .chat() audio config for conversational or realtime audio turns.

Core Rules

Input :audio is an audio input value: { data, format?, mimeType?, sampleRate?, channels? }.
Output :audio is a scripted audio artifact. The model returns plain text for that field; Ax synthesizes it after structured output parsing.
Output audio JSON schema is model-facing string, not a binary object.
Agents transcribe input audio fields before planner/executor/responder stages by default, so agent stages see text instead of base64 audio.
Realtime and conversational audio still use .chat() and modelConfig.audio.
Batch signature audio artifacts use forward-time speech options, not modelConfig.audio.

Direct Batch APIs

TypeScript

import { ai } from '@ax-llm/ax';

const llm = ai({ name: 'openai', apiKey: process.env.OPENAI_APIKEY! });

const transcript = await llm.transcribe({
  audio: { data: base64Wav, format: 'wav' },
  model: 'gpt-4o-mini-transcribe',
  language: 'en',
  prompt: 'Product support call',
});

const speech = await llm.speak({
  text: transcript.text,
  model: 'gpt-4o-mini-tts',
  voice: 'alloy',
  format: 'mp3',
});

console.log(transcript.text);
console.log(speech.data);
console.log(speech.transcript);

Providers without the requested batch audio capability throw AxMediaNotSupportedError.

Signature Audio Artifacts

TypeScript

import { ai, ax } from '@ax-llm/ax';

const llm = ai({ name: 'openai', apiKey: process.env.OPENAI_APIKEY! });
const say = ax('question:string -> speech:audio, summary:string');

const result = await say.forward(
  llm,
  { question: 'Explain retries in one sentence.' },
  {
    speech: {
      speak: { voice: 'alloy', format: 'mp3' },
      fields: {
        speech: { voice: 'alloy' },
      },
    },
  }
);

console.log(result.summary);
console.log(result.speech.data);
console.log(result.speech.mimeType);
console.log(result.speech.transcript);

The model emits a text script for speech; Ax replaces it with AxChatAudioOutput after result selection. If the field already contains an audio artifact with { data } or { id }, Ax leaves it alone.

Agent Audio Inputs

TypeScript

import { agent, ai } from '@ax-llm/ax';

const llm = ai({ name: 'openai', apiKey: process.env.OPENAI_APIKEY! });

const voiceAgent = agent(
  'recording:audio, question:string -> speech:audio, summary:string',
  {
    agentIdentity: {
      name: 'Voice Assistant',
      description: 'Answers spoken requests with spoken and written output',
    },
    contextFields: [],
  }
);

const result = await voiceAgent.forward(
  llm,
  {
    recording: { data: base64Wav, format: 'wav' },
    question: 'What should I do next?',
  },
  {
    speech: {
      transcribe: { model: 'gpt-4o-mini-transcribe' },
      speak: { voice: 'alloy', format: 'mp3' },
    },
  }
);

console.log(result.summary);
console.log(result.speech.data);

The agent runtime transcribes recording first and passes the transcript through the internal agent stages. Use direct ax(...) or .chat() when you specifically want native audio understanding in the model call.

Conversational `.chat()` Audio

Use modelConfig.audio for conversational audio turns where audio is part of the chat response instead of a structured signature field.

TypeScript

const res = await llm.chat({
  chatPrompt: [{ role: 'user', content: 'Say hello out loud.' }],
  modelConfig: {
    audio: { output: { enabled: true, voice: 'alloy', format: 'wav' } },
  },
});

console.log(res.results[0]?.content);
console.log(res.results[0]?.audio?.data);
console.log(res.results[0]?.audio?.transcript);

Config Shape

TypeScript

type AxAudioFormat =
  | 'wav'
  | 'mp3'
  | 'flac'
  | 'opus'
  | 'aac'
  | 'pcm16'
  | 'pcm'
  | 'ogg'
  | 'raw'
  | 'mulaw'
  | 'ulaw'
  | 'alaw';

type AxSpeechConfig = {
  transcribe?: {
    model?: string;
    language?: string;
    prompt?: string;
  };
  speak?: {
    model?: string;
    voice?: string;
    format?: AxAudioFormat;
  };
  fields?: Record<
    string,
    {
      model?: string;
      voice?: string;
      format?: AxAudioFormat;
    }
  >;
};

OpenAI Defaults

Use axAIOpenAIAudioDefaultConfig() for OpenAI request-based audio chat:

model: gpt-audio-mini
output enabled
voice: alloy
output format: wav
transcript enabled
streaming disabled by default
audio input formats: wav, mp3
audio output formats: wav, mp3, flac, opus, aac, pcm16

TypeScript

import { ai, axAIOpenAIAudioDefaultConfig } from '@ax-llm/ax';

const openai = ai({
  name: 'openai',
  apiKey: process.env.OPENAI_APIKEY!,
  config: axAIOpenAIAudioDefaultConfig(),
});

const res = await openai.chat({
  chatPrompt: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'What is in this recording?' },
        { type: 'audio', data: base64Wav, format: 'wav' },
      ],
    },
  ],
});

console.log(res.results[0]?.content);
console.log(res.results[0]?.audio?.data);

Use axAIOpenAIRealtimeDefaultConfig() for OpenAI realtime speech-to-speech:

model: gpt-realtime-2
output enabled
voice: marin
output format: pcm16
input default: audio/pcm, mono, 24000 Hz
turn timeout: 30000
streaming disabled by default

Use axAIOpenAIRealtimeTranscriptionDefaultConfig() for realtime transcript deltas:

model: gpt-realtime-whisper
input default: audio/pcm, mono, 24000 Hz
output audio disabled; transcript text is returned on content

Realtime models use a one-turn WebSocket call under .chat(). In Node, pass a WebSocket constructor through request options:

TypeScript

import WebSocket from 'ws';
import { ai, axAIOpenAIRealtimeDefaultConfig } from '@ax-llm/ax';

const openai = ai({
  name: 'openai',
  apiKey: process.env.OPENAI_APIKEY!,
  config: axAIOpenAIRealtimeDefaultConfig(),
});

const stream = await openai.chat(
  {
    chatPrompt: [{ role: 'user', content: 'Say hello out loud.' }],
  },
  { stream: true, webSocket: WebSocket }
);

For follow-up turns, keep the assistant audio reference in history:

TypeScript

await openai.chat({
  chatPrompt: [
    { role: 'assistant', audio: { id: previousAudioId } },
    { role: 'user', content: 'Repeat that more slowly.' },
  ],
});

Gemini Live Defaults

Use axAIGoogleGeminiLiveAudioDefaultConfig() for Gemini native audio:

model: gemini-2.5-flash-native-audio-preview-12-2025
output enabled
voice: Kore
output format: pcm16
output sample rate: 24000
input default: audio/pcm;rate=16000, mono
transcript enabled
turn timeout: 30000
streaming disabled by default

TypeScript

import { ai, axAIGoogleGeminiLiveAudioDefaultConfig } from '@ax-llm/ax';

const gemini = ai({
  name: 'google-gemini',
  apiKey: process.env.GOOGLE_APIKEY!,
  config: axAIGoogleGeminiLiveAudioDefaultConfig(),
});

const res = await gemini.chat({
  chatPrompt: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Answer this spoken question.' },
        {
          type: 'audio',
          data: base64Pcm16,
          format: 'pcm16',
          sampleRate: 16000,
          channels: 1,
        },
      ],
    },
  ],
});

console.log(res.results[0]?.content);
console.log(res.results[0]?.audio?.data);

Gemini Live uses a one-turn WebSocket call under .chat(). It expects PCM input for native audio turns; use format: 'pcm16' or mimeType: 'audio/pcm;rate=16000'.

Grok Voice Defaults

Use axAIGrokVoiceDefaultConfig() for xAI Grok Voice Agent:

model: grok-voice-think-fast-1.0
output enabled
voice: eve
output format: pcm16
output sample rate: 24000
input default: audio/pcm, mono, 24000 Hz
transcript enabled
turn timeout: 30000
streaming disabled by default

TypeScript

import WebSocket from 'ws';
import { ai, axAIGrokVoiceDefaultConfig } from '@ax-llm/ax';

const grok = ai({
  name: 'grok',
  apiKey: process.env.GROK_API_KEY!,
  config: axAIGrokVoiceDefaultConfig(),
});

const res = await grok.chat(
  {
    chatPrompt: [{ role: 'user', content: 'Say hello out loud.' }],
  },
  { webSocket: WebSocket }
);

console.log(res.results[0]?.content);
console.log(res.results[0]?.audio?.data);

Grok Voice uses a one-turn WebSocket call under .chat(). It expects PCM input for spoken input turns; use format: 'pcm16' or mimeType: 'audio/pcm'.

Streaming Audio

OpenAI audio chat, OpenAI Realtime, Gemini Live, and Grok Voice all default to non-streaming, but each can stream deltas when you pass { stream: true }.

TypeScript

const stream = await llm.chat(
  {
    chatPrompt: [{ role: 'user', content: 'Say hello.' }],
  },
  { stream: true }
);

for await (const chunk of stream) {
  const audio = chunk.results[0]?.audio;
  if (audio?.isDelta) {
    playAudioChunk(audio.data);
  }
}

Structured Outputs

Use signature audio outputs for structured speech artifacts:

TypeScript

const gen = ax('question:string -> answer:string, speech:audio');

Use .chat() audio when the response itself is a conversational audio turn. Do not combine .chat() audio output with provider-native structured response formats unless that provider explicitly supports the combination.

Audio I/O Codegen Rules (@ax-llm/ax)

Install

Source

Skill Instructions

Core Rules

Direct Batch APIs

Signature Audio Artifacts

Agent Audio Inputs

Conversational .chat() Audio

Config Shape

OpenAI Defaults

Gemini Live Defaults

Grok Voice Defaults

Streaming Audio

Structured Outputs

Conversational `.chat()` Audio