Vapi · Schema

AssistantSpeechWordProgressTiming

AIVoiceAgentsRealtimeCPaaS

Properties

Name	Type	Description
type	string	Discriminator for cursor-based word progress (e.g. Minimax subtitle data).
wordsSpoken	number	Number of words spoken so far in this turn.
totalWords	number	Total number of words sent to the TTS provider for this turn. Important: this value grows across events within a single turn because Minimax synthesizes audio incrementally as the LLM streams toke
segment	string	The text of the latest spoken segment (sentence or clause). Use this for caption display — it corresponds to the chunk just confirmed by the TTS provider, unlike `text` on the parent message which car
segmentDurationMs	number	Audio duration in milliseconds for the latest spoken segment. Pair with `segment` to animate karaoke-style word reveals — divide the segment text across this duration for approximate per-word timing.
words	array	Per-word timestamps for the latest spoken segment. Available when the TTS provider supports word-level timing (e.g. Minimax with subtitle_type: "word"). Syllables from the provider are aggregated into

View JSON Schema on GitHub

JSON Schema

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "#/components/schemas/AssistantSpeechWordProgressTiming",
  "title": "AssistantSpeechWordProgressTiming",
  "type": "object",
  "properties": {
    "type": {
      "type": "string",
      "description": "Discriminator for cursor-based word progress (e.g. Minimax subtitle data).",
      "enum": [
        "word-progress"
      ]
    },
    "wordsSpoken": {
      "type": "number",
      "description": "Number of words spoken so far in this turn."
    },
    "totalWords": {
      "type": "number",
      "description": "Total number of words sent to the TTS provider for this turn.\n\n**Important**: this value grows across events within a single turn because\nMinimax synthesizes audio incrementally as the LLM streams tokens. Treat\nit as \"best known total so far\" \u2014 it will stabilize once synthesis is\ncomplete.\n\nA value of `0` is a valid sentinel meaning \"not yet known\". This can occur\non the very first `assistant-speech` event of a turn if audio begins\nplaying before the TTS provider has confirmed word-count data. Clients\n**must** guard against divide-by-zero when computing a progress fraction:\n\n```ts\nconst pct = totalWords > 0 ? wordsSpoken / totalWords : 0;\n```"
    },
    "segment": {
      "type": "string",
      "description": "The text of the latest spoken segment (sentence or clause). Use this\nfor caption display \u2014 it corresponds to the chunk just confirmed by\nthe TTS provider, unlike `text` on the parent message which carries\nthe full turn text."
    },
    "segmentDurationMs": {
      "type": "number",
      "description": "Audio duration in milliseconds for the latest spoken segment. Pair\nwith `segment` to animate karaoke-style word reveals \u2014 divide the\nsegment text across this duration for approximate per-word timing."
    },
    "words": {
      "description": "Per-word timestamps for the latest spoken segment. Available when the\nTTS provider supports word-level timing (e.g. Minimax with\nsubtitle_type: \"word\"). Syllables from the provider are aggregated\ninto whole words with start/end times relative to the segment start.\n\nUse these for precise karaoke-style highlighting instead of\ninterpolating from segmentDurationMs.",
      "type": "array",
      "items": {
        "$ref": "#/components/schemas/AssistantSpeechWordTimestamp"
      }
    }
  },
  "required": [
    "type",
    "wordsSpoken",
    "totalWords"
  ]
}