POST /v1/audio/speech
POST /v1/audio/speech
Section titled “POST /v1/audio/speech”Generates speech audio from text. OpenAI TTS API compatible at the
field level — clients written against OpenAI’s /v1/audio/speech work
unchanged. aistack proxies to the Qwen3-TTS-12Hz-0.6B-CustomVoice
model running inside the aistack-qwen3-tts Docker container.
aistack adds no business logic at this layer; it is a transparent reverse proxy. Request body and response body flow through unchanged (minus hop-by-hop headers).
The full request and response schema lives in the auto-generated
TTS reference. For OpenAI-compatible field
semantics (input, voice, response_format, model), the
authoritative reference is OpenAI’s
Audio API documentation.
This page covers the why — the transparent-proxy stance, output
format quirks, the pass-through endpoint surface, cold-start
behaviour, and how aistack does not arbitrate TTS concurrency.
Why a transparent proxy
Section titled “Why a transparent proxy”Three reasons aistack does not transform TTS requests/responses:
- OpenAI’s contract is the contract. Clients written for OpenAI’s TTS endpoint work unchanged against aistack. Adding business logic in the middle would require keeping it in sync with OpenAI’s evolving spec — drift surface for no value.
- The upstream owns the model. Qwen3-TTS extension fields (voice cloning, voice design, batched synthesis) are the upstream’s domain. aistack passes them through so consumers reach the full feature set.
- Audio is bytes, not JSON. Transcoding (WAV → MP3, sample-rate conversion) at the gateway would burn CPU and add latency for every request. Clients that need a different format transcode client-side with ffmpeg.
Output format
Section titled “Output format”Output is whatever vLLM-Omni emits — at the time of writing,
24 kHz mono 16-bit PCM WAV regardless of the response_format
hint in the request. The vLLM-Omni server may add MP3/Opus/FLAC
encoding in a future release; aistack will pass that through
without code changes.
If you need a specific format today, transcode client-side:
ffmpeg -i out.wav -codec:a libmp3lame out.mp3Pass-through endpoint surface
Section titled “Pass-through endpoint surface”The full Qwen3-TTS extended surface is reachable through aistack
under /v1/audio/*:
POST /v1/audio/speech— synthesis (custom voice / cloned voice / designed voice)POST /v1/audio/speech/stream— streaming synthesis (chunks audio as it generates)POST /v1/audio/speech/batch— batched synthesisGET /v1/audio/voices— list available voicesPOST /v1/audio/voices— register a new voiceDELETE /v1/audio/voices/{name}— remove a registered voice
aistack proxies all of these verbatim. Their schemas are owned by vLLM-Omni’s Qwen3-TTS docs; aistack does not redocument them. Future aistack versions may add value-added behaviour (telemetry, voice-list aggregation across multi-backend deployments) without changing the over-the-wire format.
The /v1/models entry for TTS advertises supports_streaming: true
because of the /v1/audio/speech/stream pass-through path; the
streaming wire format is whatever vLLM-Omni emits there (chunked
audio bytes), not the transcript.text.delta SSE shape used by
ASR.
Cold start
Section titled “Cold start”The first request to a freshly started Docker container triggers
torch.compile + CUDA Graph capture inside vLLM-Omni — this takes
~60 to ~150 seconds depending on the request and machine state.
The proxy timeout is set to 600 seconds to absorb the worst case;
clients should display “warming up” rather than treating long
latency as a hang.
After warmup, steady-state latency on RTX 4060 Laptop is typically RTF 0.7–1.1 (a few hundred milliseconds for a short utterance).
Concurrency
Section titled “Concurrency”aistack does apply the single global GPU slot to TTS requests:
the Qwen3-TTS container shares the physical GPU with in-process ASR
and the LLM proxy, and the slot represents “GPU is doing inference”
regardless of which process owns the kernels. Concurrent requests
across capabilities get HTTP 503 with Retry-After.
vLLM-Omni handles its own request queueing inside the Docker container. aistack does not implement a queue at the gateway layer — if you want a queue, build it client-side around the busy signal.
Recovery from container down
Section titled “Recovery from container down”The most common failure mode is “Docker container is not running”:
HTTP/1.1 503 Service UnavailableContent-Type: application/json
{ "error": { "kind": "network", "provider": "aistack", "message": "Qwen3-TTS container is not reachable. Start it with: docker compose -f docker/tts_qwen3/docker-compose.yml up -d" }}The error message includes the recovery command. Containers that crash mid-cold-start can wedge in a state where requests time out at the proxy’s 600 s read timeout — restart the container in that case.
Stability
Section titled “Stability”The OpenAI-compatible field layer (input, voice,
response_format, model) is stable within /v1. The Qwen3-TTS
extension fields (task_type, language, ref_audio, …) are
the upstream’s contract and are documented authoritatively in the
vLLM-Omni Qwen3-TTS repo.
The pass-through endpoint paths (/v1/audio/speech/stream,
/voices, …) follow vLLM-Omni’s contract; if a future TTS
backend exposes a different surface, aistack will document the
mapping in a separate migration note rather than silently rewrite
the contract.