ASR — speech to text
此内容尚不支持你的语言。
POST /v1/audio/transcriptions
Section titled “POST /v1/audio/transcriptions”Transcribe an audio file
OpenAI Whisper API compatible — clients written against OpenAI’s
/v1/audio/transcriptions work with no code changes other than the
base URL.
Backend selection. The model field accepts any of:
whisper-{size}(whisper-tiny…whisper-large-v3-turbo,whisper-distil-large-v3) → faster-whisper / CTranslate2whisper-1→ maps towhisper-smallfor OpenAI legacy compat- bare size alias (
small,medium, …) → faster-whisper parakeetornvidia/parakeet-tdt-0.6b-v3→ NVIDIA NeMosensevoiceoriic/SenseVoiceSmall→ Alibaba FunASRauto(or empty) → aistack router picks based onlanguage: CJK → SenseVoice, European → Parakeet, else → Whisper-small. Falls back gracefully when a preferred backend is not installed.
GPU scheduling. aistack runs at most one inference at a time
across ASR / LLM / TTS. Concurrent requests get HTTP 503 with a
Retry-After header. The blocking inference runs in a worker
thread; the FastAPI event loop stays responsive (/health keeps
answering, and the disconnect watcher can cancel a running
transcription cooperatively).
Streaming. When stream=true the response is text/event-stream
with data: {...}\n\n frames per SSE convention. Event types:
transcript.text.delta (one per emitted segment, with text and
optional start/end/words), and transcript.text.done once at
the end. Models advertising supports_streaming=false in
/v1/models (currently Parakeet) emit a single warning event
followed by a single delta containing the full transcription —
the gateway will not silently chunk a non-streaming model and pay
the WER cost.
Cancellation. If the client disconnects mid-request, aistack sets a cooperative cancel token between segments, releases the GPU slot, and returns 499 (nginx convention) for clients that re-poll.
Errors. Every non-2xx response uses the standard envelope
{error: {kind, provider, message}}. See the errors documentation
for status code semantics.
Request body
Section titled “Request body”Content type: multipart/form-data
| Field | Type | Required | Description |
|---|---|---|---|
file | binary (multipart upload) | yes | Audio file (any ffmpeg-readable format: mp3, mp4, wav, m4a, flac, ogg, webm, mkv). |
model | string | no | Provider/model selector. Empty or ‘auto’ = pick best installed backend for the given language. Otherwise: whisper-{size} | parakeet | sensevoice. |
language | string | null | no | ISO 639-1 code (e.g. ‘en’, ‘zh’). Omit for auto-detect. |
response_format | string | no | json | verbose_json | text. Ignored when stream=true. |
translate | boolean | no | If true, transcribe to English instead of source language. Only Whisper-family models support translation; Parakeet and SenseVoice reject with 400. |
stream | boolean | no | If true, return Server-Sent Events with one transcript.text.delta per decoded segment, ending with transcript.text.done. Models with supports_streaming=false in /v1/models still accept this and emit a warning event followed by a single delta. response_format is ignored when stream=true. |
segment_granularity | string | no | How to group word timestamps into the segments field. ‘sentence’ (default) returns full sentences — right input for line-by-line LLM translation, semantic search, agent reasoning. ‘subtitle’ returns SRT-cue-sized segments (≤70 chars, 1–7s) for clients that emit SRT/VTT directly without a downstream cue-sizing pass. Affects Parakeet only; faster-whisper and SenseVoice produce VAD-driven segments natively. |
Responses
Section titled “Responses”application/json→TranscriptionResponsetext/plain→ stringtext/event-stream→ string
Malformed request — unsupported response_format / segment_granularity / model id.
application/json→ErrorEnvelope
Audio too large for the chosen backend’s VRAM budget.
application/json→ErrorEnvelope
Client disconnected mid-request (nginx convention).
application/json→ErrorEnvelope
Unexpected internal failure.
application/json→ErrorEnvelope
GPU slot busy — gateway is serving another inference. Retry-After is set.
application/json→ErrorEnvelope
Schemas
Section titled “Schemas”ErrorEnvelope {#schema-errorenvelope}
Section titled “ErrorEnvelope {#schema-errorenvelope}”Wire format for every non-2xx response from aistack.
The shape is identical regardless of which endpoint produced the error, so consumers can write one error-handling helper and reuse it across capabilities.
| Field | Type | Required | Description |
|---|---|---|---|
error | ErrorBody | yes |
TranscriptionResponse {#schema-transcriptionresponse}
Section titled “TranscriptionResponse {#schema-transcriptionresponse}”Response shape for POST /v1/audio/transcriptions.
The shape is unified across the json and verbose_json
response_format variants; which fields are populated depends on the
request:
response_format=json(default, OpenAI minimal) → onlytextresponse_format=verbose_json→ all fields populatedresponse_format=text→ not this schema; route returns a plain text body (text/plaincontent type)stream=true→ not this schema; route returns Server-Sent Events withtranscript.text.deltaevents and a finaltranscript.text.done
The verbose-only fields are typed Optional so the same schema can
represent both shapes; consumers branching on response_format
know which fields to expect.
| Field | Type | Required | Description |
|---|---|---|---|
text | string | yes | The full transcribed text. Populated for both json and verbose_json formats. |
language | string | null | no | ISO 639-1 code of the detected (or hinted) source language. Verbose only. |
duration | number | null | no | Audio duration in seconds, from the backend’s own measurement. Verbose only. |
segments | array of TranscriptionSegment | null | no | Per-segment breakdown. Segmentation strategy depends on segment_granularity and the backend. Verbose only. |
words | array of TranscriptionWord | null | no | Flat per-word timestamp list. Convenient for clients that want word timing without traversing segments. Verbose only. |