跳转到内容

ASR — speech to text

此内容尚不支持你的语言。

Transcribe an audio file

OpenAI Whisper API compatible — clients written against OpenAI’s /v1/audio/transcriptions work with no code changes other than the base URL.

Backend selection. The model field accepts any of:

  • whisper-{size} (whisper-tinywhisper-large-v3-turbo, whisper-distil-large-v3) → faster-whisper / CTranslate2
  • whisper-1 → maps to whisper-small for OpenAI legacy compat
  • bare size alias (small, medium, …) → faster-whisper
  • parakeet or nvidia/parakeet-tdt-0.6b-v3 → NVIDIA NeMo
  • sensevoice or iic/SenseVoiceSmall → Alibaba FunASR
  • auto (or empty) → aistack router picks based on language: CJK → SenseVoice, European → Parakeet, else → Whisper-small. Falls back gracefully when a preferred backend is not installed.

GPU scheduling. aistack runs at most one inference at a time across ASR / LLM / TTS. Concurrent requests get HTTP 503 with a Retry-After header. The blocking inference runs in a worker thread; the FastAPI event loop stays responsive (/health keeps answering, and the disconnect watcher can cancel a running transcription cooperatively).

Streaming. When stream=true the response is text/event-stream with data: {...}\n\n frames per SSE convention. Event types: transcript.text.delta (one per emitted segment, with text and optional start/end/words), and transcript.text.done once at the end. Models advertising supports_streaming=false in /v1/models (currently Parakeet) emit a single warning event followed by a single delta containing the full transcription — the gateway will not silently chunk a non-streaming model and pay the WER cost.

Cancellation. If the client disconnects mid-request, aistack sets a cooperative cancel token between segments, releases the GPU slot, and returns 499 (nginx convention) for clients that re-poll.

Errors. Every non-2xx response uses the standard envelope {error: {kind, provider, message}}. See the errors documentation for status code semantics.

Content type: multipart/form-data

FieldTypeRequiredDescription
filebinary (multipart upload)yesAudio file (any ffmpeg-readable format: mp3, mp4, wav, m4a, flac, ogg, webm, mkv).
modelstringnoProvider/model selector. Empty or ‘auto’ = pick best installed backend for the given language. Otherwise: whisper-{size} | parakeet | sensevoice.
languagestring | nullnoISO 639-1 code (e.g. ‘en’, ‘zh’). Omit for auto-detect.
response_formatstringnojson | verbose_json | text. Ignored when stream=true.
translatebooleannoIf true, transcribe to English instead of source language. Only Whisper-family models support translation; Parakeet and SenseVoice reject with 400.
streambooleannoIf true, return Server-Sent Events with one transcript.text.delta per decoded segment, ending with transcript.text.done. Models with supports_streaming=false in /v1/models still accept this and emit a warning event followed by a single delta. response_format is ignored when stream=true.
segment_granularitystringnoHow to group word timestamps into the segments field. ‘sentence’ (default) returns full sentences — right input for line-by-line LLM translation, semantic search, agent reasoning. ‘subtitle’ returns SRT-cue-sized segments (≤70 chars, 1–7s) for clients that emit SRT/VTT directly without a downstream cue-sizing pass. Affects Parakeet only; faster-whisper and SenseVoice produce VAD-driven segments natively.

Malformed request — unsupported response_format / segment_granularity / model id.

Audio too large for the chosen backend’s VRAM budget.

Client disconnected mid-request (nginx convention).

Unexpected internal failure.

GPU slot busy — gateway is serving another inference. Retry-After is set.


Wire format for every non-2xx response from aistack.

The shape is identical regardless of which endpoint produced the error, so consumers can write one error-handling helper and reuse it across capabilities.

FieldTypeRequiredDescription
errorErrorBodyyes

TranscriptionResponse {#schema-transcriptionresponse}

Section titled “TranscriptionResponse {#schema-transcriptionresponse}”

Response shape for POST /v1/audio/transcriptions.

The shape is unified across the json and verbose_json response_format variants; which fields are populated depends on the request:

  • response_format=json (default, OpenAI minimal) → only text
  • response_format=verbose_json → all fields populated
  • response_format=text → not this schema; route returns a plain text body (text/plain content type)
  • stream=true → not this schema; route returns Server-Sent Events with transcript.text.delta events and a final transcript.text.done

The verbose-only fields are typed Optional so the same schema can represent both shapes; consumers branching on response_format know which fields to expect.

FieldTypeRequiredDescription
textstringyesThe full transcribed text. Populated for both json and verbose_json formats.
languagestring | nullnoISO 639-1 code of the detected (or hinted) source language. Verbose only.
durationnumber | nullnoAudio duration in seconds, from the backend’s own measurement. Verbose only.
segmentsarray of TranscriptionSegment | nullnoPer-segment breakdown. Segmentation strategy depends on segment_granularity and the backend. Verbose only.
wordsarray of TranscriptionWord | nullnoFlat per-word timestamp list. Convenient for clients that want word timing without traversing segments. Verbose only.