Integration Guide
aistack Integration Guide
Section titled “aistack Integration Guide”This document is the contract aistack publishes to consumers. It is the single authoritative source for “what aistack does and how to use it” from any client — CLI tools, GUI apps like VideoCraft, agent frameworks, future dosmoon products. aistack does not adapt to any particular consumer’s needs; consumers conform to what is documented here.
If you are integrating aistack for the first time, start here. The per-endpoint design pages (under API —
asr/tts/llm/models/errors/observability) cover the why behind each capability; the auto-generated Reference covers the field-level what. This guide stitches them into a coherent journey.
1. What aistack is
Section titled “1. What aistack is”aistack is a local AI capability gateway. A single OpenAI-API-
compatible server (default 127.0.0.1:11500) that exposes three
capabilities — speech-to-text (ASR), text-to-speech (TTS), and chat
completion (LLM) — and owns local GPU scheduling so consumers do not
have to.
| You as consumer | aistack |
|---|---|
| Send OpenAI-shape HTTP requests. | Picks which backend serves the request. |
| Read OpenAI-shape responses. | Coordinates the local GPU across all backends. |
| Don’t know or care which backend ran your request. | Hot-swaps models on memory pressure. |
| Don’t ship ML libraries with your app. | Hosts the ML libraries locally or proxies them. |
The contract is intentionally narrow: stable HTTP shapes, a stable
error envelope, capability discovery via /v1/models, and 503 +
Retry-After as the universal “we are momentarily busy” signal.
2. Endpoints at a glance
Section titled “2. Endpoints at a glance”| Method | Path | Purpose | Reference |
|---|---|---|---|
| GET | /health | Liveness probe. | This page §10. |
| GET | /v1/models | Capability inventory — what the gateway can serve right now. | models |
| POST | /v1/audio/transcriptions | Speech-to-text. | asr |
| POST | /v1/audio/speech | Text-to-speech (and related TTS endpoints under /v1/audio/*). | tts |
| POST | /v1/chat/completions | Chat completion (proxied to local Ollama). | llm |
All endpoints are unauthenticated by default; aistack is meant to bind
on 127.0.0.1 or a private LAN. If exposed beyond that, put a
reverse proxy with auth in front.
3. Step 1 — Discover what’s available
Section titled “3. Step 1 — Discover what’s available”Always start by calling GET /v1/models. This is how the gateway
introduces itself; no other endpoint reveals which capabilities are
currently usable.
curl http://127.0.0.1:11500/v1/models{ "object": "list", "data": [ { "id": "auto", "object": "model", "owned_by": "aistack", "capabilities": ["asr"], "is_routing_alias": true }, { "id": "whisper-small", "object": "model", "owned_by": "openai", "capabilities": ["asr"], "languages": ["en", "zh", "ja", "ko", "es", "fr", "..."] }, { "id": "iic/SenseVoiceSmall", "object": "model", "owned_by": "alibaba", "capabilities": ["asr"], "languages": ["zh", "yue", "en", "ja", "ko"] }, { "id": "qwen3-tts-12hz-0.6b-customvoice", "object": "model", "owned_by": "qwen", "capabilities": ["tts"] }, { "id": "qwen3:4b", "object": "model", "owned_by": "ollama", "capabilities": ["llm"] } ]}Field semantics
Section titled “Field semantics”| Field | Type | Notes |
|---|---|---|
id | string | Pass verbatim as model on capability endpoints. |
object | string | Always "model" (OpenAI-spec required). |
owned_by | string | Free-form attribution to the model author. Display only; do not branch on it. |
capabilities | array of string | aistack extension. Subset of ["asr","tts","llm"]. Filter the picker by this. |
languages | array of string (ASR only) | aistack extension. ISO 639-1 codes the model can transcribe. Absent on TTS / LLM entries and on routing aliases. |
is_routing_alias | boolean | aistack extension. When true, the entry is a virtual id that aistack resolves internally rather than a real model. Currently only id="auto" for ASR. |
When to refresh
Section titled “When to refresh”- Once at startup, cache by capability.
- Whenever the user opens a model picker (a backend may have just been installed, Ollama may have just started).
- Not before every inference call — the listing is cheap but not free, and stable on the seconds-to-minutes scale.
What it means when an entry is missing
Section titled “What it means when an entry is missing”The list reflects right now, not a static catalog:
| Backend | Visible only when … |
|---|---|
| ASR providers | the corresponding Python library is importable in the venv |
| TTS (Qwen3-TTS) | the Docker container responds to its own /health |
| LLM (Ollama) | aistack can reach Ollama’s /api/tags |
If the TTS container is down, /v1/audio/speech will respond 503; if
Ollama is down, /v1/chat/completions will respond 503. Treat
/v1/models as the discovery layer that tells you not to dispatch
to a missing capability in the first place.
4. Step 2 — Transcribe audio (ASR)
Section titled “4. Step 2 — Transcribe audio (ASR)”POST /v1/audio/transcriptions mirrors OpenAI’s Whisper API. Accepts
multipart/form-data. Returns the language, duration, full text,
per-segment timestamps, and per-word timestamps where the backend
supports them.
Pick a model
Section titled “Pick a model”Three real backends are exposed when their libraries are installed,
plus the auto routing alias.
| Picker choice | Behavior |
|---|---|
model=auto | aistack picks based on the language form field: CJK → SenseVoice, European → Parakeet, else → faster-whisper-small. Falls back gracefully when a preferred backend is not installed. |
model=whisper-small (or any whisper size) | faster-whisper / CTranslate2. Default general-purpose choice. |
model=parakeet | NVIDIA Parakeet TDT 0.6B v3. Strongest English/European accuracy. Word-level timestamps from the model itself. |
model=sensevoice | Alibaba SenseVoice Small. Best CJK; also handles English/Japanese/Korean. |
Most consumers should expose auto as the default option in their
picker; advanced users can pin a specific backend.
Example: curl
Section titled “Example: curl”curl -X POST http://127.0.0.1:11500/v1/audio/transcriptions \ -F "file=@speech.mp3" \ -F "model=auto" \ -F "language=en" \ -F "response_format=verbose_json"Example: Python (httpx)
Section titled “Example: Python (httpx)”import httpx
with open("speech.mp3", "rb") as f: r = httpx.post( "http://127.0.0.1:11500/v1/audio/transcriptions", files={"file": f}, data={ "model": "auto", "language": "en", # optional hint; drives auto routing "response_format": "verbose_json", "translate": "false", # set true for Whisper-only "to English" mode }, timeout=120.0, )r.raise_for_status()result = r.json()print(result["text"])for seg in result["segments"]: print(f"[{seg['start']:.2f} → {seg['end']:.2f}] {seg['text']}")Response shape (response_format=verbose_json)
Section titled “Response shape (response_format=verbose_json)”{ "language": "en", "duration": 17.18, "text": "...", "segments": [ {"id": 0, "start": 0.81, "end": 7.14, "text": "..."} ], "words": [ {"start": 0.81, "end": 0.99, "word": "The"} ]}words[] is populated for every backend that supports word-level
timestamps; clients should treat its absence as “not available for
this backend / configuration”, not as an error.
Streaming transcription with stream=true
Section titled “Streaming transcription with stream=true”For long audio (or any case where the client wants partial results
as they become available), pass stream=true as a form field. The
response is text/event-stream instead of JSON; events follow
OpenAI’s transcription streaming shape with one extension event.
Example:
curl -N -X POST http://127.0.0.1:11500/v1/audio/transcriptions \ -F "file=@long_lecture.mp3" \ -F "model=whisper-small" \ -F "language=en" \ -F "stream=true"Wire format (one event per data: { ... }\n\n frame):
data: {"type": "transcript.text.delta", "delta": "Hello world.", "segment": {"start": 0.0, "end": 1.7, "words": [{"start":0.0,"end":0.4,"word":"Hello"}, ...]}}
data: {"type": "transcript.text.delta", "delta": "This is the second segment.", "segment": {"start": 1.7, "end": 4.2, "words": [...]}}
... (more deltas as the model produces segments) ...
data: {"type": "transcript.text.done", "language": "en", "duration": 1020.0}Event types
Section titled “Event types”type | Meaning |
|---|---|
transcript.text.delta | Incremental segment. delta is the segment text; segment carries start/end in seconds and per-word timestamps. |
transcript.text.done | End of transcription. Carries detected language and total duration. |
warning (aistack extension) | Emitted before any delta when the chosen model does not support real streaming. Carries code, model, message. The transcription still arrives as a single delta after the warning. |
error (aistack extension) | Mid-stream failure. Body matches the standard error envelope shape ({kind, provider, message}). No further events follow. |
Streaming-capable vs not
Section titled “Streaming-capable vs not”Models advertise their streaming behavior via supports_streaming in
/v1/models. As of the current contract:
whisper-small(and any whisper size) — streams natively: one delta per decoded segment, no warning.iic/SenseVoiceSmall— streams natively: one delta per VAD chunk, no warning.nvidia/parakeet-tdt-0.6b-v3— does not stream: client gets awarningevent up front, then one delta with the full text, thentranscript.text.done. Selecting Parakeet via the picker is fine, but aware clients should hide it from streaming-only workflows by filtering onsupports_streaming.auto— streams when the language hint routes to a streaming-capable backend, downgrades when it routes to a non-streaming one.supports_streamingon the alias entry is the AND of the candidate pool, so it isfalsewhenever Parakeet is installed.
The downgrade path delivers the full transcription in one event rather
than failing the request, so OpenAI-shape clients that always send
stream=true get a working response. The warning event is the
discoverable signal; aware clients should branch on it.
Cancellation
Section titled “Cancellation”Same pattern as the LLM stream (§6): close the HTTP connection. The
gateway polls request.is_disconnected() and propagates a cancel
token into the worker thread, which checks it between segments.
Long-audio transcriptions abort within ~1 second of disconnect for
streaming-capable backends; Parakeet (downgrade path) honors cancel
only at coarse boundaries.
response_format interaction
Section titled “response_format interaction”When stream=true, response_format is ignored — the response is
always SSE in the shape above. Consumers that need to choose between
plain text / json / verbose_json should leave stream=false.
5. Step 3 — Generate speech (TTS)
Section titled “5. Step 3 — Generate speech (TTS)”POST /v1/audio/speech is a transparent proxy to the locally-running
Qwen3-TTS container. Standard OpenAI fields are accepted (model,
input, voice, response_format); the upstream also exposes
extension fields for voice clone and voice design which pass through
unchanged.
Example: minimal speech synthesis
Section titled “Example: minimal speech synthesis”curl -X POST http://127.0.0.1:11500/v1/audio/speech \ -H "content-type: application/json" \ -d '{ "model": "qwen3-tts-12hz-0.6b-customvoice", "input": "Hello, this is a test of the local TTS gateway.", "voice": "alloy", "response_format": "wav" }' \ --output out.wavExample: voice cloning (extension field passthrough)
Section titled “Example: voice cloning (extension field passthrough)”curl -X POST http://127.0.0.1:11500/v1/audio/speech \ -H "content-type: application/json" \ -d '{ "model": "qwen3-tts-12hz-0.6b-customvoice", "input": "Cloned-voice output.", "task_type": "voice_clone", "ref_audio": "/path/to/reference.wav", "ref_text": "Reference transcript matching the audio." }' \ --output cloned.wavThe proxy is transparent — every field you pass and every byte you get back come straight from Qwen3-TTS. See the upstream’s documentation for the full surface; aistack will keep relaying as the upstream evolves.
6. Step 4 — Chat completion (LLM)
Section titled “6. Step 4 — Chat completion (LLM)”POST /v1/chat/completions mirrors OpenAI’s chat API. Behind the
scenes aistack proxies to a locally-running Ollama daemon and adds two
gateway-level behaviors:
asr-maincache eviction before forwarding. If aistack was serving an ASR backend recently, its model is dropped from cache before the LLM request is forwarded so VRAM frees up for Ollama’s model load.keep_alivedefault of"30s"when the client omits the field. Sequential LLM calls within 30 seconds reuse the loaded model; idle Ollama returns VRAM to whoever needs it next. Override explicitly for long-running agent sessions.
Example: non-streaming
Section titled “Example: non-streaming”curl -X POST http://127.0.0.1:11500/v1/chat/completions \ -H "content-type: application/json" \ -d '{ "model": "qwen3:4b", "messages": [ {"role": "user", "content": "Translate to Chinese: hello world"} ] }'Example: streaming with cancel
Section titled “Example: streaming with cancel”import httpx
req = { "model": "qwen3:4b", "messages": [{"role": "user", "content": "Write a short poem."}], "stream": True,}
with httpx.stream("POST", "http://127.0.0.1:11500/v1/chat/completions", json=req, timeout=600.0) as r: for line in r.iter_lines(): if not line.startswith("data: "): continue if line == "data: [DONE]": break chunk = json.loads(line[6:]) delta = chunk["choices"][0]["delta"].get("content", "") print(delta, end="", flush=True) if user_pressed_cancel(): break # closing the stream propagates a TCP RST upstream; # aistack stops pulling from Ollama and frees the slot.You do not need a separate cancel endpoint. Closing the HTTP response is the cancel signal. aistack’s stream loop polls for client disconnect and propagates the abort to the upstream model.
7. The single-task GPU slot
Section titled “7. The single-task GPU slot”This is the most important gateway behavior to understand because it shapes how clients should structure parallelism.
At most one inference workload runs on the GPU at any moment, across all three capabilities. Concurrent calls to ASR + LLM (or LLM
- TTS, etc.) result in:
- The first request acquires the slot and runs.
- The second request returns immediately with HTTP 503 + a
Retry-After: 5header and an error envelope of kindnetwork. - The client decides whether to back off and retry, or surface an error.
This is intentional. On 8 GB consumer cards, concurrent inference across capabilities OOMs the worker; serializing is the only way to make the gateway stable. On bigger hardware the same policy means predictable resource accounting at no real throughput cost.
Client guidance:
- Treat 503 +
Retry-Afteras a transport-level back-pressure signal, not a fatal error. Sleep the indicated seconds and retry. - Do not pipeline calls into aistack expecting parallelism — design for serial dispatch with retry on contention.
- For agent loops that interleave ASR / LLM / TTS, run the steps sequentially. aistack’s hot-swap policy will keep VRAM available even though the model changes between steps.
A simple Python retry helper:
import time, httpx
def call_with_retry(method, url, max_attempts=5, **kw): for attempt in range(max_attempts): r = httpx.request(method, url, **kw) if r.status_code != 503: return r retry = float(r.headers.get("Retry-After", "5")) time.sleep(retry) r.raise_for_status()8. Error envelope
Section titled “8. Error envelope”Every non-2xx response carries this shape, including the slot-busy 503 from §7:
{ "error": { "kind": "network | malformed | overflow | cancelled | unknown", "provider": "aistack | Faster-Whisper | Parakeet | SenseVoice | ...", "message": "human-readable details safe to surface to users" }}Branch on error.kind. The five kinds mean:
| Kind | When | HTTP status | Client response |
|---|---|---|---|
network | Upstream backend unreachable, model download failed, transport error | 503 | Show “service is down, please start it”; retry after delay |
malformed | Bad input — unsupported audio format, missing field, unknown model id | 400 | Show error to user; do not retry |
overflow | Input too large for the chosen model / VRAM | 413 | Suggest a smaller model or shorter input |
cancelled | Client disconnected mid-request | 499 | Usually no UI needed; user already knows they cancelled |
unknown | Anything that did not fit the categories above | 500 | Log and surface message; do not retry without diagnosis |
Full reference: errors.
9. Capability-specific notes
Section titled “9. Capability-specific notes”- The
languageform field is a hint, not a constraint. faster- whisper auto-detects whenlanguageis omitted; SenseVoice ignores the hint when the audio’s actual language conflicts. response_format:json(minimal{text}),verbose_json(full shape with segments + words),text(plain text body).translate=trueonly works on Whisper-family backends. Parakeet and SenseVoice raisemalformedif asked to translate.- For long audio (>5 min), Parakeet on 8 GB hardware uses the local- attention encoder mode automatically — accuracy is near-identical to full attention but the memory ceiling moves from ~3 min to effectively unbounded.
- The proxy is transparent; if Qwen3-TTS adds new fields upstream, aistack relays them without code change.
- The TTS container holds a fixed VRAM reservation at startup
(configurable via
gpu_memory_utilizationindocker/tts_qwen3/docker-compose.yml). On 8 GB cards, lower it to 0.5 if the gateway feels tight.
- Streaming uses Server-Sent Events with the standard OpenAI shape.
keep_aliveaccepts the same string values Ollama does ("30s","5m","0", etc.). aistack injects"30s"only when the field is absent.- Cloud-only LLMs (DeepSeek, Claude, Gemini) are explicitly out of scope for aistack. Call those APIs directly from your client; there is no gain to proxying them.
10. Health check
Section titled “10. Health check”curl http://127.0.0.1:11500/health{"status": "ok", "version": "0.0.1"}Connection refused means aistack is not running. A 200 means the
worker is alive but does not certify any specific backend — combine
with /v1/models to find out which capabilities are usable.
11. Versioning & stability
Section titled “11. Versioning & stability”What is contract within /v1
Section titled “What is contract within /v1”The following are stable and only ever change in additive ways within
/v1:
- The set of endpoints listed in §2.
- Field names and types in
/v1/modelsentries. - The error envelope shape (
error.kind,error.provider,error.message). - The five error kinds in §8 and their HTTP status mappings.
- The 503 +
Retry-Aftersemantics for slot contention. - The OpenAI-shape request/response bodies for the three capability endpoints.
What is not contract
Section titled “What is not contract”- Whether a specific model id appears in
/v1/models. That depends on installed backends and is environment-specific. - Specific timing characteristics (latency, throughput).
- Internal headers, log lines, or admin UI shape.
- Whether a particular backend is hosted in-process or proxied.
Breaking changes
Section titled “Breaking changes”A breaking change to anything in the contract list above requires
bumping to /v2. /v1 will continue to serve for at least one
release cycle so consumers can migrate without coupled deploys.
12. Where to look next
Section titled “12. Where to look next”| You want to … | Read |
|---|---|
| Understand the inventory response in detail | models |
| Build an ASR client | asr |
| Build a TTS client | tts |
| Build an LLM client | llm |
| Branch on errors precisely | errors |
Stitch your trace IDs through aistack (optional X-Request-ID) | observability |
| Read the metrics / access log / payload capture aistack records about your traffic | observability |
| Understand aistack’s internal architecture (not needed for integration) | aistack repo’s docs/design/architecture.md (internal-only) |
If something behaves differently from what this guide promises, the
guide is the authority — please file an issue against
github.com/dosmoon/aistack.