POST /v1/chat/completions
POST /v1/chat/completions
Section titled “POST /v1/chat/completions”OpenAI-compatible chat completion endpoint. aistack reverse-proxies
the request to a local Ollama daemon (default
http://127.0.0.1:11434) and adds two pieces of value over a direct
call.
The full request and response schema lives in OpenAI’s authoritative Chat Completion API reference. aistack proxies that schema verbatim — Ollama’s own OpenAI-compat layer mirrors it, and aistack does not transform the body. The auto-generated LLM reference lists what aistack’s own envelope around the proxy (response status codes, error shapes) looks like.
This page covers the why — the two value-adds, GPU scheduling, keep_alive policy, error attribution, and how concurrency works across the gateway.
What aistack adds over a direct Ollama call
Section titled “What aistack adds over a direct Ollama call”-
GPU scheduling. Before forwarding, aistack evicts its own ASR
asr-maincache entries — the in-process Whisper / Parakeet / SenseVoice model that may be sitting in VRAM gets dropped, withgc.collect()andtorch.cuda.empty_cache()so the LLM model can fit. The handler then holds the global gateway GPU slot for the entire upstream call. -
Sensible
keep_alivedefault. When the request omitskeep_alive, aistack injects"30s". Sequential LLM calls within 30 s reuse the loaded model; idle releases VRAM back for ASR. Pass an explicit value ("5m","1h","-1"for forever,"0"for unload now) to override.
Clients written against OpenAI’s /v1/chat/completions work
unchanged — aistack does not alter messages, tools,
response_format, or any other OpenAI field. It only fills in
keep_alive when missing.
Resource scheduling explained
Section titled “Resource scheduling explained”When this endpoint receives a request:
- Acquire global GPU slot. Concurrent requests across ASR / TTS
/ LLM get HTTP 503 with
Retry-After. The slot represents “GPU is doing inference” regardless of which process owns the kernels. - Evict ASR.
_model_cache.evict_category("asr-main")drops every currently-resident ASR main model. - Inject
keep_alive. If the request body omits it, set it to"30s". - Forward to Ollama. Stream the response (when
stream=true) chunk-by-chunk so first-token latency reaches the client as quickly as possible. - Release the slot. On normal completion, on stream end, on client disconnect, or on upstream error.
This means: a fresh ASR call right after an LLM call may pay a
cold-load latency (the asr-main model was evicted to make room).
That trade-off is intentional — the alternative is OOM on tight
VRAM. For workflows that hammer the LLM many times in a row
(e.g. batch translation), the 30 s keep_alive default keeps Ollama
warm across calls. For workflows that interleave ASR and LLM
tightly (e.g. a real-time multimodal agent), expect cold-load
taxes either way on 8 GB hardware; lowering Qwen3-TTS Docker’s
gpu_memory_utilization is the principal lever to recover VRAM.
Streaming
Section titled “Streaming”Every Ollama-served LLM advertises supports_streaming: true in
/v1/models. When stream=true, aistack forwards Ollama’s SSE
chunks through unchanged — first-token latency is Ollama’s prefill
time plus a sub-millisecond proxy overhead on localhost.
Client disconnect propagates: aistack closes the upstream connection on disconnect so Ollama’s runner can abort generation rather than running to completion on a dead socket.
Error attribution
Section titled “Error attribution”aistack distinguishes errors that originate at the gateway boundary from errors that originate at Ollama itself.
Gateway-originated errors use the standard
{error: {kind, provider, message}} envelope (see
errors):
{ "error": { "kind": "network", "provider": "aistack", "message": "Ollama is not reachable at http://127.0.0.1:11434. Start it with: ollama serve" }}Backend-originated errors (Ollama rejected the request because of an unknown model, malformed messages, etc.) are passed through verbatim. The client sees Ollama’s error format directly. This is intentional: existing OpenAI-compatible clients have their own expectations for what an Ollama / OpenAI error looks like, and re-wrapping them would break that.
So: branch on error.kind when present (it is the aistack
envelope); fall back to OpenAI / Ollama error parsing for any
non-2xx response without a kind field.
Health and inventory
Section titled “Health and inventory”GET /v1/modelsaggregates Ollama’s installed models withcapabilities=["llm"]andowned_by="ollama". If Ollama is unreachable the LLM entries are silently omitted (no entry, no error) — the rest of/v1/modelscontinues to serve.- A direct
POST /v1/chat/completionswhile Ollama is unreachable returns503 networkwith the actionable message above.
Configuration
Section titled “Configuration”Server-side env var:
AISTACK_OLLAMA_URL default http://127.0.0.1:11434Set in scripts/dev.bat or your launcher when Ollama runs on a
non-default port or another host on the LAN.
Stability
Section titled “Stability”OpenAI-compatible request and response shapes within /v1 follow
OpenAI’s published spec. Streaming format follows OpenAI’s chunk
schema. aistack-side behaviour (eviction, keep_alive injection)
is documented above and stable within /v1; if it changes
meaningfully (e.g. a different default keep_alive), it will be a
/v1 additive change with a release note, not a /v2 break.