LLM — chat completion
POST /v1/chat/completions
Section titled “POST /v1/chat/completions”Chat completion (Ollama proxy)
OpenAI-compatible chat completion endpoint, reverse-proxied to a
local Ollama daemon (default http://127.0.0.1:11434).
Value-adds over a direct Ollama call. Two things happen between the client and the upstream that justify routing through aistack:
-
GPU scheduling. Before forwarding, aistack evicts its own in-process ASR
asr-maincache entries so the LLM inference does not contend with a hot Whisper/Parakeet/SenseVoice for VRAM. The whole call holds the gateway’s single GPU slot, so concurrent LLM/ASR/TTS requests get HTTP 503 withRetry-After. -
keep_alivedefault. When the client omits thekeep_alivefield, aistack injects"30s"so Ollama releases the model shortly after the completion. Sequential LLM calls within that window reuse the loaded model; idle releases VRAM back for ASR. Clients that want a different lifetime override explicitly.
Streaming. When stream=true the response is forwarded
chunk-by-chunk via FastAPI StreamingResponse, so the client sees
first tokens as soon as Ollama emits them. Client disconnect
propagates: aistack closes the upstream connection so Ollama’s
runner can abort generation rather than running to completion on
a dead socket.
Cancellation. Client disconnect mid-stream releases the GPU slot promptly and aborts the upstream call.
Request schema. OpenAI-compatible — see the OpenAI Chat
Completion API reference for field semantics. aistack does not
transform the request body except to inject the keep_alive
default; it forwards every other field verbatim.
Responses
Section titled “Responses”Ollama’s response, forwarded verbatim. Schema follows OpenAI’s /v1/chat/completions contract — see https://platform.openai.com/docs/api-reference/chat for the field reference. When stream=true in the request, the response is a Server-Sent Events stream of OpenAI-shape delta chunks terminated by data: [DONE].
application/json→ objecttext/event-stream→ string
Request body is not valid JSON.
application/json→ErrorEnvelope
Ollama upstream produced an unexpected error.
application/json→ErrorEnvelope
Either the GPU slot is busy serving another inference (gateway-level), or Ollama is unreachable (e.g. the daemon is not running). The error envelope’s provider field distinguishes the two.
application/json→ErrorEnvelope
Schemas
Section titled “Schemas”ErrorEnvelope {#schema-errorenvelope}
Section titled “ErrorEnvelope {#schema-errorenvelope}”Wire format for every non-2xx response from aistack.
The shape is identical regardless of which endpoint produced the error, so consumers can write one error-handling helper and reuse it across capabilities.
| Field | Type | Required | Description |
|---|---|---|---|
error | ErrorBody | yes |