跳转到内容

LLM — chat completion

此内容尚不支持你的语言。

Chat completion (Ollama proxy)

OpenAI-compatible chat completion endpoint, reverse-proxied to a local Ollama daemon (default http://127.0.0.1:11434).

Value-adds over a direct Ollama call. Two things happen between the client and the upstream that justify routing through aistack:

  1. GPU scheduling. Before forwarding, aistack evicts its own in-process ASR asr-main cache entries so the LLM inference does not contend with a hot Whisper/Parakeet/SenseVoice for VRAM. The whole call holds the gateway’s single GPU slot, so concurrent LLM/ASR/TTS requests get HTTP 503 with Retry-After.

  2. keep_alive default. When the client omits the keep_alive field, aistack injects "30s" so Ollama releases the model shortly after the completion. Sequential LLM calls within that window reuse the loaded model; idle releases VRAM back for ASR. Clients that want a different lifetime override explicitly.

Streaming. When stream=true the response is forwarded chunk-by-chunk via FastAPI StreamingResponse, so the client sees first tokens as soon as Ollama emits them. Client disconnect propagates: aistack closes the upstream connection so Ollama’s runner can abort generation rather than running to completion on a dead socket.

Cancellation. Client disconnect mid-stream releases the GPU slot promptly and aborts the upstream call.

Request schema. OpenAI-compatible — see the OpenAI Chat Completion API reference for field semantics. aistack does not transform the request body except to inject the keep_alive default; it forwards every other field verbatim.

Ollama’s response, forwarded verbatim. Schema follows OpenAI’s /v1/chat/completions contract — see https://platform.openai.com/docs/api-reference/chat for the field reference. When stream=true in the request, the response is a Server-Sent Events stream of OpenAI-shape delta chunks terminated by data: [DONE].

  • application/json → object
  • text/event-stream → string

Request body is not valid JSON.

Ollama upstream produced an unexpected error.

Either the GPU slot is busy serving another inference (gateway-level), or Ollama is unreachable (e.g. the daemon is not running). The error envelope’s provider field distinguishes the two.


Wire format for every non-2xx response from aistack.

The shape is identical regardless of which endpoint produced the error, so consumers can write one error-handling helper and reuse it across capabilities.

FieldTypeRequiredDescription
errorErrorBodyyes