OpenAI-compatible chat completions. Routes the request through the gateway’s provider layer (OpenAI, Azure OpenAI, Anthropic, Vertex, Bedrock, OpenAI-compatible) based on the resolved model spec.
Streaming. When stream: true, the response is a text/event-stream
of OpenAI-compatible delta chunks (ChatCompletionChunk) terminated by
a literal data: [DONE] payload. Provider-native stream shapes
(Anthropic, Bedrock, Vertex) are normalised to OpenAI deltas before
being forwarded. Provider errors mid-stream are emitted as a synthetic
chunk with a content message followed by [DONE].
When stream: false (default), the response is a single JSON
ChatCompletionResponse. The non-streaming response is enriched with
usage.cost, usage.duration_ms, and usage.carbon (Prisme.ai
extensions over the standard OpenAI shape).
Prisme.ai extensions in the request body:
task_id - opaque correlation identifier for A2A flows.analytics_context - caller-supplied context (orgSlug, agent_id,
user_id, context_id, agent_allowed_models, call_type,
message_turn) used to enrich analytics.llm.completion events.Rate limiting. 100 requests per 60 seconds per consumer
(auth.user_id or session.id).
Governance. Calls may be rejected with 403 MODEL_NOT_ALLOWED or
429 quota errors based on the caller’s organization governance
(resolved via ai-governance-v2).
Documentation Index
Fetch the complete documentation index at: https://docs.prisme.ai/llms.txt
Use this file to discover all available pages before exploring further.
User session JWT or instance API key (iak_*). Send as
Authorization: Bearer <token>.
OpenAI-compatible chat completion request. Only the fields actually accepted by the gateway are documented here.
Model id from the catalogue (e.g. gpt-4o,
eu.anthropic.claude-sonnet-4-20250514-v1:0,
vertex-gemini-2.5-flash).
256Conversation history (system + user/assistant/tool turns).
Sampling temperature (provider-dependent range, typically 0–2).
Max tokens to generate.
Nucleus sampling parameter.
OpenAI-style frequency penalty.
OpenAI-style presence penalty.
One or more stop sequences (string or array of strings).
When true, the response is a text/event-stream of
ChatCompletionChunk deltas terminating with data: [DONE].
Tool/function definitions made available to the model. Forwarded to providers that support tool calling.
Tool selection hint: "auto", "none", "required", or
{ type: "function", function: { name } }.
OpenAI-style structured output hint
(e.g. { "type": "json_object" }).
Provider seed for reproducible sampling (where supported).
Prisme.ai extension. Opaque correlation id propagated to A2A (agent-to-agent) flows.
128Prisme.ai extension. Caller-supplied analytics context merged
into the analytics.llm.completion event.
Successful completion. Content type depends on request.stream:
application/json: non-streaming ChatCompletionResponse.text/event-stream: SSE stream of ChatCompletionChunk payloads
terminated by data: [DONE].Non-streaming chat completion response. Mirrors OpenAI's shape with
Prisme.ai extensions on usage (cost, duration_ms, carbon).
Generated id (chatcmpl-<correlationId>).
chat.completion Unix timestamp (seconds).
Resolved model id used to serve the request.
Token, cost, and carbon accounting.