Streaming

Stream responses token-by-token using server-sent events (SSE). This gives your users a real-time typing experience instead of waiting for the full response.

Enable streaming

Set stream: true in your request:

stream = client.chat.completions.create(
    model="gpt-oss-20b",
    messages=[{"role": "user", "content": "Write a poem about the Nile"}],
    stream=True
)
 
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

SSE format

Each event is a JSON object prefixed with data: :

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" world"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

The stream ends with data: [DONE].

Node.js streaming

const stream = await client.chat.completions.create({
  model: 'gpt-oss-20b',
  messages: [{ role: 'user', content: 'Write a haiku about Cairo' }],
  stream: true,
});
 
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

Raw HTTP streaming

curl -N https://backend.sovereigneg.com/v1/chat/completions \
  -H "Authorization: Bearer sk-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

The -N flag disables curl's output buffering, so you see tokens as they arrive.

Frontend integration

For web applications, use the fetch API with ReadableStream:

const response = await fetch('https://backend.sovereigneg.com/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer sk-...',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'gpt-oss-20b',
    messages: [{ role: 'user', content: 'Hello' }],
    stream: true,
  }),
});
 
const reader = response.body.getReader();
const decoder = new TextDecoder();
 
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
 
  const text = decoder.decode(value);
  const lines = text.split('\n').filter(line => line.startsWith('data: '));
 
  for (const line of lines) {
    const data = line.replace('data: ', '');
    if (data === '[DONE]') break;
 
    const parsed = JSON.parse(data);
    const content = parsed.choices[0]?.delta?.content;
    if (content) {
      // Append to your UI
      document.getElementById('output').textContent += content;
    }
  }
}

Performance notes

Time to first token (TTFT): Varies by model, prompt size, and routing — measure on your workload.
Token throughput: Varies by model size and load.
Streaming adds no overhead — the model generates tokens one at a time regardless.