Streaming LLM responses: why waiting for the full answer feels broken.

Why non-streaming feels slow

A non-streaming LLM response waits until the model has generated every token before sending anything. For a 500-word response at typical generation speeds, that’s 5-15 seconds of silence before the user sees anything.

Users have been trained by typewriter animations, loading spinners, and real-time search suggestions to expect progressive feedback. A blank screen that suddenly fills with text feels like a bug.

Streaming sends tokens as they’re generated. The first token arrives in ~500ms. The user sees text appearing character by character. Even if the total time is the same, streaming feels significantly faster.

Server-side streaming with the OpenAI SDK

import OpenAI from 'openai';

const openai = new OpenAI();

const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Explain how neural networks work.' }],
  stream: true,
});

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) {
    process.stdout.write(delta); // or send to client
  }
}

Each chunk contains a delta with the incremental text. The loop ends when the stream closes.

Streaming to a browser client

The standard approach is Server-Sent Events (SSE) over HTTP. The browser’s EventSource API handles SSE natively:

// Express server endpoint
app.post('/api/chat', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: req.body.messages,
    stream: true,
  });

  for await (const chunk of stream) {
    const delta = chunk.choices[0]?.delta?.content;
    if (delta) {
      res.write(`data: ${JSON.stringify({ text: delta })}\n\n`);
    }
  }

  res.write('data: [DONE]\n\n');
  res.end();
});

Client:

const eventSource = new EventSource('/api/chat');

eventSource.onmessage = (event) => {
  if (event.data === '[DONE]') {
    eventSource.close();
    return;
  }
  const { text } = JSON.parse(event.data);
  appendToMessage(text); // update the UI incrementally
};

Using the Vercel AI SDK

For React applications, the Vercel AI SDK abstracts the streaming plumbing:

// app/api/chat/route.ts (Next.js App Router)
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';

export async function POST(req) {
  const { messages } = await req.json();

  const result = streamText({
    model: openai('gpt-4o'),
    messages,
  });

  return result.toDataStreamResponse();
}

// React component
import { useChat } from 'ai/react';

function ChatInterface() {
  const { messages, input, handleInputChange, handleSubmit } = useChat();

  return (
    <div>
      {messages.map(m => (
        <div key={m.id}>
          <strong>{m.role}:</strong> {m.content}
        </div>
      ))}
      <form onSubmit={handleSubmit}>
        <input value={input} onChange={handleInputChange} />
        <button type="submit">Send</button>
      </form>
    </div>
  );
}

useChat handles the streaming, updating messages in real time as tokens arrive.

Handling streaming in React state

For custom implementations, accumulate the streaming text in state:

const [response, setResponse] = useState('');
const [isStreaming, setIsStreaming] = useState(false);

async function sendMessage(userMessage) {
  setIsStreaming(true);
  setResponse('');

  const res = await fetch('/api/chat', {
    method: 'POST',
    body: JSON.stringify({ message: userMessage }),
  });

  const reader = res.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const text = decoder.decode(value, { stream: true });
    setResponse(prev => prev + text);
  }

  setIsStreaming(false);
}

What streaming can’t do

Streaming doesn’t work with Structured Outputs (JSON schema mode) — you need the complete response before it can be validated against a schema. For structured output use cases, non-streaming is required, and you should communicate progress differently (a loading spinner, not incremental text).