Latency budget in a real-time AI app: where the milliseconds go.

When a user sends a message and nothing happens for three seconds, that feels broken. But where did those three seconds go? Without measuring each segment of the request path, optimization is guesswork.

Defining a latency budget

A latency budget is the maximum acceptable time for an end-to-end operation, broken into allocations for each stage. For a chat application targeting 1500ms to first token rendered on screen:

Stage	Budget
Client to server (network)	50ms
Auth middleware	10ms
Context retrieval (DB/cache)	100ms
LLM first token	800ms
Server to client (network + streaming setup)	100ms
Client render	50ms
Total	1110ms

That leaves 390ms of headroom, which sounds comfortable until you add logging, error handling, and the reality that P99 latency is 3-4x the median.

Instrumenting each hop

You cannot manage what you do not measure. Add timestamps at each boundary:

async function handleChatRequest(req, res) {
  const timings = { start: Date.now() };

  // Auth
  const user = await authenticate(req);
  timings.auth = Date.now();

  // Context retrieval
  const context = await fetchUserContext(user.id);
  timings.context = Date.now();

  // LLM call - track first token separately
  const stream = await client.chat.completions.create({
    model: "gpt-4o",
    messages: buildMessages(context, req.body.message),
    stream: true
  });
  timings.llmCallSent = Date.now();

  let firstToken = true;
  for await (const chunk of stream) {
    if (firstToken) {
      timings.firstToken = Date.now();
      firstToken = false;
    }
    const content = chunk.choices[0]?.delta?.content ?? "";
    if (content) res.write(`data: ${JSON.stringify({ content })}\n\n`);
  }
  timings.llmComplete = Date.now();
  res.end();

  // Log the breakdown
  console.log({
    authMs: timings.auth - timings.start,
    contextMs: timings.context - timings.auth,
    llmSetupMs: timings.llmCallSent - timings.context,
    ttftMs: timings.firstToken - timings.llmCallSent, // time to first token
    generationMs: timings.llmComplete - timings.firstToken,
    totalMs: timings.llmComplete - timings.start
  });
}

TTFT (time to first token) is the most important number for perceived responsiveness. Users tolerate slow generation better than they tolerate a blank screen.

WebSocket vs HTTP streaming

For real-time AI apps, the transport choice affects latency:

HTTP with Server-Sent Events (SSE): Simple, works over HTTP/1.1, supported natively by browsers. Each request opens a connection, the server streams events, the connection closes. Overhead per request: one TCP handshake, optionally one TLS handshake.

WebSocket: Persistent connection, lower per-message overhead once established. Better for bidirectional communication (chat, voice). The initial handshake takes one round trip more than a plain HTTP request.

For a chat interface where the user sends a message and waits, SSE is simpler and has comparable latency. For voice pipelines where audio is flowing continuously in both directions, WebSocket is the right choice.

// SSE setup
app.get("/stream", (req, res) => {
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");
  // stream events...
});

// WebSocket setup with ws library
const wss = new WebSocketServer({ port: 8080 });
wss.on("connection", (ws) => {
  ws.on("message", async (data) => {
    const response = await getAIResponse(data.toString());
    ws.send(JSON.stringify({ type: "chunk", content: response }));
  });
});

Where latency hides

Cold starts: If your server is serverless, the first request after inactivity includes function initialization. This can add 500ms to 2 seconds. Warm the function with a scheduled ping, or use an always-on server for latency-sensitive paths.

Context retrieval: Fetching chat history from a database before every LLM call is a common hidden cost. Cache the last N messages in memory or Redis keyed to the session.

Token count: Longer prompts mean more tokens for the model to process before generating. Summarize old conversation history instead of sending the full log.

Model selection: GPT-4o mini has roughly 2-3x lower TTFT than GPT-4o at the cost of response quality. Route simple queries to the smaller model.

Region mismatch: If your server is in us-east-1 but your AI API endpoint is optimized for us-west-2, you’re adding a cross-region hop on every request. Check which regions your AI provider’s APIs are physically closest to.

Perceived vs actual latency

A spinner that appears immediately makes a 1.5-second wait feel shorter than a blank screen for 800ms. Optimistic UI patterns matter:

Show the user’s message immediately on send
Show a typing indicator within 100ms of send
Start rendering the streaming response character by character

Streaming to the client and rendering tokens as they arrive converts a “waiting for response” experience into a “watching it think” experience. The total time is the same; the perceived responsiveness is completely different.

Track P50, P90, and P99 latency separately. A mean of 800ms with a P99 of 5 seconds means a significant fraction of your users are having a bad experience that your average hides.