STT to LLM to TTS: a pipeline where every hop adds latency.

A voice AI pipeline has three stages: speech-to-text converts audio to a transcript, an LLM generates a response, text-to-speech converts that response back to audio. Each hop adds latency. Understanding where time goes is the first step to reducing it.

The baseline numbers

A reasonable baseline for each stage on current cloud APIs:

Stage	Latency
STT (Whisper API, ~5s audio)	500-1200ms
LLM (GPT-4o, first token)	300-800ms
LLM (full response, 50 tokens)	1-3s
TTS (first audio chunk, streaming)	200-600ms

End-to-end before the user hears anything: roughly 1-3 seconds if you stream and overlap. If you wait for complete outputs at each stage, you’re looking at 4-8 seconds.

The naive sequential implementation

// Don't do this for real-time voice
async function voicePipelineNaive(audioBlob) {
  // Stage 1: STT
  const transcript = await transcribe(audioBlob);

  // Stage 2: LLM - wait for complete response
  const llmResponse = await generateFullResponse(transcript);

  // Stage 3: TTS - wait for complete audio
  const audioBuffer = await synthesize(llmResponse);

  return audioBuffer; // 5-8 seconds later
}

This is the worst possible approach for latency. Every stage waits for the previous to fully complete.

Streaming overlap: the key optimization

The core insight is that you can start TTS before the LLM finishes. As tokens stream out of the LLM, you accumulate them into sentence-sized chunks and send each chunk to TTS immediately.

async function voicePipelineStreaming(audioBlob) {
  // Stage 1: STT (must complete before LLM can start)
  const transcript = await transcribe(audioBlob);

  // Stage 2+3: LLM streams, TTS starts on first sentence
  const llmStream = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: transcript }],
    stream: true
  });

  let buffer = "";
  const audioQueue = [];

  for await (const chunk of llmStream) {
    const token = chunk.choices[0]?.delta?.content ?? "";
    buffer += token;

    // Flush on sentence boundary
    if (/[.!?]\s/.test(buffer) && buffer.length > 20) {
      const sentence = buffer.trim();
      buffer = "";
      // Don't await — queue TTS requests concurrently
      audioQueue.push(synthesizeAndPlay(sentence));
    }
  }

  // Flush remainder
  if (buffer.trim()) audioQueue.push(synthesizeAndPlay(buffer.trim()));
  await Promise.all(audioQueue);
}

The first TTS request fires the moment the LLM produces its first sentence. The user starts hearing audio 1-2 seconds after they stop speaking, rather than 5-8.

The sentence boundary problem

Splitting on . is fragile. “Dr. Smith said…” will split incorrectly. A minimal fix:

function flushOnSentence(buffer) {
  // Only split on period followed by space and lowercase start
  // to avoid splitting on abbreviations
  const match = buffer.match(/^(.+?[!?]|.+?\.\s+(?=[A-Z]))/);
  if (match && match[0].length > 15) {
    return {
      chunk: match[0].trim(),
      remainder: buffer.slice(match[0].length)
    };
  }
  return null;
}

For production, a small NLP sentence tokenizer is worth the dependency.

STT: streaming vs batch

Whisper (via OpenAI API) is a batch model. You send audio, you get text. There is no word-by-word streaming from the current hosted API.

For lower STT latency, options include:

Deepgram Nova-2: streaming STT with word-level timestamps and ~200ms latency
AssemblyAI: streaming with similar latency
Whisper self-hosted: remove the network hop, run locally

Deepgram’s streaming API uses WebSockets:

const { createClient, LiveTranscriptionEvents } = require("@deepgram/sdk");
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);

const connection = deepgram.listen.live({ model: "nova-2", smart_format: true });

connection.on(LiveTranscriptionEvents.Transcript, (data) => {
  const words = data.channel.alternatives[0].transcript;
  if (data.is_final && words) {
    onTranscript(words);
  }
});

With streaming STT, you can detect when the user stops speaking (voice activity detection) and start the LLM call before they have fully finished. This overlaps STT and LLM processing.

Interruption handling

Real voice conversations have interruptions. The user starts talking while the AI is still speaking. Your pipeline needs:

Voice activity detection (VAD) to detect when the user speaks
Cancellation of any in-flight LLM or TTS requests
Clearing of the audio playback queue
Starting a new pipeline cycle immediately

vadStream.on("speech_start", () => {
  // Cancel ongoing generation
  currentLLMController?.abort();
  audioPlayer.flush(); // clear queued audio
  startNewCycle();
});

The AbortController pattern works well for canceling in-flight fetch requests to both LLM and TTS APIs.

Choosing a TTS model for real-time

OpenAI’s tts-1 model targets lower latency at the cost of some audio quality. ElevenLabs has a “Flash” tier designed for streaming with sub-400ms first-chunk latency. For telephony, dedicated providers like Deepgram’s Aura or Amazon Polly Neural are optimized for the audio formats phone systems expect.

The best end-to-end latency currently achievable with cloud APIs: roughly 800ms to first audio. With self-hosted STT and a small LLM, you can push below 500ms. The pipeline architecture matters more than any individual component.