Speaker diarization: turning a transcript into 'who said what.'

What diarization is

A transcript from Whisper is a sequence of words with timestamps. It doesn’t know how many speakers are in the recording or which words each speaker said.

Diarization answers the question: “at time T, which speaker is talking?” The output is a set of time-stamped segments labeled with speaker IDs (Speaker 1, Speaker 2, etc.). Combined with a word-level Whisper transcript, you get a complete picture of who said what and when.

How diarization works

Diarization models use audio embeddings — numerical representations of voice characteristics — to identify clusters of similar voice segments. The process:

Split the audio into short segments (typically 1-3 seconds)
Compute a voice embedding for each segment
Cluster segments by embedding similarity
Each cluster becomes a “speaker”

Modern models do this well for 2-4 speakers in clean audio. Performance degrades with many speakers, overlapping speech, or noisy environments.

pyannote.audio

The open-source standard for diarization is pyannote.audio. It’s a Python library with a Hugging Face-hosted model.

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="your_hf_token"
)

diarization = pipeline("interview.mp3")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f}s - {turn.end:.1f}s: {speaker}")
    # 0.0s - 2.3s: SPEAKER_00
    # 2.3s - 5.1s: SPEAKER_01
    # 5.1s - 8.7s: SPEAKER_00

The output is a series of time segments, each assigned to a speaker label.

Combining diarization with Whisper

The standard pipeline:

Transcribe with Whisper (verbose_json, word timestamps)
Diarize with pyannote
For each word timestamp, find which speaker segment it falls in
Assign the speaker label to the word

import json
from openai import OpenAI

client = OpenAI()

# Step 1: Transcribe
with open("interview.mp3", "rb") as f:
    transcription = client.audio.transcriptions.create(
        file=f,
        model="whisper-1",
        response_format="verbose_json",
        timestamp_granularities=["word"]
    )

# Step 2: Diarize (assumes diarization is done separately)
# diarization segments from pyannote: [{start, end, speaker}, ...]

# Step 3: Assign speakers to words
def get_speaker_at(time, diarization_segments):
    for seg in diarization_segments:
        if seg['start'] <= time <= seg['end']:
            return seg['speaker']
    return 'UNKNOWN'

labeled_words = []
for word in transcription.words:
    midpoint = (word['start'] + word['end']) / 2
    speaker = get_speaker_at(midpoint, diarization_segments)
    labeled_words.append({
        'word': word['word'],
        'start': word['start'],
        'end': word['end'],
        'speaker': speaker,
    })

Formatting the output

Group consecutive words by the same speaker into utterances:

def group_by_speaker(labeled_words):
    utterances = []
    current = None

    for word in labeled_words:
        if current is None or word['speaker'] != current['speaker']:
            if current:
                utterances.append(current)
            current = {
                'speaker': word['speaker'],
                'start': word['start'],
                'end': word['end'],
                'text': word['word'],
            }
        else:
            current['text'] += word['word']
            current['end'] = word['end']

    if current:
        utterances.append(current)

    return utterances

Output:

[00:00 - 00:05] SPEAKER_00: "Welcome to the podcast, glad to have you here."
[00:05 - 00:12] SPEAKER_01: "Thanks for having me, excited to be here."

API services

If running Python models isn’t an option, several APIs offer combined transcription + diarization:

AssemblyAI: speaker_labels: true parameter
Deepgram: diarize=true parameter
Gladia: transcription with diarization in one call

These services run both steps and return a combined response, avoiding the need to merge two outputs manually.

Limitations

Speaker diarization is not perfect. Overlapping speech is difficult — most models pick one speaker for overlapping segments. Speaker count must be estimated or configured. Very short utterances (under 1 second) may be misassigned. For production applications, consider letting users correct speaker labels manually.