pyannote

2026-04-10

audioaiopen-sourcespeaker-diarization

pyannote is the piece Whisper is missing. Whisper gives you a transcript; pyannote tells you who said each word, and when.

pyannote.ai →

I built two small apps with it live: a video editor prototype and a podcast context explorer. Both run on their hosted API.

1. what it does

pyannote ships an open source speaker diarization model and a hosted API on top of it. You send audio, you get back the transcript plus a list of speaker turns: "speaker A from 0:00 to 0:12, speaker B from 0:12 to 0:28", and so on.

There is also a real time streaming API, a dashboard with playgrounds, and full documentation you can point an LLM at.

2. what worked

The API is genuinely clean. I pasted a YouTube URL into my app, it downloaded the audio, ran diarization, and came back with speaker-labeled segments. No tuning, no model hosting, no GPU.

The docs are good enough that Claude Code built both apps for me. I literally pointed it at the pyannote docs and said "build this"; I did not write a single line of code myself. That is the real tell on documentation quality.

The open source version is right there if you ever want to move off the API. No lock in.

The dashboard has playgrounds so non-coders can test the thing without writing anything.

3. what did not work

Diarization is not perfect at the edges. On my second app I noticed fuzzy segment boundaries in a few spots; nothing catastrophic, but something you would want to clean up with an LLM pass before shipping to end users.

That is really my only complaint, and it is not specific to pyannote; every diarization system has this.

4. verdict

pyannote is one of the cleanest building blocks I have touched in a while. If you are building anything around audio (podcast tools, meeting transcription, dubbing, video editing), this is the diarization layer to start with. Good model, good API, good docs, open source escape hatch.

I am already planning a follow up video where I pair it with a TTS model to try and build a real open source Descript alternative.

Best for: anyone building audio or video products who needs to know who is speaking, not just what was said.

want your product reviewed?

Get your SaaS reviewed →