FFmpeg + Whisper: The Future of Media-Processing Startups

FFmpeg 8.0 “Huffman” (August 2025) merged native OpenAI Whisper support — automatic speech recognition baked directly into the world’s most ubiquitous media tool. One command now transcodes, transcribes, and subtitles a video. No Python scripts, no API calls, no separate Whisper install. This changes the economics of an entire category of startups.

The core question: When transcription becomes a single FFmpeg flag, what happens to the $5.4B speech-to-text API market, the $6.7B captioning market, and the dozens of startups built on Whisper? Who dies, who thrives, and what should you build?

1. What FFmpeg 8.0 Actually Changed

The Whisper Filter

FFmpeg 8.0 added a native whisper audio filter powered by whisper.cpp. Build with --enable-whisper, point at a model file, and FFmpeg does speech recognition inline — as part of the same pipeline that decodes, filters, and encodes your media.

One command to transcribe a video to SRT subtitles:

ffmpeg -i input.mp4 -vn -af "whisper=model=ggml-base.bin:language=auto:queue=10:destination=output.srt:format=srt" -f null -

Filter Options

Option	Default	What it does
`model`	(required)	Path to whisper.cpp GGML model file
`language`	`auto`	Language code or auto-detect
`queue`	`3` (seconds)	Audio chunk buffer. Higher = better accuracy, more latency
`use_gpu`	`true`	GPU acceleration (CUDA, Metal, Vulkan)
`gpu_device`	`0`	GPU device index
`format`	`text`	Output format: `text`, `srt`, or `json`
`destination`	(none)	File path or URL (supports AVIO protocols — can POST JSON to HTTP endpoints)
`vad_model`	(none)	Silero VAD model for voice activity detection
`vad_threshold`	`0.5`	Speech detection sensitivity
`vad_min_speech_duration`	`0.1`s	Minimum speaking duration
`vad_min_silence_duration`	`0.5`s	Minimum silence between segments

What Else Shipped in 8.0

Vulkan compute codecs: Pure Vulkan 1.3 compute shader-based FFv1 encode/decode and ProRes RAW decode. H.264 and HEVC Vulkan encoding. First time: fully GPU-based decode→filter→encode pipelines without CPU round-trips.
VVC (H.266) support: VA-API decoding, Matroska container support, Screen Content Coding (IBC, Palette, ACT). Next-gen codec for 50% bitrate savings over HEVC.
Animated JPEG XL encoding: Via libjxl. Potential GIF/WebP killer for the web.
FLV v2 multitrack: Modern codec support in Flash Video containers. Relevant for live streaming.

Why This Matters for Startups

Before 8.0, transcription and media processing were separate concerns requiring separate tools, separate infrastructure, and separate API calls. A typical pipeline:

FFmpeg: decode video, extract audio
Upload audio to Deepgram/AssemblyAI/Whisper API ($0.006/min)
Wait for transcription response
FFmpeg: burn subtitles into video, re-encode

After 8.0, steps 1–4 collapse into one command. No network round-trips. No API costs. No separate billing. The marginal cost of transcription drops to the electricity needed to run inference on your GPU.

Whisper Model Performance

Model size vs. accuracy vs. speed
Model	Parameters	VRAM	Speed (vs. real-time)	WER (clean audio)
Tiny	39M	~1GB	~32x	~7.7%
Base	74M	~1GB	~16x	~5.0%
Small	244M	~2GB	~6x	~3.4%
Medium	769M	~5GB	~2x	~2.9%
Large-v3	1.55B	~10GB	~1x	~2.7%
Large-v3 Turbo	809M	~6GB	~6x (216x with optimizations)	~2.9%

WER = Word Error Rate on LibriSpeech clean. Human-level WER: 4–6.8%. Real-world audio (background noise, accents, call centers) increases WER significantly — up to 17.7% for low-quality recordings.

2. The Markets Being Disrupted

Three overlapping markets
Market	Size (2025–2026)	Growth	How FFmpeg+Whisper disrupts it
Speech-to-text APIs	$2.2B (2021) → $5.4B (2026 projected)	~20% CAGR	If transcription is a free FFmpeg flag, why pay $0.006/min for an API?
Captioning & subtitling	$6.7B (2025)	~8.2% CAGR to $11.5B (2032)	Automated captioning becomes a command-line operation, not a SaaS subscription.
Video processing / transcoding	Multi-billion (part of $85B video streaming market)	~15% CAGR	Processing + transcription in one pipeline eliminates the need for separate services.

The disruption pattern: Every time a capability moves from “requires a service” to “built into the tool you already use,” the market restructures. FFmpeg+Whisper does to transcription APIs what ImageMagick did to image processing services, what Let’s Encrypt did to SSL certificate vendors, and what Docker did to VM hosting.

But: The market doesn’t disappear. Let’s Encrypt didn’t kill certificate vendors — it killed basic certificate vendors and expanded the overall SSL market. FFmpeg+Whisper will kill basic transcription use cases (subtitle a video, transcribe a podcast) and create new demand for intelligent media processing that goes beyond what a single command can do.

3. The Incumbents & Who’s Threatened

Speech-to-Text API Providers

Company	Funding	Revenue	Pricing	Threat level
Deepgram	$250M total ($130M Series C, Jan 2026, $1.3B valuation)	Cash-flow positive 2024. 400+ enterprise customers. 200K developers.	$0.0065–$0.0077/min (Nova-3)	Medium. Voice Agent API ($4.50/hr) and real-time streaming are safe. Batch transcription is threatened.
AssemblyAI	$158.1M total	Not disclosed	$0.0025/min + add-ons (speaker ID, sentiment)	Medium. Audio intelligence (sentiment, topic, entity extraction) goes beyond raw transcription.
Rev.ai	Part of Rev.com ($400M+ raised)	Not disclosed	$0.002/min (standard), $1.99/min (human)	High. Their cheapest automated tier competes directly with free FFmpeg.
OpenAI Whisper API	Part of OpenAI	Part of OpenAI’s $13.8B revenue	$0.006/min flat	High. Same model, but their API version adds nothing over the local filter.
Google Cloud STT	Part of Google	Part of GCP	$0.006–$0.024/min	Low. Enterprise lock-in, compliance, and multi-language support create stickiness.
AWS Transcribe	Part of AWS	Part of AWS	$0.024/min (standard), $0.048/min (medical)	Low. Same reasons as Google. Plus medical/legal transcription is regulated.

Captioning & Video Tools

Company	Revenue / Traction	Funding	Threat level
Captions (now Mirage)	3M+ videos/month, 10M+ creators	$100M raised, $500M valuation	Low. The value is the mobile-first creative UX, not raw transcription.
Descript	$50–$100M revenue	$101M (Series C from OpenAI)	Low. Text-based video editing is the moat, not transcription.
Opus Clip	~$20M ARR, $215M valuation	$50M ($20M from SoftBank)	Medium. AI clip selection is the value, but transcription is a dependency.
Submagic	$8M ARR, 13 employees, bootstrapped	$0 raised	High. Animated captions are the value, but basic subtitling is commoditized.
Kapwing	$10.4M revenue (2024)	$12.7M	Medium. Browser-based editor moat, but auto-captioning is a key feature.

FFmpeg-as-a-Service Players

Company	Positioning	Threat level
Rendi	FFmpeg API wrapper. Already blogged about FFmpeg 8.0 Whisper.	Opportunity — they can add transcription to their API at zero marginal cost.
Mux	“Stripe of video.” Full video infrastructure.	Opportunity — can integrate Whisper into their encoding pipeline.
Cloudflare Stream	Video hosting + delivery on Cloudflare’s edge.	Opportunity — auto-subtitling on upload becomes a checkbox feature.

Summary: Companies whose core value is raw transcription accuracy are threatened. Companies whose value is UX, intelligence, or workflow (Descript, Captions/Mirage, Opus Clip) are safe. Companies that can add Whisper to their existing pipeline (Rendi, Mux, Cloudflare) get a free feature upgrade.

4. 10 Startup Opportunities

FFmpeg+Whisper doesn’t kill opportunities — it creates them by making the base layer free. Here’s what to build on top.

1. Smart Subtitle API (Transcription + Intelligence)

FFmpeg+Whisper gives you raw subtitles. They’re inaccurate on noisy audio, have no speaker identification, no punctuation correction, no profanity filtering, no keyword extraction, and no translation. Build an API that takes raw Whisper output and makes it usable: LLM-powered cleanup, speaker diarization (who said what), auto-translation to 50+ languages, branded subtitle templates, and SEO-optimized transcript formatting.

Pricing: $0.002/min (raw) to $0.01/min (full intelligence). Undercut Deepgram/AssemblyAI by 50% while offering more features. TAM: $500M+ (subset of captioning market).

2. Video Search Engine Infrastructure

Every video on the internet can now be transcribed for free. What if you could search inside every video? Build infrastructure that ingests video, transcribes with FFmpeg+Whisper, indexes the transcript with timestamps, and serves a search API. “Find every moment in this 500-hour video library where someone says ‘refund policy’” — and return a direct link to that second in the video.

Customers: E-learning platforms, corporate training, legal discovery, media archives, podcast networks. Pricing: $0.10–$0.50 per hour of indexed video + $29–$99/mo for search API access.

3. Compliance Captioning for Regulated Industries

59% of organizations in media and education prioritize captioning for legal accessibility requirements. ADA, Section 508, WCAG 2.1, EAA (European Accessibility Act, June 2025), AODA (Ontario). FFmpeg+Whisper generates subtitles, but it doesn’t guarantee compliance. Build a platform that takes raw FFmpeg+Whisper output and adds: accuracy verification (99%+ required for ADA), formatting compliance (character limits, reading speed, positioning), human review routing for edge cases, and compliance certification with audit trails.

Pricing: $0.50–$2.00/min (10–100x the raw transcription cost, justified by regulatory requirements). TAM: $1B+ (accessibility compliance market).

4. Podcast Intelligence Platform

Transcribe every podcast episode (free with FFmpeg+Whisper), then build intelligence on top: topic extraction, guest identification, key quote extraction, episode summaries, show notes generation, cross-episode knowledge graphs, and “what did [guest] say about [topic] across all episodes?”

Customers: Podcast networks, PR/media monitoring firms, researchers, podcast hosts who want to repurpose content. Pricing: $29–$149/mo per podcast. See also: Podcast Monitoring Analysis.

5. Real-Time Subtitling for Live Streams

FFmpeg+Whisper supports live audio, but with significant latency (3–30 second chunks). Build a specialized low-latency pipeline: optimized chunk sizes, overlapping windows for continuity, client-side rendering for instant display, WebSocket delivery, and multi-language real-time translation. Target: Twitch, YouTube Live, webinars, conferences, church services, town halls.

Pricing: $0.10–$0.50/min of live stream. Differentiation: Sub-2-second latency with context-aware accuracy. This is hard — the 3-second default chunk creates word boundary errors that require overlapping buffers and N-best decoding.

6. Media Asset Management with Built-In Search

DAMs (Digital Asset Managers) handle images and documents well, but video is a black box — you can’t search inside it. Build a DAM or DAM plugin that auto-transcribes every video on upload (free with FFmpeg+Whisper), indexes the transcript, and makes every moment searchable. “Find all footage where someone mentions ‘product launch’” across 10TB of corporate video.

Customers: Marketing teams, news organizations, film studios, corporate communications. Pricing: $49–$499/mo based on storage volume.

7. Meeting Intelligence (Post-Hoc)

Fireflies, Otter.ai, and Grain capture meetings live. But most meetings aren’t recorded by these tools — they’re recorded by Zoom/Teams/Meet natively, dumped to cloud storage, and never transcribed. Build a service that watches your meeting recording storage (Google Drive, OneDrive, S3), auto-transcribes new recordings with FFmpeg+Whisper, extracts action items with an LLM, and sends daily digests. No live integration required — just process the files that already exist.

Pricing: $19–$49/mo per user. Advantage over incumbents: No meeting bot, no live recording permissions, works retroactively on years of existing recordings.

8. Content Repurposing Pipeline

One video in, 15 assets out. FFmpeg+Whisper transcribes, then your pipeline generates: blog post, Twitter thread, LinkedIn post, newsletter section, short-form clips with burned-in captions (multiple aspect ratios), audiogram, pull quotes, show notes, and SEO metadata. All automated. All from one FFmpeg pipeline.

Customers: Content creators, marketing teams, agencies. Pricing: $29–$99/mo or $2–$5 per video processed. Submagic ($8M ARR, bootstrapped, 13 employees) proves this market is real and profitable.

9. Multilingual Subtitle Factory

FFmpeg+Whisper transcribes in 99 languages. Combine with an LLM translation layer to produce subtitles in 50+ languages from a single video. Target: YouTube creators who want to reach global audiences, e-learning platforms, SaaS companies with product videos, and media companies with international distribution.

Pricing: $0.05–$0.20 per minute per language. 10 languages × 60 minutes = $30–$120 per video. The $1.42B translation market meets the $6.7B captioning market.

10. Self-Hosted Transcription Appliance

For organizations that cannot send audio to the cloud (law firms, healthcare, government, defense): a pre-configured server or VM with FFmpeg 8.0 + Whisper + GPU, a web UI for uploading and managing transcriptions, and an API for integration. Ship as a Docker container, an OVA image, or a physical appliance. No data leaves the premises.

Pricing: $500–$5,000/year license + support. Or $2,000–$10,000 for a physical appliance. Market: Every organization that currently pays $24/min for AWS Transcribe Medical because “HIPAA.”

5. How to Differentiate When the Core Is Free

When the raw capability (transcription) is free and open-source, differentiation moves up the stack. Here’s the hierarchy:

Differentiation stack (bottom = commodity, top = defensible)
Layer	Example	Defensibility	Margin
Raw transcription	FFmpeg+Whisper, OpenAI Whisper API	None — free and commoditized	0% (free) to 5%
Accuracy & speed	Deepgram Nova-3, faster-whisper	Low — open-source models catch up fast	10–30%
Intelligence	Speaker diarization, sentiment, topics, entities	Medium — requires pipeline engineering	40–60%
Workflow	Auto-repurposing, compliance, search indexing	High — domain-specific integrations	60–80%
UX / Creative tools	Descript, Captions/Mirage, Kapwing	Very high — brand + habit + creative output	70–90%
Compliance & trust	HIPAA, SOC 2, ADA certification, audit trails	Very high — regulatory moat	80–95%

Differentiation Strategies That Work

1. Vertical depth over horizontal breadth: Don’t build “transcription for everyone.” Build “transcription for court reporters” or “transcription for podcast networks” or “transcription for surveillance footage.” Each vertical has specific accuracy requirements, output formats, compliance needs, and workflow integrations that justify 10–100x the price of raw transcription.
2. Pipeline, not endpoint: Don’t sell transcription. Sell the outcome. “Upload a video, get 15 social media assets” is worth $5/video. “Upload a video, get a transcript” is worth $0 (FFmpeg does it free). The more steps between input and valuable output, the harder to replicate.
3. Data flywheel: Every transcript you process is training data for better accuracy. Build custom models for specific domains (medical terminology, legal jargon, industry slang). Whisper’s WER on call center audio is 17.7% — a domain-specific model that drops this to 5% is worth real money.
4. On-premise / air-gapped: Most transcription providers are cloud-only. Many organizations (legal, healthcare, government, defense) cannot send audio to the cloud. FFmpeg+Whisper runs locally, but packaging it into an enterprise-ready appliance with a UI, API, monitoring, and support is a $5K–$50K/year product.
5. Integration depth: A Zoom plugin that auto-transcribes recordings. A Shopify app that auto-subtitles product videos. A WordPress plugin that generates searchable transcripts for embedded videos. A Slack bot that transcribes voice messages. Each integration is narrow but sticky.

6. Tactics to Get Early Customers

Week 1–2: Build the Proof

Ship a working demo in 48 hours. FFmpeg+Whisper is the backend. Build a minimal web UI: upload video, get subtitled video + transcript + SRT file back. Deploy on a $50/mo GPU server (Hetzner, Vast.ai, RunPod). This is your proof of concept.
Process 100 videos from potential customers for free. Find podcast hosts, YouTube creators, and course creators on Twitter/Reddit. DM: “I built a tool that auto-subtitles videos. Can I process 5 of yours for free? I want feedback.” The output is your portfolio.
Write the blog post: “I Built a $0/Month Subtitle Generator with FFmpeg 8.0 + Whisper.” Post on Hacker News. This is the content-market-fit play — developers will read it, share it, and some will want the hosted version.

Week 3–4: Find Product-Market Fit

Target the Submagic/Opus Clip audience. These users already pay $15–$49/mo for AI subtitles. Search Twitter for “submagic alternative” and “opus clip expensive.” These are people actively looking for cheaper options.
Join 10 niche communities. Podcast subreddits, YouTube creator Discords, e-learning Slack groups, content marketing communities. Don’t pitch — answer questions about subtitling and transcription. Drop your tool when relevant. The Submagic founder built $1M ARR in 90 days using this exact community-first approach.
Launch an affiliate program immediately. Submagic gets $1.6M/year from affiliates paying 30% lifetime commissions to 10,000+ partners. Affiliates are your sales force. YouTube reviewers, blog writers, and comparison site authors will promote you for recurring commissions.

Month 2–3: Scale

SEO attack: own every “FFmpeg subtitle” query. Write guides for every combination: “FFmpeg + Whisper + SRT,” “FFmpeg auto-caption,” “FFmpeg 8.0 transcription tutorial.” Rank for the problem, convert to your hosted solution. These queries are exploding post-8.0 and have near-zero competition.
Build the Zapier/Make/n8n integration. “When a new video is uploaded to Google Drive, auto-transcribe and add subtitles.” No-code users can’t run FFmpeg locally — they’ll pay for a managed version. List on all three marketplaces.
Cold email 100 podcast production companies. They process 10–50 episodes/week. Per-episode pricing ($2–$5) looks tiny compared to their existing transcription costs ($0.50–$2/min × 60 min = $30–$120/episode). You’re 90% cheaper.
Ship a Shopify app. “Auto-subtitle your product videos.” E-commerce stores have thousands of product videos without captions. Accessibility compliance is becoming legally required. Charge $9.99/mo for 50 videos, $29.99 for unlimited.

The Submagic Playbook (Proven: $0 → $8M ARR)

What Submagic did — and you should copy
Tactic	Result
30% lifetime affiliate commissions	10,000+ affiliates, $1.6M/yr affiliate revenue
Community-first launch (no ads)	$1M ARR in 90 days
13-person team, no VC funding	$615K revenue per employee
YouTube creator influencer partnerships	Viral demo videos drive organic signups
Free tier with watermark	Users upgrade when they’re hooked on the workflow

7. Unit Economics & Pricing

Cost Structure (Self-Hosted FFmpeg+Whisper)

Component	Cost	Notes
GPU server (Hetzner GEX44)	~$65/mo	RTX 4070, 64GB RAM. Handles ~20 concurrent transcriptions.
GPU cloud (RunPod H100)	$2.49/hr	For burst capacity. Large-v3 Turbo: 60 min transcribed in ~17 seconds.
Storage (S3/R2)	$0.015/GB/mo	Cloudflare R2 has no egress fees.
Bandwidth	$0 (R2) to $0.09/GB (AWS)	Video files are large. Egress costs can dominate.
LLM post-processing	$0.001–$0.01/min	For cleanup, speaker ID, summarization. Depends on model.

Per-Minute Unit Economics

What it costs you vs. what you charge
Service tier	Your cost/min	Your price/min	Gross margin
Raw transcription (SRT)	~$0.0005	$0.002	75%
Smart subtitles (cleanup + diarization)	~$0.003	$0.01	70%
Full pipeline (transcription + translation + burn-in)	~$0.008	$0.03	73%
Compliance-grade (human review + certification)	~$0.15	$0.50–$2.00	70–92%

Key insight: Raw transcription is nearly free. The money is in everything around it: cleanup, diarization, translation, formatting, compliance, and workflow integration. Margins increase as you move up the intelligence stack.

Pricing Models That Work

Per-minute (API customers): Deepgram model. $0.002–$0.05/min depending on features. Developers understand this. Free tier: 100 minutes/month.
Per-video (creator customers): Submagic model. $2–$5 per video. Or subscription: $15–$49/mo for X videos/month. Creators don’t think in minutes; they think in videos.
Per-seat (enterprise customers): Otter.ai model. $19–$49/user/month. Includes meeting transcription, search, summarization. Enterprise upsell to custom deployment at $500–$5,000/mo.
One-time (appliance customers): $2,000–$10,000 for a pre-configured self-hosted instance. Annual support contract at 20% of purchase price. This is the Red Hat model for transcription.

8. Technical Architecture & Moats

The Minimal Viable Pipeline

Ingestion: Accept video via upload, URL, or S3/R2 reference. Store in R2 (no egress).
Processing: FFmpeg 8.0 with --enable-whisper. GPU-accelerated (CUDA/Metal). Whisper Large-v3 Turbo for best speed/accuracy tradeoff.
Post-processing: LLM pass for punctuation, capitalization, speaker labels, profanity filtering. Optional: translation via LLM or dedicated translation model.
Output: SRT, VTT, TXT, JSON with timestamps. Optional: burn subtitles into video, generate clips, create transcript page.
Delivery: Webhook or polling. Stream results for long files.

Technical Moats to Build

Custom domain models: Fine-tune Whisper on domain-specific audio (medical dictation, legal proceedings, financial earnings calls). A model that transcribes “amoxicillin” correctly every time is worth $2/min to a medical transcription company. Whisper Large-v3 gets it right ~80% of the time.
Speaker diarization pipeline: Whisper doesn’t identify speakers. Integrate pyannote.audio or equivalent for “who said what” labeling. This is the #1 feature request from enterprise customers and the #1 feature missing from FFmpeg+Whisper.
Streaming architecture: Process files without waiting for complete upload. Start transcribing as bytes arrive. Return partial results via WebSocket. This is technically hard (FFmpeg expects seekable input for many formats) but dramatically improves UX for large files.
Multi-model orchestration: Tiny model for VAD and language detection. Large model for final transcription. Translation model for subtitle localization. LLM for cleanup and summarization. Each step runs on the optimal hardware. This pipeline is hard to replicate as a one-off.
Caching & deduplication: Audio fingerprinting. If the same audio has been transcribed before (common for reposts, syndicated content, podcast mirrors), return cached results instantly. Zero compute cost on repeat transcriptions. Builds a growing competitive advantage.

9. Risks & What Could Go Wrong

Risk 1: Mux/Cloudflare adds Whisper as a checkbox (60% probability): The most likely disruption. Mux already has an MCP server and AI workflow features. Adding “auto-subtitle on upload” is a weekend project for them. Mitigation: Build above the transcription layer. Intelligence, compliance, and workflow are defensible. Raw subtitling is not.
Risk 2: Whisper gets replaced by a better open-source model (40% probability): Whisper is 2+ years old. Competitors: Moonshine, Canary, Universal-2 (AssemblyAI), Nova-3 (Deepgram). If a dramatically better open-source STT model emerges, the FFmpeg filter could be extended or forked. Mitigation: Don’t couple to Whisper specifically. Build a pipeline that’s model-agnostic. The FFmpeg filter is the interface; the model is swappable.
Risk 3: GPU costs stay high (30% probability): Whisper Large-v3 needs ~10GB VRAM. If you need to process 1,000 hours/day, GPU costs dominate. Mitigation: Use Turbo model (6GB VRAM, 216x real-time). Or use Tiny/Base for low-quality audio where accuracy doesn’t matter. Or negotiate spot GPU pricing on RunPod/Vast.ai.
Risk 4: Accuracy ceiling (35% probability): Whisper’s WER on clean audio (2.7%) is excellent. On real-world audio (accents, background noise, multiple speakers, cross-talk), it degrades to 10–17.7%. If customers need 99%+ accuracy (legal, medical, compliance), you need human review in the loop, which destroys your unit economics. Mitigation: Be transparent about accuracy limitations. Offer tiered service: automated (95–98%), AI-assisted (98–99%), human-verified (99.5%+) at different price points.
Risk 5: “Just run it yourself” (50% probability): FFmpeg+Whisper is free. Any developer with a GPU can do what you do. The “Cloud Run tutorial” competitor is always one blog post away. Mitigation: Target non-developers (creators, marketing teams, agencies). They will never run FFmpeg. Or target enterprises who want SLAs, compliance, and support — not a DIY tutorial.

10. The Bootstrapper’s Playbook

If You Have $0 and a Weekend

Spin up a $50/mo GPU server (Hetzner GEX44 or RunPod community cloud).
Install FFmpeg 8.0 with --enable-whisper. Download the Large-v3 Turbo model.
Build a drag-and-drop web UI with file upload + transcript download.
Process 50 videos for free for Twitter/Reddit content creators. Get testimonials.
Write “How I Built a Free Subtitle Generator with FFmpeg 8.0” and post on HN.
Charge $2/video or $15/mo for 20 videos. You’re in business.

If You Have $5K and 3 Months

Pick one vertical: podcasts, e-learning, e-commerce product videos, or compliance captioning.
Build the full pipeline: transcription + LLM cleanup + speaker diarization + output formatting.
Add the intelligence layer that justifies premium pricing (topic extraction, key quotes, auto-chapters for podcasts; ADA compliance verification for e-learning).
Launch with a 30% lifetime affiliate program. Recruit 50 YouTube creators in your niche to review the product. The Submagic playbook: affiliates are your sales force.
SEO blitz: write 30 blog posts targeting every “FFmpeg + [your vertical]” query. These keywords are exploding post-8.0 and have near-zero competition.
Ship Zapier/Make/n8n integrations. List on all three marketplaces. No-code users will find you.

The Best Bet

Compliance captioning for e-learning and corporate training. Here’s why:

The European Accessibility Act takes effect June 2025 — every digital product must be accessible, including video content. Demand is mandatory, not optional.
Compliance captioning charges $0.50–$2.00/min vs. raw transcription at $0/min. The margin is 70–92%.
E-learning platforms have thousands of hours of uncaptioned video. The backlog is enormous.
Competitors (3Play Media, Verbit) charge $1–$3/min. You can undercut at $0.50/min with FFmpeg+Whisper + LLM cleanup + automated compliance checks, adding human review only for edge cases.
The customer (corporate L&D, universities, government agencies) pays reliably, has budgets, and churns slowly.
Once you’re the compliance-certified provider, switching costs are high — nobody wants to re-validate their captioning vendor mid-audit.

FFmpeg+Whisper made the base layer free. The money is in everything above it: intelligence, compliance, workflow, and trust. Build there.