2. 1. What LLM Routers Are and Why They Exist
An LLM router sits between an application and a pool of language models. When a request comes in, the router decides which model to send it to -- based on cost, latency, quality requirements, provider availability, or some combination. The application sends one request; the router figures out the rest.
The core problem is a pricing arbitrage opportunity that's hard to ignore. GPT-4o costs $2.50 per million input tokens. GPT-4o-mini costs $0.15 -- 16x cheaper. For a huge percentage of real-world requests (simple Q&A, classification, summarization of short documents), the mini model is good enough. Without routing, companies pay frontier prices for commodity queries. With routing, you can cut LLM spend by 40-85% without users noticing.
But cost is just the entry pitch. The actual product is broader:
| Problem | What the Router Does | Claimed Improvement |
|---|---|---|
| LLM spend too high | Routes easy queries to cheap models | 40-85% cost reduction |
| Provider outages | Auto-failover to secondary provider in under 20ms | Near-zero downtime |
| Latency too high | Routes to fastest responding provider in real time | ~60% reduction in time-to-first-token |
| Repeated identical queries | Semantic caching returns cached response without LLM call | 96.9% latency reduction on cache hit |
| Long context re-computation | Prompt caching reuses computed prefix state | Up to 90% cost reduction on long contexts |
| No visibility into LLM spend | Per-user, per-team, per-key cost tracking | Full cost attribution |
| Many providers, many APIs | Single OpenAI-compatible endpoint across 100-623 models | One integration, no provider lock-in |
The market is young but moving fast. OpenRouter went from $10M annual inference run-rate in October 2024 to $100M+ ARR in May 2025 -- a 10x increase in seven months. The category exists, has paying customers, and has attracted serious VC money. The question is who wins.
3. 2. Players: Managed SaaS
OpenRouter
| Funding | $40M (Series A at $500M valuation, June 2025). Investors: a16z, Sequoia, Menlo Ventures. |
| Revenue | $100M+ ARR (May 2025). Grew 10x in 7 months. |
| Users | 1M+ developers |
| Models | 623+ |
| Pricing | 5% commission on inference spend. Pay-as-you-go. Free tier. |
| Positioning | "One-stop shop marketplace for all AI models" |
OpenRouter is the market leader by every metric that matters: funding, revenue, user count, model coverage. The 5% commission model is simple and trust-building -- you know exactly what you're paying on top of model costs. The moat is ecosystem: 623 models means any developer can find what they need without going anywhere else.
The risk for OpenRouter is commoditization. If every cloud provider ships a native router (Cloudflare, Vercel, AWS already are), the aggregation story weakens. The counter-play is that OpenRouter has the most model coverage and the biggest developer community, which are hard to replicate quickly.
Portkey
| Funding | Not disclosed |
| Models | 250+ |
| Pricing | Free / $49 / $499 / $5K+/month |
| Positioning | "Enterprise-grade AI Gateway and Production Control Plane" |
| Differentiator | Strongest observability, SSO/SCIM/RBAC, hierarchical budget controls |
Portkey is going after enterprise governance where OpenRouter goes after developer velocity. SSO, SCIM, RBAC, audit trails, virtual key management, canary deployments, circuit breakers -- this is the set of features a platform team at a 500-person company needs. The $5K+/month tier is a real enterprise play.
Portkey's core gateway is moving to open source with v2.0 in 2026. Classic open-core move: commoditize the gateway, monetize the governance layer.
Martian
| Models | 200+ |
| Pricing | Developer: $20/5,000 requests. Enterprise: custom. |
| Positioning | ML-based routing to the optimal model based on query complexity. Claims 20-96% cost reduction. |
| Differentiator | Learns which model is actually best for each type of query, not just cheapest |
Not Diamond
| Pricing | Contact sales (enterprise model) |
| Positioning | Ultra-low-latency routing (<100ms overhead). Pre-trained out-of-box router plus custom router training on your own data. |
| Distribution | Available on AWS Marketplace |
| Differentiator | Stack-agnostic, integrates with existing evaluation metrics, custom router training |
Requesty
| Models | 400+ |
| Pricing | 5% markup on model costs. Enterprise volume discounts. |
| Positioning | Unified gateway with strong compliance story: data residency (Frankfurt, Virginia, Singapore), PII detection and redaction. |
| Differentiator | Per-agent routing strategies, per-user spending caps, sub-20ms failover |
Helicone
| YC Batch | W23 |
| License | MIT (open source) |
| Models | 100+ providers |
| Pricing | Free (10K requests/month). Paid tiers undisclosed. |
| Positioning | Observability-first, routing second. "See what's happening, then optimize it." |
Helicone is interesting because it came in as an observability tool (like Langfuse) but added routing. The observability angle is a wedge: get teams to install it for visibility, then upsell routing and cost optimization. One-line integration (just change the base URL) keeps the friction close to zero.
4. 3. Players: Open Source / Self-Hosted
LiteLLM
| Stars | 33,000+ |
| Language | Python |
| License | Open source (self-hosted) |
| Models | 100+ providers |
| Routing strategies | simple-shuffle, least-busy, usage-based, latency-based, complexity router (sub-millisecond, zero external calls) |
LiteLLM is the most adopted open source option by a wide margin. The complexity router is the interesting technical contribution: rule-based scoring of query complexity that makes routing decisions in under a millisecond without any external API call. That's the right tradeoff for high-throughput, latency-sensitive environments.
The weakness: LiteLLM is Python, which creates performance constraints under serious production load. Bifrost was built specifically to address this.
Bifrost (by Maxim AI)
| Language | Go |
| License | Open source (self-hosted) |
| Performance | 11 microseconds overhead at 5,000 RPS. Claims 50x faster than LiteLLM. |
| Models | 1,000+ (15+ providers including OpenAI, Anthropic, AWS Bedrock, Google Vertex) |
| Key features | Two-layer semantic caching (exact hash + vector similarity), adaptive load balancing, RBAC, zero external dependencies, cluster mode |
Bifrost is the performance play. Go vs. Python at the proxy layer is not a subtle difference -- 11 microseconds vs. multiple milliseconds matters when you're handling millions of requests per day. No external dependencies is also important for enterprise ops teams who don't want surprise failure modes. If you need self-hosted and you care about throughput, Bifrost is the pick.
RouteLLM (LMSYS / UC Berkeley)
| License | Open source (all code and datasets public on HuggingFace) |
| Origin | Academic research framework from LMSYS (the Chatbot Arena team) |
| Model pair | Optimized for GPT-4 Turbo and Mixtral 8x7B routing |
| Performance | 95% GPT-4 quality with only 26% GPT-4 calls (48% cheaper than random baseline). On MT Bench: 85% cost reduction. |
| Available routers | sw_ranking, BERT classifier, causal LLM classifier, matrix factorization |
RouteLLM is a research framework, not a production product. Its value is benchmarking: it established the evaluation methodology that the whole field now uses. The limitation is that it's optimized for a single model pair and requires adaptation for any other scenario. It's the academic baseline, not the production deployment.
vLLM Semantic Router (v0.1 "Iris", January 2026)
| License | Open source |
| Contributors | 50+ engineers from Red Hat, IBM Research, AMD, Hugging Face |
| Positioning | System-level intelligent router for Mixture-of-Models. Handles stateful conversation management, tool filtering for agentic workflows. |
| Features | OpenAI Responses API support, Signal-Decision Plugin Chain, HaluGate (hallucination detection), modular LoRA, Helm charts |
The vLLM router is significant because it's coming from the inference layer up, not the API gateway layer down. This is mixture-of-models thinking: many small specialized models working together, routed intelligently, outperforming a single large model. As specialized models proliferate (reasoning models, math models, code models, vision models), this architecture becomes more interesting.
BricksLLM
| YC Batch | Yes (BricksAI) |
| Language | Go |
| License | Open source + optional managed dashboard |
| Differentiator | Fine-grained cost and rate limiting per API key. Per-user, per-app, per-environment spending limits. PII detection and masking. |
| Models | OpenAI, Azure OpenAI, Anthropic, vLLM, open-source LLMs |
5. 4. Players: Infrastructure Extensions
These are companies that aren't primarily LLM routers but have added routing as part of a broader infrastructure play.
| Player | Core Product | Router Angle | Target |
|---|---|---|---|
| Cloudflare AI Gateway | CDN / edge network (20% of internet traffic) | 350+ models, caching, rate limiting, analytics at the edge. Global auto-scaling included. | Teams already on Cloudflare |
| Vercel AI Gateway | Frontend hosting (Next.js) | Sub-20ms routing, 100+ models, tight Next.js integration | JavaScript/TypeScript frontend teams |
| Kong AI Gateway | Enterprise API gateway | Extends Kong's platform to AI traffic. Most sophisticated semantic routing of any gateway product. | Enterprises with existing Kong investment |
| Anyscale / Ray Serve | Distributed compute platform | Prefix-aware routing, achieves 60% TTFT reduction. Custom routing for Ray-managed infra. | Teams running their own inference infrastructure |
| AWS Bedrock | Cloud AI service | Model routing within the AWS ecosystem, semantic routing support | AWS-native enterprises |
The infrastructure player threat is real. Cloudflare and Vercel have distribution that pure-play routers can't match -- they're already in the request path for millions of applications. Routing becomes a checkbox feature, not a standalone purchase. This is the commoditization risk the pure-plays face.
6. 5. Technical Approaches: How Routing Actually Works
Not all routing is the same. The technical approach determines latency, accuracy, and maintenance cost.
| Strategy | How It Works | Latency Added | Accuracy | Who Uses It |
|---|---|---|---|---|
| Rule-based complexity scoring | Heuristics on query length, token count, keyword presence. No model inference. | <1ms | Moderate | LiteLLM complexity router, Bifrost |
| ML classifier (BERT) | Fine-tuned BERT predicts which model will perform better | 20-50ms | High | RouteLLM, Not Diamond |
| Semantic embedding | Embed the prompt, similarity-match against reference clusters, route to cluster's best model | 50-100ms | High for in-distribution | vLLM Semantic Router, Kong, AWS |
| LLM-as-router | A small LLM reads the query and decides which larger LLM to call | 200-500ms | Very high | Martian, some experimental setups |
| Matrix factorization | Collaborative-filtering style: predict which model the query is "closest to" based on historical performance vectors | 10-30ms | High | RouteLLM |
| Latency-based | Real-time probe of provider response times, route to fastest | <20ms | N/A (not about quality) | Almost all gateways |
| Load balancing | Distribute across instances by weight, round-robin, or least-busy | <10ms | N/A (not about quality) | All gateways |
| Fallback chains | Primary fails, try secondary, then tertiary | Depends on failure speed | N/A (reliability play) | All gateways |
Caching as a Routing Bypass
The fastest "routing" decision is no LLM call at all. Three caching layers exist:
- Exact-match caching: Hash the prompt, return cached response if identical. Instant. Works for repeated identical queries.
- Prompt prefix caching: Anthropic and OpenAI now support computing prefix state once and reusing it. Up to 90% cost reduction on long-context queries with repeated system prompts.
- Semantic caching: Vector-embed the prompt, similarity-search against cached responses, return cache hit if semantically close enough. 96.9% latency reduction on hits. Bifrost implements this as a two-layer system (exact hash first, then vector similarity).
Research Findings on Router Fragility
ACL 2026 research found that current routers have structural problems:
- Routing collapse: As budget increases, routers default to expensive models even when cheaper ones suffice. The router "plays it safe" instead of optimizing.
- Training/decision mismatch: Routers trained to predict model performance don't optimize the same objective as routing decisions (discrete ranking). The loss function is wrong.
- Safety gap: BERT-based routers route potential jailbreaks to weaker models because the weak model scores lower on quality -- but lower quality doesn't mean safer. The opposite is often true.
These are known problems with no commercial solution yet.
7. 6. Pricing Models Compared
| Player | Pricing Model | Entry Cost | At Scale |
|---|---|---|---|
| OpenRouter | 5% commission on inference spend | Free | Scales with usage. At $10K/month model spend = $500/month. |
| Requesty | 5% markup on model costs | Free | Same math as OpenRouter. Volume discounts at enterprise. |
| Martian | Per-request (developer) + custom (enterprise) | $20 per 5,000 requests after free tier | Custom enterprise pricing. VPC deployment option. |
| Not Diamond | Enterprise (contact sales) | Contact | Per-inference cost billing for model runs |
| Portkey | Freemium + seat-based tiers | $49/month (Starter) | $499/month (Pro, 20-person team), $5K+/month (Enterprise) |
| Helicone | Freemium | Free (10K req/month) | Paid tiers undisclosed |
| LiteLLM | Free (self-hosted) | $0 | $0 (you pay for your own infra) |
| Bifrost | Free (self-hosted) | $0 | $0 |
| BricksLLM | Free (open source) + optional managed dashboard | $0 | Managed dashboard pricing undisclosed |
| RouteLLM | Free (open source research framework) | $0 | $0 |
| Cloudflare AI Gateway | Included with Cloudflare Workers | $0 (on free plan) | Scales with Cloudflare account tier |
The 5% commission model (OpenRouter, Requesty) is the cleanest. You know exactly what you're paying. Percentage-of-spend aligns the router's incentives with the customer's: if model costs drop, the fee drops. The risk for the provider is that as models get cheaper industry-wide, the commission shrinks.
The self-hosted open source options (LiteLLM, Bifrost) have $0 sticker price but real costs: engineering time to maintain, infrastructure to run, on-call responsibility. For teams with a platform engineer, it's a good deal. For a two-person startup, it probably isn't.
8. 7. Who Buys and How They Decide
Buyer Segments
| Segment | Profile | What They Buy | Decision Driver | Budget |
|---|---|---|---|---|
| Developer / indie hacker | Solo or tiny team, moving fast, no ops | OpenRouter, Helicone (free tiers) | Zero friction, free tier, works in 5 minutes | $0-$100/month |
| Cost-conscious startup | 5-30 people, LLM spend becoming a line item | Requesty, Martian, Portkey Starter | ROI: "Show me the cost reduction" | $100-$1K/month |
| Platform / infra team | 50+ person org with dedicated DevOps or platform engineering | LiteLLM, Bifrost, BricksLLM (self-hosted) | Control, on-prem, custom governance, no vendor dependency | Engineering time, not SaaS fees |
| Enterprise with governance needs | Fortune 500, regulated industries (finance, healthcare, legal) | Portkey Enterprise, Not Diamond, Kong AI Gateway | Compliance, audit trails, RBAC, data residency, SSO | $5K-$100K+/month |
| Cloudflare / Vercel native team | Frontend-heavy, already on these platforms | Cloudflare AI Gateway, Vercel AI Gateway | Already in the stack, zero new vendor | Bundled with existing spend |
How Decisions Actually Get Made
The LLM router purchase is almost always developer-initiated, not top-down. An engineer discovers the cost savings opportunity, tests a free tier over a weekend, shows the boss a 60% reduction in the LLM bill, and gets budget approved. The sales motion is product-led. The best routers optimize for this: one-line integration (just change the base URL), immediate ROI visibility in a dashboard, and no credit card to start.
For enterprise deals, the evaluation criteria shift toward compliance and governance. "Does your data stay in the EU?" is more important than "how much does it cost?" for a German bank building on LLMs. Portkey and Requesty are positioning here; most others aren't.
9. 8. Funding and Market Size
| Company | Round | Amount | Date | Valuation |
|---|---|---|---|---|
| OpenRouter | Seed + Series A | $40M total | June 2025 | $500M |
| Martian | Undisclosed | -- | -- | -- |
| Not Diamond | Seed | Undisclosed | ~2023 | -- |
| Helicone | YC W23 | -- | 2023 | -- |
| BricksAI | YC | -- | ~2024 | -- |
Outside of OpenRouter, the funding picture is thin. Most pure-play routers are either bootstrapped, seed-stage, or YC-backed at early valuations. The VC money in the broader AI infra space ($750M to Groq, $500M+ to Together AI) is going to inference infrastructure, not routing middleware. Routing is seen as a layer on top of inference, not the primary bet.
The market size estimates are optimistic: $6.52 billion by 2030 at 21% CAGR. Whether that's the standalone router market or the broader AI gateway market is unclear. OpenRouter's $100M ARR is the only hard data point. The rest is projection.
10. 9. Market Gaps: What Doesn't Exist Yet
| Gap | Current State | Opportunity | Difficulty |
|---|---|---|---|
| Capability-aware routing | Routers optimize for cost/latency/quality generics. No router knows that model X is specifically good at multi-step reasoning or multilingual tasks. | Build a capability profile for every model (benchmarks + production signals), then route based on task type, not just query complexity. | High (requires ongoing model evaluation across many dimensions) |
| Safety-aware routing | BERT routers route potential jailbreaks to weaker models (because weak = cheap), which elevates risk. Nobody has built routing that treats safety as a first-class routing signal. | Router that classifies intent first (safe vs. borderline vs. risky), then routes safe to cheap and risky to either more capable or a safety-specialized model. | High (requires intent classification without over-filtering legitimate queries) |
| Stateful multi-turn routing | Almost all routers make per-request decisions. Conversation context is ignored. | Route entire conversations to the same model for consistency, or adapt routing mid-conversation based on where the conversation goes. vLLM SR is starting here. | Medium (stateful session tracking at scale is solved infrastructure, the routing logic is novel) |
| MCP-aware routing | As of March 2026, no major router supports Model Context Protocol for tool-rich agentic workflows. Portkey and TrueFoundry are working on it. | Router that understands tool context (which tools are available, which tools the query needs) and routes to the model best suited for those specific tools. | Medium (MCP is new; the routing logic on top is non-trivial but buildable) |
| Compliance-driven routing | Data residency is addressed (Requesty has EU/US/APAC regions). Model-level compliance (HIPAA-eligible models, SOC2-certified providers, EU AI Act compliant models) is not. | Router that enforces routing constraints based on regulatory requirements: HIPAA queries only go to HIPAA-eligible model endpoints, EU user data never leaves EU models. | Medium (requires a maintained compliance attribute database for models, then rule enforcement) |
| Self-improving / adaptive routers | All current routers are static. A decision made in month 1 uses the same routing logic as month 12, regardless of what the production data shows. | Router that continuously retrains on production performance data. If model X keeps producing lower-quality outputs than predicted for a certain query type, update the routing weights. | Very High (requires a feedback loop, human-in-the-loop or LLM-as-judge, and a retraining pipeline) |
| Multimodal routing | All major routers handle text. Routing across image, video, audio, and multimodal models is not addressed. | Route based on modality requirements of the query. Image question? Route to vision model. Audio transcription? Route to Whisper-class model. Mixed? Route to GPT-4V or Gemini Ultra. | Medium (modality detection is straightforward; model capability databases for multimodal are less mature) |
| Standardized evaluation / leaderboard | RouterArena project emerging but not yet the authoritative benchmark. Every company uses its own evaluation methodology to claim cost savings. | The "HuggingFace Open LLM Leaderboard" for routers. Standardized tasks, standard model pairs, reproducible benchmarks. Whoever builds this becomes a reference point the whole market cites. | Low-Medium (research project, not a product) |
| Custom fine-tuned model routing | Routers are good at routing between public frontier models. Routing that includes custom fine-tunes or LoRA adapters is not well supported. | Router that benchmarks custom models, integrates them into the routing pool, and selects between public models and custom fine-tunes based on task fit. | Medium-High (requires a bring-your-own-model evaluation pipeline) |
11. 10. Verdict: Where the Market Is Going
A few things are becoming clear.
OpenRouter has won the developer market. $100M ARR with 1M developers and the largest model catalog is a durable position. The 5% commission model is simple and trusted. Competing on pure model aggregation against OpenRouter is not a good idea.
Enterprise governance is still open. Portkey is going after it but is still relatively small. The enterprise AI governance market (compliance, chargeback, RBAC, audit trails) is worth a lot more than the developer-tier market, and it's less crowded. The play is to build the SOC2, HIPAA, and EU AI Act compliance story that Portkey doesn't have fully yet.
Infrastructure players will commoditize the basics. Cloudflare, Vercel, AWS, Azure, Kong -- they're all adding routing as a checkbox feature. Basic cost/latency routing with fallback is going to be free and built-in within 2-3 years. The pure-plays need to go up-market (governance, compliance, adaptive learning) or specialize vertically.
Self-hosted is not going away. Legal, healthcare, finance, and government will always have segments that can't send data to a managed service. Bifrost (Go, 11 microsecond overhead, no external dependencies) is the right technical approach for this segment. The opportunity there is to add enterprise governance on top of the raw performance.
The next battleground is agentic routing. Routing a single user query is solved. Routing across a multi-step AI agent workflow -- where context accumulates, tools get invoked, and the right model for step 3 depends on what happened in steps 1 and 2 -- is not solved. vLLM Semantic Router is the only player seriously attacking this. As AI agents move from demos to production, stateful multi-turn routing becomes the thing everyone needs.
The data flywheel hasn't kicked in yet. Every router that operates as a managed service accumulates a dataset of queries, model selections, and outcome quality. None of them appear to be using this data to train proprietary routing models. The first router to close this loop -- using production data to continuously improve routing decisions -- has a compounding moat that rule-based competitors can't close.
Short summary: buy OpenRouter for developer use, self-host Bifrost for performance-critical or privacy-sensitive workloads, watch the agentic routing space carefully, and ignore the "we save you X% on LLM costs" marketing from every player -- they all say the same thing.