Research & Models

Provider Routing and Streaming

Routing layer that mediates all external LLM requests with normalization, streaming, and usage tracking.

24 septembre 2025

•

10 min de lecture

•

Planning, analysis, and transcription jobs call external providers through /api/llm and /api/audio endpoints. The routing service normalizes requests via provider_transformers, streams responses through ModernStreamHandler, and records usage metadata per job.

Provider routing map

Diagram of how requests flow from the desktop app to the proxy and out to providers.

Diagram of provider routing flow from desktop to external providers — Placeholder for provider routing diagram.

Why a Routing Layer Exists

Direct calls from the desktop client would embed provider credentials and require different payloads per provider. The routing layer keeps keys on the server, exposes a single request format, and maintains consistent streaming behavior.

Security Benefits

• API keys never leave the server
• Per-user rate limiting and quotas
• Request validation before provider calls

Operational Benefits

• Single request format for all providers
• Centralized usage tracking and billing
• Fallback to OpenRouter on failure

Supported Providers

All requests go through a single endpoint: /api/llm/chat/completions. The router determines the appropriate provider based on the model ID in the request payload. Each provider has dedicated handlers in server/src/handlers/proxy/.

Provider	Routing	Models
OpenAI	Direct	GPT-5.2, GPT-5.2-Pro, GPT-5-mini, o3, GPT-4o-transcribe
Anthropic	Direct (non-streaming), OpenRouter (streaming)	Claude Opus 4.5, Claude Sonnet 4.5
Google	Direct	Gemini 3 Pro, Gemini 3 Flash, Gemini 2.5 Pro
X.AI	Direct	Grok-4
DeepSeek	Via OpenRouter	DeepSeek-R1
OpenRouter	Direct	Fallback aggregator for all providers

Request Normalization via provider_transformers

Job processors submit a normalized payload with task ID, job ID, prompt content, and model selection. Theprovider_transformers module maps that payload into provider-specific request shapes.

// Normalized request from desktop
{
  "model": "anthropic/claude-opus-4-5-20251101",
  "messages": [
    { "role": "system", "content": "..." },
    { "role": "user", "content": "..." }
  ],
  "max_tokens": 16384,
  "temperature": 0.7,
  "stream": true,
  "metadata": {
    "job_id": "uuid-...",
    "session_id": "uuid-...",
    "task_type": "implementation_plan"
  }
}

// Transformed for Anthropic
{
  "model": "claude-opus-4-5-20251101",
  "system": "...",
  "messages": [{ "role": "user", "content": "..." }],
  "max_tokens": 16384,
  "stream": true
}

Transformation Features

• System message extraction for Anthropic API format
• Vision payload validation for image models
• Token limit enforcement based on model context window
• Provider-specific parameter mapping (top_p, presence_penalty, etc.)

Streaming via ModernStreamHandler

Responses are streamed back to the desktop client through ModernStreamHandler, enabling real-time UI updates and progressive plan rendering.

// ModernStreamHandler processing loop
async fn handle_stream(
    response: Response,
    job_id: &str,
) -> Result<StreamResult> {
    let mut stream = response.bytes_stream();
    let mut accumulated = String::new();

    while let Some(chunk) = stream.next().await {
        let text = parse_sse_chunk(&chunk?)?;
        accumulated.push_str(&text);

        // Emit event to desktop client
        emit_stream_event(job_id, StreamEvent::Chunk {
            content: text,
            accumulated_tokens: count_tokens(&accumulated),
        });
    }

    // Final usage from provider response
    let usage = extract_final_usage(&accumulated)?;
    Ok(StreamResult { content: accumulated, usage })
}

Chunk Events

Token/chunk events forwarded to job listeners for live UI updates

Partial Artifacts

Partial summaries written to job artifacts during streaming

Completion Events

Final events close the job state with usage metadata

Fallback to OpenRouter on Failure

When a primary provider fails (rate limit, outage, or error), the routing layer can automatically retry through OpenRouter as a fallback aggregator. This provides resilience without requiring user intervention.

Fallback Behavior

• Primary provider failure triggers OpenRouter retry
• Model mapping ensures equivalent capabilities
• Usage tracked separately for cost attribution
• User notified of fallback in job metadata

Token Counting and Cost Calculation

Every request records usage metadata so teams can audit cost and performance. Token counts come from provider responses when available, with fallback to tiktoken-based estimation.

// Usage record stored per request
{
  "tokens_input": 4521,
  "tokens_output": 2847,
  "cache_read_tokens": 1200,   // Anthropic prompt caching
  "cache_write_tokens": 0,
  "cost": 0.0234,              // USD based on model pricing
  "service_name": "anthropic/claude-opus-4-5-20251101",
  "request_id": "550e8400-e29b-41d4-a716-446655440000"  // Server-generated UUID
}

Tracked Usage Fields

tokens_inputPrompt tokens consumed by the request

tokens_outputCompletion tokens generated in response

cache_read_tokensTokens served from provider cache (Anthropic)

cache_write_tokensTokens written to provider cache

costComputed cost based on model pricing

service_nameModel identifier used for the request (e.g., anthropic/claude-opus-4-5)

request_idServer-generated UUID for request tracking

Vision Validation for Image Models

Requests containing images are validated before routing to ensure the selected model supports vision capabilities. Invalid requests fail fast with clear error messages.

Validation Checks

• Model supports vision (checked against config)
• Image format is supported (JPEG, PNG, WebP, GIF)
• Image size within provider limits
• Base64 encoding is valid

Vision-Capable Models

• GPT-5.2, GPT-5-mini
• Claude Opus 4.5, Claude Sonnet 4.5
• Gemini 3 Pro, Gemini 3 Flash, Gemini 2.5 Pro
• Grok-4

Failure Handling

If a provider fails or no provider is configured, the job is marked failed and the error payload is stored. Users can retry or run the job with another model instead of relying on silent fallbacks.

Rate Limit Errors

Retry-After header respected, user notified of wait time

Authentication Errors

API key validation failed, check provider configuration

Context Length Errors

Prompt exceeds model limit, suggest smaller context or different model

Security Boundaries

API keys stay in the server configuration. The desktop client only receives allowed model lists and never embeds provider credentials.

Security Measures

• Key Storage: Provider keys stored in encrypted vault, never sent to clients
• Request Signing: All proxy requests include server-signed JWT for authentication
• Content Filtering: Optional content moderation before sending to providers
• Audit Logging: All requests logged with user context for compliance

Building a Similar Proxy (Conceptual)

If you are building a similar architecture, the key components to implement are:

Model-based routing: Look up the model ID to determine which provider to use, then route internally
Request transformation: Convert normalized requests to provider-specific formats (e.g., extract system messages for Anthropic)
Streaming handlers: Process SSE chunks from providers and forward to clients with consistent event format
Usage tracking: Record input/output tokens, cache usage, and costs per request with server-generated request IDs
Fallback routing: Route certain providers through aggregators (e.g., Anthropic streaming via OpenRouter)

Implementation Note

The actual implementation uses Actix-web handlers with provider-specific modules in server/src/handlers/proxy/providers/. See router.rs for the main routing logic.

Continue into model configuration

Model configuration explains how allowed lists and token guardrails are exposed to the UI.

Model configuration →Runtime walkthrough →