Voice Transcription
Recording lifecycle, device management, and transcription behavior for voice-driven prompts.
Voice transcription is available anywhere the desktop app exposes dictation controls, including the plan terminal and prompt editors. The feature records audio locally and sends a single recording to the transcription service when you stop, then inserts text into the active input field without blocking manual typing.
Voice transcription pipeline
Audio capture, provider transcription, and text insertion flow.
Transcription pipeline
The useVoiceTranscription React hook manages the complete recording lifecycle. It initializes MediaRecorder for audio capture in WebM format with Opus codec, monitors audio levels, and handles device switching.
The desktop app invokes transcribe_audio_command to send audio data to the server endpoint /api/audio/transcriptions. The command validates minimum size (1KB), duration, temperature (0.0-1.0), and prompt length (max 1000 characters); the server enforces max file size (100MB).
Audio files must be between 1KB and 100MB. Supported formats: WAV, MP3, M4A, OGG, WebM, FLAC, AAC, and MP4. The transcription model must be specified explicitly and must be in the server allowlist (OpenAI models by default on hosted).
Recording workflow
The recording hook keeps a state machine with idle, recording, processing, and error states. It records audio into a single blob, enforces a ten-minute cap, and sends the recording on stop.
Recording states
- idle: No recording in progress, microphone permissions may or may not be granted
- recording: Capturing audio via MediaRecorder with live level monitoring
- processing: Uploading the recording to the transcription endpoint and awaiting a response
- error: Recording failed due to permission denial, device disconnection, or transcription API error
Server-side processing
The server exposes /api/audio/transcriptions which accepts multipart form data. It routes requests to OpenAI or Google based on the model's provider configuration, validates user credits, and calculates billing based on audio duration.
Request parameters
- file: Audio file data (required) - WAV, MP3, M4A, OGG, WebM, FLAC, AAC, or MP4
- model: Transcription model ID (required) - from server allowlist (e.g., openai/gpt-4o-transcribe)
- durationMs: Recording duration in milliseconds (required for billing calculation)
- language: ISO 639-1 language code (optional) - improves accuracy for specific languages
- prompt: Context hint for transcription (optional, max 1000 characters) - helps with domain-specific vocabulary
- temperature: Sampling temperature 0.0-1.0 (optional, default 0.0) - lower values produce more deterministic output
Project-aware settings
When a recording session starts, the hook looks up the active project's transcription configuration so recordings follow the project's preferences.
Project transcription preferences are stored in SQLite key_value_store under project_task_settings and include the preferred model, language code, prompt, and temperature. Hosted uses managed providers; self-hosting can adjust the allowlist.
Device management
The feature requests microphone permission, enumerates available audio inputs, and lets users choose the active device before recording. Changes take effect on the next recording.
Real-time audio level monitoring displays visual feedback during recording. The system warns when audio is silent so you can catch muted microphones before sending the recording.
Data flow
Audio data flows from the browser through the Tauri command layer to the server, which proxies requests to the appropriate transcription provider.
Processing steps
- Browser MediaRecorder captures audio in a single recording (WebM by default)
- useVoiceTranscription tracks duration and recording state
- On stop, the audio blob is converted to bytes and sent via transcribe_audio_command
- Tauri command validates size, duration, temperature, and prompt length
- Request sent to server /api/audio/transcriptions endpoint with auth token
- Server routes to the configured provider and returns transcribed text
- Transcribed text returned to desktop and inserted via callback
Multi-destination routing
Transcribed text is routed based on the active UI context and inserted into the appropriate input.
- Task description editors: Cursor insertion with optional follow-up text_improvement
- Terminal dictation buffer: Command text inserted into PTY input
- Prompt editors: Direct insertion into active text inputs
Key implementation files
desktop/src/hooks/use-voice-recording/use-voice-transcription.tsdesktop/src/actions/voice-transcription/transcribe.tsdesktop/src-tauri/src/commands/audio_commands.rsserver/src/handlers/proxy/specialized/transcription.rsserver/src/clients/openai/transcription.rsserver/src/clients/google_client.rs
Usage examples
Common voice transcription workflows:
- Sprint planning: Dictate tasks, then run text_improvement and task_refinement
- Terminal commands: Dictation transcribed and typed directly into PTY for execution
- Bug reports: Verbal description captured, refined with text_improvement, then stored in task history
- Walkthrough notes: Narrate a screen recording and attach the video analysis summary to the task
Continue exploring
Learn how transcribed text can be refined and how meeting recordings are processed into actionable tasks.