Meeting & Recording Ingestion
How recordings become structured task inputs and artifacts.
PlanToCode can process meeting recordings and screen captures to extract task-relevant information. This document describes the ingestion workflow from recording to structured artifacts.
Recording ingestion flow
How recordings flow through transcription and analysis.
Supported Inputs
The meeting ingestion pipeline accepts various recording formats:
- Screen recordings (MP4, WebM, MOV)
- Meeting recordings from Zoom, Meet, Teams
- Audio-only files (MP3, WAV, M4A)
- Direct screen capture from desktop
Upload Process
Recordings are uploaded through multipart form data to the server:
Processing Steps
- File uploaded to server temporary storage
- Metadata extracted (duration, format, resolution)
- Audio track separated for transcription
- Video frames sampled for visual analysis
- Results combined and returned to client
Format Normalization
Various input formats are normalized before processing. Audio is converted to 16kHz mono WAV for Whisper compatibility. Video is processed at native resolution with configurable frame sampling.
Normalized outputs ensure consistent downstream processing regardless of input format.
Multimodal Analysis
Recordings with both audio and video are analyzed using multimodal models. Models with google/* prefix support native video understanding.
Audio transcription and visual analysis are combined to produce a comprehensive understanding of the recording content.
Audio Transcription
Audio tracks are transcribed using OpenAI Whisper through the server API.
Speaker diarization attempts to attribute text to different speakers when multiple voices are detected.
Transcription Features
- Multiple language support with auto-detection
- Word-level timestamps for alignment
- Speaker diarization (multi-speaker)
- Punctuation and formatting restoration
Frame Sampling
Video frames are sampled at configurable intervals to capture UI state changes and user actions.
Each frame includes its timestamp for correlation with the audio transcript.
Structured Extraction
The combined analysis produces structured outputs suitable for planning:
Extracted Elements
- Action items and decisions mentioned
- UI elements and navigation paths shown
- Error states and issues demonstrated
- Technical context for implementation
Analysis Artifacts
Meeting analysis produces several artifacts stored in the session:
- meeting_transcript: Full text with timestamps
- action_items: Extracted tasks and decisions
- ui_observations: Visual state changes
- combined_context: Merged analysis summary
Key Source Files
desktop/src/components/meeting/MeetingUploader.tsxserver/src/handlers/proxy/video_handler.rsserver/src/services/video_processor.rs
Planning Handoff
Meeting analysis artifacts can be incorporated into the task description:
The combined context flows into the file discovery and plan generation pipeline, providing rich context for implementation planning.
Continue to video analysis
Learn more about how video frames are analyzed.