Zurück zur Dokumentation
Inputs

Meeting & Recording Ingestion

How recordings become structured task inputs and artifacts.

8 min Lesezeit

PlanToCode can process meeting recordings and screen captures to extract task-relevant information. This document describes the ingestion workflow from recording to structured artifacts.

Recording ingestion flow

How recordings flow through transcription and analysis.

Recording ingestion flow diagram
Click to expand
Placeholder for ingestion flow diagram.

Supported Inputs

The meeting ingestion pipeline accepts various recording formats:

  • Screen recordings (MP4, WebM, MOV)
  • Meeting recordings from Zoom, Meet, Teams
  • Audio-only files (MP3, WAV, M4A)
  • Direct screen capture from desktop

Upload Process

Recordings are uploaded through multipart form data to the server:

Processing Steps

  1. File uploaded to server temporary storage
  2. Metadata extracted (duration, format, resolution)
  3. Audio track separated for transcription
  4. Video frames sampled for visual analysis
  5. Results combined and returned to client

Format Normalization

Various input formats are normalized before processing. Audio is converted to 16kHz mono WAV for Whisper compatibility. Video is processed at native resolution with configurable frame sampling.

Normalized outputs ensure consistent downstream processing regardless of input format.

Multimodal Analysis

Recordings with both audio and video are analyzed using multimodal models. Models with google/* prefix support native video understanding.

Audio transcription and visual analysis are combined to produce a comprehensive understanding of the recording content.

Audio Transcription

Audio tracks are transcribed using OpenAI Whisper through the server API.

Speaker diarization attempts to attribute text to different speakers when multiple voices are detected.

Transcription Features

  • Multiple language support with auto-detection
  • Word-level timestamps for alignment
  • Speaker diarization (multi-speaker)
  • Punctuation and formatting restoration

Frame Sampling

Video frames are sampled at configurable intervals to capture UI state changes and user actions.

Each frame includes its timestamp for correlation with the audio transcript.

Structured Extraction

The combined analysis produces structured outputs suitable for planning:

Extracted Elements

  • Action items and decisions mentioned
  • UI elements and navigation paths shown
  • Error states and issues demonstrated
  • Technical context for implementation

Analysis Artifacts

Meeting analysis produces several artifacts stored in the session:

  • meeting_transcript: Full text with timestamps
  • action_items: Extracted tasks and decisions
  • ui_observations: Visual state changes
  • combined_context: Merged analysis summary

Key Source Files

  • desktop/src/components/meeting/MeetingUploader.tsx
  • server/src/handlers/proxy/video_handler.rs
  • server/src/services/video_processor.rs

Planning Handoff

Meeting analysis artifacts can be incorporated into the task description:

The combined context flows into the file discovery and plan generation pipeline, providing rich context for implementation planning.

Continue to video analysis

Learn more about how video frames are analyzed.