Volver a Documentación
Inputs

Video Analysis

Frame sampling, prompts, and analysis artifacts from recordings.

6 min de lectura

Video analysis extracts UI state and action sequences from screen recordings. This enables understanding of user workflows and bug reproduction contexts.

Video analysis pipeline

How frames flow through the analysis model.

Video analysis interface
Click to expand
The video analysis interface showing frame capture and analysis options.

API Endpoint

Video analysis is handled by /api/llm/video/analyze on the server. The endpoint accepts multipart form data with the video file and analysis parameters.

Payload Fields

  • video: The video file (MP4, WebM, MOV)
  • model: Model identifier for analysis
  • prompt: Optional custom analysis prompt
  • max_frames: Maximum frames to sample
  • fps: Frame sampling rate

Supported Input Formats

  • MP4 with H.264 or H.265 codec
  • WebM with VP8 or VP9 codec
  • MOV from screen recording tools
  • Maximum file size: 100MB

Frame Sampling

Frames are extracted at configurable intervals to balance coverage and API costs. Lower frame rates reduce token usage but may miss rapid changes.

Default rate is 1 frame per second. For detailed UI analysis, 2-3 FPS may be needed.

Sampling Parameters

  • fps: Frames per second to extract (0.5-5)
  • max_frames: Maximum total frames (10-100)
  • start_time: Offset to begin sampling
  • end_time: Offset to stop sampling

Model Requirements

Video analysis requires vision-capable models. Model identifiers follow provider/model format. Currently only google/* models support native video analysis.

Google Gemini models can process video natively, while other vision models require frame-by-frame image analysis.

Analysis Process

Sampled frames are sent to the vision model along with the analysis prompt. The model produces structured observations about UI state and user actions.

System prompts guide the model to focus on specific aspects of the recording.

Prompt Elements

  • UI inventory: List visible elements and controls
  • Action sequence: Describe user actions in order
  • Error detection: Identify error states and messages
  • Navigation paths: Track screen transitions

Analysis Outputs

  • frame_observations: Per-frame UI descriptions
  • action_timeline: Ordered list of user actions
  • error_summary: Any errors or issues observed
  • context_summary: High-level workflow description

Token Usage & Billing

Video analysis consumes tokens based on frame count and resolution. Each frame is processed as an image token.

  • tokens_sent: Prompt + image tokens
  • tokens_received: Analysis response tokens
  • actual_cost: Computed from model pricing

Result Storage

Analysis results are stored in the background_jobs table with task_type 'video_analysis'. The response contains the full analysis in JSON format.

Results can be incorporated into task descriptions or used directly in the planning workflow.

Key Source Files

  • server/src/handlers/proxy/video_handler.rs
  • server/src/services/video_processor.rs
  • desktop/src/components/video/VideoAnalyzer.tsx

Integration with Planning

Video analysis outputs can feed directly into the task description for context-aware planning.

The context_summary is particularly useful as a starting point for implementation planning.

See meeting ingestion

Learn how video analysis fits into the broader meeting ingestion workflow.