Back to Documentation
Technical Guide

File Discovery Workflow

Comprehensive technical guide to the 4-stage AI workflow that identifies and filters relevant files for task execution.

12 min read

PlanToCode identifies the right files before you plan or run commands. The 4-stage workflow narrows scope and keeps context tight.

File discovery pipeline

The 4-stage workflow: root folder selection, regex filtering, AI relevance assessment, and extended path discovery.

Diagram showing the 4-stage file discovery workflow: Root Folder Selection, Regex File Filter, AI File Relevance Assessment, and Extended Path Finder
Click to expand
File discovery pipeline showing all 4 stages

Workflow Architecture

The workflow operates as an orchestrated background job system with four distinct stages that execute sequentially. Each stage builds upon the previous stage's output, progressively refining the file selection based on task requirements.

The system uses a distributed job architecture where each stage runs as an independent background job, enabling cancellation, retry logic, and detailed progress tracking. Real-time events are published throughout execution to provide immediate feedback to the user interface.

Key Architecture Features:

  • Event-driven progress reporting with WebSocket-like updates
  • Comprehensive error handling with automatic retry mechanisms
  • Cost tracking and timeout management for AI operations
  • Intermediate results are persisted in SQLite job records for reuse and debugging.
  • Git integration with fallback to directory traversal

4-Stage Workflow Process

Stage 1: Root Folder Selection

Uses AI to intelligently select the most relevant root directories from a list of candidate paths based on the task description. The LLM analyzes the primary project directory and candidate roots to determine which directories are most likely to contain files relevant to the task.

Technical Details: Receives candidate root directories (up to depth 2) and the task description. The LLM evaluates each path against the task context and returns a filtered list of root directories that will be searched in subsequent stages.
Input/Output: Receives candidate_roots array and task_description. Returns root_directories array containing the AI-selected directories most relevant to the task.

Stage 2: Regex File Filter

Uses AI to generate intelligent regex pattern groups based on the task description and directory structure. Each pattern group can include path patterns (positive and negative) and content patterns. The processor then applies these patterns to filter files from each selected root directory.

Technical Details: Generates a directory tree for each root, calls the LLM to produce patternGroups with path_pattern, content_pattern, and negative_path_pattern fields. Uses fancy-regex for lookahead/lookbehind support. Processes roots in parallel with configurable concurrency.
Git Integration: Finds the git repository root for each selected directory and uses git_utils to get all non-ignored files, respecting .gitignore rules while including both tracked and untracked files.
Binary Detection: Filters out files with binary extensions (.jpg, .png, .pdf, .exe, etc.) and uses content analysis to detect binary files by null bytes and non-printable character ratios.

Stage 3: AI File Relevance Assessment

Employs AI models to analyze file content and assess relevance to the specific task description. This stage performs deep content analysis by reading file contents and having the LLM identify which files are most relevant to the task.

Technical Details: Estimates tokens per file using file-type-aware heuristics (code ~3 chars/token, structured data ~5 chars/token). Creates content-aware chunks to stay under the 90k token threshold. Processes chunks in parallel with streaming to avoid timeouts. Validates all LLM-suggested paths against the filesystem.
AI Processing: Uses large language models to evaluate file content against task requirements, with intelligent chunking based on actual file sizes and token estimates to manage context windows efficiently.

Stage 4: Extended Path Finder

Discovers additional relevant files by providing the LLM with the previously identified files and their contents, along with the directory tree. The AI analyzes imports, dependencies, and project structure to find related files that enhance the context for the task.

Technical Details: Generates a combined directory tree for selected root directories. Reads content of all initial_paths files. Uses streaming LLM calls to avoid Cloudflare timeouts. Validates discovered paths against the filesystem and normalizes to relative paths within the project.
Relationship Analysis: Reads content of all previously identified files and provides it to the LLM alongside the directory tree (scoped to selected roots if available). The AI identifies additional files based on imports, references, and structural relationships.

Configuration Options

Workflow Configuration

Timeout Management

Configure maximum execution time for the entire workflow or individual stages to prevent indefinite hanging.

timeoutMs: 300000 // 5 minutes default

Exclusion Patterns

Define directories and file patterns to exclude from the discovery process.

excludedPaths: ["node_modules", ".git", "dist", "build"]

API Usage Examples

Starting a Workflow

const tracker = await WorkflowTracker.startWorkflow(
  sessionId,
  "Add user authentication to the login page",
  "/path/to/project",
  ["node_modules", "dist"],
  { timeoutMs: 300000 }
);

Monitoring Progress

tracker.onProgress((state) => {
  console.log(`Stage: ${state.currentStage}`);
  console.log(`Progress: ${state.progressPercentage}%`);
});
tracker.onComplete((results) => {
  console.log(`Selected ${results.selectedFiles.length} files`);
});

Retrieving Results

const results = await tracker.getResults();
const selectedFiles = results.selectedFiles;
const intermediateData = results.intermediateData;
const totalCost = results.totalActualCost;

Performance Considerations

Memory Management

The workflow uses token-aware chunking, streaming responses, and cleanup of temporary data to manage memory. There is no fixed file batch size.

Cost Optimization

AI stages track actual costs from API responses, implement intelligent batching to minimize token usage, and provide cost estimates before execution to help manage expenses.

Performance Monitoring

Built-in performance tracking monitors execution times, memory usage, throughput metrics, and provides recommendations for optimization based on historical data analysis.

Integration Patterns

Desktop Application

The workflow integrates seamlessly with the desktop application through Tauri commands, providing native file system access and event-driven updates via the WorkflowTracker class.

Implementation Plans Integration

Selected files are automatically fed into the Implementation Plans panel, ensuring that plan generation uses the same optimized file context without requiring re-execution of the discovery workflow.

Session Management

Selected files and task history persist per session so follow-up actions can reuse the same context without rerunning discovery.

Error Handling & Troubleshooting

Common Issues

  • Git repository not found: Falls back to directory traversal with standard exclusions
  • Binary file detection: Uses both extension-based and content-based binary detection
  • Token limit exceeded: Implements intelligent batching and provides clear error messages
  • Network timeouts: Automatic retry with exponential backoff for transient failures

Error Categories

  • Validation Errors: Invalid session ID, missing task description, or invalid project directory
  • Workflow Errors: Stage-specific failures with detailed context and retry suggestions
  • Billing Errors: Insufficient credits or payment failures with actionable guidance
  • System Errors: File system access, git command failures, or memory constraints

Debugging Tools

The workflow provides comprehensive logging, performance metrics export, and detailed error context including stage information, retry attempts, and intermediate data for troubleshooting.

Workflow State Management

State Transitions

The workflow progresses through clearly defined states: Created → Running → Paused (optional) → Completed/Failed/Canceled. Each state transition publishes events that can be monitored for real-time updates.

Intermediate Data Storage

Each stage stores its output in a structured intermediate data format, including directory tree content, regex patterns, filtered file lists results. This data is accessible for debugging and can be used to resume workflows from specific stages.

Event-Driven Updates

The system publishes real-time events for workflow status changes, stage completions, and error conditions. These events enable responsive user interfaces and integration with external monitoring systems.

SQLite Storage

All workflow state, intermediate results, and job metadata are persisted in SQLite. Each stage stores its output in the background_jobs table, enabling workflow resumption and debugging. The job records include token usage, cost tracking, and system prompt templates for each AI stage.

Need the desktop app?

The file discovery workflow runs inside the desktop client alongside implementation planning and terminal sessions.