File Discovery Workflow
Comprehensive technical guide to the 4-stage AI workflow that identifies and filters relevant files for task execution.
PlanToCode identifies the right files before you plan or run commands. The 4-stage workflow narrows scope and keeps context tight.
File discovery pipeline
The 4-stage workflow: root folder selection, regex filtering, AI relevance assessment, and extended path discovery.
Workflow Architecture
The workflow operates as an orchestrated background job system with four distinct stages that execute sequentially. Each stage builds upon the previous stage's output, progressively refining the file selection based on task requirements.
The system uses a distributed job architecture where each stage runs as an independent background job, enabling cancellation, retry logic, and detailed progress tracking. Real-time events are published throughout execution to provide immediate feedback to the user interface.
Key Architecture Features:
- • Event-driven progress reporting with WebSocket-like updates
- • Comprehensive error handling with automatic retry mechanisms
- • Cost tracking and timeout management for AI operations
- • Intermediate results are persisted in SQLite job records for reuse and debugging.
- • Git integration with fallback to directory traversal
4-Stage Workflow Process
Stage 1: Root Folder Selection
Uses AI to intelligently select the most relevant root directories from a list of candidate paths based on the task description. The LLM analyzes the primary project directory and candidate roots to determine which directories are most likely to contain files relevant to the task.
Stage 2: Regex File Filter
Uses AI to generate intelligent regex pattern groups based on the task description and directory structure. Each pattern group can include path patterns (positive and negative) and content patterns. The processor then applies these patterns to filter files from each selected root directory.
Stage 3: AI File Relevance Assessment
Employs AI models to analyze file content and assess relevance to the specific task description. This stage performs deep content analysis by reading file contents and having the LLM identify which files are most relevant to the task.
Stage 4: Extended Path Finder
Discovers additional relevant files by providing the LLM with the previously identified files and their contents, along with the directory tree. The AI analyzes imports, dependencies, and project structure to find related files that enhance the context for the task.
Configuration Options
Workflow Configuration
Timeout Management
Configure maximum execution time for the entire workflow or individual stages to prevent indefinite hanging.
timeoutMs: 300000 // 5 minutes defaultExclusion Patterns
Define directories and file patterns to exclude from the discovery process.
excludedPaths: ["node_modules", ".git", "dist", "build"]API Usage Examples
Starting a Workflow
const tracker = await WorkflowTracker.startWorkflow(
sessionId,
"Add user authentication to the login page",
"/path/to/project",
["node_modules", "dist"],
{ timeoutMs: 300000 }
);Monitoring Progress
tracker.onProgress((state) => {
console.log(`Stage: ${state.currentStage}`);
console.log(`Progress: ${state.progressPercentage}%`);
});
tracker.onComplete((results) => {
console.log(`Selected ${results.selectedFiles.length} files`);
});Retrieving Results
const results = await tracker.getResults(); const selectedFiles = results.selectedFiles; const intermediateData = results.intermediateData; const totalCost = results.totalActualCost;
Performance Considerations
Memory Management
The workflow uses token-aware chunking, streaming responses, and cleanup of temporary data to manage memory. There is no fixed file batch size.
Cost Optimization
AI stages track actual costs from API responses, implement intelligent batching to minimize token usage, and provide cost estimates before execution to help manage expenses.
Performance Monitoring
Built-in performance tracking monitors execution times, memory usage, throughput metrics, and provides recommendations for optimization based on historical data analysis.
Integration Patterns
Desktop Application
The workflow integrates seamlessly with the desktop application through Tauri commands, providing native file system access and event-driven updates via the WorkflowTracker class.
Implementation Plans Integration
Selected files are automatically fed into the Implementation Plans panel, ensuring that plan generation uses the same optimized file context without requiring re-execution of the discovery workflow.
Session Management
Selected files and task history persist per session so follow-up actions can reuse the same context without rerunning discovery.
Error Handling & Troubleshooting
Common Issues
- • Git repository not found: Falls back to directory traversal with standard exclusions
- • Binary file detection: Uses both extension-based and content-based binary detection
- • Token limit exceeded: Implements intelligent batching and provides clear error messages
- • Network timeouts: Automatic retry with exponential backoff for transient failures
Error Categories
- • Validation Errors: Invalid session ID, missing task description, or invalid project directory
- • Workflow Errors: Stage-specific failures with detailed context and retry suggestions
- • Billing Errors: Insufficient credits or payment failures with actionable guidance
- • System Errors: File system access, git command failures, or memory constraints
Debugging Tools
The workflow provides comprehensive logging, performance metrics export, and detailed error context including stage information, retry attempts, and intermediate data for troubleshooting.
Workflow State Management
State Transitions
The workflow progresses through clearly defined states: Created → Running → Paused (optional) → Completed/Failed/Canceled. Each state transition publishes events that can be monitored for real-time updates.
Intermediate Data Storage
Each stage stores its output in a structured intermediate data format, including directory tree content, regex patterns, filtered file lists results. This data is accessible for debugging and can be used to resume workflows from specific stages.
Event-Driven Updates
The system publishes real-time events for workflow status changes, stage completions, and error conditions. These events enable responsive user interfaces and integration with external monitoring systems.
SQLite Storage
All workflow state, intermediate results, and job metadata are persisted in SQLite. Each stage stores its output in the background_jobs table, enabling workflow resumption and debugging. The job records include token usage, cost tracking, and system prompt templates for each AI stage.
Need the desktop app?
The file discovery workflow runs inside the desktop client alongside implementation planning and terminal sessions.