ADR-004: Capability-Aware Model Routing
Status: Proposed (Revised)
Date: 2026-03-26
Revised: 2026-03-26
Deciders: Jeremy McSpadden
Related: ADR-003 (pipeline simplification), Issue #2655, docs/dynamic-model-routing.md
Context
Section titled “Context”GSD already supports dynamic model routing in auto-mode, but the current router is fundamentally complexity-tier and cost based, not task-capability based.
Today the selection pipeline is:
unit dispatch → classifyUnitComplexity(unitType, unitId, basePath, budgetPct) → UNIT_TYPE_TIERS default mapping → analyzeTaskComplexity() / analyzePlanComplexity() [metadata heuristics] → getAdaptiveTierAdjustment() [routing history] → applyBudgetPressure() [budget ceiling] → resolveModelForComplexity(classification, phaseConfig, routingConfig, availableModelIds) → downgrade-only: never upgrades beyond user's configured model → MODEL_CAPABILITY_TIER lookup → cheapest available in tier → fallback chain assembly → resolveModelId() → pi.setModel() → before_provider_request hook (payload mutation only)This architecture works when all models inside a tier are effectively interchangeable. That assumption no longer holds.
Users increasingly configure heterogeneous provider pools through models.json, scoped provider setup, and /scoped-models. In practice:
- Claude-class models often perform best on greenfield implementation and architecture work
- Codex-class models often perform best on debugging, refactoring, and root-cause analysis
- Gemini-class models often perform best on long-context synthesis and research-heavy tasks
- Fast small models are often best for cheap validation, triage, and lightweight hooks
The current router cannot express those differences. If Claude and Codex are both available at the same tier, GSD either:
- treats them as equivalent and picks the cheaper one, or
- requires the user to hardcode specific phase models manually
That produces three structural problems:
1. Wrong optimization target
Section titled “1. Wrong optimization target”The router optimizes primarily for task difficulty vs model cost. The real problem is task requirements vs model strengths, subject to cost constraints.
2. Poor behavior with heterogeneous pools
Section titled “2. Poor behavior with heterogeneous pools”Different users have different subscriptions and provider access. A fixed mapping like “research always uses Gemini” does not generalize when the user only has Claude + Codex, or only local models.
3. Capability knowledge is trapped in user intuition
Section titled “3. Capability knowledge is trapped in user intuition”Experienced users know which models are better at coding, debugging, research, long-context work, or instruction following. GSD has no representation for that knowledge, so it cannot route intelligently on the user’s behalf.
The system already has several building blocks that make a richer router feasible:
- unit types already encode the kind of work being dispatched
complexity-classifier.tsalready extracts richTaskMetadata(file counts, dependency counts, tags, complexity keywords, code block counts)auto-dispatch.tsand prompt builders provide stable task categoriesctx.modelRegistry.getAvailable()exposes the current model poolmodels.jsonalready supports user overrides and cost data per model- budget ceilings, routing history, and retry escalation already exist
- the
model_selecthook fires on model changes and could be extended for pre-selection interception
Decision
Section titled “Decision”Extend dynamic routing from a one-dimensional tier system to a two-dimensional system that combines complexity classification (“how hard”) with capability scoring (“what kind”), while preserving downgrade-only semantics, budget controls, and user overrideability.
Design Principles
Section titled “Design Principles”-
Downgrade-only invariant is preserved. The user’s configured model for a phase is always the ceiling. Capability scoring ranks models within the eligible set — it never promotes above the user’s configured model.
-
Complexity classification remains. The existing
classifyUnitComplexity()pipeline (unit type defaults, task plan analysis, adaptive learning, budget pressure) continues to determine tier eligibility. Capability scoring selects among tier-eligible models. -
Cost is a constraint, not a score dimension. Budget pressure constrains which models are eligible. Capability profiles describe what models are good at, not what they cost.
-
Requirement vectors are dynamic, not static. Task requirements are computed from
(unitType, TaskMetadata), not from unit type alone.
The Revised Routing Pipeline
Section titled “The Revised Routing Pipeline”unit dispatch → classifyUnitComplexity(unitType, unitId, basePath, budgetPct) [unchanged — determines tier eligibility and budget filtering] → resolveModelForComplexity(classification, phaseConfig, routingConfig, availableModelIds) → STEP 1: filter to tier-eligible models (downgrade-only from user ceiling) → STEP 2: if capability routing enabled AND >1 eligible model: → computeTaskRequirements(unitType, taskMetadata) → scoreEligibleModels(eligible, taskRequirements) → select highest-scoring model (deterministic tie-break by cost, then ID) → STEP 3: assemble fallback chain → resolveModelId() → pi.setModel()Model Capability Profiles
Section titled “Model Capability Profiles”Each model gains an optional capability profile:
interface ModelCapabilities { coding: number; // greenfield implementation, code generation debugging: number; // root-cause analysis, error diagnosis, refactoring research: number; // information synthesis, investigation, exploration reasoning: number; // multi-step logic, planning, architecture speed: number; // response latency (inverse of thinking time) longContext: number; // effective use of large input windows instruction: number; // instruction following, structured output adherence}Scores are normalized 0–100. Seven dimensions. No costEfficiency dimension — cost is handled separately by budget pressure and tier economics.
Models without a capability profile are treated as having uniform scores across all dimensions (score 50 in each), which makes capability scoring a no-op for those models and falls back to the existing cheapest-in-tier behavior.
Dynamic Task Requirement Vectors
Section titled “Dynamic Task Requirement Vectors”Requirement vectors are computed as a function of (unitType, TaskMetadata), not looked up from a static table. This preserves the nuance that classifyUnitComplexity already captures.
function computeTaskRequirements( unitType: string, metadata?: TaskMetadata,): Partial<Record<keyof ModelCapabilities, number>> { // Base vector from unit type const base = BASE_REQUIREMENTS[unitType] ?? { reasoning: 0.5 };
// Refine based on task metadata (only for execute-task) if (unitType === "execute-task" && metadata) { // Docs/config/rename tasks → boost instruction, reduce coding if (metadata.tags?.some(t => /^(docs?|readme|comment|config|typo|rename)$/i.test(t))) { return { ...base, instruction: 0.9, coding: 0.3, speed: 0.7 }; } // Debugging keywords → boost debugging and reasoning if (metadata.complexityKeywords?.some(k => k === "concurrency" || k === "compatibility")) { return { ...base, debugging: 0.9, reasoning: 0.8 }; } // Migration/architecture → boost reasoning and coding if (metadata.complexityKeywords?.some(k => k === "migration" || k === "architecture")) { return { ...base, reasoning: 0.9, coding: 0.8 }; } // Many files or high estimated lines → boost coding if ((metadata.fileCount ?? 0) >= 6 || (metadata.estimatedLines ?? 0) >= 500) { return { ...base, coding: 0.9, reasoning: 0.7 }; } }
return base;}Base requirement vectors by unit type:
const BASE_REQUIREMENTS: Record<string, Partial<Record<keyof ModelCapabilities, number>>> = { "execute-task": { coding: 0.9, instruction: 0.7, speed: 0.3 }, "research-milestone": { research: 0.9, longContext: 0.7, reasoning: 0.5 }, "research-slice": { research: 0.9, longContext: 0.7, reasoning: 0.5 }, "plan-milestone": { reasoning: 0.9, coding: 0.5 }, "plan-slice": { reasoning: 0.9, coding: 0.5 }, "replan-slice": { reasoning: 0.9, debugging: 0.6, coding: 0.5 }, "reassess-roadmap": { reasoning: 0.9, research: 0.5 }, "complete-slice": { instruction: 0.8, speed: 0.7 }, "run-uat": { instruction: 0.7, speed: 0.8 }, "discuss-milestone": { reasoning: 0.6, instruction: 0.7 }, "complete-milestone": { instruction: 0.8, reasoning: 0.5 },};Scoring Function
Section titled “Scoring Function”function scoreModel( model: ModelCapabilities, requirements: Partial<Record<keyof ModelCapabilities, number>>,): number { let weightedSum = 0; let weightSum = 0; for (const [dim, weight] of Object.entries(requirements)) { const capability = model[dim as keyof ModelCapabilities] ?? 50; weightedSum += weight * capability; weightSum += weight; } return weightSum > 0 ? weightedSum / weightSum : 50;}This produces a weighted average in the range 0–100, where each dimension’s contribution is proportional to its requirement weight. The output is directly comparable across models regardless of how many dimensions the requirement vector has.
Tie-breaking: When two models score within 2 points of each other, prefer the cheaper model (by MODEL_COST_PER_1K_INPUT). If cost is also equal, break ties by lexicographic model ID for determinism.
Configuration Model
Section titled “Configuration Model”Built-in capability profiles ship as a data table alongside MODEL_CAPABILITY_TIER and MODEL_COST_PER_1K_INPUT in model-router.ts:
const MODEL_CAPABILITY_PROFILES: Record<string, ModelCapabilities> = { "claude-opus-4-6": { coding: 95, debugging: 90, research: 85, reasoning: 95, speed: 30, longContext: 80, instruction: 90 }, "claude-sonnet-4-6": { coding: 85, debugging: 80, research: 75, reasoning: 80, speed: 60, longContext: 75, instruction: 85 }, "claude-haiku-4-5": { coding: 60, debugging: 50, research: 45, reasoning: 50, speed: 95, longContext: 50, instruction: 75 }, "gpt-4o": { coding: 80, debugging: 75, research: 70, reasoning: 75, speed: 65, longContext: 70, instruction: 80 }, "gpt-4o-mini": { coding: 55, debugging: 45, research: 40, reasoning: 45, speed: 90, longContext: 45, instruction: 70 }, "gemini-2.5-pro": { coding: 75, debugging: 70, research: 85, reasoning: 75, speed: 55, longContext: 90, instruction: 75 }, "gemini-2.0-flash": { coding: 50, debugging: 40, research: 50, reasoning: 40, speed: 95, longContext: 60, instruction: 65 }, "deepseek-chat": { coding: 75, debugging: 65, research: 55, reasoning: 70, speed: 70, longContext: 55, instruction: 65 }, "o3": { coding: 80, debugging: 85, research: 80, reasoning: 92, speed: 25, longContext: 70, instruction: 85 },};Users can override capability profiles in models.json per provider:
{ "providers": { "anthropic": { "modelOverrides": { "claude-sonnet-4-6": { "capabilities": { "debugging": 90, "research": 85 } } } } }}Partial overrides are deep-merged with built-in defaults. This uses the same modelOverrides path that already supports contextWindow, cost, and compat overrides.
Profile Versioning
Section titled “Profile Versioning”Built-in capability profiles are maintained alongside the existing MODEL_CAPABILITY_TIER and MODEL_COST_PER_1K_INPUT tables in model-router.ts. When the @gsd/pi-ai model catalog is updated with new models, the capability profile table must be updated in the same PR. A linting rule should flag any model present in MODEL_CAPABILITY_TIER but missing from MODEL_CAPABILITY_PROFILES.
Profiles are versioned implicitly by GSD release. The existing models.json modelOverrides mechanism allows users to correct stale defaults immediately without waiting for a GSD update.
Extension-First Rollout
Section titled “Extension-First Rollout”Capability-aware routing should be prototypable as an extension before moving to core. The current hook surface is insufficient for this:
before_provider_requestfires after model selection, at the API payload level — too late to swap model choice.model_selectfires reactively when a model changes, not before selection — it cannot influence the choice.
Required hook addition: A before_model_select hook that fires within selectAndApplyModel() after tier classification but before resolveModelForComplexity(). This hook would receive:
interface BeforeModelSelectEvent { unitType: string; unitId: string; classification: ClassificationResult; taskMetadata: TaskMetadata; eligibleModels: string[]; // tier-filtered available models phaseConfig: ResolvedModelConfig;}Return value: { modelId: string } | undefined (override selection, or undefined to use default).
This hook enables an extension to implement capability scoring externally, test it against real workloads, and validate behavior before the logic moves into model-router.ts.
Rollout sequence:
- Phase 1: Add
before_model_selecthook andTaskMetadatatoClassificationResult. Ship built-in capability profile data table. No core routing changes. - Phase 2: Implement capability scoring as an extension that hooks
before_model_select. Gather user feedback through routing history. - Phase 3: If behavior proves stable, move scoring into
resolveModelForComplexity()in core. Extension hook remains for custom routing strategies.
Observability
Section titled “Observability”Every routing decision must be inspectable. The existing RoutingDecision interface is extended:
interface RoutingDecision { modelId: string; fallbacks: string[]; tier: ComplexityTier; wasDowngraded: boolean; reason: string; // New fields: capabilityScores?: Record<string, number>; // model ID → score taskRequirements?: Partial<Record<string, number>>; // dimension → weight selectionMethod: "tier-only" | "capability-scored";}When verbose mode is on, the routing notification includes the top-scoring models and why the winner was selected:
Dynamic routing [S]: claude-sonnet-4-6 (scored 82.3 — coding:0.9×85, debugging:0.6×80) runner-up: gpt-4o (scored 78.1)Consequences
Section titled “Consequences”Positive
Section titled “Positive”1. Better model-task fit
Section titled “1. Better model-task fit”Routing decisions become based on the kind of work being done, not only how expensive or complex the work appears. A debugging task routes to the strongest debugger in the pool; a research task routes to the best synthesizer.
2. Works across arbitrary model pools
Section titled “2. Works across arbitrary model pools”The router no longer depends on a hardcoded vendor assumption. If a user has only Claude + Codex, it can still route intelligently between them. If the user adds Gemini or local models later, the same scoring system continues to work.
3. Preserves all existing invariants
Section titled “3. Preserves all existing invariants”- Downgrade-only semantics: capability scoring never upgrades beyond the user’s configured phase model.
- Budget pressure: unchanged — constrains tier eligibility before scoring runs.
- Retry escalation: unchanged — escalates tier, then scoring picks the best model in the new tier.
- Fallback chains: assembled the same way, with capability-scored winner as primary.
4. Creates a testable, versionable contract for routing behavior
Section titled “4. Creates a testable, versionable contract for routing behavior”Capability profiles and task vectors are explicit data structures. Routing decisions are inspectable in verbose mode. The scoring function is a pure function suitable for deterministic unit tests.
5. Opens the door to adaptive learning
Section titled “5. Opens the door to adaptive learning”Existing routing history (routing-history.ts) can later refine capability scores per task type. When a model consistently fails at a particular task shape, its effective score for that dimension decreases. This is a natural extension of the existing getAdaptiveTierAdjustment() mechanism.
6. Graceful degradation
Section titled “6. Graceful degradation”Models without capability profiles get uniform scores, producing the same cheapest-in-tier behavior as today. Zero behavior change for users who don’t configure heterogeneous pools.
Negative
Section titled “Negative”1. More metadata to maintain
Section titled “1. More metadata to maintain”Built-in model profiles will drift as model families evolve. Mitigation: profiles live in a single data table, versioned with GSD releases, with a lint rule for completeness.
2. Scoring can create false precision
Section titled “2. Scoring can create false precision”A 0–100 capability scale looks exact but is still heuristic. Mitigation: document profiles as “relative rankings, not benchmarks.” The 2-point tie-breaking threshold prevents insignificant score differences from overriding cost optimization.
3. More routing complexity
Section titled “3. More routing complexity”The current tier router is simple to explain and debug. Multi-dimensional scoring is more powerful but harder to reason about. Mitigation: verbose observability output shows scores and reasons. The selectionMethod field in routing decisions makes it clear whether capability scoring was active.
4. Stronger test requirements
Section titled “4. Stronger test requirements”The router will need coverage for:
- profile loading and override merge rules (partial deep-merge from
modelOverrides) computeTaskRequirements()with various unit types and metadata combinations- scoring function correctness (weighted average, tie-breaking)
- interaction with tier eligibility filtering
- budget pressure applied before scoring, not conflicting with it
- fallback behavior when no scored model is eligible
- graceful degradation when no profiles exist (uniform scores)
before_model_selecthook contract (extension path)
5. New hook surface to maintain
Section titled “5. New hook surface to maintain”The before_model_select hook adds a new extension API contract that must be maintained across releases. Mitigation: the hook is narrowly scoped — one event type, optional return.
Neutral / Migration
Section titled “Neutral / Migration”1. Tier-based routing does not disappear
Section titled “1. Tier-based routing does not disappear”Complexity tiers remain as:
- the primary “how hard is this” signal that determines tier eligibility
- the fallback behavior for models without capability profiles
- the escalation path on retries (light → standard → heavy)
Capability scoring adds the “what kind of work” signal on top. The two systems are layered, not competing.
2. Existing preferences continue to work
Section titled “2. Existing preferences continue to work”dynamic_routing.tier_models still works — it pins a specific model per tier, bypassing capability scoring for that tier. Per-phase model overrides (models.planning, models.execution, etc.) continue to set the ceiling. No existing configuration breaks.
3. Documentation update required
Section titled “3. Documentation update required”docs/dynamic-model-routing.md must be updated to explain:
- what capability profiles are and how to override them
- how scoring interacts with tier routing
- how to read verbose routing output
- how to use
before_model_selectfor custom routing extensions
1. Hardcoded vendor stereotypes become stale
Section titled “1. Hardcoded vendor stereotypes become stale”If the default profiles are not reviewed regularly, GSD will encode outdated assumptions about which models are “best” at which tasks.
Mitigation: Keep defaults in a single data table (not scattered conditionals). Lint for completeness against the model catalog. User overrides via modelOverrides provide immediate escape hatch. Document profiles as heuristic rankings, not benchmarks.
2. Budget logic and capability logic may conflict in user perception
Section titled “2. Budget logic and capability logic may conflict in user perception”The highest-scoring model may not be selected because budget pressure constrained the eligible tier. This could look inconsistent if the user doesn’t understand the pipeline order.
Mitigation: Pipeline order is explicit and enforced in code:
- Complexity classification determines tier
- Budget pressure may downgrade tier
- Tier-eligible models are filtered (downgrade-only from user ceiling)
- Capability scoring ranks the eligible set
- Cost tie-breaks within scoring threshold
Verbose output shows each step. The user sees “budget pressure: 85%” in the reason string when downgrade occurs.
3. Task-type classification may be too coarse initially
Section titled “3. Task-type classification may be too coarse initially”A unit type like execute-task contains many sub-shapes. The initial base vector plus metadata refinement may not distinguish all meaningful cases.
Mitigation: The computeTaskRequirements() function is designed for iterative refinement. The existing TaskMetadata already captures tags, complexity keywords, file counts, dependency counts, and code block counts. New metadata signals can be added to the existing extractTaskMetadata() without changing the scoring function. Routing history provides signal on where refinement is needed.
4. Unknown and custom models may score poorly by default
Section titled “4. Unknown and custom models may score poorly by default”Users often bring custom provider IDs, local models, or vendor aliases that will not exist in the built-in profile table.
Mitigation: Unknown models receive uniform scores (50 across all dimensions), making capability scoring a no-op — they compete on cost within their tier, same as today. Users can add capability profiles via modelOverrides in models.json for models they know well.
5. Extension hook adds API surface
Section titled “5. Extension hook adds API surface”The before_model_select hook creates a contract that extensions may depend on.
Mitigation: The hook has a narrow, well-defined interface. It is additive (existing hooks unchanged). The return type is simple ({ modelId } | undefined). Breaking changes would be handled through the same extension API versioning as other hooks.
Alternatives Considered
Section titled “Alternatives Considered”A. Keep pure complexity-tier routing
Section titled “A. Keep pure complexity-tier routing”Rejected because it optimizes cost within a tier but still treats meaningfully different models as interchangeable. The existing MODEL_CAPABILITY_TIER table already proves this is a recognized gap — it just stops at three buckets.
B. Hardcode task → model mappings
Section titled “B. Hardcode task → model mappings”Rejected because it breaks as soon as the user does not have the expected model. This is appropriate for a closed product with a fixed fleet, not for GSD’s user-configured provider model.
C. Route only by user-specified per-phase models
Section titled “C. Route only by user-specified per-phase models”Rejected because it pushes all routing intelligence onto the user and does not adapt to retries, task subtype, or provider heterogeneity.
D. Use capability-aware routing only as an extension, never in core
Section titled “D. Use capability-aware routing only as an extension, never in core”Not rejected as a starting point, but insufficient as the long-term architecture. Extension prototyping is the recommended first phase. However, coherent preferences, diagnostics, testing, and profile versioning will likely require core integration if the model proves valuable.
E. Add costEfficiency as a capability dimension
Section titled “E. Add costEfficiency as a capability dimension”Rejected because it conflates two concerns. If cost appears in both the scoring function and the budget constraint, the router has two competing cost signals that produce confusing behavior (e.g., a cheap model wins on costEfficiency score but then gets filtered out by budget pressure, or vice versa). Cost constrains eligibility; capability determines ranking.
F. Use static requirement vectors per unit type (no metadata refinement)
Section titled “F. Use static requirement vectors per unit type (no metadata refinement)”Rejected because the existing classifyUnitComplexity() already proves that unit type alone is too coarse. A execute-task for docs vs. a execute-task for migration are categorically different. The metadata signals (tags, complexity keywords, file counts) that the classifier already extracts should inform requirement vectors.
Appendix: Current Architecture Reference
Section titled “Appendix: Current Architecture Reference”For implementors, the current routing pipeline files:
| File | Role |
|---|---|
auto-dispatch.ts | Rule table that determines unit type + prompt |
auto-model-selection.ts | Orchestrates model selection for each dispatch |
complexity-classifier.ts | Tier classification with task metadata analysis |
model-router.ts | Tier → model resolution with downgrade-only semantics |
routing-history.ts | Adaptive learning from success/failure patterns |
preferences-models.ts | Per-phase model config resolution and fallbacks |
register-hooks.ts | Hook registration including before_provider_request |
The capability scoring additions would primarily touch model-router.ts (profiles, scoring function) and auto-model-selection.ts (passing metadata to the router, new hook point).