Skip to content

Frontier Techniques for GSD-2

Research into cutting-edge AI agent techniques that map directly to GSD-2’s architecture, ranked by impact and feasibility.

Date: 2026-03-25 Status: Research / Pre-RFC



GSD-2 is a multi-layered, event-driven agent platform with strong extensibility primitives: a skill system, file-based memory, session branching, compaction, and 16+ extension lifecycle hooks. These existing primitives create natural integration points for six frontier techniques that could fundamentally change how GSD operates.

The techniques fall into three categories:

CategoryTechniquesTheme
Self-ImprovementSkill Library Evolution, Cross-Session Learning GraphGSD gets better the more you use it
PerformanceDAG Tool Execution, Speculative Tool ExecutionGSD gets faster per turn
IntelligenceSemantic Context Compression, MCTS PlanningGSD reasons better with the same context budget

Category: Self-Improvement Impact: Massive | Effort: Medium | Priority: #1

Inspired by SkillRL (ICLR 2026), this technique transforms GSD’s skill system from static instruction files into a self-improving knowledge base. Instead of skills being written once and updated manually, they evolve based on execution outcomes.

SkillRL demonstrates that agents with learned skill libraries outperform baselines by 15.3%+ across task benchmarks, with 10-20% token compression compared to raw trajectory storage.

┌─────────────────────────────────────────────────────────┐
│ EXECUTION LOOP │
│ │
│ 1. Skill invoked → agent executes task │
│ 2. Outcome captured (success/failure + trajectory) │
│ 3. Trajectory distilled: │
│ ├─ Success → strategic pattern extracted │
│ └─ Failure → anti-pattern + lesson recorded │
│ 4. Skill file updated with versioned improvement │
│ 5. Next invocation benefits from accumulated learnings │
│ │
└─────────────────────────────────────────────────────────┘

Two types of learned knowledge:

TypeDescriptionExample
General SkillsUniversal strategic guidance applicable across tasks”When editing TypeScript files, always check for type errors via LSP before committing”
Task-Specific SkillsCategory-level heuristics for specific skill domains”The fix-issue skill should check CI status before opening a PR, not after”

GSD already has every primitive needed:

  • Skill files (~/.claude/skills/, .claude/skills/) — the storage layer exists
  • Extension hooks (turn_end, agent_end) — outcome capture points exist
  • Memory system (MEMORY.md + individual files) — persistence exists
  • /improve-skill and /heal-skill commands — manual versions of this loop already exist

The gap is automation: connecting execution outcomes back to skill files without human intervention.

GSD ComponentRole in Integration
agent-session.tsturn_end eventCaptures execution outcome (success/failure signals)
Extension hook: agent_endTriggers trajectory distillation
Skill file systemReceives versioned updates with learned patterns
compaction.tsProvides trajectory data from the session for distillation
User invokes skill
┌──────────────┐ ┌──────────────────┐
│ AgentSession │────▶│ Skill Executor │
│ (turn_end) │ │ (tracks outcome) │
└──────────────┘ └────────┬─────────┘
┌─────────▼──────────┐
│ Outcome Classifier │
│ (success/failure/ │
│ partial) │
└─────────┬──────────┘
┌───────────────┼───────────────┐
▼ ▼ ▼
┌────────────┐ ┌──────────────┐ ┌───────────┐
│ Success │ │ Failure │ │ Partial │
│ Distiller │ │ Distiller │ │ Analyzer │
└─────┬──────┘ └──────┬───────┘ └─────┬─────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────┐
│ Skill File Updater │
│ • Appends learned pattern to skill │
│ • Versions the update │
│ • Preserves original skill intent │
└─────────────────────────────────────────────┘
  • Drift prevention: How to prevent accumulated learnings from overwhelming the original skill intent?
  • Conflict resolution: What happens when a lesson from one session contradicts another?
  • Quality gate: Should updates require a validation pass before being written?

Category: Performance Impact: High | Effort: Medium | Priority: #2

The LLM Compiler pattern (ICML 2024) treats multi-tool workflows like a compiler optimization pass. When the model returns multiple tool calls in a single response, instead of executing them sequentially, the system:

  1. Analyzes dependencies between tool calls
  2. Constructs a Directed Acyclic Graph (DAG)
  3. Executes independent tools in parallel
  4. Blocks only on actual data dependencies

Current GSD behavior (sequential):

Read(auth.ts) ─── 150ms ───▶ result
Read(types.ts) ─── 120ms ──▶ result
Grep("login") ─── 80ms ────▶ result
Read(test.ts) ─── 130ms ───▶ result
Total: ~480ms sequential

With DAG execution (parallel):

Read(auth.ts) ─── 150ms ──▶ result ─┐
Read(types.ts) ─── 120ms ──▶ result ─┤
Grep("login") ─── 80ms ───▶ result ─┤── all complete at 150ms
Read(test.ts) ─── 130ms ──▶ result ─┘
Total: ~150ms (max of parallel set)

Dependency analysis rules:

Tool ATool BDependency?Reason
Read(file)Read(file)NoReads are idempotent
Read(file)Grep(pattern)NoIndependent data sources
Read(file)Edit(file)YesEdit depends on Read content
Edit(file)Edit(file)YesEdits to same file must serialize
Bash(cmd)Bash(cmd)MaybeDepends on side effects
Write(file)Read(file)YesRead after write needs write to complete

The model already emits multiple tool_use blocks in a single response. GSD processes them, but the execution path in agent-loop.ts handles them in sequence. The parallelism opportunity is sitting right there.

Measured impact estimate: A typical coding turn involves 3-5 tool calls. With 60% parallelizable (reads, greps, globs), per-turn latency drops by 40-60%. Over a 50-turn session, that’s minutes saved.

GSD ComponentRole in Integration
agent-loop.ts tool execution pathReplace sequential execution with DAG scheduler
Tool definitionsAnnotate tools with side-effect metadata (pure/impure)
Extension hooks (tool_*)Must still fire in correct order per dependency chain
Model response with N tool_use blocks
┌──────────────────────────────┐
│ Dependency Analyzer │
│ • Parse tool calls │
│ • Identify file overlaps │
│ • Identify data dependencies │
│ • Classify: pure vs impure │
└──────────────┬───────────────┘
┌──────────────────────────────┐
│ DAG Constructor │
│ • Nodes = tool calls │
│ • Edges = dependencies │
│ • Topological sort │
└──────────────┬───────────────┘
┌──────────────────────────────┐
│ Parallel Executor │
│ • Execute roots immediately │
│ • On completion, unlock │
│ dependent nodes │
│ • Collect all results │
│ • Return in original order │
└──────────────────────────────┘
  • Bash side effects: How to determine if two Bash commands conflict without executing them?
  • Extension hooks: Should tool_start/tool_end events fire in execution order or original order?
  • Error propagation: If a parallel tool fails, do dependent tools get cancelled or receive the error?

Category: Performance Impact: High | Effort: Low-Medium | Priority: #3

Based on Speculative Tool Calls research, this technique predicts which tools the model will request and pre-executes them before the model responds. Correct predictions eliminate the first tool-call round-trip entirely. Wrong predictions are discarded at zero cost beyond compute.

┌─────────────────────────────────────────────────────────────┐
│ User: "fix the bug in auth.ts" │
│ │
│ BEFORE model responds: │
│ Speculator predicts: │
│ ├─ Read("auth.ts") → pre-executed ✓ │
│ ├─ Grep("error|bug", "auth") → pre-executed ✓ │
│ ├─ LSP diagnostics(auth.ts) → pre-executed ✓ │
│ └─ Read("auth.test.ts") → pre-executed ✓ │
│ │
│ Model responds with tool calls: │
│ ├─ Read("auth.ts") → CACHE HIT (0ms) │
│ ├─ Read("auth.test.ts") → CACHE HIT (0ms) │
│ └─ Grep("login", "src/") → cache miss (execute) │
│ │
│ Hit rate: 2/3 = 67% │
│ Latency saved: ~300ms on this turn │
└─────────────────────────────────────────────────────────────┘

Prediction strategies (simplest to most sophisticated):

StrategyDescriptionExpected Hit Rate
Keyword extractionParse user prompt for file paths, function names → Read those files40-60%
Session historyTrack which tools follow which user prompt patterns50-70%
Learned patternsUse the skill library evolution data to predict tool sequences60-80%
Model pre-queryAsk a fast/cheap model to predict tool calls70-85%

The #1 latency bottleneck in GSD is the round-trip: user prompt → model thinks → model requests tool → tool executes → result sent back → model thinks again. Speculative execution attacks the highest-latency step.

GSD’s architecture makes this easy to add:

  • AgentSession.prompt() already processes user input before sending to the model
  • Tool results are already cached in the message array
  • The extension system can intercept input and spawn pre-fetches
GSD ComponentRole in Integration
AgentSession.prompt()Trigger speculation after user input, before model call
Tool result cache (new)Store speculated results keyed by tool+args
agent-loop.ts tool executionCheck cache before executing; serve cached result on hit
Extension hook: inputParse user intent for file paths, patterns
User input arrives
├──────────────────────────────────────┐
│ │
▼ ▼
┌───────────────┐ ┌──────────────────┐
│ Send to LLM │ │ Speculator │
│ (normal path) │ │ • Extract paths │
│ │ │ • Predict tools │
│ ... waiting │ │ • Pre-execute │
│ for response │ │ • Cache results │
│ │ └──────────────────┘
│ │ │
│ │◀─── model returns ──────────│
│ │ tool_use blocks │
└───────┬───────┘ │
│ │
▼ │
┌───────────────┐ │
│ Tool Executor │◀──── check cache ───────────┘
│ • Cache hit? │
│ → return │
│ • Cache miss? │
│ → execute │
└───────────────┘
ScenarioCost
Correct prediction~0ms latency (result already available). Compute cost: the pre-execution itself (trivial for Read/Grep).
Wrong predictionWasted compute for the pre-executed tool. For Read/Grep/Glob, this is <10ms of I/O.
Partial hitNet positive as long as hit rate > 20% (given how cheap misses are).
  • TTL for cached results: How long are speculated results valid? File contents can change between speculation and model request.
  • Side effects: Should only pure tools (Read, Grep, Glob, LSP) be speculatable?
  • Resource limits: Cap on number of speculative executions per turn to prevent I/O storms?

Category: Intelligence Impact: High | Effort: High | Priority: #4

GSD’s compaction system uses a char/4 heuristic for token estimation and all-or-nothing LLM summarization for context reduction. Research from Zylos and context engineering literature shows that embedding-based compression achieves 80-90% token reduction while preserving the ability to selectively recall specific historical context.

Current GSD Compaction (Weaknesses Highlighted)

Section titled “Current GSD Compaction (Weaknesses Highlighted)”
Messages: [M1, M2, M3, M4, M5, M6, M7, M8, M9, M10]
Token budget exceeded │ recent
Current approach:
┌─────────────────────────┬─────────────────────────┐
│ M1-M6: LLM-summarized │ M7-M10: kept verbatim │
│ into single blob │ (last ~20k tokens) │
│ │ │
│ ⚠ All detail lost │ ✓ Full fidelity │
│ ⚠ No selective recall │ │
│ ⚠ char/4 overestimates │ │
└─────────────────────────┴─────────────────────────┘

Three specific weaknesses:

WeaknessImpactCurrent Code Location
char/4 token estimation~25% overestimate → compacts too early → wastes contextcompaction.ts:201-259
All-or-nothing summarizationLoses specific details that may be relevant latercompaction.ts:327-400
No retrieval from compacted historyOnce summarized, detail is gone forevercompaction-orchestrator.ts
┌─────────────────────────────────────────────────────────┐
│ HOT TIER │
│ Recent turns (last ~20k tokens) │
│ Full text, full fidelity │
│ Storage: in-context messages │
│ Access: always in prompt │
├─────────────────────────────────────────────────────────┤
│ WARM TIER │
│ Older turns (beyond context window) │
│ Stored as embeddings + compressed text │
│ Storage: session-local vector index │
│ Access: retrieved when semantically relevant to │
│ current turn │
│ Token cost: only retrieved segments count │
├─────────────────────────────────────────────────────────┤
│ COLD TIER │
│ Ancient turns / previous sessions │
│ Stored as summaries + metadata │
│ Storage: disk (existing session files) │
│ Access: retrieved only on explicit recall │
│ Token cost: minimal summary headers │
└─────────────────────────────────────────────────────────┘

How retrieval works per turn:

New user prompt arrives
┌───────────────────┐
│ Embed the prompt │ (compute embedding of user's question)
└────────┬──────────┘
├──── query warm tier ──▶ top-K relevant historical turns
│ (cosine similarity > threshold)
├──── always include ──▶ hot tier (recent turns, full text)
┌───────────────────┐
│ Compose context │
│ = hot + retrieved │
│ + system prompt │
└───────────────────┘

Replace char/4 with adaptive estimation:

ApproachAccuracyCost
char/4 (current)~75% (overestimates)Zero
Provider-reported usage100% (for last turn)Zero (already tracked)
tiktoken/provider tokenizer~98%~5ms per message
Hybrid: actual for recent, char/4 for old~95%Negligible

The hybrid approach — use actual token counts from provider responses for recent messages, fall back to char/4 for older messages — is a quick win that requires no new dependencies.

GSD ComponentRole in Integration
compaction.tsReplace cut-point algorithm with tiered approach
compaction-orchestrator.tsAdd warm-tier retrieval before model call
agent-session.ts message buildingInject retrieved warm-tier segments
Session persistence layerStore embeddings alongside session entries
  • Embedding model: Local (fast, private) or API (better quality, adds latency)?
  • Index format: Simple cosine similarity on flat arrays vs. HNSW index?
  • Retrieval budget: How many tokens to allocate to warm-tier retrievals per turn?
  • Coherence: How to prevent retrieved historical context from confusing the model about the current state?

Category: Self-Improvement Impact: Transformative | Effort: High | Priority: #5

GSD’s memory system (MEMORY.md + individual files) stores flat, file-based memories. A learning graph extends this into a structured knowledge base that captures relationships between codebases, files, errors, solutions, and patterns across all sessions.

This is informed by research on agent memory architectures and the emerging discipline of context engineering.

AspectCurrent (MEMORY.md)Learning Graph
StructureFlat file listNodes + edges (graph)
RelationshipsNone”file X often breaks when Y changes”
RetrievalAll loaded into contextQuery-driven, only relevant nodes
LearningManual (user says “remember X”)Automatic from execution outcomes
ScopePer-project directoryPer-project with cross-project patterns
StalenessManual cleanupConfidence decay over time
┌──────────┐ touches ┌──────────┐
│ Session │────────────────▶│ File │
│ │ │ │
│ • date │ │ • path │
│ • outcome │ │ • type │
│ • tokens │ │ • churn │
└────┬──────┘ └─────┬─────┘
│ │
│ encountered │ involved_in
│ │
▼ ▼
┌──────────┐ resolved_by ┌──────────┐
│ Error │────────────────▶│ Solution │
│ │ │ │
│ • type │ │ • pattern │
│ • message │ │ • success │
│ • freq │ │ rate │
└──────────┘ └──────────┘
│ │
│ prevented_by │ uses
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ Pattern │ │ Tool │
│ │ │ │
│ • type │ │ • name │
│ • desc │ │ • avg │
│ • conf │ │ time │
└──────────┘ └──────────┘
QueryResult
”What errors have occurred in auth.ts?”List of error nodes connected to that file node
”What’s the typical fix for TypeError in this codebase?”Solution nodes with highest success rate for that error type
”Which files tend to break together?”File clusters with high co-occurrence in error sessions
”What tools are slowest in this project?”Tool nodes sorted by avg execution time
GSD ComponentRole in Integration
session-manager.tsWrite graph nodes on session save
agent-session.ts prompt buildingQuery graph for relevant context before model call
Memory system (MEMORY.md)Coexists — graph handles structured knowledge, memory handles preferences/feedback
Extension hook: agent_endTrigger graph update with session outcome
OptionProsCons
SQLite + json columnsSimple, no dependencies, fast queriesNo native vector search
SQLite + sqlite-vssAdds vector similarity to SQLiteExtra native dependency
Flat JSON filesZero dependencies, git-friendlySlow for large graphs
LanceDBEmbedded vector DB, no serverAdditional dependency
  • Privacy: Graph contains detailed codebase interaction history — should it be encrypted at rest?
  • Portability: Should the graph travel with the project (.claude/ dir) or stay user-local?
  • Garbage collection: How to prune stale nodes (e.g., files that no longer exist)?

Category: Intelligence Impact: Transformative | Effort: Very High | Priority: #6

Inspired by ToolTree and Monte Carlo Tree Search, this technique replaces GSD’s linear action selection with a tree-based planner that explores multiple solution paths simultaneously.

Instead of the model deciding one action at a time and hoping it works, the system:

  1. Generates N candidate next-actions
  2. Scores each based on estimated probability of reaching the goal
  3. Explores promising branches in parallel
  4. Backtracks when a path fails, without wasting the user’s context on dead ends

Current (linear):

User: "fix the auth bug"
Action 1: Read auth.ts ──▶ Action 2: Edit line 45 ──▶ Action 3: Run tests
Tests fail ✗
Action 4: Try different edit
Tests fail ✗
Action 5: Read error log...
(linear flailing)

With MCTS (tree search):

User: "fix the auth bug"
Read auth.ts
├── Branch A: Edit line 45 (score: 0.6)
│ └── Run tests → FAIL → prune
├── Branch B: Check auth middleware (score: 0.7) ◀── highest score
│ └── Edit middleware.ts → Run tests → PASS ✓
└── Branch C: Check env config (score: 0.3)
└── (not explored — lower score)
Result: Branch B succeeds after 2 actions, not 5+

GSD already has session branching primitives:

  • fork() creates a branch from any message
  • Branch summaries compress history at fork points
  • Tree navigation (/tree) lets users explore branches
  • Session tree is already a first-class concept

The gap: these primitives are user-triggered. MCTS would make the agent trigger them automatically during problem-solving.

┌─────────────────────────────────────────────────────────┐
│ MCTS Planning Layer │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Proposer │───▶│ Scorer │───▶│ Selector │ │
│ │ Generate N │ │ Estimate P │ │ Pick best │ │
│ │ candidates │ │ of success │ │ to explore │ │
│ └─────────────┘ └──────────────┘ └─────┬──────┘ │
│ │ │
│ ┌─────────────┐ ┌──────────────┐ │ │
│ │ Pruner │◀───│ Executor │◀─────────┘ │
│ │ Kill dead │ │ Run action │ │
│ │ branches │ │ in worktree │ │
│ └─────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
┌─────────────────────┐
│ Agent Session │
│ (receives winning │
│ branch as result) │
└─────────────────────┘
ApproachSpeedQualityCost
Heuristic (file relevance, error proximity)FastLowFree
Fast model (haiku-class rates candidates)MediumMediumLow
Self-evaluation (main model rates its own proposals)SlowHighHigh
Learned scorer (trained on past outcomes from learning graph)FastHighFree at inference
GSD ComponentRole in Integration
agent-loop.tsNew planning phase between user prompt and action execution
Session branching (fork())Used to create exploration branches
Git worktreesEach branch explored in an isolated worktree
agent-session.tsReceives the winning branch and presents it as the result
Skill Library Evolution (#1)Provides learned patterns to improve the scorer over time
FactorValue
LLM calls per turn2-5x more (proposal generation + scoring)
Token usage3-10x more per complex problem
Success rate on hard problemsEstimated 30-50% improvement
Time to solutionFewer total turns despite more LLM calls per turn
User experienceAgent appears to “think harder” on hard problems
  • When to activate: MCTS is expensive. Should it only activate when the agent detects a hard problem (repeated failures, high uncertainty)?
  • Branch isolation: Git worktrees work for file changes, but how to isolate Bash side effects?
  • Budget control: How many branches to explore before falling back to linear execution?
  • Transparency: Should the user see the exploration tree or just the winning path?

#TechniqueImpactEffortCompoundingDependencies
1Skill Library EvolutionMassiveMediumYes — improves all other techniquesNone
2DAG Tool ExecutionHighMediumNo — static speedupNone
3Speculative Tool ExecutionHighLow-MedYes — improves with learningBenefits from #1
4Semantic Context CompressionHighHighNo — static improvementNone
5Cross-Session Learning GraphTransformativeHighYes — feeds #1, #3, #6Benefits from #1
6MCTS PlanningTransformativeVery HighYes — improves with #1, #5Benefits from #1, #5
Phase 1 (Foundation) Phase 2 (Performance) Phase 3 (Intelligence)
───────────────────── ───────────────────── ─────────────────────
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Skill Library │ │ DAG Tool Exec │ │ Semantic Context│
│ Evolution │──feeds──▶│ │ │ Compression │
│ │ │ Speculative │ │ │
│ │──feeds──▶│ Tool Exec │ │ MCTS Planning │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ ▲
┌─────────────────┐ │ │
│ Cross-Session │───────────────────┴──────────────────────────┘
│ Learning Graph │ (feeds intelligence layer)
└─────────────────┘

Phase 1 creates the feedback loop that makes everything else better over time. Phase 2 delivers immediate, measurable performance wins. Phase 3 requires the most architectural change but delivers the deepest capability gains.