Skip to content

Recipe: Recover from Errors

GSD auto-mode has stopped unexpectedly. Maybe the agent hit a context limit, an API went down, a tool call timed out, or something else interrupted execution mid-task. The terminal shows no progress, and you’re not sure what state the project is in.

This recipe walks through the recovery process — from recognizing the problem through automatic healing and, if needed, deeper investigation.

  • GSD installed and available in your terminal
  • A project that was running auto-mode (/gsd auto) when the failure occurred
  • Familiarity with the /gsd doctor and /gsd forensics commands

The scenario: Cookmate was mid-execution on the recipe image upload feature (M002, S01, T02) when the agent hit a rate limit on tool calls. Auto-mode stopped. The terminal just sits there — no error message, no completion.

The first sign is usually silence — auto-mode stops producing output. You might also see:

  • A stale auto.lock file in .gsd/
  • STATE.md showing an in-progress phase with no recent activity
  • An incomplete task — T02-PLAN.md exists but T02-SUMMARY.md doesn’t

Check the current state:

> cat .gsd/STATE.md
# GSD State
**Active Milestone:** M002: Recipe Image Upload
**Active Slice:** S01: Storage and Upload API
**Phase:** executing
...

The auto.lock file confirms auto-mode was running and didn’t shut down cleanly. It’s a JSON file that records the PID, which unit was dispatched, when it started, and the session file path:

{
"pid": 48221,
"startedAt": "2025-01-15T10:00:00.000Z",
"unitType": "execute-task",
"unitId": "M002/S01/T02",
"unitStartedAt": "2025-01-15T10:24:00.000Z",
"completedUnits": 3,
"sessionFile": ".gsd/activity/004-execute-task-M002-S01-T02.jsonl"
}

The directory structure confirms T02 never finished:

.gsd/
├── auto.lock ← stale lock — auto-mode crashed
├── STATE.md ← shows executing phase, S01 active
└── milestones/
└── M002/
└── slices/
└── S01/
├── S01-PLAN.md
└── tasks/
├── T01-PLAN.md
├── T01-SUMMARY.md ← T01 completed
├── T02-PLAN.md ← T02 was in progress
└── (no T02-SUMMARY) ← T02 never finished

Start by scanning for issues without making any changes:

> /gsd doctor

Doctor runs checks across six domains:

  • Structural checks — Missing summaries, UAT files, unchecked roadmap entries, missing tasks/ directories, duplicate task IDs, premature slice completion
  • Git health — Orphaned worktrees, stale milestone branches, corrupt merge or rebase state, runtime files tracked by git, orphaned worktree directories, stale or dirty worktrees
  • Runtime health — Stale auto.lock (PID check), stranded lock directory, stale parallel sessions, orphaned completed-unit keys, stale hook state, activity log bloat, STATE.md staleness, gitignore drift, metrics ledger bloat
  • Global health — Orphaned project state directories in ~/.gsd/projects/ whose git root no longer exists
  • Environment health — Node version, dependencies, missing .env files, port conflicts, disk space, Docker, package manager, TypeScript/Python/Rust/Go tool presence, git remote reachability
  • Provider/auth checks — API key availability for required LLM providers, remote questions channel tokens, optional tool integrations

The report shows what’s wrong before anything is repaired:

GSD doctor report.
Scope: M002/S01
Issues: 2 total · 1 error(s) · 1 warning(s) · 2 fixable
Priority issues:
- [ERROR] project: stale auto.lock — PID 48221 is dead
- [WARN] M002/S01/T02: summary exists but task not checked in plan

Once you understand what’s wrong, apply automatic repairs:

> /gsd doctor fix

Fix mode repairs everything marked as fixable. Depending on what’s detected, this includes:

  • Removing the stale auto.lock crash lock
  • Removing stranded .gsd.lock/ directories
  • Marking tasks [x] in the slice plan when a summary already exists on disk
  • Un-checking tasks in the slice plan when the summary is missing (so they re-execute cleanly)
  • Un-marking slices as done in the roadmap when completion was premature
  • Creating placeholder summaries for done-but-missing slice artifacts
  • Creating placeholder UAT files when all tasks are complete but UAT is missing
  • Marking the slice [x] in the roadmap when all tasks and artifacts are complete
  • Removing orphaned completed-unit keys that reference missing artifacts
  • Rebuilding STATE.md from current disk state
  • Pruning activity logs to a 7-day retention window
  • Removing orphaned worktrees and stale milestone branches
  • Adding missing GSD runtime patterns to .gitignore

Doctor fix only touches issues with fixable: true — it won’t rewrite plan content or make judgment calls.

GSD doctor report.
Scope: M002/S01
Issues: 2 total · 1 error(s) · 1 warning(s) · 2 fixable
Fixes applied:
- cleared stale auto.lock
- marked T02 done in .gsd/milestones/M002/slices/S01/S01-PLAN.md

If doctor fix resolves the issues cleanly, resume execution:

> /gsd auto

When auto-mode starts after a crash, it reads the crash lock (if still present) and displays context-aware recovery information based on what was happening when it crashed — for example, confirming that completed work is preserved for an interrupted execute-task, or that a plan-slice unit may need to re-run. The crash lock is cleared on clean exit so this message only appears after an unclean stop.

Auto-mode re-derives state from disk on startup. If T02 never wrote a summary, auto-mode re-executes T02 from scratch — it reads T01’s summary for context and continues the slice. If T02’s summary exists but the plan checkbox was unchecked, doctor already fixed that and auto-mode picks up the next task.

Auto-mode also runs lightweight self-healing on every startup:

  • Stale runtime records — Clears dispatched records older than 1 hour (which means the process crashed before the unit could complete).
  • Complete-slice invariant repair — If a complete-slice unit has both a SUMMARY and UAT file on disk but the roadmap checkbox is still unchecked (crashed after writing artifacts but before updating the roadmap), auto-mode flips the checkbox automatically.
  • Merge state reconciliation — Checks for leftover MERGE_HEAD or SQUASH_MSG from a prior session. If all conflicts are resolved, it finalizes the commit. If conflicts involve only .gsd/ state files, it auto-resolves by accepting theirs. If code conflicts remain, it aborts and resets.

In addition to startup self-healing, a proactive healing layer runs continuously during an auto-mode session:

  • Pre-dispatch health gate — Before each unit dispatch, checks for stale crash locks, corrupt merge state, missing STATE.md, missing integration branches, and low disk space. Attempts auto-repair of each. Blocks dispatch if critical issues cannot be resolved.
  • Health score tracking — After each unit, records a health snapshot and tracks trends over the last 50 snapshots. Emits level-change events when health transitions between green (no errors), yellow (1+ errors or degrading trend), and red (3+ consecutive error units).
  • Auto-heal escalation — After 5 consecutive units with unresolved errors, if the trend is not improving, escalates to LLM-assisted healing. Escalation fires at most once per auto-mode session.

When a unit fails repeatedly: If automatic recovery exhausts all retries, auto-mode writes a BLOCKER placeholder artifact for the stuck unit and advances the pipeline. The placeholder marks the task as complete with a note that it failed recovery and needs manual review. This prevents a single stuck unit from halting the entire execution.

5. If structural issues remain — use heal mode

Section titled “5. If structural issues remain — use heal mode”

When auto-fix can’t resolve everything (non-fixable issues, missing artifacts that need real content), use heal mode to dispatch remaining issues to the LLM:

> /gsd doctor heal

Heal mode first applies all fixable repairs, then filters the remaining issues — all errors plus UAT-related warnings (all_tasks_done_missing_slice_uat and slice_checked_missing_uat) — and dispatches them to the LLM as a structured list. The LLM receives the full doctor report with issue codes, unit IDs, and file paths, and is instructed to generate real artifacts from existing context (task summaries, plan files) rather than leaving placeholders when possible.

GSD doctor heal prep.
Scope: M002/S01
Issues: 1 total · 1 error(s) · 0 warning(s) · 0 fixable
Doctor heal dispatched 1 issue(s) to the LLM.
● Investigating: all_tasks_done_missing_slice_uat for M002/S01...
Reading task summaries to build UAT script...
Wrote .gsd/milestones/M002/slices/S01/S01-UAT.md
GSD doctor heal complete.

6. If the crash is unusual — run /gsd forensics

Section titled “6. If the crash is unusual — run /gsd forensics”

When the failure is behavioral (keeps happening, unusual cost spikes, stuck loops) rather than structural, use forensics for a deeper investigation:

> /gsd forensics auto-mode keeps crashing on T02 of the image upload slice

Forensics gathers a structured report from multiple data sources:

  • Activity logs — Tool calls, reasoning traces, and errors from the crashed session (worktree-aware: checks both the project root and any active worktree)
  • Metrics ledger — Cost and timing data to detect spikes or excessive retries
  • Crash lock — PID liveness to confirm the crash
  • Doctor checks — Full structural scan run internally
  • Completed keys — Cross-referenced against expected artifacts to find stale completions

Forensics then detects anomalies — stuck loops, cost spikes, timeouts, missing artifacts, error traces — and dispatches the full report to the LLM for interactive root-cause analysis. The LLM traces from symptom to root cause in the GSD source code and produces a filing-ready GitHub issue draft with specific file references and a concrete fix suggestion.

Every forensic report is saved as a timestamped file in .gsd/forensics/ for future reference.

If both doctor and forensics can’t automatically resolve the state, you can repair manually. When auto-mode detects a stuck loop, it provides concrete remediation steps based on the unit type.

Stuck execute-task:

  1. Write the task summary — Even a partial summary in T02-SUMMARY.md is sufficient to unblock the pipeline
  2. Mark the task done — Change - [ ] **T02:- [x] **T02: in the slice plan
  3. Reconcile state — Run /gsd doctor fix to rebuild STATE.md and clear any stale runtime records
  4. Resume — Run /gsd auto to restart execution from the next task

Stuck complete-slice:

  1. Write the slice summary and UAT file
  2. Mark the slice [x] in the milestone roadmap
  3. Run /gsd doctor fix
  4. Resume auto-mode

Stuck plan-slice or research-slice:

  1. Write the slice plan or research file manually (or with the LLM in interactive mode)
  2. Run /gsd doctor fix to reconcile state
  3. Resume auto-mode

Stuck validate-milestone:

  1. Write the validation file with verdict: pass
  2. Run /gsd doctor fix
  3. Resume auto-mode

The recovery process may produce:

FilePurpose
.gsd/STATE.mdRegenerated from current disk state
.gsd/auto.lockRemoved (stale lock cleared)
.gsd/milestones/*/slices/*/tasks/*-SUMMARY.mdStub summary if task was done but summary missing
.gsd/milestones/*/slices/*/S*-PLAN.mdTask checkbox updated (checked or un-checked) based on summary presence
.gsd/milestones/*/M*-ROADMAP.mdSlice checkbox updated if all tasks done, or un-marked if premature
.gsd/milestones/*/slices/*/S*-UAT.mdPlaceholder UAT if all tasks complete but UAT missing
.gitignoreMissing GSD runtime patterns appended
FilePurpose
.gsd/milestones/*/slices/*/S*-SUMMARY.mdReconstructed from task summaries
.gsd/milestones/*/slices/*/S*-UAT.mdGenerated from slice plan and task context
FilePurpose
.gsd/milestones/*/slices/*/tasks/*-SUMMARY.mdBLOCKER placeholder written when retry exhaustion occurs
FilePurpose
.gsd/forensics/report-*.mdTimestamped, redacted forensic report with filing-ready GitHub issue draft
  • /gsd doctor — Structural health checks, auto-repair, and LLM-assisted heal mode
  • /gsd forensics — Behavioral investigation of stuck loops, cost spikes, and crash analysis
  • /gsd status — View current project state
  • /gsd auto — Resume auto-mode after recovery
  • /gsd skip — Skip a stuck unit and advance the pipeline