Skip to content

CI/CD Pipeline Design — GSD 2

A three-stage promotion pipeline for GSD 2 that moves merged PRs through Dev → Test → Prod using npm dist-tags as environment markers, GitHub Environments for approval gates, and Docker images for both CI acceleration and end-user distribution.

  1. Every merged PR is immediately installable via npx gsd-pi@dev
  2. Verified builds auto-promote to @next for early adopters
  3. Production releases require manual approval and optional live-LLM validation
  4. CI builds are fast and reproducible via pre-built Docker builder image
  5. End users can run GSD via Docker as an alternative to npm
  6. LLM-dependent behavior is testable without API calls via recorded fixtures
  • Replacing the existing PR gate workflow (ci.yml)
  • Replacing the native binary cross-compilation workflow (build-native.yml)
  • Cross-platform native binary builds (macOS/Windows remain on build-native.yml)
  • Hosting GSD as a web service
  • Automated prompt regression testing (future work)
┌─────────────────────────────────────────────────────────────┐
│ PR Merged to main │
│ ci.yml runs (build, test, typecheck) │
└──────────────────────────┬──────────────────────────────────┘
▼ (workflow_run: ci.yml success)
┌──────────────────────────────────────────────────────────────┐
│ STAGE: DEV Environment: dev │
│ │
│ 1. Version stamp: <current>-dev.<short-sha> │
│ 2. npm publish gsd-pi@<version>-dev.<sha> --tag dev │
│ 3. Smoke test: npx gsd-pi@dev --version │
│ │
│ Note: Build/test/typecheck already ran in ci.yml │
│ Docker: Build CI builder image (only if Dockerfile changed) │
└──────────────────────────┬──────────────────────────────────┘
▼ (auto-promote if all green)
┌──────────────────────────────────────────────────────────────┐
│ STAGE: TEST Environment: test │
│ │
│ 1. Install gsd-pi@dev from registry │
│ 2. CLI smoke tests (--version, init, help, config) │
│ 3. Dry-run fixture suite (recorded LLM conversations) │
│ - Agent session replay with fixture provider │
│ - Tool use round-trips verified │
│ - Extension loading validated │
│ 4. npm dist-tag add gsd-pi@<version> next │
│ │
│ Docker: Build + push runtime image to GHCR as :next │
└──────────────────────────┬──────────────────────────────────┘
▼ (manual approval required)
┌──────────────────────────────────────────────────────────────┐
│ STAGE: PROD Environment: prod │
│ │
│ 1. (Optional) Real LLM integration tests │
│ - Gated behind workflow input flag │
│ - Uses ANTHROPIC_API_KEY / OPENAI_API_KEY secrets │
│ - Budget-capped: small models, short conversations │
│ 2. npm dist-tag add gsd-pi@<version> latest │
│ 3. GitHub Release created with changelog │
│ 4. Docker: tag runtime image as :latest + :v<version> │
│ 5. Post-publish smoke test against @latest │
└──────────────────────────────────────────────────────────────┘
Dist-tagWhen publishedVersion formatRisk level
@devEvery merged PR2.27.0-dev.a3f2c1bBleeding edge
@nextAuto-promoted from DevSame version, new tagCandidate
@latestManually approved from TestSame version, new tagProduction

The -dev. prerelease identifier is distinct from the existing -next. convention used in build-native.yml. The two pipelines do not overlap — build-native.yml only triggers on v* tags and checks for -next. to determine npm dist-tag. The -dev. versions are published exclusively by pipeline.yml.

Dev versions (@dev tag) use the native binaries from the most recent stable build-native.yml release. The optionalDependencies in package.json use >= ranges, so a -dev. version of gsd-pi resolves the latest stable @gsd-build/engine-* packages from the registry.

If a PR modifies Rust native crate code (native/ directory), the dev publish will bundle stale native binaries. This is acceptable because:

  • Native crate changes are infrequent and always accompanied by a v* tag release
  • The Test stage validates the installed package works end-to-end
  • Full native binary validation happens via build-native.yml on the version tag
concurrency:
group: pipeline-${{ github.sha }}
cancel-in-progress: false

Policy:

  • Each pipeline run is keyed to its commit SHA — no two runs for the same commit race
  • Newer merges do NOT cancel in-progress promotions — a version already in the Test stage completes its promotion
  • If Run A is promoting version X to @next while Run B publishes version Y to @dev, they operate independently — @next and @dev point to different versions, which is correct
  • The Prod stage always promotes whatever version is currently at @next, so approving promotion after a newer version has already moved to @next promotes the newer one (last-writer-wins, which is the desired behavior)
FailureImpactRecovery
Dev publish succeeds, smoke test failsBroken version on @dev tagNext successful merge overwrites @dev. Manual fix: npm dist-tag add gsd-pi@<last-good> dev
Test stage fails after promoting to @nextBroken version on @next tagManual: npm dist-tag add gsd-pi@<last-good> next. @latest is never affected.
Prod promotion publishes @latest then found brokenBroken production releaseManual: npm dist-tag add gsd-pi@<previous-stable> latest and docker tag ghcr.io/gsd-build/gsd-pi:<previous> latest && docker push. Post-mortem required.
Docker push succeeds, npm dist-tag failsImages and npm out of syncRe-run the failed job (GitHub Actions retry). Images are tagged by version so stale tags are harmless.
GHCR push failsNo Docker image for this versionNon-blocking — npm publish is the primary distribution. Docker image can be rebuilt manually.

Rollback responsibility: any maintainer with npm publish rights and GHCR push access. The Prod environment’s required-reviewers list doubles as the rollback-authorized list.

FileTriggerPurposeStatus
ci.ymlPR opened/updated, push to mainPre-merge gate: build, test, typecheckUnchanged
build-native.ymlv* tag or manual dispatchCross-compile native binaries for 5 platformsUnchanged
pipeline.ymlworkflow_run (after ci.yml succeeds on main)Post-merge promotion: Dev → Test → ProdNew

The pipeline triggers via workflow_run after ci.yml completes successfully on main, avoiding duplicate build/test work. The Dev stage only performs version stamping, publishing, and smoke testing.

Two images from a single Dockerfile at the repo root.

  • Name: ghcr.io/gsd-build/gsd-ci-builder
  • Base: node:22-bookworm
  • Contains: Node 22, Rust stable toolchain, aarch64-linux-gnu cross-compiler
  • Size: ~2 GB
  • Tags: :latest, :<YYYY-MM-DD> (date-stamped for rollback)
  • Rebuilt: Only when Dockerfile changes
  • Used by: pipeline.yml Dev stage, optionally ci.yml
  • Purpose: Eliminates 3-5 min toolchain install on every CI run

The builder image does NOT include Playwright system deps (not needed for current CI jobs). If browser-based E2E tests are added later, Playwright deps can be added at that point.

Builder images are tagged with both :latest and a date stamp (e.g., :2026-03-17). The pipeline.yml workflow pins to a specific date-stamped tag. When the Dockerfile is updated, the PR that changes it also updates the tag reference in pipeline.yml. This prevents a broken Dockerfile change from silently breaking all subsequent runs.

  • Name: ghcr.io/gsd-build/gsd-pi
  • Base: node:22-slim
  • Contains: Node 22, git, gsd-pi installed globally
  • Size: ~250 MB
  • Tags: :latest, :next, :v2.27.0
  • Published: On every Prod promotion
  • Purpose: docker run ghcr.io/gsd-build/gsd-pi as alternative to npx
  • Bookworm for CI: The Rust native crates depend on vendored libgit2, image processing, and cross-compilation to ARM64. Debian Bookworm provides the full toolchain via apt. Alpine breaks due to musl vs glibc incompatibilities with N-API bindings.
  • Slim for runtime: Only needs Node + git. Native .node binaries are prebuilt and bundled in the npm package — no Rust toolchain needed at runtime.

The fixture system hooks into the pi-ai provider abstraction layer to capture and replay LLM conversations without hitting real APIs.

Agent Session
pi-ai provider abstraction
FixtureProvider (intercept layer)
├── record mode → Real API + save to fixture JSON
└── replay mode → Load fixture JSON (no API call)

The FixtureProvider implements the Provider interface from @gsd/pi-ai (the same interface all 20+ built-in providers implement). It registers itself via environment variable detection at provider initialization:

// Pseudocode — actual implementation will follow pi-ai patterns
import type { Provider, StreamingResponse } from "@gsd/pi-ai";
class FixtureProvider implements Provider {
// In record mode: wraps the real provider, saves responses
// In replay mode: returns saved responses directly
async *stream(request: ProviderRequest): AsyncGenerator<StreamingResponse> {
if (this.mode === "replay") {
// Yield fixture response chunks (simulated streaming)
yield* this.replayTurn(this.turnIndex++);
} else {
// Proxy to real provider, capture response
const chunks = [];
for await (const chunk of this.realProvider.stream(request)) {
chunks.push(chunk);
yield chunk;
}
this.saveTurn(request, chunks);
}
}
}

Key integration details:

  • Streaming: Fixture replay simulates streaming by yielding saved response chunks with minimal delay. This exercises the same consumer code paths as real streaming.
  • Registration: When GSD_FIXTURE_MODE is set, the fixture provider wraps the configured real provider. No changes to provider selection logic needed.
  • Provider-agnostic: Fixtures are captured at the Provider interface level (above HTTP transport), so they work regardless of which underlying provider was used during recording.
ModeTriggerBehavior
RecordGSD_FIXTURE_MODE=record GSD_FIXTURE_DIR=./fixturesWraps real provider, saves request/response pairs
ReplayGSD_FIXTURE_MODE=replay GSD_FIXTURE_DIR=./fixturesReturns saved responses, zero API calls
OffDefault (no env vars)Normal operation, no interception

One JSON file per recorded session:

{
"name": "agent-creates-file",
"recorded": "2026-03-17T00:00:00Z",
"provider": "anthropic",
"model": "claude-sonnet-4-6",
"turns": [
{
"request": {
"messages": [{ "role": "user", "content": "Create hello.ts" }],
"tools": ["Write", "Read"],
"model": "claude-sonnet-4-6"
},
"response": {
"content": [
{ "type": "text", "text": "I'll create hello.ts for you." },
{ "type": "tool_use", "name": "Write", "input": { "file_path": "hello.ts", "content": "console.log('hello')" } }
],
"stopReason": "toolUse",
"usage": { "input": 150, "output": 45 }
}
}
]
}

Turn-index based. Response N is served for request N in sequence. If the conversation diverges from the fixture (e.g., unexpected turn count), the test fails explicitly with a descriptive error rather than silently producing wrong results.

Why not request-body hashing: request bodies contain timestamps, random IDs, and system prompt variations that cause brittle mismatches.

Why not a generic HTTP VCR: The pi-ai layer abstracts 20+ providers with different wire formats. Intercepting above the transport means fixtures are provider-agnostic.

  • Agent session lifecycle (start → tool calls → completion)
  • Tool dispatch and response handling
  • Multi-turn conversation flow
  • Extension loading and routing
  • Error handling paths (fixtures can include error responses)

What Does NOT Get Tested (Deferred to Live Gate)

Section titled “What Does NOT Get Tested (Deferred to Live Gate)”
  • Model output quality
  • Prompt regression
  • New tool compatibility with live APIs

Committed to repo under tests/fixtures/recordings/. Each fixture is 5-50KB of JSON. Recording is a manual developer action, not automated in CI.

Old -dev. versions accumulate on npm with every merged PR. A scheduled workflow (cleanup-dev-versions.yml) runs weekly and unpublishes dev versions older than 30 days via npm unpublish gsd-pi@<old-dev-version>. This prevents registry bloat while keeping recent dev versions available.

tests/
├── smoke/ # CLI smoke tests (Stage: Test)
│ ├── run.ts
│ ├── test-version.ts
│ ├── test-help.ts
│ └── test-init.ts
├── fixtures/ # Recorded LLM replay tests (Stage: Test)
│ ├── run.ts # Test runner
│ ├── record.ts # Recording helper
│ ├── provider.ts # FixtureProvider intercept layer
│ └── recordings/
│ ├── agent-creates-file.json
│ ├── agent-reads-and-edits.json
│ ├── agent-handles-error.json
│ └── agent-multi-turn-tools.json
├── live/ # Real LLM tests (Stage: Prod, optional)
│ ├── run.ts
│ ├── test-anthropic-roundtrip.ts
│ └── test-openai-roundtrip.ts
scripts/
├── version-stamp.mjs # Stamps <version>-dev.<sha>
Dockerfile # Multi-stage: builder + runtime
.github/workflows/pipeline.yml # Promotion pipeline
.github/workflows/cleanup-dev-versions.yml # Weekly dev version pruning

All test files use .ts with --experimental-strip-types for consistency with the existing test convention in the project.

{
"test:smoke": "node --experimental-strip-types tests/smoke/run.ts",
"test:fixtures": "node --experimental-strip-types tests/fixtures/run.ts",
"test:fixtures:record": "GSD_FIXTURE_MODE=record node --experimental-strip-types tests/fixtures/record.ts",
"test:live": "GSD_LIVE_TESTS=1 node --experimental-strip-types tests/live/run.ts",
"pipeline:version-stamp": "node scripts/version-stamp.mjs",
"docker:build-runtime": "docker build --target runtime -t ghcr.io/gsd-build/gsd-pi .",
"docker:build-builder": "docker build --target builder -t ghcr.io/gsd-build/gsd-ci-builder ."
}
SettingValue
Environment: devNo protection rules
Environment: testNo protection rules (auto-promote)
Environment: prodRequired reviewers: maintainers
Secret: NPM_TOKENAll environments
Secret: ANTHROPIC_API_KEYProd only
Secret: OPENAI_API_KEYProd only
GHCREnabled for org
  1. A merged PR is installable via npx gsd-pi@dev within 15 minutes (assumes warm CI builder image cache)
  2. Fixture replay tests complete in under 60 seconds with zero API calls
  3. The full Dev → Test promotion completes without human intervention
  4. Prod promotion is blocked until a maintainer explicitly approves
  5. docker run ghcr.io/gsd-build/gsd-pi --version returns the correct version
  6. Existing ci.yml and build-native.yml workflows continue to work unchanged
  7. CI builder image reduces toolchain setup from ~3-5 min to ~30s pull