Session-First Architecture
Status: Approved with risks (v6 — post-review round 5) Author: Design review collaboration Date: 2026-03-08Problem Statement
Gas City has two parallel session management models that don’t interoperate:-
Agent-centric (controller path):
config.Agent→buildOneAgent()→agent.Agent→runtime.Provider.Start(). The controller rebuilds the full agent list from config on every tick. Sessions are an implementation detail — tmux session names derived from agent names. -
Session-centric (
gc sessionpath):session.Manager.Create()→ bead (type=“session”) →runtime.Provider.Start(). Sessions are persistent, resumable, bead-backed. But the controller doesn’t know about them.
- Pool members have no persistence. When a pool member is stopped, its history disappears. There’s no way to query old pool sessions.
- The agent.Agent interface is redundant. It’s a thin wrapper around
runtime.Provider+ session name.session.Manageralready provides the same operations plus persistence. - Config-driven identity is fragile. Pool instances get slot-based names
(
worker-3) that change when scaling happens. Sessions need stable identity. - Two code paths to maintain.
buildOneAgent(200+ lines) andsession.Manager.Createdo overlapping work with different models.
Design Principles
- Session is the primitive. A session is a persistent, bead-backed conversation between a human/system and an agent. It has stable identity (bead ID), lifecycle state, and history.
-
Templates replace agent types.
config.Agentbecomes a session template. The single/multi/pool distinction becomes a policy on how many concurrent sessions a template allows and how they’re scaled. -
The controller manages sessions, not agents. Instead of rebuilding
agent.Agentobjects from config each tick, the controller reconciles session beads against desired state. - Pool growth = new session. Pool shrink = drain + archive session. Old pool sessions remain queryable but don’t receive new work.
- Single-writer per lifecycle. At every migration phase, exactly one system owns runtime lifecycle mutations. No dual-writer ambiguity.
- Fail closed. On partial failure (bead store errors, stale reads), the controller aborts the tick rather than acting on incomplete data.
Core Invariants
INV-1: Creating a session requires only a target template. A template is a reusable agent definition (provider, prompt, env, hooks, etc.) drawn from[[agent]] config. Creating a session from a template resolves
the provider, builds the command, and starts the runtime.
INV-2: Non-pool templates allow unlimited concurrent sessions.
Any template without pool config can have an arbitrary number of sessions.
The controller doesn’t enforce a count limit — sessions are created on demand
and persist until closed.
INV-3: Pool templates have bounded occupancy.
A pool template’s max field caps occupancy: the count of creating +
active + suspended + quarantined sessions. Archived, draining, and
closed sessions do NOT count. Growing = create new session (reserves a
creating slot) or reactivate archived. Shrinking = drain + archive
excess sessions.
INV-4: Sessions support template overlay at creation time.
A session can override a strict allowlist of template defaults (model, name,
title, prompt) and per-template-allowed env vars at creation time. The
overlay is stored on the session bead so resume uses the same overrides.
Overlays are a second config source — the session bead records the effective
configuration, and gc session inspect shows both template defaults and
overlay overrides for full transparency.
INV-5: Single controller exclusivity and single source of truth.
Only one controller process manages session lifecycle at a time. Enforced by
controller.lock (flock). The reconciliation loop is single-threaded —
no concurrent tick execution. All lifecycle mutations (including CLI
commands like gc session close) go through the controller socket. The
CLI sends mutation requests via controller.sock; the controller applies
them within the event loop and updates the in-memory index synchronously.
No out-of-band bead store writes for lifecycle state.
Architecture
Session Bead Schema
Every session is a bead withtype = "session". The bead stores all state
needed to start, resume, suspend, and query the session.
State Reason Values
Every state transition records the reason. Valid values:| State | Valid Reasons |
|---|---|
| creating | pool_scale_up, user_request, config_drift_replace |
| active | creation_complete, resumed, reactivated, quarantine_cleared |
| suspended | user_request, idle_timeout, dependency_down |
| draining | scale_down, config_drift, manual |
| archived | drain_complete, drain_timeout, crash_during_drain, suspended_scale_down, quarantine_evicted |
| quarantined | crash_loop |
| closed | user_request, pruned, manual, stale_creating |
crash_recovery is used internally by the repair table for
active → suspended transitions during crash recovery, mapping to
suspended with state_reason=crash_recovery.
Two-Axis State Model
Session state uses two axes:bead.status∈ : Record-level lifecycle.closedis terminal and immutable. Maps to the bead store’s native status field.metadata.state∈ : Operational lifecycle within an open bead.
closedbeads MUST havebead.status = "closed". Thestatefield is not meaningful for closed beads (set to empty string on close).- All other states require
bead.status = "open". - CLI output maps both axes: a bead with
status=closedshows stateclosedregardless of the metadatastatefield.
Pool Occupancy Accounting
Which states count against poolmax:
| State | Counts Against max | Rationale |
|---|---|---|
| creating | Yes | Reserves capacity; prevents creation burst |
| active | Yes | Running and receiving work |
| suspended | Yes | Holds context, temporarily paused |
| draining | No | Being retired, already de-routed |
| archived | No | Retired, no resources held |
| quarantined | Yes* | Holds a slot; see note below |
| closed | No | Terminal |
max to prevent replacement. When a
quarantined session’s cooldown expires, the reconciler checks current pool
occupancy. If the pool is at max (because other sessions were created),
the quarantined session transitions to archived instead of active.
This prevents max violations from quarantine reactivation.
Session Name Convention
Session names use{template}-{short-hash} format where short-hash is the
first 6 characters of the bead ID. Examples: polecat-a3f2b7, worker-b7c1d9.
Six characters provide ~16 million values per template, making collisions
negligible. On collision (detected at creation), a 7th character is appended.
This preserves operator ergonomics (tab-completable, human-readable) while
maintaining stable identity via the bead ID internally. Pool sessions also
store a pool_slot metadata field with a sequential number, visible in
default gc session list output and usable as a CLI selector via
gc session attach worker~3 syntax.
Generation and Instance Token
generation: Incremented each time a session transitions fromarchived→active(reactivation). Starts at1on creation. Used for auditing how many incarnations a pool slot has had.instance_token: Random value set oncreateandreactivate. The drain protocol checks this token — if the token on the bead doesn’t match the controller’s expected value, the drain targets a stale incarnation and is aborted. Prevents races where a drain for incarnation N arrives after incarnation N+1 has started.
Session States
pool: label is
NOT set. creating_at records when this state was entered. If the runtime
starts successfully and IsRunning() confirms liveness, transitions to
active (and pool: label is added for pool sessions). If the bead remains
in creating for longer than creation_timeout (default 60s), the
reconciler treats it as stale: checks IsRunning() — if alive, completes
the transition to active; if dead, closes the bead with
state_reason=stale_creating. creating beads count against pool max
to prevent creation bursts during slow provider starts. Visible in
gc session list default output with state creating.
active: Has a live runtime session. Receives work (for pool sessions).
Crash bookkeeping: crash_count incremented on unexpected exit, reset on
successful operation. If crash_count exceeds max_restarts_per_window
within restart_window, transitions to quarantined.
Single crash (below threshold): On unexpected runtime exit while
crash_count is below the quarantine threshold, the controller
restarts the runtime in-place (re-runs Start() on the existing bead)
without changing state. The pool: label remains set during the brief
restart window; the next tick detects non-liveness if restart fails and
increments crash_count. This is a restart-in-place, not a state
transition — the session remains active throughout.
suspended: No runtime resources. Resumable with full context. User- or
system-initiated pause. Counts against pool max (the session is paused,
not retired). For pool sessions, the pool: label is removed on suspend
(same pattern as draining — a non-running session must not be routable).
The member:{template} label preserves pool membership for queries.
suspended → archived occurs when the controller needs to scale down and
finds suspended sessions (archived first before draining active sessions).
draining: Transitional state for pool sessions being scaled down. The
pool: label is removed (stops new work routing), the runtime continues
until in-flight work completes or drain_timeout expires. On completion,
transitions to archived. The runtime is NOT killed until drain completes.
If the runtime crashes during drain, transitions immediately to archived
with state_reason=crash_during_drain (no quarantine — already being
retired). Does not increment crash_count.
archived: No runtime resources. Queryable but does NOT receive new work.
Used for old pool sessions. Can be reactivated if the pool needs to grow
and wake_mode=resume. Non-pool sessions cannot enter this state.
quarantined: No runtime resources. Auto-restarts blocked until
quarantine_until timestamp passes (exponential backoff, capped at 5min).
quarantine_cycle is incremented on each quarantined → active transition
(persisted on the bead, survives controller restart). On cooldown expiry,
the reconciler checks pool occupancy: if the pool is at max, the session
transitions to archived instead of active. If it can reactivate, it
transitions to active and resets crash_count (but not quarantine_cycle).
After quarantine_max_attempts (default 3) cycles without sustained healthy
operation (defined as quarantine_healthy_duration, default 5 minutes,
without crash after reactivation), the session is evicted: transitioned
to archived with state_reason=quarantine_evicted. This frees the slot
for fresh capacity. A session.quarantine.evicted event is emitted for
operator attention.
closed: Terminal. Bead status set to "closed". The metadata state
field is cleared. History preserved. Sensitive metadata (session_key,
overlay.env.*, overlay.prompt) scrubbed on close (scrub BEFORE marking
closed on ExecStore to ensure fail-closed). Any beads claimed by this
session are marked blocked with reason=session_closed.
Orphan Work Cleanup
All state transitions that terminate or abandon a runtime MUST clean up claimed work. This applies to:| Transition | Orphan Action |
|---|---|
| drain timeout | Mark claimed beads blocked (reason=session_archived) |
| crash during drain | Mark claimed beads blocked (reason=session_crash_drain) |
gc session close | Mark claimed beads blocked (reason=session_closed) |
gc session suspend | Mark claimed beads blocked (reason=session_suspended) |
| active → quarantined | Mark claimed beads blocked (reason=session_quarantined) |
session_name or bead ID to identify
claimed work. This is a single query + batch update, executed before the
state transition is written.
Atomic State Mutations
State transitions that involve multiple field changes (e.g., archive requiresstate→archived + label removal + archived_at timestamp) MUST be written
as a single SetMetadataBatch call. The bead store guarantees batch writes
are atomic for MemStore and FileStore (single lock). For ExecStore
(bd/br CLI), writes are sequential but ordered to fail closed:
Creation ordering (fail closed):
- Create bead with
state=creating, NOpool:label - Start runtime
- Confirm liveness (
IsRunning()) - Set
state=active,state_reason=creation_complete(batch) - Add
pool:label (enables routing — only after runtime confirmed)
- Remove
pool:label (stops routing) - Set
state=suspended,suspended_at,state_reason(batch) - Kill runtime
- Remove
pool:label (stops routing — safe even if crash follows) - Set
state=archived,archived_at,state_reason(batch) - Kill runtime
- Start runtime (session must be alive before routing)
- Confirm runtime liveness (
IsRunning()) - Set
state=active,state_reason=reactivated,generation++(batch) - Add
pool:label (enables routing — only after runtime confirmed)
- Start runtime
- Confirm runtime liveness (
IsRunning()) - Set
state=active,state_reason=resumed(batch) - Add
pool:label (enables routing — only after runtime confirmed)
ExecStore Partial-Failure Repair Table
For ExecStore (sequential writes), a crash between steps can leave intermediate states. The reconciler detects and repairs these deterministically. The guiding principle is fail closed: when in doubt, leave the session de-routed (nopool: label) rather than
accidentally routing work to a broken session.
state | Has pool: label | Runtime running | Is pool session? | Repair Action |
|---|---|---|---|---|
creating | No | Yes | Yes | Complete: set state=active, add pool: label |
creating | No | Yes | No | Complete: set state=active |
creating | No | No | Any | Close bead (stale_creating) if age > creation_timeout |
active | No | Yes | Yes | If pool under max: restore label. If at max: begin drain. |
active | No | Yes | No | No repair needed (non-pool, no label expected) |
active | No | No | Any | Set state=suspended, state_reason=crash_recovery |
draining | Yes | Yes | Yes | Remove pool: label (interrupted drain start) |
draining | Yes | No | Yes | Remove pool: label, set state=archived |
draining | No | Yes | Yes | No repair needed (drain in progress) |
draining | No | No | Yes | Set state=archived (drain crash completion) |
archived | Yes | No | Yes | Remove pool: label (interrupted archive) |
archived | No | Yes | Yes | Kill runtime (should not be running) |
suspended | Yes | No | Yes | Remove pool: label (interrupted suspend) |
suspended | No | Yes | Any | Kill runtime (should not be running) |
quarantined | Yes | No | Yes | Remove pool: label (interrupted quarantine entry) |
quarantined | Yes | Yes | Yes | Remove pool: label, kill runtime |
quarantined | No | Yes | Any | Kill runtime (quarantined should not be running) |
quarantined | No | No | Any | No repair needed (correct quarantine state) |
active pool session missing its pool: label is
auto-healed based on pool occupancy. If the pool is under max, the label
is restored (the session was likely interrupted during creation). If at
max, the session is drained (it was likely interrupted during retirement).
A session.repair.active_no_label event is emitted in both cases for
operator visibility.
The repair table is the single source of truth for crash recovery. Each
row is a test case in TestExecStore_PartialFailureRepair.
Template Model
Templates are defined incity.toml via [[agent]] — the existing config
format. The key shift is conceptual: agents become templates, and templates
produce sessions.
min and max based on check command. Excess sessions are drained
then archived (not destroyed). Archived sessions can be reactivated (warm)
or new fresh sessions created, controlled by wake_mode.
check command failure behavior: If check returns a non-zero exit
code, times out (10s default), or produces non-numeric output, the
controller logs a warning and skips scaling for that template on this tick.
It does NOT default to 0 or any assumed count — this preserves the current
session count (fail static).
Scale targets and tick budget: The controller executes check commands
concurrently across templates (goroutine per template, bounded by
runtime.NumCPU()), with a hard per-tick deadline of 30 seconds.
Templates whose check command hasn’t returned by the deadline are skipped
for that tick. The in-memory index makes drain-completion checks O(1) per
session (the index tracks claimed-work counts, updated on bead mutations).
Claimed-work synchronization: Work claims are made out-of-band by
agents (not through the controller socket). The controller synchronizes
its claim index via an authoritative query of the bead store
immediately before transitioning from draining → archived. This
ensures no race between a late claim and archival. During normal ticks,
the index maintains an approximate claim count for display/scheduling
purposes via the bead mutation feed (if available) or periodic scan
(every 10 ticks). The authoritative pre-archive query is the safety
gate — approximate counts only affect scheduling priority, not
correctness.
Target: reconciliation tick completes in <1s for 50 templates × 100
sessions with warm index.
Controller Reconciliation
The controller’s tick loop changes from “rebuild agents from config” to “reconcile session beads against desired state.”Current Flow (agent-centric)
Target Flow (session-first)
Reconciliation Idempotency
The reconciliation loop MUST be idempotent — running the same tick twice with the same inputs produces the same result. This is guaranteed by:-
Single-controller exclusivity.
controller.lock(flock) ensures only one controller process runs. The reconciliation loop is single- threaded within that process. No concurrent tick execution. -
Creation-intent markers. When creating a new session, the controller
first creates a bead with
state=creatingand a deterministic key (template:{name}:tick:{tick_id}:slot:{n}). Before creating, it checks for existingcreatingbeads from prior ticks and reconciles them (either complete the creation or terminate the partial bead). -
Fail-closed startup. If the startup index population
(
populateIndex()) fails, the controller does not start reconciliation. During normal operation, the in-memory index is the authoritative source (maintained synchronously). If a bead store write fails during a mutation, the index is NOT updated — the mutation is retried next tick.
agent.Agent
objects. It reads config templates, evaluates pool desired counts, and
manages session beads directly. Runtime operations go through
session.Manager (or a thin wrapper), not agent.Agent.
Config Hash Canonicalization
Theconfig_hash field detects whether a session’s effective configuration
has drifted from its template. The hash is computed over the effective
resolved config (template defaults merged with overlay overrides) to
correctly detect drift for overlaid sessions.
-
Field inclusion list (behavioral fields only):
provider,command(resolved),prompt_template(content hash),env(sorted key=value pairs, including overlay env),work_dir,hooks(sorted),model,wake_mode,session_setup,session_setup_script,pre_start. -
Excluded from hash (non-behavioral): TOML whitespace, comments, key
ordering,
name,title,description, pool scaling config (min,max,check),drain_timeout,archive_order,max_archived. -
Canonicalization: Fields sorted lexicographically, values normalized
(paths resolved, env sorted), concatenated as
key=value\n, SHA-256 hashed, truncated to 16 hex characters. -
Drift response: On drift detection, sessions are drained in a
rolling update — at most
max_unavailable(default 1) sessions per template are drained simultaneously per tick. This prevents a template config change from dropping pool capacity to zero. After each drained session is archived, a replacement is created with the updated config. A bounded retry prevents churn: if drift-triggered recreates exceed 3 per 10 minutes (tracked on the bead viadrift_recreate_countanddrift_recreate_window), the controller logs a warning and skips further drift recreates for that template until the window expires. - Unit test requirement: A test MUST prove that semantically identical configs with different TOML formatting produce identical hashes. A separate test MUST prove that template + overlay produces the same hash as the equivalent flat config.
Pool Session Lifecycle
Pool sessions are the most complex case. Here’s the complete lifecycle:Drain Protocol
When the controller decides to archive a pool session:- Remove
pool:label — prevents new work from being routed. - Set
state=draining,drain_started,state_reason. - Wait for in-flight work. The controller checks each tick whether the
session has any open beads claimed by this session (assigned work,
not just ready-queue presence). The check uses the session’s
session_nameor bead ID to identify claimed work, not the pool label. - On drain complete (no claimed work): set
state=archived, sendSIGTERMto runtime, wait 5s, thenSIGKILLif still running. - On drain timeout (
drain_timeoutfrom pool config, default 30s): setstate=archivedwithstate_reason=drain_timeout, sendSIGTERMthenSIGKILL. Any orphaned beads are markedblockedwithreason=session_archived. - On crash during drain (runtime exits unexpectedly while draining):
set
state=archivedwithstate_reason=crash_during_drain. Any orphaned beads are markedblockedwithreason=session_crash_drain(same cleanup as drain timeout).
Work Routing for Pools
Work discovery must exclude non-active sessions. Thepool:{template}
label is the routing gate — it means “eligible for new work dispatch NOW”:
- Creating sessions: No
pool:label → no routing. - Active sessions: Have the
pool:label → receive work. - Suspended sessions:
pool:label removed on suspend → no routing. Themember:{template}label preserves pool membership for queries. On resume,pool:label is restored after runtime liveness confirmed. - Draining sessions:
pool:label already removed → no new work. - Archived sessions:
pool:label removed → no new work. Thetemplate:andmember:labels preserve associations for queries.
pool: label presence. The
pool: label is ONLY present on active sessions with confirmed-live
runtimes. No metadata inspection needed at routing time.
Session Creation with Overlay
When creating a session, the caller can override template defaults from a strict allowlist:Overlay Allowlist
| Key | Description |
|---|---|
model | Override provider model |
name | Override session display name |
title | Override session title |
prompt | Append to template prompt (see note) |
env.{KEY} | Override environment variable (per-template allowlist) |
overlay.prompt is appended to the
template’s prompt_template content (separated by \n\n---\n\nAdditional context provided at session creation:\n\n). It cannot replace or remove
template prompt content — the template’s safety instructions and identity
are always preserved. The overlay is explicitly framed as lower-trust
supplementary context, not as instructions that override the template.
Templates can disable prompt overlay entirely by omitting prompt from
allow_overlay (a new config field, default: ["model", "name", "title"]
— prompt overlay requires explicit opt-in via allow_overlay = ["model", "name", "title", "prompt"]). A size cap of 16KB is enforced. Logged as
session.overlay.prompt event. Scrubbed on close. Redacted in all
gc session inspect output (shows [16KB appended] not content).
Environment Variable Override Security
Environment variable overrides use a per-template allowlist, not a global denylist. Templates declare which env vars may be overridden:- Only keys listed in
allow_env_overrideare accepted viaenv.{KEY}. - If
allow_env_overrideis omitted, no env overrides are permitted. - Env key names must match
^[A-Z][A-Z0-9_]{0,127}$. - This eliminates the fragile denylist approach entirely — templates opt-in to exactly which variables callers may override.
Banned Overlay Keys (rejected at Create time)
These keys are always rejected regardless of template config:- Command/provider:
command,provider,resume_flag,resume_style - Internal state:
session_key,state,generation,instance_token - Any key not in the allowlist above
Create() time. Unknown keys outside the allowlist
are rejected with an error listing valid keys.
Overlay fields are stored on the session bead (prefixed with overlay.)
so that resume reconstructs the same configuration.
Overlay revalidation on resume/reactivate: When a session is resumed
or reactivated, stored overlays are revalidated against the current
template policy (allow_overlay, allow_env_override). If the template
owner has revoked an overlay key since the session was created, the
offending overlay fields are stripped from the bead and the session
resumes with the template default for that field. A
session.overlay.stripped event is emitted listing the removed fields.
This prevents archived sessions from bypassing updated security policies.
The config hash is recomputed after stripping — if this changes the hash,
a drift event is also emitted.
Template resolution at start time merges: template defaults ← overlay fields.
gc session inspect {session} shows the effective configuration with both
layers visible for debugging.
Session Key Lifecycle
Thesession_key is a provider-specific resume handle (e.g., Claude’s
--resume session ID). It requires lifecycle management:
- Set on create: Generated by the provider on first start.
- Preserved on suspend/archive: Enables resume with warm context.
- Rotated on reactivate (if
wake_mode=fresh): New key, fresh context. - Scrubbed on close: Set to empty string when bead status →
closed. - Redacted in CLI output:
gc session listandgc session inspectshow[redacted]instead of the raw key value. - No TTL (by design): The key’s lifetime matches the session’s lifetime. Archived sessions may hold keys for extended periods — the retention policy (see below) bounds this.
Archived Bead Retention
Archived sessions accumulate over time. To prevent unbounded growth:-
Per-template cap:
max_archivedin pool config (default 10). When creating a new archived session would exceed the cap, the oldest archived session is closed (bead status →closed, sensitive metadata scrubbed). -
Excluded from hot path: The reconciliation loop’s in-memory session
index only tracks
active,suspended,draining, andquarantinedsessions. Archived sessions are not queried per-tick — only on reactivation (filtered query by template + state=archived). -
Sensitive metadata scrubbed on close: When an archived bead is
pruned to
closed,session_keyandoverlay.env.*fields are cleared. -
Time-based secret scrubbing: Archived sessions with
wake_mode=freshhave theirsession_keyandoverlay.env.*scrubbed afterarchived_secret_ttl(default 24h) even while the bead remains open. These sessions will never be resumed with their old key, so early scrubbing is safe. Archived sessions withwake_mode=resumeretain secrets until closed (they need the key for reactivation). The reconciler checksarchived_at + archived_secret_ttlon each tick forwake_mode=fresharchived beads and scrubs expired secrets in-place.
Removing agent.Agent
Theagent.Agent interface (internal/agent/agent.go) becomes unnecessary.
Its operations map directly to session.Manager + runtime.Provider:
| agent.Agent method | Replacement |
|---|---|
Start() | session.Manager.Create() or .Attach() |
Stop() | session.Manager.Suspend() or .Close() |
Attach() | session.Manager.Attach() |
IsRunning() | sp.IsRunning(sessionName) |
IsAttached() | sp.IsAttached(sessionName) |
Nudge() | sp.Nudge(sessionName, msg) |
Peek() | session.Manager.Peek() |
SessionConfig() | Template resolution (pure function) |
managed struct (internal/agent/agent.go:246-258) is replaced by the
session bead + template resolution. buildOneAgent (cmd/gc/build_agent.go)
becomes resolveTemplate() — a pure function that produces
session.CreateParams from config without creating in-memory objects.
Migration Path
This is a large architectural change. Migration proceeds in phases to avoid a big-bang rewrite. Each phase has a defined single-writer for runtime lifecycle and a rollback procedure.Phase 0: Bead Schema Migration (no risk)
Existing session beads usetype: "agent_session" with label
gc:agent_session and states active/stopped/orphaned/suspended. This
phase adds forward-compatible handling: the controller recognizes both
agent_session and session bead types. New beads are created with
type: "session". Existing beads are NOT migrated — they continue to work
and are naturally replaced as sessions are recreated. After Phase 4, any
remaining agent_session beads can be closed via a one-time cleanup
command.
Legacy state mapping:
Legacy state (agent_session) | New state (session) | Pool occupancy |
|---|---|---|
active | active | Counts against max |
suspended | suspended | Counts against max |
stopped | closed (terminal) | Does not count |
orphaned | suspended (no runtime) | Counts against max |
max during the hybrid period to prevent
over-provisioning.
Phase 0 tests:
TestLegacyBeadRecognition— controller readsagent_sessionbeadsTestLegacyStateMapping— legacy states map to new model correctlyTestHybridPoolOccupancy— mixed legacy + new beads count correctly
Phase 1: Template Resolution (low risk)
Extract template resolution frombuildOneAgent into a pure function that
returns session.CreateParams (command, env, hints, workDir). No behavioral
change — buildOneAgent calls the new function internally.
Single writer: agent.Agent (unchanged).
Rollback: Revert the extraction. buildOneAgent is self-contained again.
Phase 2: Controller Uses session.Manager (medium risk)
Modify the controller to create sessions viasession.Manager.Create()
instead of agent.Agent.Start(). Session beads become the source of truth.
agent.Agent objects are still built but become read-only — they are
used only for operations that don’t mutate lifecycle (peek, nudge, attach,
status queries). All lifecycle mutations (start, stop, suspend) go through
session.Manager exclusively.
Single writer: session.Manager (lifecycle). agent.Agent (read-only
operations only — Peek(), IsRunning(), IsAttached(), Nudge()).
Anti-corruption boundary: agent.Agent.Start() and agent.Agent.Stop()
are made unreachable in Phase 2 (panic if called, caught by tests).
Rollback: Re-enable agent.Agent lifecycle methods, revert controller
to use agent.Agent.Start().
Phase 3: Pool Archival (medium risk)
Implement the drain protocol and archived state for pool sessions. Old pool sessions transition throughdraining → archived instead of being
destroyed. Work routing excludes non-active sessions. Controller prefers
reactivation vs fresh creation based on wake_mode.
Session naming: The {template}-{short-hash} naming convention is
introduced in Phase 3 alongside the new pool lifecycle. During Phase 2,
session names remain compatible with the existing agent-name format.
Downgrade handling: On rollback to Phase 2, hash-named sessions are
unknown to the old binary. The rollback runbook (step 1) closes all
Phase 3 sessions before downgrading. The old binary’s forward-compatibility
(skip unknown states/names with warning) prevents crashes if any are missed.
A TestPhase3Downgrade_HashNamedSessions integration test validates this.
Single writer: session.Manager (lifecycle, including new drain/archive).
Rollback: Revert to immediate destroy on scale-down. Rollback runbook:
- While the new controller is still running, execute cleanup via socket:
gc session drain-all --template=X(drains active sessions)gc session close --state=archived,quarantined,creating(closes beads) - Stop the new controller
- Start the old binary (Phase 2)
- Old binary skips unknown state values with warning (forward-compat)
If the new controller has already crashed (can’t use socket), use
gc session admin-close --offlinewhich: (a) acquirescontroller.lock(non-blocking — fails if another controller is running), (b) kills runtimes bysession_nameviaruntime.Provider.Stop(), (c) marks orphaned beads asblocked, and (d) writes state changes directly to bead store (bypassing socket). Requires--yesflag for non-interactive confirmation. This is the ONLY sanctioned offline mutation path and does NOT require a running controller — it operates directly on the bead store and runtime provider. Forward compatibility: Unknownstatevalues are skipped with asession.unknown_statewarning event, not errors. This allows safe rollback from Phase 3 to Phase 2 without crashing ondraining/archivedbeads that the older binary doesn’t understand.
Phase 4: Remove agent.Agent (low risk, large diff)
Replace allagent.Agent usage with direct session.Manager +
runtime.Provider calls. Remove internal/agent/agent.go, buildOneAgent,
buildAgentsFromConfig. The controller operates entirely on session beads.
Single writer: session.Manager (only writer remaining).
Rollback: Restore agent.Agent as read-only wrapper. Larger revert but
mechanically straightforward since Phase 2-3 already proved bead-driven
lifecycle.
Phase 5: Multi-Instance Consolidation
RemovemultiRegistry. Multi-instance agents are just templates with
unlimited sessions — gc session new {template} creates a new session
from the template. gc session suspend {session} suspends or closes it.
The multi-instance bead tracking is subsumed by session beads.
Single writer: session.Manager (unchanged).
Rollback: Restore multiRegistry as a compatibility shim that delegates
to session beads.
Depends-On Across Templates
Todaydepends_on is agent-to-agent. In the session model, it becomes
template-to-template: “at least one active session of the dependency
template must be alive.” This is already how allDependenciesAlive works
for pools — generalize it.
Specifically: depends_on: ["mayor"] means “at least one session with
template:mayor label must be in active state.” This is checked before
waking any session of the depending template.
CLI Changes
gc session list Default Output
pool_slot for pool sessions (dash for non-pool).
gc session inspect redacts session_key and overlay.env.* values,
showing [redacted] instead.
Default filter: Shows creating, active, suspended, draining,
quarantined. Archived and closed sessions are hidden by default.
Flags:
--all— show all states including archived and closed--state=archived— filter to specific state--template=worker— filter by template name
Ambiguity Resolution
Whengc session peek {name} matches multiple sessions (e.g., multiple
polecat sessions), the CLI returns an error:
gc agent commands).
Test Strategy
Each migration phase has a defined test plan. All pool lifecycle tests useruntime.Fake + beads.MemStore — no tmux required.
Phase 1 Tests
| Test | Type | What It Verifies |
|---|---|---|
TestResolveTemplate_Basic | Unit | Pure function produces correct CreateParams |
TestResolveTemplate_WithOverlay | Unit | Overlay merges correctly with template defaults |
TestResolveTemplate_OverlayDenylist | Unit | Banned keys rejected at creation |
TestConfigHash_Canonical | Unit | Semantically identical configs produce identical hashes |
TestConfigHash_Behavioral | Unit | Non-behavioral changes (comments, whitespace) don’t change hash |
Phase 2 Tests
| Test | Type | What It Verifies |
|---|---|---|
TestController_SessionManager_Create | Integration | Controller creates sessions via Manager, not agent.Agent |
TestController_AgentStart_Panics | Unit | agent.Agent.Start() is unreachable |
TestController_BeadDrivenLifecycle | Integration | 3+ ticks with controller restart; no duplicate sessions, no orphaned beads |
TestController_FailedBeadRead_AbortsTick | Unit | Bead store error → tick aborted, no mutations |
agent.Agent.Start() directly
need updating to use session.Manager.Create().
Phase 3 Tests
| Test | Type | What It Verifies |
|---|---|---|
TestDrainProtocol_InFlightCompletes | Integration | Drain waits for work, then archives |
TestDrainProtocol_Timeout | Integration | Drain timeout → archive + orphan beads marked |
TestDrainProtocol_CrashDuringDrain | Integration | Crash during drain → immediate archive |
TestArchive_LabelRemoved | Unit | Archived session has no pool: label |
TestSuspend_PoolLabelRemoved | Unit | Suspended pool session has no pool: label |
TestResume_LabelRestoredAfterLiveness | Integration | Label only added after runtime confirmed alive |
TestReactivate_LabelRestoredAfterLiveness | Integration | Label only added after runtime confirmed alive |
TestCreation_LabelAddedAfterLiveness | Integration | pool: label only after state=active |
TestArchive_Reactivate_AtomicMutations | Unit | State + label changes are batched |
TestArchivedSession_NoWorkRouting | Integration | bd ready excludes archived sessions |
TestSuspendedSession_NoWorkRouting | Integration | bd ready excludes suspended sessions |
TestRetentionPolicy_MaxArchived | Unit | Oldest archived closed when cap exceeded |
TestCrashLoop_Quarantine | Integration | N crashes → quarantined, cooldown → reactivated |
TestQuarantine_ReactivationBlockedAtMax | Integration | At-max pool → quarantined→archived |
TestQuarantine_CycleCountPersisted | Unit | quarantine_cycle survives controller restart |
TestScaleDown_SuspendedFirst | Integration | Suspended archived before active drained |
TestExecStore_PartialFailureRepair | Integration | Each repair table row (uses fault-injecting store wrapper) |
TestSocketConcurrency_MutationDuringTick | Integration | CLI mutation via socket during active tick |
TestCreating_StaleCleanup | Integration | Creating bead >60s → closed or completed |
TestForwardCompatibility_UnknownState | Unit | Unknown state values skipped with warning |
TestReactivate_OverlayRevalidation | Integration | Revoked overlay keys stripped on reactivate |
TestArchivedSecretTTL_FreshMode | Integration | Secrets scrubbed after TTL for wake_mode=fresh |
TestAdminClose_Offline_KillsRuntimes | Integration | Offline admin-close kills runtimes + marks beads |
TestDrainCompletion_AuthoritativeQuery | Integration | Pre-archive query catches late work claims |
TestExecStore_QuarantineRepair | Integration | All quarantine repair table rows |
TestActiveCrash_BelowThreshold_RestartInPlace | Integration | Single crash restarts without state change |
Phase 4 Tests
| Test | Type | What It Verifies |
|---|---|---|
TestNoAgentAgentImports | Build | No package imports internal/agent |
TestController_DirectManagerOps | Integration | All operations work without agent.Agent |
agent.Agent interface
directly. Mechanical update to session.Manager equivalents.
Phase 5 Tests
| Test | Type | What It Verifies |
|---|---|---|
TestMultiInstance_ViaSessionBeads | Integration | gc session new creates session, gc session suspend closes |
TestNoMultiRegistry | Build | multi_registry.go removed, no references |
Conformance Suite Additions
The session conformance suite (internal/session/conformance_test.go) gains:
TestConformance_CreatingState— creating → active with liveness checkTestConformance_CreatingStale— creating cleanup after timeoutTestConformance_DrainState— draining → archived transitionTestConformance_DrainCrash— crash during drain → immediate archiveTestConformance_QuarantineState— crash loop → quarantine → recoveryTestConformance_QuarantineAtMax— quarantine reactivation blocked at maxTestConformance_ArchivedReactivation— archived → active with generation bumpTestConformance_OverlayValidation— per-template env allowlist enforcementTestConformance_AtomicStateTransitions— batch writes for multi-field transitionsTestConformance_SuspendedPoolRouting— suspended pool session not routableTestConformance_TwoAxisState— bead.status × metadata.state consistencyTestConformance_UnknownStateForwardCompat— unknown states skipped safely
Impact Analysis
Files to Change
| File | Phase | Change |
|---|---|---|
internal/session/manager.go | 1-2 | Extend Create to accept template resolution |
cmd/gc/build_agent.go | 1 | Extract resolveTemplate() |
cmd/gc/build_agents.go | 2-4 | Rewrite to produce desired template counts |
cmd/gc/session_reconciler.go | 2-3 | Reconcile against templates, not agents |
cmd/gc/session_beads.go | 2-3 | Simplify (beads are now canonical) |
cmd/gc/session_wake.go | 3 | Add drain/archived/quarantine state transitions |
cmd/gc/pool.go | 3-4 | Pool scaling creates/drains/archives sessions |
cmd/gc/multi_registry.go | 5 | Remove entirely |
internal/agent/agent.go | 4 | Remove entirely |
cmd/gc/city_runtime.go | 2-4 | Remove agent.Agent fields |
internal/config/config.go | 1 | Add defaults section to Agent |
Backward Compatibility
- city.toml format: No breaking changes.
[[agent]]syntax is unchanged. Pool config is unchanged. The[agent.defaults]section and new pool fields (drain_timeout,archive_order, etc.) are additive. - CLI commands:
gc session new/suspend/peek/attachare the primary interface.gc agentis config-only (add/suspend/resume). - Bead schema: New metadata fields are additive. Existing session beads are compatible (missing fields use defaults).
- Environment variables:
GC_SESSION_NAMEandGC_TEMPLATE(already emitted) become canonical. LegacyGC_AGENTcontinues during migration.
Risks
-
Drain protocol complexity. The
drainingstate adds a transitional lifecycle path. Implementation must handle edge cases: drain of a session that crashes during drain, drain timeout racing with work completion, double-drain of the same session. - Migration duration. Five phases over multiple PRs. The intermediate states increase code complexity temporarily, but the single-writer contract and anti-corruption boundary (Phase 2) limit the blast radius.
- Performance. The reconciliation hot path uses an in-memory session index (same pattern as the convergence active index). The index maps bead ID → for all non-closed, non-archived sessions. It is populated at startup via a one-time full scan, then maintained synchronously on every mutation by the single-writer controller. Since all lifecycle mutations (including CLI commands) go through the controller socket (INV-5), the index is always consistent — no periodic full reconcile needed. The index eliminates per-tick store queries. Archived sessions are queried on-demand only during reactivation.
-
Naming transition. Pool instances today have deterministic names
(
worker-3). Session-based naming uses{template}-{short-hash}. Thepool_slotmetadata field provides backward-compatible sequential references for operators who need them.
Resolved Questions
-
Archived session pruning: Yes, auto-pruned via
max_archivedper template (default 10). Oldest archived sessions are closed when the cap is exceeded. Sensitive metadata is scrubbed on close. -
Reactivation semantics: When
wake_mode=resume, the controller reactivates an archived session (same bead, same key, warm context). Whenwake_mode=fresh, the controller creates a new session bead with fresh context — archived sessions are NOT reactivated. The archived beads stay archived until pruned bymax_archived. -
Template overlay scope: Overlays are limited to a strict allowlist
(
model,name,title,prompt,env.*with denylist). Unknown keys are rejected at creation time. -
depends_onacross templates: Template-to-template: “at least one active session of the dependency template must be alive.” Generalized from the existing pool dependency check. -
Routing label as ZFC compromise: The controller manages the
pool:routing label (adding/removing it during state transitions). The label string is parameterized viarouting_labelin pool config, so Go code manipulates a configured value, not a hardcoded prefix. This is a v1 pragmatic compromise — future versions could externalize routing entirely to agent-driven label management via hooks.
Open Questions
-
Should
drainingsessions be visible togc session peek? They’re still running but about to be archived. Current recommendation: yes, peek works on any running session regardless of state. - Multi-template overlays. Could a session combine fields from multiple templates? Current answer: no. One template per session. If needed, create a new template that inherits from others.
Appendix: Current vs Target Comparison
Creating a Pool Member
Current (7 steps, in-memory):evaluatePool()→ desired countpoolAgents()→ deep copy config per instancebuildOneAgent()→ resolve provider, build command, create agent.AgentsyncSessionBeads()→ create bead to match agentreconcileSessionBeads()→ decide to wakeagent.Agent.Start()→ runtime session- Agent picks up work via
bd ready --label=pool:{template}
evaluatePool()→ desired countresolveTemplate()→ session.CreateParams from configsession.Manager.Create()→ bead (state=creating, no pool: label)- Runtime starts, liveness confirmed → state=active, pool: label added
- Session picks up work via pool label on bead
Stopping a Pool Member
Current (destroyed):- Controller sees excess instances
agent.Agent.Stop()→ tmux session killedsyncSessionBeads()→ bead closed- History lost
- Controller sees excess active sessions
session.Manager.Drain()→ state=draining, pool label removed- Wait for in-flight work (or timeout)
session.Manager.Archive()→ state=archived, runtime killed- Queryable via
gc session list --state=archived --template=worker - Reactivatable if pool grows again