Scaling Backlog — 100+ Agent Deployments
Phased improvements for running Gas City at scale. Phase 1 items are pure algorithm changes that benefit all providers (tmux, exec, K8s). Later phases are K8s-specific infrastructure.Phase 1: Algorithm Changes (all providers)
These reduce per-tick I/O calls from O(N) per-agent subprocess invocations to O(1) batch calls + O(N) map lookups.-
Pre-fetch running set in reconcile loop — Call
listRunning()once at top ofdoReconcileAgents, skipIsRunning()+ProcessAlive()for sessions not in set, reuse for orphan cleanup (eliminates secondlistRunning()call). Saves N provider calls per tick for non-running agents. -
Replace
countRunningPoolInstanceswithListRunning— Current code callssp.IsRunning()per pool instance (1..max). Replace with singlesp.ListRunning()call + set intersection. For pool max=100, saves 99 provider calls. -
Parallelize pool
scale_checkcommands — Pool scale checks are independent shell commands. Run them concurrently with goroutines inbuildAgents. For 5 pools with 2s checks each, wall-clock drops from ~10s to ~2s.
Phase 2: K8s Infrastructure
-
gc-events-dolt— Replace ConfigMap-based events with a Dolt table. Eliminates CAS counter serialization bottleneck. ConfigMap list performance degrades after ~1000 events; Dolt handles millions of rows. - Event cleanup CronJob — Prune event ConfigMaps older than N hours. Prevents unbounded accumulation that slows list operations.
- Scale mcp-mail — Bump resources to 500m/512Mi request, 1 CPU / 1Gi limit. SQLite single-writer means multiple replicas won’t help; scale the single replica instead.
- Scale Dolt — Increase to 500m/1Gi request, 2 CPU / 4Gi limit, raise max_connections to 500, storage to 20Gi.
-
Increase
patrol_interval— Added scaling guidance to example-city.toml (30s recommended for 50+ agents). Controller resource env vars added to gc-controller-k8s.
Phase 3: Native Providers
-
Native K8s session provider —
internal/session/k8s/package usingclient-gofor direct API calls over reused HTTP/2 connections. Eliminates ~300 kubectl subprocesses per tick at 100 agents. Pod manifests compatible with gc-session-k8s for mixed-mode migration. Configured via[session] provider = "k8s"orGC_SESSION=k8s. - MCP mail PostgreSQL backend — Replace SQLite with PostgreSQL for concurrent writes. SQLite serializes at ~15-20 agents under burst.
-
Bake city into agent image —
gc build-imageassembles a Docker context and builds a prebaked image.prebaked = truein[session.k8s]skips init containers and staging. Pod startup drops from 30-60s to seconds.
Scaling Estimates
| Agents | Phase 1 (algorithm) | + Phase 2 (infra) | + Phase 3 (native) |
|---|---|---|---|
| 10 | works today | — | — |
| 25-30 | removes bottleneck | — | — |
| 50 | marginal | removes bottleneck | — |
| 100 | insufficient alone | marginal | removes bottleneck |
