ADR-0021: Cross-fleet pattern integration wave (2026-06-13)¶
Status¶
Accepted
Date¶
2026-06-13
Authors¶
praxis maintainers (pattern-integration wave following ADR-0020)
Context¶
A survey of proven operational patterns from the broader fleet-tooling domain (infrastructure provisioning, configuration management, operator runbooks, shell and SSH actuation, and an agent runtime) was run against praxis to find additions worth adopting. praxis had already absorbed most generic hardening idioms (finite or non-negative numeric validation at construction, per-link failure containment in the routing-chain dispatcher, provider-token redaction shapes, the append-only hash-chained audit with degrade-to-stderr, process-group subprocess kill, an env allowlist, chain resume on restart), so the genuinely missing, high-value additions clustered in four areas. They are integrated here natively, with no import from or runtime coupling to any external tooling (ADR-0001, ADR-0014): praxis reimplements the idea, it does not depend on the source.
The wave deliberately implements the four self-contained, well-scoped additions and records the larger or decision-bearing findings as tracked backlog items rather than rushing them into a security-sensitive change.
Decision¶
-
Machine-checkable compliance catalog (BL-031, closing the open item; advances the module-back-citation part of BL-036). The prose compliance map (
docs/governance/compliance-map.md) and the STPA security constraints (docs/stpa/07-security-constraints.md) are projected intodocs/governance/compliance-controls.json, a pydantic-validated catalog keyed by the existing SEC-1..SEC-10 ids plus one governance control (CTL-001, supply-chain and secure development). The model (src/praxis/governance/catalog.py) is the single source of truth for a generated JSON Schema (docs/schema/compliance-controls.schema.json, under the existingmake schemadrift guard), mirroring the SKILL.md frontmatter pattern (ADR-0014).src/praxis/governance/validate.pyruns eleven bidirectional cross-reference rules: id format; the SEC controls are exactly the STPA constraints; every cited module exists and (for a SEC control'ssrc/praxismodules) carries the matchingSEC-Nback-citation token; everySEC-Ntoken in the source tree names a catalog control; invariants are in 1..9; every regulatory framework is known and every in-scope framework is cited; every listed proving testpath::functionexists and an implemented control names at least one (partial/planned exempt); the prose map cites no undefined control; and a partial/planned control carries a tracking BL id while an implemented one does not.scripts/validate_compliance.pywires it intomake validate-complianceand theci-successaggregate, and the suite re-runs it (tests/governance/) somake checkcatches drift too. -
Redaction hardening (BL-097).
execution/redaction.pygains a PyPI upload-token shape (anchored on the fixedAgEIcHlwaS5vcmcmacaroon prefix), runs the npm and GitLab token bodies UNBOUNDED from their length floor so a longer-than-minimum token collapses whole instead of leaving a tail in the audit log, and adds a context-gated compact MySQL password redaction (mysql -p<secret>) that fires only when a MySQL-family client is present in the same string, so-p-as-port for unrelated tools (ssh -p22,nmap -p1-1000) is not over-scrubbed. Strengthens SEC-9 (invariant 3); no classification change, soPATTERNS_VERSIONis unchanged. -
Talos partition-scoped reset (BL-098).
actuation/talosctl.pygains an additivesystem_labelsstructured param mapping totalosctl reset --system-labels-to-wipe(allowlisted toEPHEMERAL/STATE, case-normalised). Wiping onlyEPHEMERALpreserves theSTATEpartition (node identity and secrets), so the node rejoins the cluster rather than needing a full re-provision. It is mutually exclusive with the disk-scoped--wipe-mode(supplying both is refused: fail closed on an ambiguous reset scope). The documentedwipe_modedefault (system-disk, BL-025) is left unchanged: this is a strictly additive capability. The partition reset keeps its T3 authority because theresetverb already floors at T3 in the classifier (a test pins this), so nopatterns.pychange is needed (SEC-5, invariant 6). -
Content-hash compare-and-set for supersede (BL-027, closing the open item). A new
VersionedStoreProtocol (store/base.py) sits besideVectorStoreon the extension ladder, advertised viaCapability.COMPARE_AND_SET.Fact.content_hashis a stable SHA-256 over the asserted content (key, value, write provenance), not the store-assigned lifecycle stamps, so the version a caller reads offget_activesurvives the store round-trip and is identical on both backends.put_fact_if(fact, expected_version=...)makes the read-compare-write atomic and version-gated: SQLite takes the write lock up front withBEGIN IMMEDIATE(plus abusy_timeoutso a second instance serialises rather than failing fast), Postgres locks the active row withSELECT ... FOR UPDATE; a mismatch raisesVersionConflictand writes nothing. This forecloses a lost update where an operator approves replacing the fact they read but a different value lands in between (SEC-6, invariant 4). The Postgres create-if-absent case cannot be row-locked (there is no row yet), so a concurrent create that wins the race trips the partial unique index, which is translated toVersionConflictso the create path honours the contract like the supersede path (live-PG concurrency verification tracked as BL-103). A SQLite concurrency test proves exactly one of two racing writers wins; the rest run in the shared backend-parity suite. -
The larger or decision-bearing findings are tracked, not rushed (mirrors how ADR-0011 recorded a validated backlog rather than implementing all at once): a CIS-Talos desired-state baseline as drift data (large; needs a fact-predicate schema decision, filed for a dedicated ADR), a multi-sink audit fan-out with per-sink failure containment (latent until a second audit sink is wired, e.g. the Postgres path), an audit
request_id/client_idcorrelation field (lower value for a stdio-default single-operator server), and a client-side-only (--server=false) talosctl pre-flight health probe (a behavioural change to a HARD safety precondition, raised for an operator decision rather than changed unilaterally). Seedocs/backlog.md.
Consequences¶
Positive:
- The compliance map stops being prose a reviewer must trust and becomes a CI gate: a stale module path, a control that no longer back-cites its constraint, an uncovered framework, a missing proving test, or a map row referencing an undefined control all break the build. The compliance map's own stated principle ("a control without a test is a visible gap") is now machine-enforced.
- The redaction additions close real audit-log leak paths (a long npm token tail, a bare PyPI token, the most common MySQL password form) without over-scrubbing.
- Operators gain a STATE-preserving Talos reset (the safer, more common reset) as a first-class, audited, T3-gated option, without changing the documented default.
- The store gains a lost-update guard that matches the human-gated convergence promise: an approval bound to a read cannot silently apply to a value that changed underneath it. The two backends are held to one behaviour by the parity suite.
Negative:
- The catalog is a new artifact to maintain: adding a SEC constraint or moving an enforcement file now also means updating the catalog, or CI fails. That coupling is the point (it is the drift guard), but it is real maintenance.
put_fact_ifadds a third fact-write path besideput_factandsupersede; callers that want lost-update safety must read the version and pass it. The plainput_fact(last-writer-wins under the unique-index race) is unchanged, so this is opt-in, not a default tightening.- The SQLite
busy_timeoutmakes a contended write wait up to 5 s rather than fail fast; for the single-operator stdio default there is normally one connection, so no contention, but a multi-instance deployment now blocks instead of erroring.
Neutral:
- No change to the execution pipeline, the approval gate, the transport guard, or
PATTERNS_VERSION. The redaction change is additive pattern coverage; the talos change is an additive param; the CAS change is an additive Protocol and method. - The catalog keys on SEC-N rather than inventing a parallel control namespace, so the existing STPA traceability stays the single spine and the catalog is a projection of it, not a competitor.
Alternatives considered and rejected¶
- Model the compliance catalog in YAML. Rejected: praxis has no YAML dependency and hand-rolls its only frontmatter parser (BL-057) to avoid the parser attack surface; JSON plus a pydantic model is the in-repo idiom (ADR-0014) and needs no new dependency.
- Invent a
CTL-NNNnamespace for every control. Rejected: the SEC constraints are already the tested, STPA-derived control set; keying the catalog on them keeps one spine. A singleCTL-001is used only for the supply-chain/secure-development control that has no single SEC home. - Change the talosctl reset default to the STATE-preserving scope. Rejected here: the
system-diskdefault is a documented BL-025 decision, and reversing a destructive default is an operator decision that belongs in its own ADR, not a side effect of adding an option. The safer scope is now available; the default question is a revisit trigger. - Implement compare-and-set as a new MCP tool. Rejected: it is a store primitive, not a new control action; it routes through no new tool surface and so needs no new UCA row. The fact-writing tools that already have UCA coverage can adopt it internally.
- Add an audit
request_id/client_idfield in this wave. Rejected for now: the value is correlation across concurrent clients, which the stdio-default single-operator server does not have; filed rather than built.
Revisit triggers¶
- A second audit sink is wired (the Postgres audit path, a syslog target): introduce
the multi-sink fan-out with per-sink failure containment,
BaseExceptionstill propagating, before the second sink can mask the first. - The drift engine needs a machine-readable CIS-Talos desired-state baseline: open a
dedicated ADR for the fact-predicate schema, then transcribe the baseline as
KNOWN_GOODdata. - An operator decides the default Talos reset scope should preserve
STATE: supersede the BL-025 default in a new ADR and flip_DEFAULT_WIPE_MODEwith its test. - The HTTP transport (BL-012) lands and the server serves multiple clients: add the
audit
request_id/client_idcorrelation field (additive to the audit record). - A new SEC constraint or a moved enforcement file: update
docs/governance/compliance-controls.jsonin the same change (the validator will flag the omission). - Live PostgreSQL becomes available to CI or the operator: add the concurrent create-if-absent compare-and-set test (BL-103) and confirm the IntegrityError-to-VersionConflict translation under real contention.