ADR-0011: External fleet-repository audit (2026-06) and validated hardening backlog¶
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-06-07 |
| Authors | Roman Mednitzer |
Context¶
praxis is self-contained (ADR-0001): zero runtime dependency on, and no imports
from, any other repository. It does not follow that praxis cannot learn from the
sibling repositories that share its operator, its EU-sovereign posture, and its
security concerns. Nine sibling repositories were audited for transferable
patterns, with auditing, security, and capability prioritised:
relay-shell(shell/SSH MCP server; the execution trust boundary),isms-mcp(read-only MCP overlay; server surface, path boundary, classification),agents(harness and memory backends; the multi-audit ADR cadence),core-graph(bitemporal four-timestamp model; evidence integrity; engine-level RLS),ai-stack(governed Helm chart; supply-chain parity; governance-as-code),infraandrunbooks(Talos no-SSH paradigm; destructive-op guards),automationandisms(compliance-controls mapping; append-only evidence).
The audit confirmed that several controls are already correct in praxis and must
not be re-done (the RFC 6962 Merkle construction in audit/merkle.py, the
active-fact partial unique index in store/sqlite.py, single-target T3 in
execution/runner.py, the DRY_RUN preview in actuation/base.py, the
fail-closed transport guard in config.py, and readOnlyRootFilesystem: true,
which is stricter than the ai-stack helper default and is kept). Two agent
claims about praxis were checked and found incorrect: praxis already has a
forward-linked, restart-resuming audit hash chain, and its bitemporal
supersession does not suffer the overlap-constraint trap.
The remaining findings are recorded here as a backlog wave. Per the standalone constraint every item is a pattern to re-implement, never a dependency to add.
Decision¶
- Adopt periodic external-fleet audit as a recurring practice, recorded as an
ADR per wave (the cadence observed in the
agentsrepository, ADR 0007 to 0023, where each audit generalises an invariant class to a new boundary). This ADR is the first such wave. - Each finding is validated against a trusted source before acceptance, not only against the sibling repository that surfaced it. The validation verdict is recorded in the table below. Validation materially changed four findings: BL-019 was downgraded (no current injection bypass exists), BL-022 and BL-023 were reframed (Talos already verifies the snapshot hash; a pre-upgrade health check is best practice, not a Talos requirement), and BL-024 was strengthened (the first OWASP defence is to not let the caller supply the path at all, so a runbook registry by id is preferred over a path allow-list).
- The validated findings are accepted as backlog items BL-017 to BL-036, each
mapped to a security constraint (
docs/stpa/07-security-constraints.md) or an invariant, and each citing this ADR as its source. Accepting an item schedules the work; it does not weaken any current default.
Validated findings¶
Trusted sources: OpenSSH ssh_config(5) (man.openbsd.org); the Python
subprocess documentation; the PostgreSQL 17 documentation (CREATE TRIGGER,
ddl-rowsecurity, ddl-priv, advisory locks); the Talos / talosctl
documentation; the OWASP OS Command Injection, Path Traversal, and Logging
guidance (the trusted references named in relay-shell and the praxis security
posture); IEEE 754 for floating-point comparison semantics.
| BL | Finding | Constraint | Source | Trusted-source verdict |
|---|---|---|---|---|
| 017 | Read and ingest tools (query_facts, fact_history, ingest_observation, drift_scan) return without an audit record; only run_action flows through run(). |
SEC-2, SEC-9, INV 1 | relay-shell | OWASP Logging: log access to sensitive data and access-control failures. Confirmed. |
| 018 | Trifecta denials raise TrifectaViolation (in context.py and tools/actuate.py) with no audit record, unlike the runner denied() path. |
SEC-2, SEC-4 | agents (BL-202) | OWASP Logging: authorization failures must always be logged. Confirmed. |
| 019 | The classify and deny probe sees only the command string, not the tool name (and would not see stdin or env if those were ever passed through). | SEC-1, SEC-3 | relay-shell | OWASP Command Injection: array-argument execution (no shell) is the primary defence and praxis already does it; the full argv is already classified. Downgraded to an enhancement (tool-scoped deny rules) plus a forward note. |
| 020 | The SSH adapter emits a bare ssh target action with no host-key policy, risking MITM and interactive hangs. |
SEC-5, SEC-8 | relay-shell | OpenSSH ssh_config(5): StrictHostKeyChecking accept-new adds new keys but refuses changed keys; BatchMode=yes disables prompts and fails fast. Confirmed. |
| 021 | run_subprocess has no start_new_session and no killpg on timeout, orphaning descendants (acute for ansible and tofu partial state). |
SEC-6, SEC-8 | relay-shell | Python subprocess: run() timeout kills the direct child only; start_new_session=True calls setsid(), enabling os.killpg. Confirmed. |
| 022 | The talosctl etcd-restore path must preserve snapshot integrity verification. | SEC-6, SEC-4 | runbooks | Talos: bootstrap --recover-from hash-verifies the snapshot by default (skippable only with --recover-skip-hash-check). Reframed: never pass the skip flag; an optional praxis-side sidecar verify is defence in depth. |
| 023 | A pre-flight cluster health check before a talosctl upgrade. | SEC-6 | runbooks | Talos: not mandated by the upgrade API; it is SRE best practice. Reframed as a recommended HARD precondition, not a requirement. |
| 024 | The runbook adapter runs bash <caller-supplied-path> with no path boundary. |
SEC-4 | isms-mcp | OWASP Path Traversal: prefer not letting the caller supply the path (use an index of known-good items); otherwise normalise and constrain within a base directory. Strengthened: prefer a runbook registry keyed by id over a path allow-list. |
| 025 | The talosctl reset scope should not inherit the most destructive default. | SEC-6, INV 9 | runbooks | Talos: talosctl reset --wipe-mode defaults to ALL (wipes system and user disks). Confirmed: require an explicit scope and treat ALL as a T3-confirmed choice. |
| 026 | Numeric fields parsed from collected host data are not checked for NaN or infinity before use. | SEC-10, SEC-4 | agents, isms-mcp | IEEE 754: NaN comparisons are always false, so a NaN can silently disable a <= 0 or ordering check. Confirmed: parse with a finite-or-default helper at every collector site. |
| 027 | The store exposes only StoreProtocol and VectorStore; no additive extension ladder and no content-hash compare-and-set to harden the one-active-fact supersede. |
SEC-10 | agents | Optimistic concurrency control is standard; the additive-stability rule keeps it non-breaking. Accepted as additive. |
| 028 | The Postgres backend enforces append-only by trigger only. | SEC-10 | core-graph | PostgreSQL 17: TRUNCATE is not subject to row security and is a separately revocable privilege; a TRUNCATE trigger is statement-level; owners and superusers bypass RLS unless FORCEd; RESTRICTIVE policies combine with AND. Confirmed: add REVOKE plus a BEFORE TRUNCATE trigger; an optional RESTRICTIVE RLS classification floor raises the bar against non-superuser paths. |
| 029 | The audit hash chain is not write-serialised under concurrent writers. | SEC-2, SEC-9 | core-graph | PostgreSQL pg_advisory_xact_lock serialises chain appends. Confirmed. Low now (stdio is serial), load-bearing once the HTTP or Postgres audit path is concurrent. |
| 030 | Collected snapshots are not integrity-bound into the Merkle checkpoint (the chain covers what is written, not what was read from the host). | SEC-9, SEC-10 | isms | Chain-of-custody practice; consistent with the RFC 6962 checkpoint already in audit/evidence.py. Accepted: stamp a raw_snapshot_hash. |
| 031 | compliance-map.md is prose only, with no machine check that each control maps to an enforcing file and a test, and each declared framework has at least one article-level citation. |
governance | automation, isms | GRC-as-code practice (the automation bidirectional validator and the isms validator suite). Accepted. |
| 032 | There are no chart-rendering assertions; a regression flipping automountServiceAccountToken or dropping readOnlyRootFilesystem passes make check and fails only at deploy. |
SEC-7, INV 9 | ai-stack | helm-unittest is the standard chart-assertion tool. Accepted. |
| 033 | The airgap zarf.yaml carries a placeholder digest, there is no SBOM, and no values-to-SBOM-to-zarf parity check, yet the compliance map cites an SBOM as the CRA enforcement. |
supply chain | ai-stack | CycloneDX is the SBOM standard; CRA Annex I expects supply-chain traceability; a placeholder digest fails airgap pull. Accepted. |
| 034 | parse_ansible_check maps every changed task to WARNING, missing FAILED (ERROR), unreachable (CRITICAL), and ok (known-good). |
SEC-3, SEC-6 | automation | Conservative round-up (SEC-3) and trustworthy DRY_RUN before approval (SEC-6) require severity fidelity. Accepted. |
| 035 | The audit and evidence chain has no documented retention policy. | governance | automation | NIS2 Art. 23 and ISO/IEC 27001 A.8.15 expect defined log retention. Accepted: documented tiers bound in config. |
| 036 | Governance hygiene: back-citation headers in enforcing modules, agent hard-rules (no fabrication, no bypass) in CLAUDE.md, a values-prod.yaml overlay and version-bump checklist, a namespace default-deny NetworkPolicy, regulatory-deadline data (NISG 2026 in force 2026-10-01), and an empty-string-is-not-loopback test. |
governance | automation, isms, ai-stack, core-graph | Practice-level, each traceable to a sibling-repo control. Accepted as a consolidated low-priority item. |
Consequences¶
Positive: the hardening backlog is derived from a cross-fleet audit and each item is validated against an authoritative source, so the work is defensible and traceable (finding to constraint to source). The audit-as-ADR cadence becomes a repeatable maintenance method.
Negative: the backlog grows by twenty items; some (BL-027 to BL-030, BL-031 to BL-033) are medium effort and touch the store, the Postgres backend, and the deploy and governance surfaces.
Neutral: this ADR records findings and their validation; the enforcement is the code and tests delivered under each backlog item. No current default changes on acceptance.
Alternatives considered and rejected¶
- Take a runtime dependency on a sibling repository (for example reuse relay-shell as the actuator). Rejected: ADR-0001 makes praxis self-contained; the value here is the patterns, not the code.
- Accept the sibling-repo recommendations as-is without independent validation. Rejected: validation corrected four findings (BL-019, BL-022, BL-023, BL-024), which would otherwise have produced wrong or wasted work.
- Record the findings only in the backlog without an ADR. Rejected: the backlog format requires a source ADR, and the validation evidence needs a durable home.
Revisit triggers¶
- A new sibling repository enters the operator's fleet, or a material change lands in an audited one.
- The HTTP transport or a concurrent Postgres audit path is implemented (raises BL-029 from low to load-bearing).
- A trusted source contradicts a recorded verdict (correct by an appended audit note here and a new finding, never by rewriting an accepted row).