ADR-0013: Third audit wave (2026-06): actuation, audit, and untrusted-input hardening¶
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-06-07 |
| Authors | Roman Mednitzer |
Context¶
A third review wave read the full source again, re-ran every gate, and validated
the architecture against established hardening practice for the surfaces praxis
fuses: privileged SSH/subprocess execution (OpenSSH BatchMode and host-key
policy, POSIX process groups), append-only audit logging (owner-only file mode,
visible-seam chain resume), untrusted-input parsing (finite-or-default numeric
coercion, frontmatter fence and size discipline), and secret redaction (structural
provider-token anchors, value-complete Authorization redaction). The design was
distilled from proven systems and reimplemented natively; this wave closes the gap
between the distilled intent and the delivered code. No third-party code or
cross-repository reference is introduced: praxis stays self-contained (ADR-0001).
The review confirmed the previously-audited controls still hold (the append-only triggers, the fail-closed evidence verifier, the SSRF numeric-form normalisation, the deny-first policy). It also confirmed a coherent cluster of open backlog items from ADR-0011 and ADR-0012 are real and surgically fixable, and surfaced a small set of new hardening gaps and one self-containment defect:
- The SSH adapter built
["ssh", target, action]with no host-key policy and noBatchMode, so it could prompt (and hang a TTY-less MCP call) or accept an unverified host key; a leading-dash target was an ssh-option-injection vector. - The subprocess runner inherited the server environment and stdin, and on timeout killed only the direct child: a wrapped tool could read the MCP stdio stream, hang on a credential prompt, or leak a grandchild process tree.
- The talosctl adapter enforced the T3 single-target rule on
host.namerather than the actualhost.nodes, and tokenised a free-form action into argv. - A trifecta refusal raised out of the tool handler with no audit record.
- The audit logger reopened the file after degrading and dropped the sink to stderr on a corrupt tail (losing records); the log was created world/group readable.
- A poisoned embedding (
NaN) could drag aNaNscore into vector ranking; the manifest parser accepted----as a fence, had no size cap, and allowed indented and duplicate keys; an empty AIDE report read as a clean host. - A
context.pydocstring cited an out-of-tree prototype by name as its rationale, a self-containment (ADR-0001) and docs-honesty defect.
Decision¶
- Adopt this wave as the third recurring audit (after the external ADR-0011 and the internal ADR-0012), validating delivered code against established hardening practice and reimplementing every adopted technique natively. No sibling repository is named in code or docs; the self-contained invariant is upheld.
- Remediate the coherent, surgical, security-relevant cluster in the change that accompanies this ADR, each fix with a regression test: BL-018, BL-020, BL-021, BL-034, BL-047, BL-048, BL-054, BL-055, BL-057, BL-058, BL-059 resolved, plus the new items BL-063 to BL-067. Architectural open items (BL-046 SSRF resolution, BL-049 credential wiring, BL-051/052 CI/deploy gating) stay open and tracked.
- Where a resolved item carried a residual beyond the security-critical core, carve
the residual into a new tracked item rather than over-claim closure: the store
seqcross-connection race (residual of BL-054) becomes BL-068, and the talosctl structured-parameter refactor (beyond the BL-048 verb allowlist) is noted as a future refinement, not a delivered guarantee.
Findings¶
Verification: R = reproduced by executing the code, V = verified against the exact source.
| BL | Finding | Constraint | Sev | Verify | Status |
|---|---|---|---|---|---|
| 020 | SSH adapter has no host-key policy or BatchMode; a leading-dash target is an option-injection vector |
SEC-5, INV 5 | High | V | resolved |
| 021 | Subprocess runner inherits stdin (can read the MCP stdio stream) and env (can hang on a prompt), and kills only the direct child on timeout (leaks the process tree) | SEC-8 | High | R | resolved |
| 047 | talosctl T3 single-target rule is enforced on host.name, not the host.nodes list, so one T3 reset can wipe multiple nodes |
SEC-6, INV 6 | High | V | resolved |
| 048 | talosctl tokenises a free-form action into argv; constrain the leading verb to an allowlist |
SEC-8 | Med | V | resolved |
| 018 | A trifecta refusal raises out of the tool handler with no audit record | SEC-4, INV 3 | Med | R | resolved |
| 055 | Audit logger reopens the file after _degrade and drops the sink to stderr on a corrupt tail (losing records) and leaks the handle |
SEC-8, INV 3 | Med | R | resolved |
| 057 | Manifest parser accepts ---- as a fence, has no size cap, and allows indented and duplicate keys; non-UTF-8 bytes crash the loader |
INV 8 | Med | R | resolved |
| 058 | AIDE empty output reads as a clean host (false negative); collected telemetry has no size cap before parsing | INV 8 | Med | R | resolved |
| 054 | _cosine/similar propagate a NaN from a poisoned or corrupted embedding into the ranking |
SEC-10 | Med | R | resolved |
| 034 | parse_ansible_check only reads changed:; a FAILED/UNREACHABLE host during a check is dropped |
SEC-6 | Med | V | resolved |
| 059 | An UNEXPECTED security-predicate finding (a rogue port or user) is ranked INFO, below a changed/missing one |
SEC-6 | Med | V | resolved |
| 063 | Actuation subprocess does not scrub the env (no GIT_TERMINAL_PROMPT=0/DEBIAN_FRONTEND=noninteractive) or detach stdin |
SEC-8 | Med | R | resolved |
| 064 | Audit log is created world/group readable; not opened O_APPEND at the OS level |
SEC-9 | Med | R | resolved |
| 065 | Redaction misses common provider token shapes (github_pat_, glpat-, npm_, AIza, ya29., Stripe, OpenAI scoped) and stops Authorization at the first space, leaking a comma-separated SigV4 signature |
SEC-9, INV 3 | Med | R | resolved |
| 066 | context.py cites an out-of-tree prototype by name as rationale (self-containment and docs-honesty defect) |
governance, INV (self-contained) | Low | V | resolved |
| 067 | PRAXIS_HTTP_HOST is not whitespace-stripped, so a "127.0.0.1\n" value is misread as non-loopback |
SEC-7 | Low | R | resolved |
| 068 | Store seq is not unique; the MAX(seq)+1 read can race across two store instances on one file (residual of BL-054) |
SEC-10 | Low | V | open |
Consequences¶
Positive: the privileged-execution surface (SSH host-key policy and option-injection guard, process-group isolation, stdin detachment, env scrubbing, the talosctl verb allowlist and node-aware T3 gate) is materially hardened with tests; every denial is now audited; the audit log is owner-only and survives a corrupt tail without losing records; untrusted parsing (vectors, manifests, AIDE, telemetry size) is robust; and redaction covers more secret shapes value-completely. The repository is again fully self-contained in code and docs.
Negative: the actuation subprocess path moved from subprocess.run to a Popen
plus explicit timeout/kill, which is more code on the trust boundary (covered by a
new timeout test). The talosctl verb allowlist must be extended deliberately when a
new subcommand is needed.
Neutral: this ADR records the wave and its acceptance; enforcement is the code and tests under each item. The architectural open items from ADR-0012 are unchanged.
Alternatives considered and rejected¶
- Resolve every open item from ADR-0011/0012 in one change. Rejected: the architectural items (hostname-resolving SSRF, credential wiring, CI/deploy gating) are larger than this surgical security cluster and merit their own reviewable changes, consistent with ADR-0012's staging.
- Default the SSH host-key policy to
StrictHostKeyChecking=yes. Rejected for v0: a fleet with no pre-seededknown_hostswould refuse every first connection;accept-new(Trust-On-First-Use, refusing a changed key) is the secure default and is overridable toyesonceknown_hostsis seeded. - Degrade the audit sink to stderr on a corrupt tail (the prior behaviour). Rejected: losing the audit record is worse than a visible seq-reset seam that the verifier reports; the seam is the security signal.
Revisit triggers¶
- The HTTP transport is implemented (raises BL-046 SSRF resolution and the consent registry in urgency).
- A concurrent or multi-instance store path is implemented (raises BL-068).
- A later audit contradicts a recorded verdict here (append an audit note, never rewrite a resolved row).