Architecture Decision Records¶
The decisions documented here shape the trust boundary, the runtime,
and the operator-facing posture of relay-shell. Read them in order
the first time; revisit one when its subject area changes.
The format follows Michael Nygard's ADR
template:
context, decision, consequences. Each ADR has a Status line at the
top. Status values:
| Status | Meaning |
|---|---|
| Proposed | Drafted, not yet adopted. |
| Accepted | Adopted; the codebase reflects the decision. |
| Superseded | Replaced by a later ADR; see Superseded by line. |
| Deprecated | No longer applicable; kept for history. |
Index¶
| ADR | Title | Status | Date | Subject |
|---|---|---|---|---|
| 0001 | Runtime, SDK, and SSH library | Accepted | 2026-05-19 | Why Python 3.12 + the official mcp SDK + asyncssh, with the alternatives that were rejected (paramiko, building a transport from scratch). |
| 0002 | Unsandboxed, full-access posture | Accepted | 2026-05-19 | Why the executor runs without a meaningful internal sandbox - the project exists to give an MCP client real administrative power, so the safety story is compensating controls (audit, tier policy, redaction, bounds, deployment discipline) instead. |
| 0003 | Tiered authority | Accepted | 2026-05-19 | The four-tier classification (read-only / reversible / stateful / irreversible) plus open / guarded / readonly admission modes that consume it. The deny list is enforced first in every mode. |
| 0004 | Automated TLS at the edge | Accepted | 2026-05-20 | Why deploy/install-edge.sh provisions Caddy + ACME (Let's Encrypt) for the HTTP transport, and why certbot+cron and native TLS in the Python service were rejected. |
| 0005 | Codebase validation against known-good sources | Accepted | 2026-05-24 | A repeatable validation pass against the upstream mcp / asyncssh / OAuth surfaces, the audit record schema, and the documented redaction / tier behavior. A running record that appends a dated outcome per pass: 2026-05-24 (three documentation-drift findings), 2026-05-31 (F-004, redaction coverage for bare provider-token shapes), 2026-06-01 (F-005 C-005 runbook drift; the ADR 0007 audit hash-chain landed in the same pass), 2026-06-12 (D-001, the mcp 1.27.1 → 1.27.2 pin-drift reconciliation, recorded as a full audit pass under audit/), and 2026-06-21 (DOC-1, the runbook §8.18 next-free-ADR marker corrected to match this index), plus a same-day 2026-06-21 full audit pass (scanner battery + steps 1-4 clean; SEC-3 dependency-floor hardening, TOOL-4 CODEOWNERS; deferrals in audit/2026-06-21-engagement.md). |
| 0006 | Syscall-level audit channel via seccomp-bpf notification mode | Accepted | 2026-06-02 | An audit-only seccomp-bpf channel (notify-mode, never blocking) that closes the audit gap on the child side of asyncio.create_subprocess_* without re-introducing a sandbox. Shipped in src/relay_shell/seccomp.py (pure ctypes, no new deps): opt-in via RELAY_SHELL_SECCOMP_NOTIFY (default off), CAP_SYS_ADMIN-gated so set-uid/sudo posture is preserved verbatim, Linux/x86_64/kernel ≥ 5.5, additive syscall_notify / syscall_notify_overflow audit lines (extend the ADR 0007 chain) plus two bounded /metrics counters. Proposed 2026-05-24; accepted with the implementing PR (runbook §7.5 B-021). Follow-ups landed 2026-06-09 (filter version 2): prctl notified for privilege-relevant options via an eq-any predicate (B-024) and coverage extended to local PTY sessions, whose transport adopts the monitor for the session lifetime (B-026). |
| 0007 | Tamper-evident audit log via per-record hash chaining | Accepted | 2026-06-01 | An opt-in (RELAY_SHELL_AUDIT_CHAIN, default off), additive per-record hash chain (seq/prev/chain) that makes edits, insertions, reorders, and interior deletions of the on-disk audit log detectable by recomputation; the fail-closed relay-shell --verify-audit also rejects a missing / empty / head-truncated log by default (--segment for a rotation segment; tail-truncation needs the off-host copy) — closing the integrity gap left by chattr +a + off-host shipping against the ADR 0002 residual-risk attacker. jsonl only; a CLI verb, not an MCP tool. |
| 0008 | Operating-guidance MCP prompt, audited like a resource read | Accepted | 2026-06-08 | Adds one MCP prompt (operating_guide) as the canonical home for detailed "when to use which tool" guidance (one-shot vs PTY session, the spawn+session_* workflow, fleet/transfer entry points), beyond the concise instructions string and per-tool descriptions. A fetch is a model-context pull, so it is audited (tier 0, stable tool="prompt:<name>") and bounded by the same max_output cap, bypassing Relay.run exactly as resource reads do; prompts/list returns metadata only and does not audit. No audit-record-shape change — only a new prompt: tool namespace alongside resource: / syscall_notify. |
When to write an ADR¶
Any of the following needs an ADR before code lands:
- A new transport (e.g. unix-socket alongside
stdio/streamable-http). - A new auth provider (e.g. JWT static-keys alongside the file-backed OAuth 2.1).
- A change to the audit-record shape or to the no-sandbox posture.
- A new policy category (not just another verb in the existing
TIER2_PATTERN/TIER3_PATTERN; seedocs/runbook.md§6.4).
Routine additions - a new tool, a new redaction pattern, a tightened test - do not need an ADR; they go through the normal review loop. The runbook §6 has recipes per case.
How to write one¶
- Number sequentially. Next free number is 0009.
- Filename pattern:
NNNN-short-slug.md. - Required header:
- Sections:
Context,Decision,Consequences,Rejected alternatives(when applicable). - Reference the ADR by number in code or other docs, not by file path - the path is stable but the number is the canonical handle.
Cross-references¶
docs/architecture.md- request lifecycle, module table, and how the ADRs map onto the runtime.docs/runbook.md§6 - extension recipes that may require an ADR before code lands.SECURITY.md- the threat model and how ADRs 0002 / 0003 constrain it.