ADR 0006: Syscall-level audit channel via seccomp-bpf notification mode¶
- Status: Accepted (Proposed 2026-05-24)
- Date: 2026-06-02
Context¶
ADR 0002 establishes the service account — not an in-process sandbox — as the trust boundary, and ADR 0003 adds a tier classifier that records the intended blast radius of each tool call. Together they cover the MCP-side of the request: what the model asked for, how the runner classified it, what bytes flowed back, and the SHA-256 of the output.
What they do not cover is anything the spawned process does after
asyncio.create_subprocess_* returns. Once a child is running (a one-shot
shell.run, a script body, a PTY session, or an SSH session's local
half), the kernel sees every syscall but relay-shell sees only the
combined stdout/stderr and the exit code. Two failure modes follow:
- Audit gap on the child side of the boundary. A long-running script
may open files, exec further processes, mount/unmount, set/clear
capabilities, or call
prctl(PR_SET_DUMPABLE, 0)— none of which appear in the audit trail. The hash-of-output invariant from ADR 0002 is preserved, but a forensic question of the form "did the child shell out toncafter the first stdout line?" cannot be answered fromaudit.jsonlalone. - No structured channel for the host's own monitoring. Operators
shipping
audit.jsonlto a SIEM (seedocs/audit-shipper.md) have no way to correlate syscall-level activity with the tool call that spawned it. Linuxauditdruns at the host level and produces syscall events for every process on the box; matching those to arelay-shellrequest_idrequires PID-tree walking, which is racy under fast-exiting children.
B-021 in docs/runbook.md §7.5 flagged seccomp-bpf notification mode as
a way to close the first gap without re-introducing a sandbox. This ADR
records the design constraints before any code lands, per the ADR
README criteria ("A change to the audit-record shape... needs an ADR").
Decision¶
Add an audit-only seccomp-bpf channel using user-notify mode
(SECCOMP_RET_USER_NOTIF, first available in Linux 5.0; the effective
floor for this design is Linux >= 5.5 because
SECCOMP_USER_NOTIF_FLAG_CONTINUE lands there). The channel is:
- Notify-only, never block. Every notified syscall is allowed to
continue via
SECCOMP_USER_NOTIF_FLAG_CONTINUE(Linux >= 5.5). The notify handler emits an audit event and returns; it does not gate the syscall, does not rewrite arguments, and does not kill the child. This preserves ADR 0002 verbatim: the executor still runs unsandboxed, the service account is still the boundary. - Opt-in. Disabled by default. A new
RELAY_SHELL_SECCOMP_NOTIFYsetting (defaultoff) turns it on. Whenoff, no seccomp filter is installed on the child and the spawn path is byte-identical to today. - Linux-only. Macs and BSDs silently no-op the setting (logged once
at startup with the host's
uname -s). The runbook will document the supported kernel floor (5.5 forADDFD/CONTINUE; 6.0+ recommended for stableseccomp_notify_id_validsemantics) when the implementing PR lands. - Narrow syscall set. The filter notifies on a small,
forensically-interesting list —
execve,execveat,openatwithO_WRONLY|O_RDWR|O_CREAT,mount,umount2,setuid/setgid,unshare,prctl(with capability-relevantoptionvalues), andptrace. Everything else stays in the defaultSECCOMP_RET_ALLOWpath with no kernel-userspace round-trip. The list will live in a dedicated module undersrc/relay_shell/and be version-pinned the waypatterns.pyis; the exact filename is the implementing PR's call. - Bounded audit volume. The notify handler writes one JSON event
per notification into the same
audit.jsonlstream withtool="syscall_notify"andtier=0(a passive observation, not a call). The event carries the originatingrequest_id, the child PID, the syscall name and numeric arguments (no buffer dereferencing — that requiresPIDFD_GETFDand re-introduces a sandbox-shaped attack surface). A per-call event cap (RELAY_SHELL_SECCOMP_NOTIFY_CAP, default 256) bounds the worst case — beyond the cap, the channel records a singlesyscall_notify_overflowline and stops emitting for that call. - Failure isolation. If the supervisor's notify socket dies, the
child continues unaffected (the kernel falls back to
ALLOWfor every subsequent notification once the listener is gone). The audit pipeline recordsdegraded=trueon the next call, identically to the existingAuditLoggerdegraded path.
Consequences¶
- The audit record schema grows a new event type (
syscall_notify,syscall_notify_overflow). The implementing PR will document the new shape indocs/architecture.md§"Request lifecycle" anddocs/tools.md§"Audit shape". The existing per-call record is unchanged — the new events are additional lines, not replacement fields, so log shippers and off-host parsers built against the current shape keep working. - The runbook §2 audit pass gains a step under §3 (Upstream surface
validation): assert that the kernel-side constants the filter uses
(
SECCOMP_RET_USER_NOTIF,SECCOMP_USER_NOTIF_FLAG_CONTINUE) are present inlibseccomp's headers on the build host. This catches a silently-downgraded kernel. - The HTTP
/metricsendpoint (ADR 0001 / B-012) gains two counters:relay_shell_seccomp_notify_events_total{syscall="..."}andrelay_shell_seccomp_notify_overflow_total, so an operator can alert on a chatty child or a per-call cap that needs bumping. - The
verifier(relay-shell --verify-deploy, B-020) gains a check that warns whenRELAY_SHELL_SECCOMP_NOTIFY=onis set on a host whose kernel is below 5.5; the verifier already speaks the env-var vocabulary so this is a one-row addition. - The CI matrix (B-009, Python 3.12/3.13/3.14) does not need a new
axis; the seccomp code path is gated behind the env var and skipped
in CI by default. A dedicated
seccomppytest mark covers the Linux-only tests; CI runs them on theubuntu-latestleg only.
Rejected alternatives¶
- Seccomp filter mode (
SECCOMP_RET_KILL_PROCESS/SECCOMP_RET_ERRNO). This is a sandbox — exactly the posture ADR 0002 rejects. A kill-on-violation filter would make every new syscall added by a kernel upgrade a potential outage and would require maintaining a denylist big enough to cover the long tail of what an operator's scripts legitimately do. Notify-mode keeps the capability and adds visibility without taking on the kill-list maintenance burden. - eBPF tracing via
bpftrace/tracee/ a customBPF_PROG_TYPE_KPROBEprogram. Heavier dependency surface (kernel headers, BTF on older kernels, a privileged loader process), and the per-event payload carries kernel pointers we would have to peer-dereference to attribute to arelay-shellcall. Seccomp-notify ties events to the child PID we just spawned, which is the attribution we actually want; eBPF would deliver firehose-scoped events we then have to filter back down to one PID tree. ptrace(PTRACE_SEIZE)per child. Quadratic single-stepping cost on syscall-heavy workloads (afind /would crawl), a well-known DoS surface (the tracer can be stalled by an uncooperative tracee), and Linux limits one tracer per task — adoption would conflict with operators running their ownstrace/gdbon the same child. seccomp-notify is the kernel's purpose-built mechanism for this exact case.- Lean on host
auditd. Already runs on most production hosts, and the operator should keep it on — but it sees every process on the box, not justrelay-shellchildren, and the attribution back to arelay-shellrequest_idrequires PID-tree walking that is racy under fast-exiting children. The two channels are complementary, not substitutes:auditdcovers host-level events, the seccomp-notify channel covers per-call attribution. - A separate "syscall_audit.jsonl" sink. Splitting the audit trail
across files makes off-host shipping (ADR-aligned with
docs/audit-shipper.md) and forensic correlation harder. One append-only stream with a discriminator ontoolis consistent with the existing resource-read events (tool="resource:<name>", seedocs/architecture.md).
Validation outcome¶
Accepted 2026-06-02 with the implementing PR (runbook §7.5 B-021). The
channel ships in src/relay_shell/seccomp.py (version-pinned
SECCOMP_FILTER_VERSION, like patterns.py), wired into the local
executor via a per-call ContextVar, with additive syscall_notify /
syscall_notify_overflow audit lines and two bounded /metrics
counters. The four-step ADR 0005 pass ran green against the
implementation:
- Code index — one new module, no new tool (the 21-tool contract is
unchanged; this is an audit event, not a tool).
server_infogrows aseccompblock;Settingsgrowsseccomp_notify/seccomp_notify_cap. - Quality gates —
ruff/ruff format/mypy --strictclean;pytestgreen; coverage holds the 90% floor (seccomp.py~97% with the portable unit suite alone; the privileged paths carry a# pragma: no coveror are exercised by theseccomp-marked end-to-end tests). - Upstream surface validation — the kernel ABI constants
(
SECCOMP_FILTER_FLAG_NEW_LISTENER = 1<<3,SECCOMP_RET_USER_NOTIF,SECCOMP_USER_NOTIF_FLAG_CONTINUE, the notify ioctl numbers, and the 80/24/64 struct sizes) were validated against a liveLinux 6.18 / x86_64host;platform_support()re-checks the struct sizes viaSECCOMP_GET_NOTIF_SIZESat runtime and disables the channel on a mismatch (the "silently-downgraded kernel" guard this ADR called for). - Behavior validation —
seccomp-marked end-to-end tests drive a real child and assert thatexecveand a write-openatare observed and allowed to continue, a read-onlyopenis not notified, the per-call cap emits one overflow marker while the child still runs to completion, and the events extend the ADR 0007 hash chain.
Refinements adopted at implementation (deltas from the Decision)¶
- No
libseccompdependency. The forward-looking note proposed apython-libseccompextra; the channel is implemented in purectypesinstead, so the bare and[dev]installs gain zero new dependencies. The[seccomp]extra is therefore unnecessary and was not added. CAP_SYS_ADMIN-gated, neverno_new_privs. A seccomp filter installs withCAP_SYS_ADMINor by latchingno_new_privs. Latchingno_new_privswould silently disable set-uid escalation in audited children (sudowould break) — a capability regression this project forbids and one the Decision's "preserves ADR 0002 verbatim" claim cannot tolerate. The channel therefore activates only withCAP_SYS_ADMIN(e.g. running as root, a supported privileged posture) and installs withoutno_new_privs, so set-uid/sudosemantics are unchanged. WithoutCAP_SYS_ADMINthe channel cleanly no-ops.x86_64only in v1. Only syscall-number tables we can validate on a live host ship; any other arch makesplatform_support()report unsupported and the channel no-op, so a guessed number can never notify the wrong syscall.aarch64is a recorded follow-up (runbook §7.5).- Syscall set. Implemented unconditionally:
execve,execveat,ptrace,mount,umount2,unshare,setns,chroot,pivot_root,setuid/setgid/setreuid/setregid/setresuid/setresgid; plusopenat/opengated on a write/create flag (O_WRONLY|O_RDWR|O_CREAT).prctloption-filtering is deferred (it needs per-argument BPF predicates and has volume concerns); recorded as a follow-up. The privilege/namespace coverage is broader than the Decision's sketch. - Runtime support check placement. The Decision put a kernel-floor check
in the
verifier(--verify-deploy); that command does template drift detection only. The runtime check instead lives inplatform_support(), is surfaced byserver_info.seccomp(supported+reason), logged once at startup, and reflected by--check-config. - Scope (v1). The one-shot local executor (
shell_exec/shell_script/ssh_keyscan). Long-lived PTY sessions and the SSH-local half are a recorded follow-up (runbook §7.5). (Both resolved 2026-06-09; see the follow-up section below.)
Follow-ups landed 2026-06-09 (B-024, B-026; filter version 2)¶
prctloption-filtering (B-024).prctljoined the notified set, gated on the privilege/capability-relevantoptionvalues the Decision sketched:PR_SET_DUMPABLE,PR_SET_KEEPCAPS,PR_SET_SECCOMP,PR_CAPBSET_DROP,PR_SET_SECUREBITS,PR_SET_NO_NEW_PRIVS,PR_CAP_AMBIENT(thePRCTL_NOTIFIED_OPTIONStuple inseccomp.py, validated against a live host's<linux/prctl.h>). The filter assembler gained aneq-anypredicate onargs[0]alongside the existing write-flag predicate, so high-volume benign options (PR_SET_NAMEfrom thread naming, glibc'sPR_SET_VMAtagging) never trap — the volume concern that deferred this in v1.SECCOMP_FILTER_VERSIONis now 2. The paired positive / near-miss tests run portably through a small classic-BPF interpreter intests/test_seccomp.py(near-misses include the numerically-adjacentGETtwins of each notified option), plus aseccomp-marked live test driving a real child through one notified and one near-missprctl.- PTY session coverage (B-026).
sessions.LocalPtyTransport.spawnconsults the same ambient per-call monitor the one-shot executor uses, and the transport adopts it: the monitor is stopped inaclose()(and on the spawn-failure path), not when the originatingshell_spawncall returns. The session child and everything it forks inherit the filter, so commands typed into a session are observed; events keep the spawning call'srequest_id, and theRELAY_SHELL_SECCOMP_NOTIFY_CAPbound applies per session rather than per call. The off path is byte-identical, as before. - SSH-local half: resolved as vacuous. The Context's "an SSH session's
local half" anticipated a local child on the SSH path; as implemented
there is none —
asyncsshruns in-process andsshpool.pyspawns no subprocess (noProxyCommandsupport is wired). There is nothing local to observe, so no code change applies. If a local-subprocess proxy path ever lands, the ambient-monitor pattern covers it the same way the PTY path does. - Audit-record shape: unchanged (
syscall_notify/syscall_notify_overflowas accepted;prctlevents are just a new value in the existingsyscallfield, and the/metricssyscalllabel set stays bounded byNOTIFIED_SYSCALLS).
Operational notes (as accepted)¶
Operator-facing detail now lives in docs/deployment.md §6a; the as-built
rationale is captured here next to the decision.
- Activation prerequisites, not a kernel-floor installer check. The
channel self-gates at runtime via
platform_support()(Linux /x86_64/ kernel ≥ 5.5 /CAP_SYS_ADMIN/ a matching notify ABI). WhenRELAY_SHELL_SECCOMP_NOTIFY=onbut a prerequisite is missing, it logs once at startup,server_info.seccomp.supportedisfalsewith areason, and local spawns are byte-identical to the off path — there is no separate installeruname -rgate to drift out of sync. - No
libseccomppackaging step. The pure-ctypesimplementation needs no system package; there is nothing to add to a base image beyond a kernel that meets the floor and theCAP_SYS_ADMINposture. - Off-host shipping. No change to
docs/audit-shipper.md: the new events ride the same JSONL stream the three existing recipes already tail, and the schema discriminator (toolfield) is exactly what Vector/Fluent Bit/journal-remotealready key on.