Deployment¶
relay-shell grants real administrative power. The single most important control is
where and as whom it runs. This guide describes a production-grade
deployment modeled on a mature MCP gateway.
0. Pre-flight checklist¶
Run through this list before §1 (service account creation). Each row is a hard precondition for the matching later section; skipping any of them turns a smooth install into an outage.
- [ ] Service account name decided. The default is
relay-shellwith home directory/var/lib/relay-shell(§1). Pick a different name only if your host conventions require it; update the systemd unit + logrotate config +chownlines below accordingly. - [ ] Audit directory writable by the service account.
/var/log/relay-shell/exists, is owned by the service account, and the filesystem supportschattr +a(ext2/3/4, xfs). Append-only on atmpfsor fuse mount silently degrades to a normal write — confirmlsattrafterchattr +a. - [ ] DNS resolves for the edge domain (HTTP transport only).
dig +short ${RELAY_SHELL_EDGE_DOMAIN}returns the host's public IP. The HTTP-01 ACME challenge does not work without this. - [ ] Ports 80 and 443 reachable from the internet (HTTP transport
only). Port 80 is required for HTTP-01; closing it blocks
issuance and renewal both. If a host firewall is in front, open
it before running
install-edge.sh(or setRELAY_SHELL_EDGE_OPEN_FIREWALL=1to have the installer try). - [ ] SSH service account keypair generated (
ssh-keygen -t ed25519as the service account) andknown_hostsstrategy chosen (strictfor production — pre-populate withssh-keyscan; see §7). - [ ] Off-host audit shipping target ready (Vector / Fluent Bit /
journal-remote — see
docs/audit-shipper.md). An on-host audit file is evidence only until the host is compromised; the shipper must be in place before the unit is enabled in production.
If any row is unchecked, fix it before continuing — the install is idempotent but recovery from a half-configured edge or a non-shipping audit log is more work than the precondition.
1. Service account¶
Default recommendation: run as a dedicated unprivileged user, never a human's account.
sudo useradd --system --create-home --home-dir /var/lib/relay-shell \
--shell /usr/sbin/nologin relay-shell
Grant only the privileges the workload genuinely needs. If sudo is required
for the intended tasks, prefer command-scoped sudoers entries over
NOPASSWD: ALL. A single-owner lab host may accept a broader grant; a
multi-tenant or sensitive host must not. State the choice in an ADR.
If your explicit goal is maximum model capability (full root/sudo behavior), run the service in privileged posture on an isolated admin host and treat that host as a high-trust control plane.
2. Install¶
sudo -u relay-shell python3 -m venv /var/lib/relay-shell/venv
sudo -u relay-shell /var/lib/relay-shell/venv/bin/pip install --upgrade pip
sudo -u relay-shell /var/lib/relay-shell/venv/bin/pip install /path/to/relay-shell # or: pip install relay-shell
deploy/install.sh does this idempotently. It deliberately does not
auto-start the service; review the unit and configuration first.
Validate the resolved configuration against the service account's environment before enabling the unit:
sudo -u relay-shell \
--preserve-env=RELAY_SHELL_AUDIT_PATH,RELAY_SHELL_TRANSPORT,RELAY_SHELL_POLICY_MODE \
/var/lib/relay-shell/venv/bin/relay-shell --check-config
--check-config loads the settings, builds the server (audit sink open,
ssh_config + inventory parse, OAuth state dir creation if auth_enabled=true)
without starting a transport, and exits 0 on success or 2 on any
initialization failure - including a degraded audit sink, which is the
single most common production misconfiguration. Wire this into the
image-bake step of your CI pipeline.
3. systemd¶
deploy/systemd/relay-shell.service plus the relay-shell.service.d/hardening.conf
drop-in. The hardening is intentionally partial: filesystem, capability,
and syscall confinement (ProtectSystem=strict, NoNewPrivileges,
SystemCallFilter) would break the very shell/SSH capability this service
exists to provide (see docs/adr/0002). What is applied: resource caps
(MemoryMax, CPUQuota, TasksMax), PrivateTmp, restart limits, and the
non-execution-breaking Protect* directives. Encrypted credentials are
delivered via LoadCredentialEncrypted=.
sudo cp deploy/systemd/relay-shell.service /etc/systemd/system/
sudo mkdir -p /etc/systemd/system/relay-shell.service.d
sudo cp deploy/systemd/relay-shell.service.d/hardening.conf /etc/systemd/system/relay-shell.service.d/
sudo systemctl daemon-reload
sudo systemctl enable --now relay-shell
4. Network edge (HTTP transport)¶
The HTTP transport binds 127.0.0.1 by design. Terminate TLS and restrict by
source IP at a reverse proxy. deploy/Caddyfile is shipped parameterized
through environment variables and provides:
- automatic TLS via ACME (Let's Encrypt by default, ZeroSSL fallback),
- an
@blockedmatcher that 403s any source outside the allowlisted CIDRs, - HSTS /
X-Content-Type-Options/X-Frame-Options/Referrer-Policy/Content-Security-Policy(default-src 'self'; frame-ancestors 'none'), reverse_proxyto the loopback MCP port.
Set the allowlist to the CIDRs of your MCP client only. The OAuth browser
endpoints (/authorize, /.well-known/*) are reachable for the redirect
flow; tool traffic and /token are CIDR-restricted.
Defense in depth: a host firewall (only 80/443 inbound), the proxy CIDR matcher, OAuth 2.1, then the policy/audit layer.
4a. Automated TLS (Caddy + Let's Encrypt)¶
deploy/install-edge.sh is the supported turnkey path. It installs Caddy
from the official apt repository (if missing), drops the parameterized
Caddyfile in place, writes a systemd environment drop-in, validates the
config, and starts the service. Renewal is driven by Caddy's built-in ACME
scheduler - no cron, no certbot.
Prerequisites: a public DNS A/AAAA record for the chosen hostname pointing at this host, and TCP/80 + TCP/443 reachable from the internet (port 80 is required for the HTTP-01 challenge).
# Edit /etc/relay-shell/relay-shell.env (or export inline) and set at minimum:
# RELAY_SHELL_EDGE_DOMAIN=relay-shell.example.org
# RELAY_SHELL_EDGE_ACME_EMAIL=admin@example.org
# RELAY_SHELL_EDGE_CLIENT_CIDRS="203.0.113.0/24 198.51.100.0/24"
sudo deploy/install-edge.sh
| Variable | Purpose |
|---|---|
RELAY_SHELL_EDGE_DOMAIN |
Public hostname presented in the TLS certificate. |
RELAY_SHELL_EDGE_ACME_EMAIL |
Contact email for the ACME account. |
RELAY_SHELL_EDGE_CLIENT_CIDRS |
Space-separated source CIDR allowlist (defaults to loopback only - remote clients get 403 until you set this). |
RELAY_SHELL_EDGE_UPSTREAM |
Loopback target (default 127.0.0.1:8080). |
RELAY_SHELL_EDGE_ACME_CA |
ACME directory override; set to https://acme-staging-v02.api.letsencrypt.org/directory for dry runs against LE staging. |
RELAY_SHELL_EDGE_OPEN_FIREWALL |
Set to 1 to ufw allow 80,443/tcp if ufw is present. |
RELAY_SHELL_EDGE_DRY_RUN |
Set to 1 to log the resolved values and print the parameterized Caddyfile template, then exit without installing. Caddy substitutes the {$RELAY_SHELL_EDGE_*} placeholders at service start. |
RELAY_SHELL_EDGE_FORCE |
Set to 1 to overwrite an existing /etc/caddy/Caddyfile that this installer did not write. Without it, the installer refuses to clobber a Caddyfile that may serve other sites on the host. |
RELAY_SHELL_EDGE_CADDY_GPG_FPR |
Pin the Caddy apt repo signing key: the installer fails closed if the fetched key's fingerprint does not match (DEP-1). The key is fetched over TLS but otherwise trust-on-first-use; the installer always logs the observed fingerprint, so set this to that value (after confirming it at https://caddyserver.com/docs/install) to enforce it. Unset = unpinned, with a warning. Only used when the installer provisions Caddy via apt. |
The installer is idempotent: re-run it after editing the env file to push
changes. Both installers create /etc/relay-shell as 0750
(root:relay-shell), not world-listable — systemd reads the EnvironmentFiles
as root, so the service is unaffected. To pin the Caddy apt repo signing key,
set RELAY_SHELL_EDGE_CADDY_GPG_FPR (see the env-var table above); otherwise
the installer logs the fetched key's fingerprint and warns that it is
unpinned. The drop-in at
/etc/systemd/system/caddy.service.d/relay-shell-edge.conf is static and
references a managed EnvironmentFile=/etc/relay-shell/relay-shell-edge.env,
so user-supplied values never land inside systemd unit syntax. Cert state
persists under Caddy's data directory across restarts. See
docs/adr/0004-edge-tls-automation.md
for the design rationale and rejected alternatives.
Operators running a non-Caddy edge (nginx, HAProxy, an upstream LB with its own ACME integration) can still use the relay-shell service; the loopback-bind contract and CIDR/header expectations described above are all that matter.
5. OAuth 2.1 (optional)¶
Authentication is opt-in and off by default (RELAY_SHELL_AUTH_ENABLED
defaults to false) and applies only to the HTTP transport. A fresh install
never stands up an authenticated — or unauthenticated — network listener
unless you choose the HTTP transport; once you do, enable OAuth for any
exposure beyond a trusted loopback + edge.
RELAY_SHELL_TRANSPORT=http
RELAY_SHELL_AUTH_ENABLED=true # default false — opt in explicitly
RELAY_SHELL_AUTH_ISSUER=https://relay-shell.example.org
RELAY_SHELL_AUTH_STATE_DIR=/var/lib/relay-shell/oauth
RELAY_SHELL_AUTH_SINGLE_CLIENT=true # lock DCR after the first client registers
Install the [http] extra. Tokens are file-backed under the state dir
(clients.json, codes.json, tokens.json), access tokens are short-lived,
refresh tokens rotate on use, and expiry is enforced lazily on read. With
single-client lockdown, dynamic registration is refused once one client
exists. See auth.md for the full authentication lifecycle — how a
client registers, obtains tokens, and stays authenticated via refresh
rotation.
6. Audit¶
RELAY_SHELL_AUDIT_PATH (default /var/log/relay-shell/audit.jsonl). Make it append-only
and rotate it without losing that attribute:
sudo mkdir -p /var/log/relay-shell && sudo chown relay-shell:relay-shell /var/log/relay-shell
sudo touch /var/log/relay-shell/audit.jsonl && sudo chattr +a /var/log/relay-shell/audit.jsonl
sudo cp deploy/logrotate/relay-shell /etc/logrotate.d/relay-shell
RELAY_SHELL_AUDIT_FORMAT controls serialization for downstream SIEM ingest:
jsonl (default), cef, or leef.
The bundled logrotate config drops the append-only bit only for the rotate
and restores it immediately. Ship the log off-host and alert on gaps; an
on-host log is evidence only until the host is compromised. See
docs/audit-shipper.md for worked examples using
Vector, Fluent Bit, and journalctl → systemd-journal-remote.
6a. Tamper-evident chain (optional)¶
chattr +a and off-host shipping protect the log, but neither makes a
single altered record detectable, and the shipper has a flush window. Set
RELAY_SHELL_AUDIT_CHAIN=true (requires RELAY_SHELL_AUDIT_FORMAT=jsonl) to
add a per-record hash chain (ADR 0007): each
record carries seq, the previous record's prev hash, and its own chain
hash. Default off keeps the record byte-identical; server_info.audit.chain
reports the live state.
What the chain proves, and what it does not. From a single file the chain detects any edit, insertion, reorder, or interior deletion by recomputation — including from the shipped copy, without trusting the relay host. Head-truncation (excising leading records) is caught by the genesis anchor. Tail-truncation (dropping the newest records) leaves a shorter but valid prefix and is not detectable from the file alone — catch it by comparing against the off-host copy, which has the later records. This split is by design (ADR 0007 delegates durability/truncation defense to off-host shipping).
--verify-audit is fail-closed: it exits 0 only when the file exists,
carries a chained record, verifies clean, and is genesis-anchored; a missing /
empty / unchained log, a broken chain, or a non-genesis start (head-truncation)
exits 2. Enable chaining on a freshly rotated log so the chain runs from
genesis. Verify the on-host log or a shipped copy:
relay-shell --verify-audit # uses RELAY_SHELL_AUDIT_PATH
relay-shell --verify-audit --audit-path /var/log/relay-shell/audit.jsonl-20260601 \
--segment --json # a mid-stream rotation segment
# exit 0 = clean, genesis-anchored chain; exit 2 = missing/empty log, a record
# edited / reordered / inserted / deleted from the interior, or a non-genesis
# start. Pass --segment when the file legitimately starts at seq > 0 (a rotation
# segment); a missing/empty log and a broken chain fail regardless.
Rotation. While the process keeps running, rotation preserves the chain:
the in-memory anchor follows the file (WatchedFileHandler reopens, or
copytruncate keeps the fd), so the new file continues the same seq/chain.
A rotation immediately followed by a restart (before any record lands in
the fresh file) re-anchors at genesis: the new file is a fresh genesis segment
with seq restarting at 0 — a visible seam, not a silent gap. Verify each
genesis-anchored segment independently; cross-segment continuity lives in the
ordered off-host stream, consistent with ADR 0007's delegation of cross-file
durability to off-host shipping.
6b. Syscall-level audit channel (optional)¶
The audit record covers what the model asked for and the SHA-256 of the
output, but not what a spawned child does after exec returns. Set
RELAY_SHELL_SECCOMP_NOTIFY=true to add an audit-only seccomp user-notify
channel (ADR 0006):
locally-spawned children — one-shot (shell_exec / shell_script /
ssh_keyscan) and local PTY sessions (shell_spawn) — get a BPF filter that
traps a small, high-signal syscall set (execve, the set[re|res]?[ug]id
family, mount/umount2, unshare/setns, chroot/pivot_root,
ptrace, write-open/openat, and prctl for privilege-relevant options
such as PR_SET_SECUREBITS / PR_CAP_AMBIENT / PR_SET_NO_NEW_PRIVS), and
a supervisor appends one syscall_notify line per observed call to the same
JSONL stream (it extends the §6a hash chain when that is on). It never
blocks a syscall — the supervisor always answers CONTINUE — so ADR 0002's
no-sandbox posture is unchanged. For a PTY session the filter rides the
session child (and everything it forks) for the session's whole life, and
events carry the spawning call's request_id. SSH sessions have no local
child (asyncssh runs in-process), so nothing is observed — or missed —
on that path.
Activation is gated, by design. A seccomp filter installs with
CAP_SYS_ADMIN or by latching no_new_privs; the latter would silently
disable set-uid escalation in the child (sudo would break). To preserve the
privileged-admin posture verbatim, the channel installs only with
CAP_SYS_ADMIN (e.g. running as root) and never latches no_new_privs. It
also requires Linux / x86_64 / kernel ≥ 5.5 and a notify ABI matching the
build. Where any prerequisite is missing it cleanly no-ops — local spawns
are byte-identical to the off path — and server_info.seccomp.supported is
false with a reason (also logged once at startup). There is no system
package to install: the channel is pure ctypes.
RELAY_SHELL_SECCOMP_NOTIFY_CAP (default 256, range 1..65536) bounds the
event volume per spawned child — per call for the one-shot executors, per
session for shell_spawn (a long-lived interactive session runs many
commands under one filter, so size the cap for the session, not the
command): beyond it, one syscall_notify_overflow line is written
and emission stops while the child still runs to completion. Watch
relay_shell_seccomp_notify_overflow_total (§9a) and raise the cap if a
legitimate workload trips it. The events ride the same off-host shipper as the
rest of the log (docs/audit-shipper.md); shippers route on the tool field.
7. SSH credential scoping¶
The realized credential surface is whatever keys the service account can use.
Prefer one key per role/scope, revocable independently, over one all-powerful
key. RELAY_SHELL_SSH_KNOWN_HOSTS=strict is recommended for production; pre-populate
~/.ssh/known_hosts for the service account. Provide a JSON inventory via
RELAY_SHELL_INVENTORY for hosts not in ssh_config.
The SSH connection pool caches one connection per user@host:port and
reuses it for follow-up calls. RELAY_SHELL_SSH_IDLE_TIMEOUT (default
1800 seconds) drops a cached connection that has not been used for that
many seconds the next time the pool is consulted; set 0 to keep the
historical behavior (closed connections are still purged on the next
sweep). Long-running deployments that fan out across a large host
inventory should leave the reaper on so a long-lived server does not
accumulate idle handles.
8. Policy posture¶
open- full access, every call still classified and audited. The documented single-owner default.guarded- Tier 2+ refused unlessRELAY_SHELL_POLICY_ALLOWmatches; set an allowlist of sanctioned change patterns.readonly- only Tier 0. Useful for an observation-only client.
RELAY_SHELL_POLICY_DENY is a regex evaluated first in every mode, against
the probe text "<tool> <command>" (the tool name is prepended, so anchor with
\b/substrings rather than ^command; you can also deny a whole tool, e.g.
^ssh_keyscan). Use it as an always-on first-line filter — but treat it as
defence in depth, not an absolute prohibition: it matches command text, so
a determined caller can evade it with shell obfuscation (extra whitespace,
quoting like r''m, ${IFS}, $(...), base64 | sh) or alternate encodings.
ssh_keyscan normalizes any literal IP in its target list into the probe, so
an IP deny (e.g. on the cloud metadata address) is not dodged by a
decimal/hex/octal/IPv4-mapped spelling of the same address (SSRF-1) — but this
cannot help for hostnames (no DNS is resolved in the policy path; a DNS or
rebinding answer can differ from the one the connection dials). Enforce hard
prohibitions with OS/network controls — an egress firewall (DNS-rebinding-proof),
seccomp/AppArmor, a restricted service account, readonly/guarded mode — not
the deny list alone. See ADR 0003.
9. Health¶
scripts/healthcheck.sh checks the local HTTP port. For stdio, liveness is
the supervising client's concern. server_info reports effective limits and
whether the audit sink is degraded (a degraded audit sink is an alert).
The end-to-end HTTP smoke (start the transport, hit
/.well-known/oauth-protected-resource, stop it) is documented in
runbook.md §4.6.
9a. Prometheus metrics¶
The HTTP transport exposes a GET /metrics endpoint in Prometheus text
exposition format (no auth - the route bypasses OAuth by design, scope it
via the Caddy CIDR allowlist). The audit log remains the source of truth
for what happened; metrics are for dashboards only and reset on restart.
| metric | type | meaning |
|---|---|---|
relay_shell_tool_calls_total |
counter | One per tool call. Labels: tool, tier, mode, outcome. |
relay_shell_seccomp_notify_events_total |
counter | One per observed syscall when the §6b channel is on. Label: syscall (bounded set). |
relay_shell_seccomp_notify_overflow_total |
counter | One per tool call that hit RELAY_SHELL_SECCOMP_NOTIFY_CAP. |
relay_shell_active_sessions |
gauge | Live local + SSH PTY sessions. |
relay_shell_active_forwards |
gauge | Live SSH port forwards. |
relay_shell_audit_degraded |
gauge | 1 if the audit sink is degraded, 0 otherwise. Should always be 0. |
outcome is one of ok (work returned), denied (policy refused), or
error (work raised). Combine mode + tier + outcome for the classic
"denied tier-3 calls per minute" panel.
The stdio transport does not expose /metrics; the route is gated on
RELAY_SHELL_TRANSPORT=http.
10. Drift detection¶
After install, and on a periodic schedule in production, run:
It compares each shipped template against the file the installer placed:
| name | install path |
|---|---|
| systemd-unit | /etc/systemd/system/relay-shell.service |
| systemd-hardening | /etc/systemd/system/relay-shell.service.d/hardening.conf |
| logrotate | /etc/logrotate.d/relay-shell |
| caddyfile | /etc/caddy/Caddyfile (marker line is stripped) |
Exit 0 means every entry matched byte-for-byte; exit 2 means at least one
DRIFT, MISSING, or ABSENT_TEMPLATE row was reported. Pair with
--json for machine-readable output (Nagios / Prometheus blackbox /
Ansible drift-detection callouts). A cron line like:
17 4 * * * relay-shell /var/lib/relay-shell/venv/bin/relay-shell --verify-deploy --json > /var/log/relay-shell/drift.json
…lets a log shipper trip an alert when ok: false lands in the JSON.
11. Backup and restore¶
The relay's persistent state is small and lives in three directories. Back them up together; the relay itself is stateless beyond these:
| What | Where | Why it matters |
|---|---|---|
| OAuth state | ${RELAY_SHELL_AUTH_STATE_DIR}/{clients.json, codes.json, tokens.json} (default /var/lib/relay-shell/oauth/) |
Losing it logs every client out and re-runs single-client lockdown from scratch (a new client can register, the old one cannot). |
| systemd EnvironmentFile | /etc/relay-shell/relay-shell.env (plus /etc/relay-shell/relay-shell-edge.env for the edge) |
Source of truth for every RELAY_SHELL_* knob; recreating it from scratch is error-prone. |
| Audit log | ${RELAY_SHELL_AUDIT_PATH} (default /var/log/relay-shell/audit.jsonl) plus audit.jsonl.{1..N}.gz rotations |
The on-host copy is the local fallback when the off-host shipper has fallen behind. |
A simple recipe (adapt to your backup tool):
backup_dir=/root/relay-shell-backups
sudo install -d -m 700 "$backup_dir"
sudo install -m 600 /dev/null "$backup_dir/relay-shell-state-$(date +%F).tar.gz"
sudo tar -czf "$backup_dir/relay-shell-state-$(date +%F).tar.gz" \
/etc/relay-shell \
/var/lib/relay-shell/oauth \
/var/log/relay-shell
The install steps intentionally create a root-only destination
(directory 0700, archive 0600) before tar writes sensitive OAuth
state and environment material.
Restore the OAuth state with chmod 0700 oauth/ && chmod 0600 oauth/*.json
preserved (the file modes are part of the trust boundary — the relay
re-applies them on next write, but an attacker reading between restore
and first write should not see 0644 files). The audit log is append-only
on disk; restore it before the relay starts so chattr +a does not
race with a write into the unmoved file. The drift-detection CLI
(relay-shell --verify-deploy, §10) confirms the systemd unit + Caddyfile
+ logrotate config match the templates after a restore.
What is not in scope for backup:
- The Python venv under
/var/lib/relay-shell/venv/— recreate withpip install relay-shell(§2). - The Caddy data directory (
/var/lib/caddy/) — Caddy will re-issue the certificate from ACME on first start. Backing it up is only worth doing if you are rate-limited on the ACME directory. - The SSH known_hosts file under the service account's
~/.ssh/— reseed withssh_keyscan(§7) rather than restoring stale entries.
Emergency¶
- Disable fast:
sudo systemctl stop relay-shell(and revoke OAuth tokens by clearingtokens.json, or rotate the proxy CIDR allowlist to none). - Revoke SSH reach: remove/disable the service account's keys on targets.
- The audit log (off-host copy) is the post-incident record.