ADR 0004: Automated TLS at the edge¶
- Status: Accepted
- Date: 2026-05-20
Context¶
The HTTP transport binds 127.0.0.1 by design (ADR 0001, docs/architecture.md).
A reverse proxy in front terminates TLS and enforces the CIDR allowlist,
security headers, and access logging. Until now the deployment guide pointed
at a reference Caddyfile that operators had to hand-edit for their domain,
email, and CIDRs; the install path stopped at the Python service. That left
two recurring frictions for production deployers:
- Bootstrapping the TLS edge correctly (CA contact, ACME directory, loopback upstream, security headers, log paths) was a manual checklist every time, with no validation step.
- There was no documented automated-renewal story. Operators familiar with certbot tend to reach for a cron + reload pattern that adds moving parts we do not need.
Decision¶
Adopt Caddy's built-in ACME client as the supported automated-TLS path, and ship it as a turnkey installer:
deploy/Caddyfileis now parameterized via Caddy's native{$VAR:default}env substitution:RELAY_SHELL_EDGE_DOMAIN,RELAY_SHELL_EDGE_ACME_EMAIL,RELAY_SHELL_EDGE_CLIENT_CIDRS,RELAY_SHELL_EDGE_UPSTREAM, andRELAY_SHELL_EDGE_ACME_CA. The same file works for every deployment; the config is the environment, not a fork of the Caddyfile.deploy/install-edge.shinstalls Caddy from the official Cloudsmith apt repository when absent, writes a systemd drop-in (/etc/systemd/system/caddy.service.d/relay-shell-edge.conf) carrying those variables, validates the Caddyfile, and starts the service. It is idempotent and re-running it picks up edits to/etc/relay-shell/relay-shell.env.- Let's Encrypt is the default ACME CA;
RELAY_SHELL_EDGE_ACME_CAswitches to the LE staging directory for dry runs, or to ZeroSSL / an internal CA if the deployment requires it. Caddy's existing ZeroSSL fallback on Let's Encrypt outage remains in effect. - Renewals are driven by Caddy's internal scheduler (well before the 30-day threshold), not by cron. Certificates and account keys persist under Caddy's data directory across restarts.
Consequences¶
- The supported edge becomes "install Caddy, set three env vars, run the script". Bootstrapping a new deployment no longer requires reading the Caddyfile line by line.
- No new code paths inside the Python service: TLS termination stays at the edge, the server still binds loopback, and the audit/policy layer is unchanged. The capability and security posture documented in ADR 0002 and ADR 0003 are preserved verbatim.
- Operators using a non-Caddy edge (nginx, HAProxy, an internal load balancer) are not constrained. The Caddyfile is the reference and the automated path; alternative proxies remain supported, with the operator responsible for their own ACME integration.
- The default CIDR allowlist in the rendered Caddyfile is loopback only. The installer warns when it is unchanged, so a misconfigured deployment fails closed (clients get a 403) rather than silently exposing the tool surface to the internet.
Rejected¶
- certbot + cron + reload: more moving parts (separate package, hook scripts, a reload race during renewal) for a problem Caddy solves in process. Renewal failures would surface only at expiry, not at the next request.
- Native TLS in the Python server: would require key/cert lifecycle code inside the service, contradicts the loopback-bind architecture, and duplicates capability the edge proxy already owns. The proxy also handles CIDR allowlisting and security headers, which would otherwise migrate into the application.
- Manual cert installation: workable for a one-off deploy but loses the automated-renewal property the task requires.
Operational notes¶
The day-to-day operator-facing surface for the edge boils down to two
commands. They live here rather than in deployment.md so the
troubleshooting context stays next to the decision they exercise.
- Cert issuance / renewal troubleshooting. Caddy logs ACME activity to its systemd journal stream. The most useful first look:
sudo journalctl -u caddy -n 200 --no-pager
sudo journalctl -u caddy -f # follow during an issuance attempt
ACME failures usually show as obtain: ... or solve: ... errors;
the message names the failing challenge (HTTP-01) and the upstream
reason (rate limit, DNS, port 80 unreachable). Switch to the LE
staging directory via RELAY_SHELL_EDGE_ACME_CA (see
deployment.md §4a) while reproducing — production rate limits are
unforgiving.
- Caddyfile drift / validation. A misconfigured Caddyfile fails closed at service-start. Validate before pushing:
This is what install-edge.sh runs internally before swapping the
file; running it standalone is the right move when the install
finished but a downstream config change (a new CIDR, a header
edit) needs a one-off check. Pair with relay-shell --verify-deploy
(see deployment.md §10) for a byte-for-byte comparison against
the shipped template — useful when "did someone hand-edit this?"
is the operative question.