ADR 0005: Codebase validation against known-good sources¶
- Status: Accepted
- Date: 2026-05-24
Context¶
ADR 0002 makes the trust boundary the service account, not an internal
sandbox; ADRs 0003 and 0004 add the compensating controls (tiered authority,
audited execution, edge TLS). All three rest on a contract with upstream
sources: the mcp SDK (FastMCP, the OAuth provider interface, the resource
+ tool decorators), asyncssh (the SSH semantics we expose), and
pydantic-settings (the configuration layer). If any of those drifts
silently — a renamed kwarg, a removed method, a relaxed default — the
documented guarantees stop being load-bearing.
The runbook §2 codifies the audit pass, but it had not been executed end-to-end against the actual upstream surfaces on the pinned versions since the original release. This ADR records the methodology, the result, and the small drift the pass surfaced.
Decision¶
Treat a periodic, repeatable validation pass against the upstream APIs as a first-class operating procedure, owned by the runbook §2 (Audit) and referenced from this ADR. The pass produces, in this order:
- Code index.
git ls-files, line counts for every package and test file, the registered tool set (fromFastMCP.list_tools()), and the registered resource URIs. - Quality gates.
ruff check,ruff format --check,mypy --strict,pytest -q, andcoverage reportwith subprocess collection wired (runbook §4.3) so the stdio e2e contributes. - Upstream surface validation. Import every dependency the code touches and assert that the symbols and signatures we rely on still exist on the pinned version:
mcp.server.fastmcp.FastMCP.__init__accepts the kwargsbuild_serverpasses (instructions,host,port,stateless_http,json_response,auth,auth_server_provider).mcp.server.fastmcp.Contextexposesrequest_idandclient_id(the audit pipeline reads both best-effort).mcp.server.auth.provider.OAuthAuthorizationServerProvideradvertises the nine methodsFileOAuthProviderimplements (authorize,exchange_authorization_code,exchange_refresh_token,get_client,load_access_token,load_authorization_code,load_refresh_token,register_client,revoke_token).mcp.server.auth.provider.AuthorizationParamsexposes the fields the provider reads (code_challenge,redirect_uri,redirect_uri_provided_explicitly,scopes,state,resource).mcp.shared.auth.OAuthClientInformationFull/mcp.shared.auth.OAuthTokencarry the fields the provider sets.asyncssh.connectacceptsknown_hosts,username,port,client_keys,connect_timeout,config,tunnel,keepalive_interval— the surface the pool relies on.- Behavior validation. Drive at least one real tool through
build_server(...).call_tool(...), then inspect the resulting audit record on disk and confirm: - the record schema (
ts,tool,tier,denied,args,output_sha256,output_len,exit_code) is intact; - the output body is not present in the record (only its hash);
- a sample of tier classifications matches the table in ADR 0003;
- canonical redaction inputs (Bearer, AWS SigV4, MySQL
-p, URL creds, PEM block, GitHub token,--password "two words") all produce the documented redacted output andssh -p22is not redacted (the well-known under-scrub regression). - Findings: any code/doc mismatch surfaced by steps 1-4 becomes a line item with a severity (P0 release-blocker through P3 nice-to-fix) and a resolution plan that lands in the same PR.
Validation outcome (2026-05-24)¶
All four steps passed without code-level regressions:
- 21 MCP tools registered, matching
tests/test_server.py::_EXPECTEDanddocs/tools.md. - 3 MCP resources registered, matching
docs/tools.md§Resources. ruff check,ruff format --check,mypy --strictclean.pytest -q— 195 passed, 13 deselected (thefuzzmarker is nightly- only by design; seepyproject.toml).coverage— 89% with subprocess collection enabled (floor is 85%).- Every upstream symbol in step 3 resolved on the pinned versions.
- The audit record schema and the redaction / tier samples in step 4 match the documented behavior verbatim.
Three documentation-drift findings were surfaced and fixed in the same PR that lands this ADR, so they do not survive past 0005:
| ID | Severity | Subject | Resolution |
|---|---|---|---|
| F-001 | P2 | requirements.txt pinned starlette==1.0.0 / PyJWT==2.12.1 / ruff==0.15.13 while pip install -e ".[dev]" resolved 1.1.0 / 2.13.0 / 0.15.14 — the file header claimed "validated build set" but the pins did not match reality. |
Pins refreshed to the actually-tested set; header reworded to "validated against the current development matrix". |
| F-002 | P2 | docs/runbook.md §4.3 still said "CI floor: 75%, current ~78%". The floor moved to 85% in B-022 and measured coverage is 89%. |
Header and body of §4.3 updated to "CI floor: 85%, current ~89%". |
| F-003 | P3 | docs/runbook.md §3.4 referenced the obsolete tool-count assertion len(names) == 18. Actual: tests/test_server.py:36 asserts 21. |
Reference updated to 21. |
No security findings. No capability regressions. The trust boundary described in ADR 0002 and the admission semantics described in ADR 0003 still hold byte-for-byte against the current code.
Validation outcome (2026-05-31)¶
Re-ran steps 1-4 against the same pinned surfaces. The gates and the upstream contract are still green:
- 21 MCP tools registered, matching
tests/test_server.py::_EXPECTEDanddocs/tools.md; 3 MCP resources registered. ruff check,ruff format --check,mypy --strictclean.pytest -q— 250 passed, 13 deselected (up from 195/244 as the redaction-coverage tests below landed; thefuzzmarker is nightly-only by design).pytest -m fuzz— 13 property invariants pass, includingredactidempotency on the new shapes.coverage— 92% with subprocess collection (floor 90%);patterns.pyandredaction.pyat 100%.- Every upstream symbol in step 3 still resolves on
mcp==1.27.1/asyncssh2.23.0 (FastMCP kwargs,Contextids, the nine OAuth provider methods,AuthorizationParamsfields, the eightasyncssh.connectoption kwargs,OAuthTokenfields).
Step 4 surfaced one security finding against the redaction sample set, fixed in the same PR:
| ID | Severity | Subject | Resolution |
|---|---|---|---|
| F-004 | P2 | The whole-match REDACTION_PATTERNS set covered GitHub / OpenAI(sk-) / AWS(AKIA) / Slack(xox*) but missed several of the most common structurally-anchored secret shapes when they arrive bare in an audited argument (a JSON body, a log line, a flag the CLI-flag prefix list does not name): Google API key (AIza), Google OAuth token (ya29.), Stripe sk_/rk_ keys, GitLab glpat-, npm npm_, PyPI pypi-, and JWTs. The OpenAI sk- shape also missed the sk-proj-/sk-svcacct-/sk-admin- prefixes (an internal hyphen broke the run). Secret leakage into the audit log is in SECURITY.md §Scope. |
Added the seven shapes and widened the OpenAI prefix in src/relay_shell/patterns.py; anchors and length floors track the canonical secret-scanning rulesets (gitleaks / GitHub secret scanning). PATTERNS_VERSION bumped "3" → "4". Paired over-scrub / under-scrub tests in tests/test_patterns.py (lines under test_openai_project_and_service_keys through test_registry_and_jwt_shapes_positive_and_negative) and a bare-in-args scenario in tests/test_redaction.py. redaction.py docstring and SECURITY.md redaction bullet updated. |
PR review (Copilot + Codex) further hardened the F-004 fix before merge:
the provider-token bodies were made to run unbounded from their length
floor and to admit each token's full alphabet (Google OAuth ya29. dots,
OpenAI sk-proj-/sk-svcacct-/sk-admin- URL-safe separators), and the
JWT rule's payload/signature floor was lowered so a compact JWT with a
small claim set is still redacted. The intent in every case is the same:
a match must collapse the whole token, never leave a tail, and never
miss a valid-but-compact credential. Regression fixtures pinning each
edge live in tests/test_patterns.py.
No capability regressions; no change to policy admission, the audit record schema, or any tool's response shape. The trust boundary (ADR 0002) and tier semantics (ADR 0003) are unchanged — this pass hardened a compensating control, it did not move the boundary.
Validation outcome (2026-06-01)¶
Re-ran steps 1-4. This pass also landed ADR 0007
(tamper-evident audit hash chain) in the same PR, so the behavior
validation in step 4 was extended to cover the new chain surface. The
broader engagement is recorded in
audit/2026-06-01-engagement.md.
- 21 MCP tools and 3 resources registered, matching
tests/test_server.py::_EXPECTEDanddocs/tools.md. ADR 0007 adds a CLI verb (--verify-audit), not a tool, so the tool contract is unchanged. ruff check,ruff format --check,mypy --strictclean.pytest -q— 277 passed, 13 deselected (up from 250; +27 tests for the chain emit/resume, theverify_chaintamper / head-truncation / tail-truncation cases, the config cross-field validator, and the fail-closed--verify-auditCLI incl.--segment).pytest -m fuzz— 13 invariants pass.coverage— 92% with subprocess collection (floor 90%);config.py99%,audit.py95%,patterns.py/redaction.py/policy.py100%.- Every upstream symbol in step 3 still resolves on
mcp==1.27.1/asyncssh2.23.0;pydanticmodel_validator(mode="after")(the new cross-fieldaudit_chain→jsonlguard) resolves on the pinnedpydantic>=2.11. - Step 4 audit-record schema: the default-off record is byte-identical to
the 2026-05-31 pass; with
RELAY_SHELL_AUDIT_CHAIN=trueit gains exactlyseq/prev/chainandverify_chainconfirms a clean chain and flags edit / forgery / deletion / reorder / garbage-line / non-genesis-anchor.
| ID | Severity | Subject | Resolution |
|---|---|---|---|
| F-005 | P3 | docs/runbook.md §5.1 still listed C-005 (Inventory field naming) as an open consolidation candidate. It was closed in PR #57 (Inventory(ssh_config=...) ctor + ssh_config_file resolved-iff-exists property), as recorded in audit/2026-05-27-engagement.md §2.2/§3 and CHANGELOG.md. The code carries no ssh_config_path; the runbook entry was the last stale reference. |
Moved C-005 to the §5.1 "Closed (do not re-add)" block with the resolution note. |
| G-1 | (gap) | The audit log — ADR 0002's first compensating control — had no in-record tamper-evidence; integrity rested solely on chattr +a + off-host shipping, both defeatable by the documented residual-risk attacker in the pre-ship window. Not a regression against the docs (the docs never claimed in-record integrity), recorded as a hardening gap. |
Closed by ADR 0007: opt-in, additive per-record hash chain + relay-shell --verify-audit. Default-off keeps every existing deployment byte-identical. |
No security regressions. No capability regressions. The trust boundary (ADR 0002) and tier semantics (ADR 0003) are unchanged. The audit-record schema change is additive and opt-in (default off), so off-host parsers built against the prior shape keep working.
Validation outcome (2026-06-12)¶
Re-ran steps 1-4 as part of a full audit / validation / hardening pass
(evidence pack under audit/00-inventory.md, audit/01-baseline.md,
audit/02-security-findings.md, audit/03-final-report.md). This pass also
reconciled the mcp pin drift surfaced as D-001 below. All gates green; the
upstream contract holds on the bumped pin.
- 21 MCP tools registered, equal to
tests/test_server.py::_EXPECTEDanddocs/tools.md; 3 MCP resources; 1 prompt (operating_guide, ADR 0008). ruff check,ruff format --check,mypy --strictclean (19 source files).pytest— 339 passed, 13 deselected (thefuzzmarker is nightly-only by design);pytest -m fuzz— 13 property invariants pass. Runtime ~36 s.coverage— 93% with subprocess collection (floor 90%);patterns.py/redaction.py/policy.py/metrics.py100%,audit.py94%,server.py95%,seccomp.py97%.- Step 3 upstream surface now resolves on
mcp==1.27.2(the pin moved 1.27.1 → 1.27.2 in PR #66, 2026-06-04) /asyncssh2.23.1: FastMCP kwargs,Contextids, the nine OAuth provider methods,AuthorizationParams/OAuthTokenfields, and theasyncssh.connectoption kwargs all resolve unchanged;pydanticmodel_validatorresolves on the pinnedpydantic>=2.11. - Step 4 behavior: audit-record schema intact (
ts, tool, tier, denied, args, output_sha256, output_len, exit_code); a sentinel present only in command output does not reach the audit log;--verify-auditconfirms a freshly chained log (anchored=True); tier classification and the redaction sample set match ADR 0003 /SECURITY.md. External scanners (semgrep, bandit, pip-audit, trivy) reported zero findings.
| ID | Severity | Subject | Resolution |
|---|---|---|---|
| D-001 | low (docs) | The mcp pin moved 1.27.1 → 1.27.2 in PR #66 (2026-06-04), but the README status line (README.md:6) + compatibility matrix (README.md:151) and docs/architecture.md:15 still named 1.27.1, and no ADR 0001 Consequences entry / ADR 0005 outcome had recorded the bump (the runbook §8.9 trigger was missed). |
Synced the living docs to 1.27.2 + asyncssh 2.23.1 (this pass); added the ADR 0001 Consequences pin-movement line; recorded this outcome paragraph. Frozen ADR bodies and prior dated outcomes left intact. |
No security regressions. No capability regressions. The trust boundary (ADR 0002) and tier semantics (ADR 0003) are unchanged; this pass corrected documentation drift and did not touch executor, policy, redaction, or audit behavior.
Validation outcome (2026-06-21)¶
Re-ran steps 1-4 on the pinned upstream surfaces. No dependency change since the 2026-06-12 pass; this is a documentation-drift reconciliation pass whose single finding (DOC-1) is a stale cross-reference, fixed in the same PR that records this outcome. All gates green; the upstream contract holds.
- 21 MCP tools registered, equal to
tests/test_server.py::_EXPECTEDanddocs/tools.md; 3 MCP resources (2 static + theinventory/{host}template); 1 prompt (operating_guide, ADR 0008). ruff check,ruff format --check,mypy --strictclean (19 source files).pytest— 342 passed, 13 deselected (thefuzzmarker is nightly-only by design);pytest -m fuzz— 13 property invariants pass.coverage— 93% with subprocess collection (floor 90%);patterns.py/redaction.py/policy.py/metrics.py100%,config.py99%,seccomp.py97%,server.py/sshpool.py95%,audit.py94%.- Step 3 upstream surface resolves on
mcp==1.27.2/asyncssh2.23.1:FastMCP.__init__kwargs,Contextids, the nine OAuth provider methods,AuthorizationParams/OAuthTokenfields, and theasyncssh.connectoption kwargs all resolve unchanged;pydanticmodel_validatorresolves on the pinnedpydantic>=2.11. - Step 4 behavior: a real
shell_execcall writes an audit record carrying the intact schema (ts, tool, tier, denied, args, output_sha256, output_len, exit_code) with the command output referenced by SHA-256 + length only — no raw output-body field on the record. Tier classification and the redaction sample set are exercised green by the suite.
| ID | Severity | Subject | Resolution |
|---|---|---|---|
| DOC-1 | info (docs) | docs/runbook.md §8.18 described the ADR next-free-number marker as "currently 0008", but docs/adr/README.md (the authoritative marker) has read 0009 since ADR 0008 landed (2026-06-08); the runbook's maintenance note for the README was one number behind. |
Corrected the §8.18 parenthetical to 0009 to match docs/adr/README.md, and appended this pass to the ADR 0005 index subject. Frozen ADR bodies and prior dated outcomes were left intact — the "next free ... 0006" line in this ADR's own Consequences is the 2026-05-24 record, not a live marker. |
No security findings. No capability regressions. The trust boundary (ADR 0002) and tier semantics (ADR 0003) are unchanged; this pass corrected documentation drift and did not touch executor, policy, redaction, or audit behavior.
Validation outcome (2026-06-21, full audit pass)¶
A full validation + security audit, run the same day as the doc-drift pass above;
evidence pack:
audit/2026-06-21-engagement.md. Steps
1-4 were re-run on a clean venv at the pinned mcp==1.27.2 / asyncssh==2.23.1
(a semgrep-into-the-project-venv mcp downgrade to 1.23.3 was caught and the
venv rebuilt clean, with semgrep thereafter run isolated via uvx), plus an
external scanner battery and reviews of the trust boundary, the CI supply chain,
and structure-vs-spec conformance against trusted external sources.
- Gates green:
ruff/ruff format/mypy --strictclean (19 files);pytest342 passed, 13 deselected;pytest -m fuzz13;coverage93% (floor 90%). 21 tools / 3 resources / 1 prompt; upstream surface resolves; the audit-record output-hash-only invariant holds. - Scanner battery all clean: pip-audit (OSV/PyPA), trivy fs (vuln/secret/misconfig), bandit (0 medium+; 12 benign Low), semgrep (0 findings, 183 rules), actionlint, shellcheck, gitleaks. No pinned dependency carries a known CVE at its pinned version.
| ID | Severity | Subject | Resolution |
|---|---|---|---|
| SEC-3 | P2 | pyproject.toml lower bounds sat below the patched minimum for asyncssh (GHSA-g794-3fmp-753h), starlette (BadHost GHSA-86qp-5c8j-p5mr / GHSA-jp82-jpqv-5vv3), PyJWT (HMAC confusion GHSA-xgmm-8j9v-c9wx), and cryptography (GHSA-537c-gmf6-5ccf) — a cold pip install resolver could pick a vulnerable transitive version, though the pinned set was already safe. |
Floors raised to asyncssh>=2.23.0, starlette>=1.3.0, PyJWT>=2.13.0, cryptography>=48.0.1 (mirrors PR #97's pydantic-settings floor bump). Installed/tested set unchanged; gate green. |
| TOOL-4 | info | .github/CODEOWNERS required review on a non-existent /.github/dependabot.yml; the repo uses Renovate. |
Reference corrected to /renovate.json5. |
The remaining findings are incremental hardening / format-conformance with no
P0/P1 (SEC-4 Anthropic sk-ant- / HuggingFace hf_ redaction shapes is the
recommended next item); they are registered in the engagement pack §7 and
deferred to BACKLOG.md / runbook §7. No security regression; the ADR 0002
trust boundary and ADR 0003 tier semantics are unchanged.
Consequences¶
- The runbook §2 audit pass is now grounded in a concrete, repeatable procedure rather than a free-form review. Reproducing the pass is a matter of re-running steps 1-4; deviations become a finding row in this table format.
- Each
mcpSDK bump triggers a fresh §2 pass (the step-3 symbol set is the diff target). The latest validated pin is the one named in the most recent dated outcome paragraph above (currentlymcp==1.27.2, validated 2026-06-21), not a hardcoded version here — so this bullet does not drift as the pin moves. - Each subsequent ADR landing should record its own validation outcome using the same format — terse findings table, severity, resolution — so the audit trail of decisions and the audit trail of executions stay symmetrical.
- The next free ADR number is 0006.
Rejected alternatives¶
- Annual-only audit. A yearly pass surfaces drift too late; the SDK ships minor versions in weeks and the policy/redaction modules drift fastest. The pass belongs in the runbook (always available) and is triggered on event, not date.
- Continuous validation via a check-everything CI job. Some of the
step-3 introspection (importing the SDK, listing tool registration) is
already covered by
tests/test_server.pyandtests/test_oauth.py. Duplicating those into a separate "validation" job would double the surface and the upkeep without buying anythingpytestdoes not already give us. Coverage subprocess collection in CI catches the remaining cases (the e2e contributes the wrapper bodies). - Auto-generated findings file. A standalone
docs/audit/2026-05-24.mdwould split the audit trail across files. Recording each pass inline in the closing ADR keeps the consequences and the resolution co-located.