ADR-0031: Opt-in client-side-only talosctl pre-upgrade health probe (2026-06-14)¶
Status¶
Accepted
Date¶
2026-06-14
Authors¶
praxis maintainers (implements BL-102, raised for operator decision in ADR-0021)
Context¶
A real-run talosctl upgrade (and upgrade-k8s) is gated on a HARD pre-flight
talosctl health precondition (BL-023, SEC-5): if the cluster is not healthy, the
upgrade is refused as an audited HARD violation. The probe is built in
src/praxis/actuation/talosctl.py::_health_ok and runs talosctl --nodes ...
--endpoints ... health.
talosctl health defaults to a server-side check (--server=true): the API server
runs the health assertions across the cluster. On a freshly bootstrapped or
single-node cluster, or one whose discovery service is not fully converged, those
server-side checks can fail for reasons unrelated to upgrade readiness, so the HARD
gate can spuriously block a legitimate upgrade. talosctl health --server=false runs
only the client-side checks, which avoids that class of false negative at the cost of
a less comprehensive check.
BL-102 (from the ADR-0021 audit) flagged this as a change to a HARD safety precondition (SEC-5) and explicitly deferred it to an operator decision rather than changing it unilaterally. The operator has now chosen to add the capability.
Decision¶
-
Add an additive, opt-in
health_client_side_onlyboolean param (defaultFalse). When truthy,_health_okappends--server=falseto the health argv, so only the client-side checks run. The default is unchanged: the full server-side check. -
Never weaken the default and never skip the gate. The health precondition stays HARD and always runs for a real-run upgrade; the flag narrows its scope, it does not remove it. There is no value of the flag that bypasses the health check.
-
Fail closed on a malformed flag. The value is coerced by
_as_health_flaginside the health predicate'stest, so a non-boolean is a HARD audited refusal (a throwing predicate is a HARD failure), never a silent relaxation of the gate. -
Keep it audited and operator-driven. The flag is a structured param on the
run_actiontool (health_client_side_only), so it is part of the request args written to the tamper-evident audit record and shown in the DRY_RUN preview the operator approves out-of-band. An autonomous caller cannot use it to dodge the gate: the gate still runs (client-side), and the upgrade still needs the minted T2/T3 approval.
Consequences¶
Positive: an operator can complete a legitimate upgrade on a post-bootstrap cluster whose server-side health checks spuriously fail, without disabling the safety gate. The capability is additive, opt-in, fail-closed, and audited; the default posture is unchanged.
Negative: a client-side-only check is less comprehensive than the server-side check, so when the operator opts in they accept a weaker (but non-empty) pre-upgrade assurance. This is a deliberate, audited operator choice, visible in the DRY_RUN and the audit trail, not a default.
Neutral: the flag maps directly to the talosctl --server flag, so its semantics
track the tool. It is set as a single --server=false token (not a separated value),
consistent with the adapter's structured-argv, no-free-form-options posture (BL-082).
Alternatives considered and rejected¶
- Always pass
--server=false. Rejected: it weakens the default health gate for every upgrade to avoid an occasional false negative, contrary to the never-weaken-a-default rule. - Make the health precondition SOFT (a warning) when server-side fails. Rejected: it silently downgrades a HARD safety control and an autonomous caller could proceed past an unhealthy cluster.
- Leave BL-023 unchanged and require the operator to run the upgrade outside praxis when server-side health is spuriously red. Rejected: it pushes a privileged action off the audited path, losing the audit record and the approval gate.
Revisit triggers¶
- talosctl changes the
--serverflag semantics or default, or splits client/server health into a different interface. - A SOFT-with-explicit-acknowledgement design is wanted (record the spurious server-side failure as a finding, then allow client-side-only) once a structured finding surface exists for pre-flight checks.