Skip to content

ADR 0018: FSM transition guards for STPA safety constraints

  • Status: Accepted
  • Date: 2026-05-21
  • Authors: rmednitzer
  • Builds on: ADR 0004, ADR 0009

Context

The hand-rolled FSM in src/nous/state/machine.py admits every transition listed in the explicit table. The STPA artefacts under docs/stpa/ specify several preconditions the FSM should enforce, but the v0.1 table left those constraints to the controller. In particular, SC-2 (no MISSION transition while thermal headroom is exhausted) and the UCAs for state_transition were unenforceable in code.

A controller bug, a stale estimate, or a malicious peer could therefore drive the simulator into MISSION on a hot device. That defeats the purpose of recording the constraint in docs/stpa/05-safety-constraints.md.

Decision

StateMachine.transition takes an optional context: Mapping[str, Any] and consults a per-transition guard. The guard returns (ok, reason); on ok=False the FSM raises GuardDenied with structured attributes and records the refusal in refusals() for audit. The guard set is explicit in _GUARDS at the bottom of machine.py so the safety preconditions live next to the transition table.

The engine surfaces a default safety context (thermal headroom, SoC critical threshold) through Engine.request_transition, which merges caller-supplied context with the engine-derived defaults. A controller that calls the raw FSM keeps full control; the engine helper is for the common case.

A guard that lacks the context it needs refuses the transition (fails closed). Missing context is unobservable; the FSM cannot assume the device is safe.

Consequences

Easier: SC-2 is enforced in code, not just in prose. A guard refusal is an observable signal -- the controller sees (ok=False, reason=...) and the audit trail records the refused transition. The same guard pattern extends to the cool and recover triggers, both of which had their own UCAs.

Harder: a controller that does not surface a safety context cannot transition to MISSION at all. The default-deny stance is intentional -- a missing context is treated the same as an unsafe one -- but a sleeping controller is a brick-walled simulator. The integration test in tests/integration/test_concurrent_anomalies.py documents the contract a controller must meet.

Alternatives rejected:

  • A separate "policy" layer on top of the FSM. Splits the constraint across two files and lets the FSM ship with a bypass.
  • Guards as decorators on the transition table. Harder to mypy; harder to enumerate the guarded transitions in one screen.

Revisit triggers

  • A new UCA needs a context that does not fit Mapping[str, Any].
  • The guard set exceeds ~15 entries and the per-transition lookup is no longer the cheapest abstraction.
  • An external safety analyser (Polyspace, AstrĂ©e) needs to round-trip the guard predicates as constraints.