In-depth systems review (2026-05-21)¶

Scope and method¶

This review inspects architecture, runtime behavior, safety controls, security posture, test coverage, and simulation realism. Findings are based on repository evidence in code and documentation, with prioritized recommendations.

Executive summary¶

The project has a strong control spine for a simulator at this maturity stage: explicit policy tiers, per-call audit records, typed modules, and documented boundaries. The largest gaps are realism completeness, assurance depth for regulated contexts, and mismatch risk between documented architecture and current implementation state.

Top priority improvements:

Close architecture-to-implementation drift by wiring or demoting currently stubbed subsystems and tools.
Add deterministic replay and calibration harnesses to convert realism claims into measurable evidence.
Harden deployment and supply chain posture for regulated operations, especially around auto-deploy and cloud inference pathways.
Expand observability and negative-path testing for policy, audit degradation, and scenario reproducibility.

Prioritized findings¶

1) Correctness defects and behavior gaps¶

Architecture promises broader subsystem orchestration than the active engine loop.
docs/architecture.md describes tick orchestration across all subsystems and self-model refresh each tick.
src/nous/engine.py currently wires power and APU estimators only, with defaulted load and ambient values.
Risk: consumers over-trust device health and capability claims while major subsystem couplings are not active.
Improvement: introduce explicit capability_matrix and fidelity_level outputs in device_info and device_health so downstream logic can gate behavior by implemented fidelity.
Profile loading has permissive fallback behavior.
_load_profile returns a generic fallback dict if profile file is missing or invalid shape.
Risk: silent simulation under unrealistic defaults can invalidate scenario outcomes.
Improvement: fail fast on missing profile in non-test mode, with an explicit strictness setting for CI/dev.
Thermal and other subsystem stubs can imply false precision.
src/nous/subsystems/thermal.py returns fixed values with additive time progression only.
Risk: capability decisions can appear physically grounded while being static placeholders.
Improvement: expose model_fidelity: stub|parametric|validated per subsystem in tool output.

2) Security and control-plane findings¶

Admission control is clear but coarse for high-impact tool families.
Tiering exists and runner enforcement is centralized, which is strong.
Risk: regex-based allow/deny can be brittle and hard to audit across changing tool names.
Improvement: add signed, versioned policy bundles with explicit tool IDs and optional argument constraints.
Auto-deploy-to-production cadence is operationally aggressive.
STATUS.md states production tracks main with frequent polling and auto-restart.
Risk: narrow rollback windows and accidental propagation of quality regressions.
Improvement: add staged promotion (main -> canary -> prod), with health gate checks and automatic rollback on failed post-deploy smoke checks.
Cloud inference path creates data sovereignty and lock-in exposure.
LIMITATIONS.md documents Anthropic cap behavior and cloud dependency path.
Risk: non-EU processing and external control surface dependency for mission flows.
Improvement: prioritize local model execution path for default mission mode, cloud only as explicitly enabled fallback with data classification gating.

3) Data loss, migration, and auditability risks¶

Audit durability and verification depth are not fully evidenced in tests.
Runner writes one record per call and hashes output body, which is a strong baseline.
Risk: degraded sink or partial write behavior can reduce forensic confidence.
Improvement: add crash-consistency tests for audit append behavior and periodic integrity verification job that checks line-level schema and hash consistency metadata.
Reset and irreversible tool pathways need stronger guardrails in docs and tests.
Irreversible classes exist in policy tiering.
Risk: accidental enablement under permissive mode during operator workflows.
Improvement: add mandatory dual-control token for T3 operations and explicit replayable audit evidence in integration tests.

4) Concurrency and determinism¶

Deterministic replay is not first-class.
Engine uses time and tick progression but deterministic seed controls are not surfaced as a clear contract.
Risk: inability to reproduce a mission-state sequence exactly for incident review.
Improvement: add deterministic seed, deterministic noise generator selection, and scenario event log export/replay command pair.
Tick-loop load coupling is currently simplified.
Power load is drawn from idle profile value rather than live compute/inference coupling.
Risk: underestimation of thermal-power interaction and endurance variability.
Improvement: implement cross-coupled compute -> thermal -> power derate feedback with bounded step solver.

5) Error handling and observability¶

Error handling returns bounded strings, but structured machine-parsable errors are limited.
Runner catches exceptions and returns [error Class: message] style output.
Risk: controllers cannot reliably differentiate transient, policy, and invariant violations.
Improvement: standardize error envelope JSON with code, class, retryable, and correlation_id fields.
Degraded-mode signaling can be expanded.
device_info surfaces audit.degraded which is useful.
Improvement: add degraded flags for estimator divergence, stale sensor paths, profile fallback, and policy regex compile failures.

6) Test and verification posture¶

Coverage appears focused on selected unit and smoke paths.
There are unit, integration, and e2e folders, with limited evidence of scenario replay conformance checks.
Improvement: add property-based invariants for energy conservation, monotonicity constraints, and safe-state transitions under randomized injections.
Calibration against real benchmarks needs stronger continuous process.
Documentation requires realistic values and BOM-first updates, which is excellent.
Improvement: add calibration CI task that compares profile and model-card constants to docs/bom.md references and fails on undocumented drift.

Feature roadmap to improve realism¶

A) Physics and subsystem realism¶

Add thermal RC network model with ambient and enclosure conduction, fan curve, and compute power coupling.
Add battery hysteresis and temperature-dependent internal resistance curve, plus aging state (cycle_count, soh_pct).
Add comms channel model with path loss exponent variants, shadowing, burst error model, and optional terrain plugin seam.
Add operator biometrics physiology-lite model (hydration/strain coupling) before full biophysical model.

B) Mission realism and scenario expressiveness¶

Extend scenario DSL with uncertainty annotations and adversarial injectors.
Add mission phase templates (patrol, relay, stationary overwatch, extraction) with reusable load/comms envelopes.
Add fault campaign runner that sweeps injector matrices and outputs pass/fail against STPA-derived constraints.

C) Assurance for regulated infrastructure¶

Introduce signed artifacts for profile YAML and scenario bundles.
Add reproducible run manifest: code revision, profile hash, scenario hash, seed, policy mode, tool transcript hashes.
Add mandatory evidence package export for each run: key states, denied operations, degraded flags, and conformance report.

D) EU sovereignty oriented architecture options¶

Default to self-hosted model runners for sensitive modes.
Add configurable data localization controls for logs, metrics, and any external adapter output.
Provide a no-cloud operational profile that hard-disables cloud inference and external telemetry adapters.

Proposed implementation plan (minimal to high impact)¶

Immediate (small diffs, high confidence):
Add fidelity metadata fields to tool outputs.
Add strict profile-load option with explicit startup failure behavior.
Add structured runner error envelope.
Near term (moderate effort):
Deterministic replay framework with seed and event-log export.
Energy and thermal invariant property tests.
Audit integrity periodic checker and report tool.
Mid term (larger realism gains):
Thermal and compute coupling implementation.
Comms propagation and burst-noise model.
Scenario fault campaign executor with STPA requirement assertions.

Unknowns and minimal checks needed¶

Unknown: current CI branch protection and release promotion controls.
Minimal check: inspect CI workflow files and deployment gate scripts.
Unknown: real-world calibration error of current power/APU model against benchmark traces.
Minimal check: run replay against measured trace set and compute RMSE/MAPE.
Unknown: runtime performance at target tick rates with full subsystem coupling.
Minimal check: benchmark with representative scenario profiles and publish p50/p95 loop latency.

Rollout and rollback guidance for recommended changes¶

Rollout: ship one feature flag at a time (strict_profile_load, deterministic_replay, structured_errors) with default-off where compatibility risk exists.
Rollback: disable the feature flag and redeploy previous tagged release; preserve audit and scenario artifacts for post-incident review.