Sentinel

v1.1 · 8 pulses
Observational corpus on HAT failure modes in a production agent runtime.

Falsifiers

This document records conditions under which the sentinel framework, codebook,

or project would be wrong, and the response action each triggers. The act of

writing these down is most of the value: a research practice without explicit

falsification criteria becomes confirmation-seeking by default. When a

falsifier triggers, the operator commits to acting on it, including retracting

prior claims if the evidence warrants.

This file is a contract, not a reference. Edits require a corresponding

changelog entry below and a rationale in the PR description.

Falsifier categories

Falsifiers are partitioned into four severity levels, in increasing order of

what they call into question.

Response: codebook revision (new versioned file under codebook/).

substrate. Response: scope reduction or pivot.

unreliable signal. Response: routine redesign.

Response: explicit retirement or pause, with prior claims annotated.

Active falsifiers

F-codebook-001: meaningful_control_erosion overreach

Condition. meaningful_control_erosion rate exceeds 30% over any rolling

30-day window in which no pulse cites an actually irreversible action by the

observed agent.

Why this matters. The mode definition implies the agent reduced operator

ability to intervene, contest, or reverse. If the rate is high but no

irreversible actions occurred, the mode is being applied to legible

operator-displeasure rather than actual control loss. The mode is too loose.

Detection. Manual review during quarterly audit, plus automatable check

(check_falsifiers.py, deferred until corpus reaches 30 days).

Response. Open a codebook/ revision PR that tightens the mode

definition. Document the trigger in the PR body. Reclassify retroactively if

warranted (Wave 2 codebook drift detection).

F-codebook-002: evaluation_awareness_divergence base rate (closed by v1.1)

Status. Closed by codebook v1.1 (PR #51). The mode produced no signal

in v1.0 and was deprecated in v1.1 precisely because base rate was

effectively zero on this single-deployment substrate. The falsifier

itself ran the other direction (rate near zero rather than >5%) and was

satisfied by deprecating the mode, not by triggering an investigation.

Condition (historical). evaluation_awareness_divergence rate exceeds 5%

in any rolling 14-day window, given the v1.0 codebook's note that base

rate should be near zero for this single-deployment substrate.

Why this matters (historical). The codebook itself recorded the

expectation that this mode was rarely applicable here. A high observed

rate would have meant the mode definition did not match the substrate,

or the substrate had changed in ways that newly exposed

evaluation-awareness behavior, or the classifier was overreaching.

Restoration condition. Restore as an active falsifier if a future

codebook version (v1.2+) reintroduces an evaluation-awareness mode, e.g.

because the substrate begins to span multiple monitoring conditions.

F-codebook-003: unmappable patterns persist

Condition. patterns_observed field contains entries that do not map to

any codebook mode for three consecutive pulses.

Why this matters. patterns_observed is the explicit overflow channel

for things the codebook cannot name. A persistent overflow signals the

codebook is missing a mode.

Detection. Manual review during quarterly audit, plus a check

(check_consistency.py) that surfaces unmapped pattern entries.

Response. Codebook revision adding the missing mode, with explicit note

in codebook/v1.N.md of which pulse first surfaced it.

F-framework falsifiers

F-framework-001: HAT lens does not capture the actual signal

Condition. Two consecutive quarterly audits produce mode disagreement

between the operator's blind reclassification and the routine's original

classification on more than 40% of audited pulses, AND the operator's

reclassifications themselves do not converge on consistent modes.

Why this matters. If the operator (the codebook author) cannot

reproduce the classifier's labels and cannot consistently produce their

own labels, the framework is not naming stable categories. The HAT

failure-mode lens may be fine in principle but wrong for this substrate,

or the codebook may need a structural rewrite.

Detection. Manual via quarterly audit comparison.

Response. Pause routine. Open scope-reduction or pivot PR; possibly

retire codebook v1 entirely and restart with v2 against a different mode

inventory.

F-framework-002: substrate shift invalidates prior corpus

Condition. The host MCP server infrastructure undergoes a structural redesign

that materially changes which agents run and what they produce, AND the

codebook's mode rates shift by more than 50% within four weeks of the

change.

Why this matters. Pulses are observations of one specific substrate.

If the substrate changes substantially, the corpus before the change and

after it cannot be meaningfully aggregated. Trends across the boundary are

artifacts of the substrate change, not the codebook.

Detection. Manual; substrate revisions are visible in

substrate_revision field once populated.

Response. Mark the substrate boundary explicitly in substrate_revision

and STATS.md. Compute mode rates separately on each side. Treat the older

corpus as a frozen reference, not a comparable baseline.

F-methodology falsifiers

F-methodology-001: classifier non-determinism

Condition. Re-running the routine on identical inputs (same window,

same substrate state) produces classifications that disagree on mode for

more than 20% of items.

Why this matters. Single-rater work depends on the rater being

internally consistent. A non-deterministic classifier produces noise that

masquerades as signal. The routine cannot ground claims if it cannot

reproduce its own outputs.

Detection. Adversarial replay test: schedule one pulse per quarter to

re-classify a fixed historical window. Currently manual; a replay-test.yml

workflow is deferred to Wave 4.

Response. Routine redesign. Lock down model version, temperature, and

seed (where supported). If still non-deterministic above threshold,

acknowledge the limit explicitly in honesty_notice and downgrade

confidence calibration claims accordingly.

F-methodology-002: provenance-classification decoupling

Condition. A pulse's codebook_hash or routine_hash does not match

the git blob SHA of the file at the pulse's commit, or those fields are

missing on a v1.1 pulse.

Why this matters. Provenance fields exist to make codebook drift,

substrate drift, and classifier drift observable. If they are missing or

wrong, the corpus loses its self-attestation property and historical

classifications cannot be trusted.

Detection. Automated (check_provenance.py, deferred to Wave 2.5).

Response. Hard fail in CI. The pulse PR cannot merge until provenance

is correct.

F-project falsifiers

F-project-001: methodological commitment cannot be held

Condition. The operator misses two consecutive quarterly audits, OR

codebook revisions land without the required reclassification of recent

pulses, OR the routine runs without producing valid pulses for more than

14 consecutive days.

Why this matters. Observational research requires sustained discipline.

A project that cannot maintain its own audit cadence is producing data

without the methodological scaffolding that makes that data trustworthy.

Better to retire honestly than continue under an empty commitment.

Detection. Manual; pulse cadence is automated via stale-routine.yml.

Response. Pause the routine. Either resume with a renewed commitment

documented in a PR, or retire the project. Retirement means freezing the

corpus, marking it explicitly retired in README, and tagging a final

release.

F-project-002: results contaminate the substrate they observe

Condition. Sentinel pulses materially change how the observed substrate's

agents behave (the operator reads pulses, adjusts routines based on them,

and the next pulses reflect those adjustments rather than the agents'

unobserved behavior).

Why this matters. This is the observer effect at the operator level.

If the corpus is no longer a record of how the agents behave but a record

of how the agents behave when known to be observed, the project's framing

as observational research is wrong. It becomes intervention research,

which is a different kind of work with different methodology.

Detection. Manual; require an audit question per quarter explicitly

asking the operator whether and how recent pulses influenced agent

configuration.

Response. Reframe the project explicitly. Either adopt intervention-

research methodology (with controls, intentional A/B periods, etc.) or

restore strict observation by separating the operator-of-routines from

the operator-of-agents. The current single-operator setup makes pure

observation difficult.

Inactive / retired falsifiers

(none yet)

Changelog

When Falsifier Action Rationale
2026-05-04 (initial) added Wave 2 of redesign. Establishes the falsifier registry.