ADR 0005: Anthropic client with a hard daily cap and prompt-cache discipline¶

Status: Accepted
Date: 2026-05-20
Authors: rmednitzer
Builds on: ADR 0001

Context¶

inference_cloud is the seam through which a nous deployment talks to Anthropic. Three operational concerns drove the design:

Cost. An unattended simulator can hit a billing surprise quickly.
Latency. Repeated calls with the same system prompt should hit the prompt cache.
Prompt injection. Sensor and operator inputs reach the model; trusted and untrusted content must not share a cache slot.

src/nous/anthropic_client.py:

Counts every call against a file-locked daily counter at $NOUS_HOME/.anthropic_daily_count. The default cap is 100 calls per UTC day and is set by NOUS_ANTHROPIC_DAILY_CAP. When the cap is exhausted, inference_cloud raises CapExhausted and the caller is expected to fall back to inference_local.
Marks the system prompt and every trusted-context block with cache_control={"type": "ephemeral"} so repeated calls within the cache window pay the input-token discount.
Reserves the system slot for trusted content (controller instructions, self-model claims, structured engine outputs) and confines untrusted content (sensor text, intercepted radio payloads) to the user slot.

The default model is claude-haiku-4-5-20251001. Operators who need the larger model set NOUS_ANTHROPIC_MODEL_ADVANCED=claude-sonnet-4-6.

Easier: budget surprises become a configuration knob, not a billing ticket. Cache discipline is enforced by the client, not by every caller.

Harder: the file-locked counter introduces a small but real I/O cost per call. The slot discipline must be respected by every callsite.

Anthropic releases a new model that supersedes the current defaults.
The cache TTL changes (currently 5 minutes for ephemeral).
A use case needs multi-tenant per-workspace caps; the file lock is single-host today.