Model card: Inference (local mock)¶
Module: src/nous/server.py::inference_local
Backlog: BL-013, BL-043 (real local model in L3)
Inputs¶
- A prompt string from the controller.
Outputs¶
A structured JSON response with {model, prompt_len, response}. The
response echoes the first 160 characters of the prompt prefixed with
[local mock]. The mock does not run a model.
SLA¶
- Latency: fixed at a configurable simulator-time delay (default 20 ms wall, 200 ms simulated).
- Energy: derived from the hardware profile's compute curve.
Known failure modes¶
- The mock is not a model. It cannot answer questions; it cannot
reason. Treat its outputs as deterministic placeholders. Calling
inference_localshould be paired withinference_cloudfor any scenario that needs an actual response.
Replacement¶
The L3 implementation (BL-043) wires either TensorRT-LLM (Jetson) or llama.cpp (Pi 5 / x86) into this tool. Replacing the mock will produce real responses and real energy/latency; the SLA in this card will be replaced by the new model's card at that time.