Behavioral Contracts for Human-Autonomy Teaming

Abstract: Three research communities are independently converging on the same idea: formalized behavioral agreements between humans and autonomous agents. Human-Autonomy Teaming (HAT) calls them working agreements. AI alignment calls them constitutions and model specifications. Safety engineering calls them control structure constraints. None of them cite each other. This article bridges the three fields, argues that scoped operator-authored contracts are theoretically necessary (not merely convenient), and identifies eight structural gaps in the literature that a unified framework would close. The argument draws on cybernetics, systems-theoretic safety analysis, and the empirical HAT literature to show that universal behavioral rules for AI are provably insufficient, and that the alternative is not weaker governance but more precise governance.

The convergence nobody noticed

Three mature fields have spent decades developing the same core concept under different names, in different venues, with different citation networks.

In Human-Autonomy Teaming, Gutzwiller, Espinosa, Kenny, and Lange formalized “working agreements” as a design pattern in 2017: structured pre-definitions of task allocation, transition points, and manual override conditions between human operators and autonomous systems. NATO’s STO-TR-HFM-247 report, produced by twenty scientists from seven nations, identified working agreements, dialog management, and explicit intent communication as foundational HAT solutions, emphasizing “flexible, adjustable and trustworthy automation operated under user sovereignty.”

In AI alignment, Anthropic’s Constitutional AI (Bai et al., 2022) demonstrated that natural-language principles can govern model behavior through self-critique and reinforcement learning. By 2025–2026, both OpenAI and Anthropic had published detailed model specifications defining hierarchical authority chains: platform rules that are never overridable, operator-level configurations, and user-level preferences. Anthropic’s specification introduces a “hardcoded/softcoded” distinction: absolute prohibitions versus defaults that operators and users can adjust within bounded ranges.

In safety engineering, Leveson’s STAMP framework (2004) models safety as a control problem where accidents arise from inadequate constraint enforcement in hierarchical control structures. Arkin’s ethical governor (2009) demonstrated encoding behavioral constraints as machine-enforceable rules for autonomous systems, with two constraint types: forbidding (from Laws of War) and obligating (from Rules of Engagement).

All three communities arrived at the same architectural pattern: a structured specification of what an autonomous agent may, must, and must not do, authored or configured by a human with operational authority, enforced at runtime, and subject to override under defined conditions. The HAT community has the deepest empirical tradition. The alignment community has the most advanced formalization. The safety engineering community has the strongest theoretical foundations. And the cross-citation rate between them is approximately zero.

Why universal rules fail: the impossibility argument

The intuition that a single set of behavioral rules should govern all AI interactions is attractive and wrong. Three classical results from computability theory and logic establish why.

Rice’s theorem (1953) proves that all non-trivial semantic properties of programs are undecidable. A 2024 paper applied this directly to AI alignment, proving that asserting whether an arbitrary AI model satisfies a non-trivial alignment function is undecidable in the general case. You cannot build a universal verifier for behavioral compliance.

Godel’s incompleteness theorems establish that no sufficiently complex formal system can prove its own consistency. Applied to AI governance: a behavioral specification rich enough to be useful for real-world interaction cannot guarantee, from within its own rules, that it produces no contradictions. Every non-trivial contract has edge cases where its axioms collide.

The frame problem (McCarthy and Hayes, 1969) demonstrates that specifying the effects of actions requires enumerating all non-effects, producing a combinatorial explosion that defeats universal specification. A behavioral rule that says “do not cause harm” requires defining harm across every possible context, which is precisely the enumeration that the frame problem proves intractable.

The practical consequence is not that behavioral governance is impossible. It is that universal behavioral governance is impossible. Scoped contracts that constrain the operational domain can sidestep these impossibility results the same way type systems in programming languages provide decidable safety guarantees by restricting expressiveness. A contract that governs “how this operator interacts with this AI in this operational context” operates in a dramatically smaller state space than “how all humans should interact with all AIs in all contexts.”

This is not a weakness. It is a design principle. Specificity is the mechanism that converts undecidable universal problems into tractable local ones.

What the HAT literature actually shows

The empirical HAT literature is mature, well-structured, and methodologically narrow. O’Neill, McNeese, Barron, and Schelble’s 2022 systematic review in Human Factors analyzed 76 empirical HAT studies and found that 100% used simulation testbeds. No field studies met inclusion criteria. 75% used military or emergency-related simulations. A 2025 CHI meta-analytical review of 57 articles confirmed these patterns and found that 77% involved a single human paired with a single AI agent. Chung, Holder, Shah, and Yang (2026) extended the analysis and confirmed that the simulation-only pattern persists.

Within these constraints, three constructs are well-established and directly relevant to behavioral contracts.

Shared mental models originated in human-team research (Cannon-Bowers, Salas, and Converse, 1993) and were validated empirically by Mathieu, Heffner, Goodwin, Salas, and Cannon-Bowers (2000) using flight-combat simulations. The critical HAT finding came from Schelble, Flathmann, McNeese, Freeman, and Mallick (2022): explicit goal-sharing was significantly more important for shared mental model formation in human-agent teams than in all-human teams, where implicit socialization processes suffice. This is the empirical case for contracts. What human teams build tacitly through social interaction, human-AI teams must build explicitly through specification.

Nancy Cooke’s Interactive Team Cognition framework offers an important alternative: team cognition is an activity residing in interaction patterns, not a property stored in individual heads. Under this framing, behavioral contracts are not knowledge-alignment tools but interaction protocol specifications. Both perspectives support the same architectural conclusion: the agreement must be externalized.

Calibrated trust is formally defined by Lee and See (2004) in Human Factors as the correspondence between a person’s trust in the automation and the automation’s actual capabilities. The failure modes are well-characterized: overtrust leads to misuse (Parasuraman and Riley, 1997), undertrust leads to disuse, and automation bias (Mosier and Skitka, 1996) produces commission errors where operators follow incorrect automated recommendations without verification.

Behavioral contracts are trust calibration mechanisms. A well-specified contract makes the AI’s capabilities and boundaries explicit, reducing the gap between perceived and actual trustworthiness. The contract’s epistemic controls (confidence tagging, source verification requirements, pointer-gates for volatile information) directly address the conditions under which automation bias manifests: when the operator has no independent basis for evaluating the AI’s output.

Recent trust repair research adds urgency. Schelble, Lopez, Textor, and colleagues (2024) found that neither apologies nor denials restored trust after ethical violations by AI teammates. Trust, once lost through behavioral boundary violations, does not recover through social repair strategies. This means behavioral contracts must prevent violations rather than rely on post-hoc repair. The contract is the primary trust maintenance mechanism, not a fallback.

Dynamic task allocation has been studied through multiple frameworks. The Parasuraman-Sheridan-Wickens (2000) model defines automation across four information-processing functions on a 10-level continuum. Johnson, Bradshaw, and Feltovich’s Coactive Design work (2011–2017) critiques this as overly reductive, arguing that interdependence must shape autonomy through three requirements: observability, predictability, and directability. Miller and Parasuraman’s Playbook architecture (2007) is the most direct precursor to behavioral contracts: pre-authored behavioral specifications that operators select, combine, and modify, analogous to sports plays.

A behavioral contract unifies all three perspectives. It specifies task boundaries (Parasuraman-Sheridan-Wickens), encodes interdependence requirements (Johnson-Bradshaw-Feltovich), and provides a vocabulary for intent communication (Miller-Parasuraman). The contract is not a replacement for these frameworks but an integration point.

The alignment community reinvents working agreements

The most significant recent development in AI governance is the publication of detailed model specifications by frontier AI labs. These documents instantiate HAT concepts with remarkable precision, entirely without citing HAT literature.

Anthropic’s model specification (2026) establishes a three-tier principal hierarchy: Anthropic (foundation constraints) at the top, operators (system prompts, enterprise configuration) in the middle, and users (preferences, instructions) at the bottom. The specification distinguishes “hardcoded” behaviors (absolute, never overridable) from “softcoded” behaviors (defaults that operators and users can adjust within bounded ranges). This is precisely the working agreement pattern: a pre-defined task allocation with transition points, override conditions, and escalation rules.

The specification describes the AI as a “seconded employee” from a staffing agency who maintains the agency’s principles while serving the operator’s objectives. This metaphor is more precise than it appears: it acknowledges that the AI has both inherent constraints (from training, from constitutional principles) and operator-authored specifications (from system prompts, from custom instructions), and that these two layers can conflict. The resolution hierarchy (Anthropic’s constraints override operator instructions, which override user preferences) is a formal authority structure.

OpenAI’s model specification (2025) uses a four-level chain: Platform, Developer, User, Guideline, with “hard rules” at the platform level. The pattern is identical: hierarchical behavioral governance with scoped override authority.

What neither specification does is connect this architecture to the decades of HAT research on the constructs it implements. Trust calibration through transparency? Chen and Barnes’ SAT model at the Army Research Laboratory addressed this in 2014. Authority hierarchies for human-autonomy interaction? NATO’s HFM-247 formalized this in 2020. Behavioral specifications as shared mental model artifacts? Schelble et al. demonstrated this empirically in 2022. The alignment community is learning by doing what the HAT community learned by studying. The result is convergent architecture with divergent vocabulary.

Safety engineering provides the theoretical grammar

Cybernetics offers the deepest theoretical foundation for why behavioral contracts work and when they fail.

Ashby’s Law of Requisite Variety (1956) states that only variety can absorb variety: a controller must have at least as many available response states as the system it governs has disturbance states. Applied to human-AI teaming: an operator governing an AI must possess sufficient control variety (constraints, overrides, mode switches, contracts) to match the AI’s behavioral variety. A behavioral contract is a variety-management mechanism. It constrains the AI’s behavioral space to a range the operator can monitor and govern, while providing the operator with sufficient control levers to respond to deviations.

Conant and Ashby’s Good Regulator Theorem (1970) proves that every effective controller must be (or contain) a model of the system it controls. The operator must maintain an accurate model of the AI’s behavior to govern it effectively. The behavioral contract serves double duty: it constrains the AI’s behavior (reducing the model complexity the operator must maintain) and it externalizes the model itself (making the expected behavior explicit and inspectable).

Beer’s Viable System Model (1972–1984) maps governance onto five recursive subsystems: operations (S1), coordination (S2), operational management (S3), strategic intelligence (S4), and policy/identity (S5). A behavioral contract spans multiple VSM levels: coordination protocols (S2), operational constraint enforcement (S3), and value-level constraints (S5). The contract is not just a control mechanism. It is a governance architecture that makes the human-AI system viable in Beer’s technical sense: capable of maintaining identity and purpose under perturbation.

Leveson’s STAMP framework provides the safety-specific application. In STAMP, accidents arise not from component failures but from inadequate enforcement of constraints within a hierarchical control structure. The human operator is a controller in this hierarchy. The behavioral contract defines the constraints the operator enforces on the AI. Unsafe control actions (the STPA construct) correspond to contract violations: actions not provided when needed, incorrect actions, actions at wrong timing, actions that persist too long or stop too early.

Pennington, Johnson, Hobbs, and Colombi (2025) published the first application of STPA-Coordination to HAT, analyzing the “Loyal Wingman” air combat concept. A separate 2025 paper applied STPA directly to frontier AI systems, noting that its “foundation in systems theory makes it well suited for analysis of emergent properties in complex systems.” The mapping from STPA to behavioral contracts is direct and productive: contracts are safety constraints in a control hierarchy, and STPA provides the methodology for deriving them systematically.

Contracts drift: Rasmussen explains why

Rasmussen’s risk management framework (1997) describes how organizations migrate toward safety boundaries driven by gradients toward least effort and maximum performance. Three boundaries constrain a “discretionary space”: economic failure, unacceptable workload, and functionally acceptable behavior. The operating point drifts invisibly until crossing the safety boundary.

Applied to human-AI teaming: as operators become comfortable with AI teammates, they delegate more, monitor less, and expand AI authority. The behavioral contract’s boundaries erode not through explicit renegotiation but through incremental norm drift. The operator skips the verification step that the contract specifies. The AI’s confidence tags go unexamined because the last fifty outputs were correct. The escalation threshold creeps upward because the operator trusts the AI to handle more.

This is not operator failure. It is the predictable behavior of any system under Rasmussen’s gradient pressures. The contract must be designed with drift resistance: regular verification of contract compliance, explicit renegotiation triggers, and mechanisms that make the drift visible before the safety boundary is crossed.

Sarter, Woods, and Billings (1997) documented the closely related phenomenon of automation surprises: operators losing track of what the automation is doing, why, and what it will do next. Rath (2025) quantified “Agent Drift” in multi-agent LLM systems, finding that behavioral boundaries degraded fastest among all drift types, with a 46% decline over 500 interactions. Behavioral contracts that lack maintenance mechanisms will degrade. This is not a risk. It is a certainty with an empirically characterized rate.

The scoping advantage

The impossibility results establish that universal rules fail. Cybernetics establishes that effective governance requires matched variety between controller and system. Safety engineering establishes that constraints must be enforced within a control hierarchy. Together, these provide the theoretical basis for why scoped operator-authored contracts are not merely convenient but architecturally necessary.

A contract scoped to one operator, one operational context, and one interaction mode has three structural advantages over universal specifications.

First, it operates in a tractable state space. The contract does not need to resolve “what does safety mean for all humans interacting with all AIs.” It needs to resolve “what does correctness look like when this operator, with this expertise, asks this AI to perform this class of tasks in this operational context.” The operator’s expertise is part of the contract’s operating assumptions, not a variable to be accommodated.

Second, it supports calibrated trust through specificity. A universal rule like “be helpful and harmless” provides no basis for trust calibration because it does not specify what the AI can and cannot do in any particular domain. A scoped contract that specifies epistemic controls, confidence tagging, verification requirements, and escalation conditions gives the operator concrete criteria for evaluating the AI’s output.

Third, it is maintainable. A universal specification requires governance across all possible contexts, which means governance by committee, which means slow updates, which means drift between specification and practice. An operator-authored contract can be updated by the operator when the operational context changes, with the maintenance burden proportional to the contract’s scope.

The analogy is not a constitution (universal, slow to amend, requiring broad consensus). It is a cockpit checklist: scoped to a specific aircraft type, a specific phase of flight, a specific crew configuration. Nobody expects a 737 checklist to also govern how to sail a boat. And when the checklist has a gap, the pilot’s judgment fills it. The checklist does not override the pilot; it supports them.

The governance hierarchy

If operator-authored contracts are scoped, they need a governance hierarchy to prevent conflicts with higher-level constraints. This hierarchy already exists in practice, implemented independently by the alignment and safety communities.

The foundation layer contains universal axioms, kept minimal precisely because universality and richness are in tension. Something like: fidelity over comfort, bounded authority, transparent failure, bidirectional correction. Not much more. The fewer axioms at this layer, the fewer collision surfaces.

The domain layer adds domain-specific constraints, risk models, regulatory bindings, and acceptable failure modes. A medical domain contract has different degradation priorities than a defense one. This is where most of the actual governance lives.

The operator layer is the scoped contract. It inherits from the domain contract, which inherits from the foundation. Conflicts resolve upward: the foundation constrains the domain, which constrains the operator.

This is architecturally identical to Anthropic’s principal hierarchy (Anthropic constraints, operator system prompts, user preferences), to STAMP’s hierarchical control structure (system-level constraints, subsystem constraints, component constraints), and to Beer’s VSM recursive governance (S5 policy, S3 operational management, S1 operations). The pattern is convergent because the problem is convergent.

The enforcement question is where the communities diverge. HAT relies on soft enforcement through shared understanding and mutual adaptation. Alignment relies on training-time enforcement through RLHF and constitutional self-critique, plus runtime enforcement through system prompts. Safety engineering relies on formal verification and runtime monitoring. A complete governance architecture would use all three: training-time alignment for the foundation layer, runtime enforcement for the domain layer, and operator-authored contracts with mutual adaptation for the operator layer.

What is missing from the literature

Eight structural gaps exist where fundamental concepts from one domain answer open questions in another. No existing work bridges them.

First, no paper connects HAT working agreements to Constitutional AI. These solve the same problem using entirely different vocabularies and citation networks. The bridge is straightforward: Constitutional AI principles are foundation-layer behavioral contracts; model specifications are domain-layer contracts; system prompts and custom instructions are operator-layer contracts. The HAT literature provides the empirical basis for how these contracts affect team performance. The alignment literature provides the enforcement mechanisms.

Second, “operator-authored behavioral contracts” is not an established term in any literature. Bhardwaj (2026) formalizes “agent behavioral contracts” from a software engineering perspective. Gutzwiller et al. (2017) formalized “working agreements” from a human factors perspective. Neither addresses the operator-as-author framing that distinguishes runtime customization from design-time specification.

Third, no formal argument connects impossibility theorems to the design rationale for scoped contracts. The logical bridge exists but has not been made explicit: universal rules are provably insufficient (Rice, Godel, the frame problem), therefore scoping is theoretically necessary, not merely practically convenient.

Fourth, LLM sycophancy has not been framed as automation bias. The conceptual parallel is precise: sycophancy produces commission-like errors (users accept AI-affirmed incorrect positions) and suppresses verification behaviors. Mosier and Skitka demonstrated in 1998 that accountability pressures reduce automation bias. Anti-sycophancy controls in behavioral contracts are accountability pressures by another name.

Fifth, no rigorous cybernetics-HAT integration exists in peer-reviewed literature. Requisite variety, the Good Regulator Theorem, and the Viable System Model all provide powerful theoretical apparatus for behavioral contracts, but the connections appear only in practitioner publications.

Sixth, Rasmussen’s drift model has not been formally applied to HAT. The migration-toward-boundaries framework explains precisely why contracts degrade, but no published work makes this application explicit.

Seventh, the field study gap persists. 100% of empirical HAT studies use simulation testbeds. Behavioral contracts have never been tested in ecologically valid settings over sustained periods.

Eighth, contract maintenance over time is unaddressed. While Rath (2025) documents agent behavioral drift and Sarter and Woods document automation surprises, no work examines how explicit human-AI behavioral agreements degrade, require renegotiation, or interact with operator skill development.

Implications for practitioners

For infrastructure engineers, security architects, and governance professionals working with AI systems in regulated environments, the literature review supports several practical conclusions.

Behavioral contracts are not bureaucratic overhead. They are the mechanism that converts an undecidable universal governance problem into a tractable local one. If your AI interaction has no explicit behavioral specification, you are relying on the AI’s training-time alignment to match your operational needs. For well-understood consumer interactions, this may be sufficient. For high-consequence operational contexts (infrastructure, security, compliance, safety-critical systems), it is not.

The contract should be authored by the operator, not the vendor. The operator has domain knowledge the vendor cannot anticipate. The vendor provides the foundation and domain layers. The operator provides the operational layer. This is the same principle that makes system prompts and custom instructions useful: they adapt a general capability to a specific context.

Contracts require maintenance. Rasmussen’s drift model predicts that unmaintained contracts will degrade. Build in explicit renegotiation triggers: when the operational context changes, when the AI’s capabilities change, when the operator’s expertise develops, when post-incident review reveals a contract gap. The contract is a living artifact, not a configuration file.

Anti-failure modes belong in the contract. The HAT literature on automation bias, the alignment literature on sycophancy, and the safety literature on automation surprises all describe the same family of failures: the operator stops verifying the AI’s output because the output is usually correct. Explicit contract provisions for bidirectional correction, confidence tagging, and mandatory verification for high-consequence outputs are not features. They are structural defenses against well-characterized failure modes.

Conclusion

The behavioral contract pattern is not new. HAT researchers formalized it in 2017. Alignment researchers implemented it in 2022. Safety engineers have been modeling it since 2004. What is new is the recognition that these are the same pattern, that the theoretical foundations from cybernetics and computability theory explain why it works, and that the empirical literature from all three fields validates the same architectural conclusion: scoped, operator-authored, hierarchically governed behavioral agreements are the mechanism that makes human-AI teaming governable.

The communities have converged independently. The next step is convergence by design: a unified framework that draws on HAT’s empirical tradition, alignment’s enforcement mechanisms, and safety engineering’s theoretical architecture. The eight identified gaps are not incremental research questions. They are structural absences where fundamental concepts from one domain answer open questions in another.

The window for this synthesis is narrow. The alignment community is formalizing agent behavioral contracts (Bhardwaj, 2026; Ye and Tan, 2026) and industry model specifications are becoming de facto standards. If the HAT community does not assert its decades-deep expertise in this design space, the vocabulary and the research agenda will be defined entirely by AI engineering, without the empirical rigor and human-factors grounding that the HAT tradition provides. And if the safety engineering community does not connect STAMP, requisite variety, and Rasmussen’s drift to human-AI teaming governance, the theoretical foundations will remain implicit rather than explicit, making the contracts harder to design, harder to verify, and harder to maintain.

The pattern exists. The evidence exists. The theory exists. What does not yet exist is the bridge.

Key references

Andrews, R. W., Lilly, J. M., Srivastava, D., & Feigh, K. M. (2023). The role of shared mental models in human-AI teams: A theoretical review. Theoretical Issues in Ergonomics Science, 24(2), 129–175.

Arkin, R. C. (2009). Governing Lethal Behavior in Autonomous Robots. Chapman & Hall/CRC.

Ashby, W. R. (1956). An Introduction to Cybernetics. Chapman & Hall.

Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.

Beer, S. (1972). Brain of the Firm. Allen Lane/The Penguin Press.

Bhardwaj, R. (2026). Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents. arXiv:2602.22302.

Chen, J. Y. C., Procci, K., Boyce, M., Wright, J., Garcia, A., & Barnes, M. (2014). Situation Awareness-Based Agent Transparency. ARL-TR-6905.

Conant, R. C., & Ashby, W. R. (1970). Every good regulator of a system must be a model of that system. International Journal of Systems Science, 1(2), 89–97.

Gutzwiller, R. S., Espinosa, S. P., Kenny, C., & Lange, D. S. (2018). A Design Pattern for Working Agreements in Human-Autonomy Teaming. In Advances in Intelligent Systems and Computing (vol. 591, pp. 12–24). Springer.

Johnson, M., Bradshaw, J. M., & Feltovich, P. J. (2014). Coactive Design: Designing Support for Interdependence in Joint Activity. Journal of Human-Robot Interaction, 3(1), 43–69.

Lee, J. D., & See, K. A. (2004). Trust in Automation: Designing for Appropriate Reliance. Human Factors, 46(1), 50–80.

Leveson, N. G. (2004). A new accident model for engineering safer systems. Safety Science, 42(4), 237–270.

Leveson, N. G., & Thomas, J. P. (2018). STPA Handbook.

Miller, C. A., & Parasuraman, R. (2007). Designing for Flexible Interaction Between Humans and Automation: Delegation Interfaces for Supervisory Control. Human Factors, 49(1), 57–75.

Mosier, K. L., & Skitka, L. J. (1996). Human Decision Makers and Automated Decision Aids: Made for Each Other? In R. Parasuraman & M. Mouloua (Eds.), Automation and Human Performance.

O’Neill, T., McNeese, N., Barron, A., & Schelble, B. (2022). Human-Autonomy Teaming: A Review and Analysis of the Empirical Literature. Human Factors, 64(5), 904–938.

Parasuraman, R., & Riley, V. (1997). Humans and Automation: Use, Misuse, Disuse, Abuse. Human Factors, 39(2), 230–253.

Parasuraman, R., Sheridan, T. B., & Wickens, C. D. (2000). A model for types and levels of human interaction with automation. IEEE Transactions on Systems, Man, and Cybernetics – Part A, 30(3), 286–297.

Pennington, B. T., Johnson, C. W., Hobbs, A. A., & Colombi, J. M. (2025). Engineering safe human-autonomy teaming using STPA-coordination. Safety Science.

Rasmussen, J. (1997). Risk management in a dynamic society: A modelling problem. Safety Science, 27(2–3), 183–213.

Santoni de Sio, F., & van den Hoven, J. (2018). Meaningful Human Control over Autonomous Systems: A Philosophical Account. Frontiers in Robotics and AI, 5, Article 15.

Schelble, B. G., Flathmann, C., McNeese, N. J., Freeman, G., & Mallick, R. (2022). Let’s Think Together! Assessing Shared Mental Models, Performance, and Trust in Human-Agent Teams. Proceedings of the ACM on Human-Computer Interaction, 6(GROUP).

Shen, T., et al. (2024). Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions. arXiv:2406.09264.