Abstract: A reference architecture for maximising operational autonomy in IT infrastructure while maintaining compliance, safety, and auditability. The architecture targets 70–85% automation of operational toil with explicit human decision gates at irreducible control points. It covers five operational layers (platform, patching, security, compliance, change management), observability, agentic AI integration, and a phased implementation roadmap.
Executive Summary
Core principle: Automate the predictable; gate the consequential.
Every domain has a hard ceiling – the point beyond which automation requires human judgement, regulatory accountability, or decisions under genuine uncertainty. This architecture is designed around those ceilings, not against them.
| Domain | Automatable | Hard ceiling |
|---|---|---|
| Self-healing infrastructure | 80–90% | Novel failures, cascading events, split-brain |
| Continuous patching | 60–75% | Stateful upgrades, breaking changes, kernel updates |
| Security response | 50–65% | Novel attacks, availability-impacting actions |
| Compliance automation | 40–60% | Risk acceptance, management attestation |
| Change management | 70–80% | Architecture changes, non-standard changes |
1. Architecture Principles
1.1 Design Axioms
- Immutability over mutation. Prefer replacing infrastructure over patching in place. Immutable images, declarative state, GitOps reconciliation.
- Deterministic over probabilistic. Automation actions must be predictable and testable. LLM/AI-driven actions are advisory until validated in a closed-loop with deterministic verification.
- Least privilege, least blast radius. Every automated agent operates with scoped permissions and bounded blast radius. Canary before fleet. Feature flag before hard deploy.
- Evidence by default. Every automated action produces a signed, timestamped audit trail. If you can’t prove it happened correctly, it didn’t.
- Fail safe, not fail silent. Unknown states trigger safe-halt or degraded mode, never silent continuation.
- Human gates are structural, not temporary. Certain decisions require human accountability by regulation and by good engineering. These are not automation gaps to be closed – they are load-bearing control points.
1.2 Autonomy Levels (Graduated Model)
Adapted from the operational maturity model emerging in the Agentic SRE space:
| Level | Name | Description | Human involvement |
|---|---|---|---|
| L0 | Manual | Runbook-driven, human executes | Full |
| L1 | Assisted | System recommends, human approves and executes | Decision + execution |
| L2 | Semi-automated | System executes pre-approved actions, human approves scope | Decision only |
| L3 | Supervised autonomous | System executes and verifies, human monitors and can intervene | Oversight only |
| L4 | Autonomous (bounded) | System executes within policy, alerts human on boundary | Exception only |
Target state for this architecture: L3–L4 for standard operations, L1–L2 for risk-bearing changes.
No component should operate at a level beyond what its verification evidence supports.
2. Layer 1 – Immutable, Self-Healing Base Platform
2.1 Purpose
Provide a compute substrate that recovers from common failure modes without human intervention. This is the most mature automation domain and should be built first.
2.2 Reference Stack
Operating system: Immutable, minimal-attack-surface OS.
- Talos Linux (Kubernetes-native, API-managed, no SSH, no shell)
- Flatcar Container Linux (auto-updating, immutable root)
- Bottlerocket (AWS-native, API-managed)
Selection criteria: No general-purpose OS in the compute plane. General-purpose Linux (Ubuntu, RHEL) only for management/bastion nodes with CIS hardening.
Container orchestration: Kubernetes (managed or self-hosted Talos).
- Control plane: minimum 3 nodes, etcd with automated backup and restore
- Node auto-repair: cloud provider node auto-repair or Cluster API machine health checks
- Pod self-healing: liveness/readiness probes, PodDisruptionBudgets, topology spread constraints
Service mesh: Optional but recommended for mTLS, circuit breaking, retry budgets.
- Istio, Linkerd, or Cilium service mesh
- Automatic mTLS rotation (cert-manager with short-lived certificates)
2.3 Self-Healing Capabilities
| Failure mode | Automated response | Autonomy level | Verification |
|---|---|---|---|
| Pod crash | Restart (kubelet) | L4 | Restart count metric, SLO check |
| Node failure | Reschedule pods, replace node (Cluster API / ASG) | L4 | Node ready count, workload distribution |
| Disk pressure | Evict pods, alert, trigger volume expansion | L3–L4 | Disk utilisation metric, PV status |
| Network partition | Circuit breaker, retry with backoff, failover | L4 | Error rate metric, mesh health |
| Certificate expiry | Auto-rotation (cert-manager) | L4 | Cert expiry metric, TLS handshake success |
| Config drift | GitOps reconciliation (Flux/ArgoCD) | L4 | Drift detection alert, sync status |
| Resource exhaustion | HPA/VPA scaling | L3–L4 | Resource utilisation, scaling events |
| Cascading failure | Circuit breaker + rate limiting + load shedding | L3 | Error budget burn rate, human review |
| Split-brain / data inconsistency | Human gate – safe-halt, alert | L1 | Requires manual diagnosis |
2.4 What Cannot Be Automated
- Novel failure modes not covered by existing remediation playbooks
- Cascading failures crossing service boundaries in unpredictable ways
- Split-brain scenarios requiring data reconciliation decisions
- Infrastructure architecture changes (adding regions, changing topology)
2.5 Implementation Pattern
┌─────────────────────────────────────────────────────┐
│ Git Repository │
│ (Infrastructure state, Kubernetes manifests, │
│ Helm charts, Kustomize overlays) │
└──────────────┬──────────────────────────────────────┘
│ GitOps sync
▼
┌──────────────────────────┐ ┌─────────────────────┐
│ Flux / ArgoCD │◄───│ Drift Detection │
│ (Continuous reconcile) │ │ (alert on manual │
│ │ │ cluster changes) │
└──────────┬───────────────┘ └─────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Kubernetes Cluster (Talos / managed K8s) │
│ ┌──────────┐ ┌──────────┐ ┌───────────────────┐ │
│ │ HPA/VPA │ │ PDB │ │ Cluster API / │ │
│ │ (scaling)│ │ (budget) │ │ Node auto-repair │ │
│ └──────────┘ └──────────┘ └───────────────────┘ │
│ ┌──────────────────┐ ┌────────────────────────┐ │
│ │ cert-manager │ │ Service mesh (mTLS, │ │
│ │ (cert rotation) │ │ circuit breaker) │ │
│ └──────────────────┘ └────────────────────────┘ │
└──────────────────────────────────────────────────────┘
2.6 Evidence Capture
- GitOps sync status and history (Flux/ArgoCD events)
- Node lifecycle events (creation, deletion, repair)
- Scaling events (HPA/VPA decisions with reasoning)
- Certificate rotation events (cert-manager logs)
- Drift detection alerts with before/after state
3. Layer 2 – Continuous Patching Pipeline
3.1 Purpose
Automatically apply security and dependency patches with verification gates, maintaining compliance SLAs for patch windows while minimising human intervention.
3.2 Patching Domains
| Domain | Approach | Autonomy target |
|---|---|---|
| OS base image | Immutable image rebuild on upstream CVE | L3–L4 |
| Container base images | Automated rebuild pipeline | L3–L4 |
| Application dependencies | Renovate/Dependabot with auto-merge rules | L3 (low-risk), L2 (high-risk) |
| Kubernetes components | Managed K8s auto-upgrade or staged rollout | L2–L3 |
| Database engines | Staged, human-gated | L1–L2 |
| Kernel / firmware | Human-gated, scheduled maintenance | L1 |
3.3 Automated Patching Pipeline
CVE Feed / Upstream Release
│
▼
┌───────────────────────┐
│ Vulnerability Scanner │ ◄── Trivy, Grype, Snyk
│ (continuous scan of │ scanning container
│ images + deps) │ registry + repos
└──────────┬────────────┘
│ New CVE or dependency update detected
▼
┌───────────────────────┐
│ Renovate / Dependabot │
│ (auto-PR with │
│ changelog + diff) │
└──────────┬────────────┘
│ PR opened
▼
┌───────────────────────┐
│ CI Pipeline │
│ Build, test, SAST, │
│ DAST, SCA, container │
│ scan, SBOM, signing │
└───────────┬───────────┘
│ All checks pass
▼
┌─────────────────────────────────────────────┐
│ Auto-merge Policy Engine │
│ IF severity < CRITICAL │
│ AND test coverage >= threshold │
│ AND no breaking API changes │
│ AND dependency is in approved-list │
│ AND SBOM diff is within policy │
│ THEN auto-merge to staging branch │
│ ELSE require human review │
└──────────────┬──────────────────────────────┘
│
▼
┌───────────────────────┐
│ Canary Deployment │
│ (Argo Rollouts / │
│ Flagger) │
│ Monitor: error rate, │
│ latency, success rate,│
│ resource usage │
│ Auto-promote if SLO │
│ met; rollback if not │
└──────────┬────────────┘
│
▼
┌───────────────────────┐
│ Progressive Rollout │
│ 5% → 25% → 50% → 100%│
│ with SLO gates at │
│ each stage │
└───────────────────────┘
3.4 Auto-Merge Rules (Policy-as-Code)
These rules determine which patches can proceed without human review. They should be conservative and tightened over time based on incident data.
# Example: Renovate auto-merge policy (conceptual)
auto_merge_criteria:
# Patch version bumps of well-known, low-risk deps
- match:
update_type: "patch"
dependency_type: "production"
severity: ["low", "medium"]
requires:
ci_pass: true
test_coverage_delta: ">= 0" # no coverage regression
breaking_changes: false
sbom_policy_check: pass
action: auto_merge
# Security patches – critical CVEs get fast-tracked
# but still require canary verification
- match:
update_type: "any"
cve_severity: "critical"
cisa_kev: true # Known Exploited Vulnerability
requires:
ci_pass: true
action: auto_merge_to_canary
escalation: page_oncall_if_canary_fails
# Everything else: human review
- match:
update_type: "major"
action: require_human_review
- match:
dependency_type: "database_engine"
action: require_human_review
- match:
update_type: "minor"
breaking_changes: true
action: require_human_review
3.5 OS-Level Patching (Immutable Image Rebuild)
For immutable OS (Talos, Flatcar, Bottlerocket):
- Upstream publishes new image with security fixes
- CI pipeline builds new machine image incorporating the update
- Image is scanned (Trivy/vulnerability assessment)
- Staged node replacement: drain → replace → verify, one node at a time
- PodDisruptionBudgets ensure workload availability during rollout
- Rollback: revert to previous image if node health checks fail
For traditional OS (management nodes):
- Unattended-upgrades for security patches (automatic)
- Ansible playbooks for coordinated upgrades
- Snapshot before, apply, verify, rollback if failed
- Kernel updates: scheduled maintenance window, human approval
3.6 Compliance Patch SLAs
Regulatory frameworks set expectations for patch timelines. These should be encoded as policy:
| Severity | SLA target | Auto-action |
|---|---|---|
| Critical (CVSS ≥ 9.0, CISA KEV) | 24–72 hours | Auto-canary, page on-call |
| High (CVSS 7.0–8.9) | 7 days | Auto-PR, auto-merge if policy met |
| Medium (CVSS 4.0–6.9) | 30 days | Auto-PR, batch with next release |
| Low (CVSS < 4.0) | 90 days | Auto-PR, low priority queue |
3.7 Human Gates (Non-Automatable)
- Database engine major version upgrades (schema compatibility)
- Kubernetes control plane upgrades (API deprecation review)
- Kernel updates on bare-metal with custom drivers
- Any patch that changes API contracts or data formats
- First-time patching of a new dependency (no historical data)
3.8 Evidence Capture
- SBOM for every deployed image (CycloneDX or SPDX)
- Vulnerability scan results at build time and runtime
- Signed image digests (cosign / Sigstore)
- Canary metrics during bake period
- Rollback events with reason codes
- Patch compliance dashboard (time-to-patch by severity)
4. Layer 3 – Security Automation and Autonomous Defense
4.1 Purpose
Detect, contain, and respond to security threats with minimal human latency for known attack patterns, while maintaining human oversight for novel threats and availability-impacting responses.
4.2 Defense-in-Depth Stack
Layer 6: Compliance-as-Code (OPA, Kyverno, Cedar)
Continuous policy enforcement, admission control
Layer 5: SOAR (Tines, Shuffle, XSOAR)
Playbook-driven automated response
Layer 4: SIEM / Correlation (Wazuh, Elastic SIEM)
Event correlation, alert enrichment, threat intelligence
Layer 3: Runtime Security (Falco, Tetragon, Sysdig)
Syscall monitoring, behavioural detection, eBPF
Layer 2: Network Security (Cilium, Calico, NP-as-code)
Network policy, DNS filtering, egress control
Layer 1: Supply Chain (Trivy, cosign, SLSA, admission)
Image signing, SBOM, vulnerability gates
Layer 0: Identity (Keycloak/OIDC, RBAC, SPIFFE/SPIRE)
Zero-trust identity, workload identity, least privilege
4.3 Automated Response Playbooks
| Threat pattern | Automated response | Autonomy level | Constraint |
|---|---|---|---|
| Known malware hash in container | Kill pod, quarantine image, alert | L4 | Pre-approved action |
| Brute force authentication | Progressive rate limit, temp block IP, alert | L4 | Threshold-based |
| Anomalous egress traffic | Block egress to unknown destination, alert | L3 | May impact availability |
| Privilege escalation attempt | Kill process, alert, capture forensics | L4 | Pre-approved action |
| CVE in running container | Schedule replacement with patched image | L3 | Follows patching pipeline |
| Certificate about to expire | Auto-rotate | L4 | cert-manager handles |
| Config drift from policy | Auto-remediate to desired state | L3–L4 | Policy-as-code |
| Unusual API call patterns | Increase logging, alert, reduce rate limit | L3 | May impact legitimate traffic |
| Novel attack pattern | Human gate – alert, capture, do not auto-remediate | L1 | Unknown blast radius |
| Insider threat indicators | Human gate – alert security team, capture evidence | L1 | Legal/HR implications |
4.4 Policy-as-Code (Admission Control)
Prevent bad state from entering the cluster rather than detecting it after the fact.
Kubernetes admission:
- Kyverno or OPA/Gatekeeper for pod security standards
- Image signature verification (cosign + admission webhook)
- No privileged containers, no host networking (except explicit allowlist)
- Resource limits required on all pods
- Network policies required for all namespaces
Cloud-level:
- AWS SCP / Azure Policy / GCP Organization Policy for guardrails
- Terraform / OpenTofu with plan validation (OPA on plan output)
- No direct console changes – all changes through IaC pipeline
Runtime:
- Falco rules for syscall-level behavioural detection
- Tetragon / eBPF for kernel-level enforcement
- Wazuh for host-level integrity monitoring (FIM) and log analysis
4.5 Vulnerability Management Pipeline
Continuous Scanning
├── Registry scan (Trivy) — on push and scheduled
├── Runtime scan (Trivy operator) — running containers
├── Host scan (Wazuh) — OS and installed packages
├── IaC scan (Checkov, tfsec) — in CI pipeline
└── Dependency scan (SCA) — in CI pipeline
│
▼
Prioritisation Engine
├── CVSS score
├── EPSS (Exploit Prediction Scoring)
├── CISA KEV (Known Exploited)
├── Reachability analysis (is the vuln actually reachable?)
├── Asset criticality (what does this run on?)
└── Exposure context (internet-facing? internal?)
│
▼
Action routing
├── Critical + Exploited + Reachable → Emergency patch
├── High + Reachable → Fast-track to patching pipeline
├── Medium/Low or Not Reachable → Standard patching SLA
└── Accepted risk → Document in risk register, review quarterly
4.6 Human Gates (Non-Automatable)
- Novel attack patterns requiring investigation
- Actions that would impact production availability (killing services, blocking IP ranges)
- Incident response decisions with legal/regulatory implications
- Threat intelligence assessment (is this a false positive or a real campaign?)
- Risk acceptance decisions (accepting a vulnerability that can’t be patched)
- Security architecture changes
4.7 Evidence Capture
- SIEM event logs with correlation IDs
- Automated response execution logs (SOAR audit trail)
- Forensic captures (container snapshots, memory dumps) for incidents
- Policy enforcement logs (admission webhook decisions)
- Vulnerability scan history and remediation timelines
- Image signatures and SBOM for deployed artifacts
5. Layer 4 – Compliance Automation and Continuous Assurance
5.1 Purpose
Maintain continuous compliance posture with automated evidence generation, policy enforcement, and drift detection, while preserving human accountability for risk decisions and regulatory attestation.
5.2 Compliance Automation Model
Compliance Control Plane
Policy Engine Evidence Store Audit Dashboard
(OPA/Kyverno/ (immutable, (continuous
Cedar) signed, posture)
timestamped)
Control Mapping Layer
ISO 27001 <-> NIS2 <-> SOC 2 <-> DORA <-> CIS <-> NIST
Maps technical controls to regulatory requirements
One control can satisfy multiple frameworks
5.3 Continuous Control Monitoring
| Control category | Automated check | Frequency | Evidence type |
|---|---|---|---|
| Access control | RBAC audit, stale accounts, over-privileged roles | Continuous | RBAC dump, access review report |
| Encryption | TLS version check, cert validity, at-rest encryption | Continuous | Scan results, cert inventory |
| Patch compliance | Vulnerability age vs. SLA | Continuous | Patch timeline report |
| Network segmentation | Network policy coverage, egress audit | Continuous | Policy dump, connectivity test |
| Logging and monitoring | Log pipeline health, retention compliance | Continuous | Pipeline metrics, retention proof |
| Backup integrity | Restore test results, RPO compliance | Daily/weekly | Restore test logs, hash verification |
| Change management | All changes via GitOps (no manual cluster changes) | Continuous | Git history, drift detection alerts |
| Incident response | Playbook test results, MTTR metrics | Quarterly exercise | Exercise report, MTTR dashboard |
| Supply chain | SBOM coverage, image signing rate | Continuous | SBOM inventory, signing audit |
| Identity lifecycle | Joiner/mover/leaver automation, MFA enforcement | Continuous | IGA audit logs, MFA coverage |
5.4 Evidence Generation (OSCAL-Based)
OSCAL (Open Security Controls Assessment Language) provides a machine-readable format for compliance evidence:
- System Security Plan (SSP): Generated from IaC and policy-as-code definitions
- Assessment Plan: Automated test definitions mapped to controls
- Assessment Results: Continuous scan and test results in OSCAL format
- Plan of Action and Milestones (POA&M): Auto-generated from failed controls
Evidence pipeline:
Control check runs → Results stored (immutable, signed) →
Mapped to framework requirements → Dashboard updated →
Auditor accesses dashboard + evidence store
5.5 Regulatory Framework Requirements
NIS2 (applicable if operating in EU critical sectors):
- Risk management measures (Article 21): Automated control enforcement
- Incident reporting: 24-hour early warning, 72-hour notification, 1-month final report
- Supply chain security: SBOM, vendor assessment automation
- Management accountability: Human sign-off required – cannot be automated
DORA (applicable if in EU financial sector):
- ICT risk management framework: Continuous control monitoring
- Incident classification and reporting: 4-hour classification, then tiered reporting
- Threat-Led Penetration Testing (TLPT): Must be conducted – cannot be purely automated
- Third-party risk management: Contract-level controls, cannot be fully automated
SOC 2 / ISO 27001:
- Continuous control monitoring replaces point-in-time audits for many controls
- Management review and risk treatment decisions: Human gate
- Internal audit: Can be assisted by automation but requires human judgement
- Certification audit: External auditor, fully human process
5.6 GitSecOps: Git as the Source of Compliance Truth
The strongest pattern for demonstrating compliance to auditors:
- All infrastructure state in Git – signed commits, PR-based review
- All policy in Git – OPA/Kyverno policies version-controlled
- All changes traceable – commit → PR → CI → deploy → verify
- Immutable evidence – build logs, scan results, deployment events stored with integrity protection
- Automated mapping – technical controls mapped to regulatory requirements
This turns audit from “show me your documents” to “here is the commit history, the policy enforcement logs, and the continuous compliance dashboard.”
5.7 Human Gates (Structurally Required by Regulation)
These are not automation gaps. They are requirements imposed by every major compliance framework:
- Risk acceptance decisions: A named human must accept residual risk
- Management review: ISO 27001 Clause 9.3, NIS2 Article 20 – senior management must review
- Incident severity classification: Initial triage can be automated, final classification needs human judgement
- Vendor risk assessment: Questionnaires can be automated, risk decisions cannot
- Audit response: External auditors interact with humans
- Policy approval: Policy-as-code must be approved by a human before enforcement
- Exception management: Granting exceptions to policy requires documented human decision
5.8 Evidence Capture
- OSCAL-formatted assessment results
- Git commit history with signed commits (GPG/SSH)
- CI/CD pipeline execution logs
- Policy enforcement decision logs (admission webhook audit)
- Access review reports
- Incident timeline and response evidence
- Management review meeting minutes (human-generated)
- Risk register with decision trail
6. Layer 5 – Change Management and Autonomous Deployment
6.1 Purpose
Enable safe, fast, automated deployment of standard changes while maintaining rigorous gates for non-standard and emergency changes.
6.2 Change Classification
| Type | Definition | Process | Autonomy |
|---|---|---|---|
| Standard change | Pre-approved, bounded blast radius, automated verification | Fully automated pipeline | L4 |
| Normal change | Requires review, moderate risk | PR review + automated deploy + canary | L2–L3 |
| Emergency change | Urgent fix, expedited process | Abbreviated review + automated deploy + immediate verify | L2 |
| Major change | High risk, architecture impact | Full CAB review + staged manual rollout | L1 |
6.3 Standard Change Automation (L4)
Standard changes are the largest volume and the highest-value automation target. These are pre-approved change types where the blast radius is bounded and verification is automated.
Examples of standard changes:
- Application deployment (same architecture, same APIs, code changes only)
- Dependency patch (within auto-merge policy)
- Scaling adjustments (within defined limits)
- Feature flag toggle (within defined scope)
- Certificate rotation
- Configuration value change (within defined schema)
Pipeline:
Developer pushes code
│
▼
CI Pipeline (build, test, scan, sign)
│ All gates pass
▼
PR Auto-merge (if policy met)
│
▼
GitOps Sync (Flux/ArgoCD) → Canary Deploy (Argo Rollouts)
│
SLO Verification
(bake time: 15–30 min)
│
Pass? Promote
Fail? Rollback
│
▼
Progressive Rollout: 5% → 25% → 100%
with SLO gates at each stage
6.4 Feature Flags (Decouple Deploy from Release)
Feature flags enable deploying code without activating it, then progressively enabling:
- Deploy: Code ships to production, feature is off
- Canary release: Feature enabled for 1–5% of traffic
- Progressive rollout: Gradual increase based on metrics
- Full release: Feature enabled for all traffic
- Kill switch: Instant disable without redeployment
Tools: LaunchDarkly, Flipt (self-hosted), Unleash, OpenFeature SDK
Integration with SLOs: Feature flags should be wired to SLO monitoring. If enabling a feature degrades SLOs beyond threshold, auto-disable.
6.5 Rollback Strategy
Every deployment must have a tested rollback path:
| Scenario | Rollback method | Time to restore |
|---|---|---|
| Canary failure | Argo Rollouts auto-rollback | Seconds–minutes |
| Post-deploy SLO violation | GitOps revert (revert commit) | Minutes |
| Feature flag issue | Disable flag | Seconds |
| Schema migration failure | Forward-fix preferred; backward migration if tested | Minutes–hours |
| Infrastructure change failure | Terraform/OpenTofu state rollback | Minutes |
Critical rule: Never deploy a schema migration that cannot be rolled back, or a migration that requires the new code to function. Deploy migrations and code changes in separate steps (expand-contract pattern).
6.6 SLO-Gated Deployment
Every automated deployment should be gated on SLO health:
# Conceptual: Argo Rollouts analysis template
analysis:
metrics:
- name: error-rate
provider: prometheus
query: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
threshold:
max: 0.01 # 1% error rate
interval: 60s
- name: latency-p99
provider: prometheus
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m]))
by (le))
threshold:
max: 0.5 # 500ms p99
interval: 60s
- name: success-rate
provider: prometheus
query: |
sum(rate(http_requests_total{status=~"2.."}[5m]))
/ sum(rate(http_requests_total[5m]))
threshold:
min: 0.995 # 99.5% success rate
interval: 60s
# Rollback if ANY metric fails
# Promote only if ALL metrics pass for full bake time
6.7 Human Gates
- Non-standard changes (new service, new dependency, architecture change)
- Emergency changes (abbreviated review, but still human-approved)
- Changes crossing compliance boundaries
- First deployment of a new service (no baseline metrics)
- Changes to the deployment pipeline itself
6.8 Evidence Capture
- Git PR history with reviews and approvals
- CI pipeline logs (build, test, scan results)
- Canary analysis results (metrics during bake period)
- Rollback events with trigger reason
- Feature flag change history
- Deployment timeline (start, promote, complete, or rollback)
7. Observability – The Nervous System
7.1 Purpose
Observability is the foundation that enables all other layers. Without high-quality telemetry, self-healing is guesswork, SLO gating is impossible, and compliance evidence is incomplete.
7.2 Observability Stack
Dashboards & Alerts: Grafana
Metrics Logs Traces
Prometheus / Loki / OpenSearch Tempo / Jaeger
Mimir / Thanos / Elastic
OpenTelemetry Collector
(unified pipeline)
Applications Infrastructure Security
(OTel SDK) (node exporter, (Falco,
kube-state) Wazuh)
7.3 SLI/SLO Framework
Define SLIs and SLOs for every user-facing service:
| SLI | Measurement | Typical SLO |
|---|---|---|
| Availability | Successful requests / total requests | 99.9% (30d) |
| Latency | p99 response time | < 500ms |
| Error rate | 5xx responses / total responses | < 0.1% |
| Throughput | Requests per second sustained | Within capacity plan |
| Correctness | Business logic validation pass rate | 99.99% |
Error budget: The difference between 100% and the SLO target is the error budget. Automation decisions consume error budget. If error budget is exhausted, freeze automated deployments until budget recovers.
7.4 Alert Design (Anti-Alert-Fatigue)
- Alert on SLO burn rate, not individual metrics
- Multi-window, multi-burn-rate alerts (fast burn = page, slow burn = ticket)
- Actionable alerts only: Every alert must have a runbook link and a defined action
- Silence expected noise: Planned maintenance, known conditions
- Review alert quality monthly: Track alert-to-incident ratio, false positive rate
7.5 Evidence Capture
- SLO compliance reports (monthly, quarterly)
- Error budget consumption history
- Alert history and response times
- Dashboard snapshots at time of incidents
- Telemetry retention proof (meeting regulatory requirements)
8. Agentic AI in Operations
AI agents are entering infrastructure operations: triaging alerts, suggesting remediations, and beginning to execute bounded actions autonomously. The engineering case is strong. The governance case requires deliberate architecture.
8.1 Current Maturity
| Capability | Maturity | Autonomy level |
|---|---|---|
| Alert triage and enrichment | Production-ready | L2–L3 |
| Root cause suggestion | Production-ready | L2–L3 |
| Log analysis and summarisation | Production-ready | L2–L3 |
| Autonomous remediation (known patterns) | Emerging | L1–L2 |
| Natural-language infrastructure changes | Experimental | L1 |
| Autonomous architecture decisions | Not ready | – |
8.2 Guardrails
AI agents in this architecture must operate within the same enforcement model as any other automated component:
- Scoped authority: Read-only by default. Write actions require policy-as-code approval, not prompt instructions.
- Structured evidence: Every action produces a signed record: trigger, context, decision, execution, outcome.
- Human gates at consequence boundaries: Actions that cross security, availability, or compliance thresholds require human approval.
- Circuit breakers: If the agent’s error rate or SLO impact exceeds thresholds, it halts automatically.
8.3 Adoption Path
Start at L1 (read-only advisory), build trust evidence over 3–6 months, then graduate to L2 (human-approved actions). Advance to L3 only for well-understood action classes with documented success rates. Do not skip phases – each builds the evidence needed to justify the next.
For a detailed treatment of guardrail architecture, evidence requirements, and the accountability gap in EU-regulated environments, see Agentic AI in Regulated Infrastructure.
9. Cross-Cutting Concerns
| Concern | Approach | Key constraint |
|---|---|---|
| Secrets management | Vault or equivalent. Short-lived credentials via OIDC federation. Runtime injection via CSI driver. Automated rotation. All access logged. | Never in Git. No long-lived API keys in production. |
| Disaster recovery | RPO/RTO defined per service. Automated daily restore tests. etcd backup verified. DR runbook exercised quarterly. | Multi-zone minimum for critical services. Untested backups are not backups. |
| Cost governance | Resource requests/limits enforced. VPA right-sizing. Cost anomaly alerts. Chargeback per team/service. | Spot instances for non-critical batch only. |
| Network architecture | Zero-trust: default-deny, all traffic explicitly allowed via network policy as code. mTLS via service mesh. Filtered external DNS. | No implicit trust between services. |
10. Implementation Roadmap
Phase 1: Foundation (Months 1–3)
Goal: Immutable base platform with GitOps and basic observability.
- Deploy Kubernetes (managed or Talos) with GitOps (Flux or ArgoCD)
- Implement pod security standards (Kyverno/OPA baseline)
- Deploy OpenTelemetry Collector + Prometheus + Grafana + Loki
- Define SLIs/SLOs for existing services
- Implement cert-manager with automated certificate rotation
- Establish Git repository structure for infrastructure state
- Configure node auto-repair and HPA/VPA
Phase 2: Patching and Supply Chain (Months 3–6)
Goal: Automated patching pipeline with compliance SLAs.
- Deploy Renovate/Dependabot with auto-merge policies
- Implement CI pipeline with SAST, SCA, container scanning
- Set up image signing (cosign/Sigstore) and admission verification
- Generate SBOMs for all deployed images
- Implement canary deployment (Argo Rollouts / Flagger)
- Define and enforce patch SLA policy
- Establish vulnerability prioritisation workflow
Phase 3: Security Automation (Months 6–9)
Goal: Automated detection and response for known threat patterns.
- Deploy Falco/Tetragon for runtime security
- Deploy Wazuh for host-level monitoring
- Implement SOAR playbooks for top 10 threat patterns
- Configure network policies as code (default-deny)
- Implement egress filtering
- Deploy SIEM correlation rules
- Establish incident response automation (containment playbooks)
Phase 4: Compliance Automation (Months 9–12)
Goal: Continuous compliance posture with automated evidence.
- Map technical controls to regulatory frameworks (NIS2/DORA/SOC2/ISO27001)
- Implement OSCAL-based evidence generation
- Deploy continuous compliance dashboard
- Automate access reviews
- Establish risk register with automated findings intake
- Conduct first automated compliance assessment
- Schedule management review (human gate)
Phase 5: Advanced Autonomy (Months 12+)
Goal: Expand automation envelope with AI-assisted operations.
- Deploy AI-assisted alert triage (read-only, advisory)
- Implement SLO-based error budget policies
- Expand SOAR playbooks based on incident data
- Evaluate AI agent tools for supervised remediation
- Refine auto-merge policies based on 6+ months of data
- Conduct autonomy level assessment for each component
- Document trust evidence for current autonomy levels
11. Anti-Patterns
| Anti-pattern | Why it fails |
|---|---|
| Automate without observability | You cannot verify what you cannot measure. Automation without SLO gating is guesswork. |
| Skip canary for speed | The time saved is repaid with interest during the inevitable incident. |
| AI agents without audit trails | If you cannot reconstruct why the agent acted, you cannot trust it – and neither can an auditor. |
| Bolt-on compliance | Compliance evidence must be a byproduct of the pipeline, not a separate workstream. |
| Eliminate human gates | Regulation and engineering both require human accountability at consequence boundaries. |
| Alert-driven ops without SLOs | Optimising for alert count rather than user impact produces noise, not reliability. |
| Single-vendor security | One product failure should not compromise your entire security posture. |
| Immutable OS without rollback testing | Immutability only delivers value if you can revert to the previous image within minutes. |
| Policy-as-code without negative tests | Untested policies block legitimate workloads in production. |
12. Decision Log
| Decision | Rationale | Alternatives considered | Reversibility |
|---|---|---|---|
| GitOps as deployment model | Auditability, drift detection, rollback via revert | Push-based CI/CD, manual kubectl | High |
| Immutable OS for compute | Reduced attack surface, consistent state | Hardened Ubuntu/RHEL | Medium |
| OPA/Kyverno for policy | Kubernetes-native, declarative, testable | Cedar, Sentinel, custom webhooks | High |
| OpenTelemetry for instrumentation | Vendor-neutral, standard, broad ecosystem | Vendor-specific agents | High |
| Canary deployment default | Lowest-risk deployment pattern | Blue/green, rolling update | High |
| Feature flags for release | Decouples deploy from release, instant rollback | Branch-based releases | High |
| Human gates for risk-bearing changes | Regulatory requirement, safety requirement | Full automation | N/A (structurally required) |
Appendix A: Tool Reference
| Category | Recommended | Alternatives |
|---|---|---|
| OS (compute) | Talos Linux | Flatcar, Bottlerocket |
| Orchestration | Kubernetes | Nomad (for specific use cases) |
| GitOps | Flux CD | ArgoCD |
| Progressive delivery | Argo Rollouts | Flagger |
| Feature flags | Flipt (self-hosted) | LaunchDarkly, Unleash |
| CI/CD | GitLab CI, GitHub Actions | Tekton, Jenkins |
| IaC | OpenTofu / Terraform | Pulumi, Crossplane |
| Policy engine | Kyverno | OPA/Gatekeeper, Cedar |
| Metrics | Prometheus + Mimir/Thanos | Datadog, New Relic |
| Logs | Loki | OpenSearch, Elastic |
| Traces | Tempo | Jaeger |
| Dashboards | Grafana | – |
| Telemetry collection | OpenTelemetry | – (de facto standard) |
| Runtime security | Falco + Tetragon | Sysdig |
| Host security | Wazuh | OSSEC, CrowdSec |
| SIEM | Wazuh / Elastic SIEM | Splunk, Sentinel |
| SOAR | Tines | Shuffle, Cortex XSOAR |
| Vulnerability scanning | Trivy + Grype | Snyk, Prisma Cloud |
| Image signing | cosign (Sigstore) | Notary v2 |
| SBOM | Syft (CycloneDX) | SPDX tools |
| Secrets | HashiCorp Vault | AWS SM, Azure KV, SOPS |
| Certificate management | cert-manager | Vault PKI |
| Dependency updates | Renovate | Dependabot |
| Compliance evidence | OSCAL tooling | Manual evidence collection |
| Identity | Keycloak | Auth0, Okta |
| Network policy | Cilium | Calico |
| Service mesh | Cilium, Linkerd | Istio |
Appendix B: Regulatory Quick Reference
| Framework | Scope | Key automation-relevant requirements | Human gate requirements |
|---|---|---|---|
| NIS2 | EU critical infrastructure | Risk management (Art. 21), incident reporting (24h), supply chain security | Management accountability (Art. 20), risk acceptance |
| DORA | EU financial sector | ICT risk management, incident reporting (4h classify), TLPT, third-party oversight | Management oversight, TLPT execution, vendor risk decisions |
| SOC 2 | US, voluntary | Trust services criteria (security, availability, etc.) | Management assertions, auditor interaction |
| ISO 27001 | Global, voluntary | Annex A controls, ISMS operation | Management review (9.3), internal audit (9.2), risk treatment (6.1) |
| CRA | EU, products with digital elements | Vulnerability handling, SBOM, security updates | Conformity assessment, incident reporting |