Reference Architecture: Maximum-Autonomy IT Operations

Abstract: A reference architecture for maximising operational autonomy in IT infrastructure while maintaining compliance, safety, and auditability. The architecture targets 70–85% automation of operational toil with explicit human decision gates at irreducible control points. It covers five operational layers (platform, patching, security, compliance, change management), observability, agentic AI integration, and a phased implementation roadmap.

Executive Summary

Core principle: Automate the predictable; gate the consequential.

Every domain has a hard ceiling – the point beyond which automation requires human judgement, regulatory accountability, or decisions under genuine uncertainty. This architecture is designed around those ceilings, not against them.

Domain	Automatable	Hard ceiling
Self-healing infrastructure	80–90%	Novel failures, cascading events, split-brain
Continuous patching	60–75%	Stateful upgrades, breaking changes, kernel updates
Security response	50–65%	Novel attacks, availability-impacting actions
Compliance automation	40–60%	Risk acceptance, management attestation
Change management	70–80%	Architecture changes, non-standard changes

1. Architecture Principles

1.1 Design Axioms

Immutability over mutation. Prefer replacing infrastructure over patching in place. Immutable images, declarative state, GitOps reconciliation.
Deterministic over probabilistic. Automation actions must be predictable and testable. LLM/AI-driven actions are advisory until validated in a closed-loop with deterministic verification.
Least privilege, least blast radius. Every automated agent operates with scoped permissions and bounded blast radius. Canary before fleet. Feature flag before hard deploy.
Evidence by default. Every automated action produces a signed, timestamped audit trail. If you can’t prove it happened correctly, it didn’t.
Fail safe, not fail silent. Unknown states trigger safe-halt or degraded mode, never silent continuation.
Human gates are structural, not temporary. Certain decisions require human accountability by regulation and by good engineering. These are not automation gaps to be closed – they are load-bearing control points.

1.2 Autonomy Levels (Graduated Model)

Adapted from the operational maturity model emerging in the Agentic SRE space:

Level	Name	Description	Human involvement
L0	Manual	Runbook-driven, human executes	Full
L1	Assisted	System recommends, human approves and executes	Decision + execution
L2	Semi-automated	System executes pre-approved actions, human approves scope	Decision only
L3	Supervised autonomous	System executes and verifies, human monitors and can intervene	Oversight only
L4	Autonomous (bounded)	System executes within policy, alerts human on boundary	Exception only

Target state for this architecture: L3–L4 for standard operations, L1–L2 for risk-bearing changes.

No component should operate at a level beyond what its verification evidence supports.

2. Layer 1 – Immutable, Self-Healing Base Platform

2.1 Purpose

Provide a compute substrate that recovers from common failure modes without human intervention. This is the most mature automation domain and should be built first.

2.2 Reference Stack

Operating system: Immutable, minimal-attack-surface OS.

Talos Linux (Kubernetes-native, API-managed, no SSH, no shell)
Flatcar Container Linux (auto-updating, immutable root)
Bottlerocket (AWS-native, API-managed)

Selection criteria: No general-purpose OS in the compute plane. General-purpose Linux (Ubuntu, RHEL) only for management/bastion nodes with CIS hardening.

Container orchestration: Kubernetes (managed or self-hosted Talos).

Control plane: minimum 3 nodes, etcd with automated backup and restore
Node auto-repair: cloud provider node auto-repair or Cluster API machine health checks
Pod self-healing: liveness/readiness probes, PodDisruptionBudgets, topology spread constraints

Service mesh: Optional but recommended for mTLS, circuit breaking, retry budgets.

Istio, Linkerd, or Cilium service mesh
Automatic mTLS rotation (cert-manager with short-lived certificates)

2.3 Self-Healing Capabilities

Failure mode	Automated response	Autonomy level	Verification
Pod crash	Restart (kubelet)	L4	Restart count metric, SLO check
Node failure	Reschedule pods, replace node (Cluster API / ASG)	L4	Node ready count, workload distribution
Disk pressure	Evict pods, alert, trigger volume expansion	L3–L4	Disk utilisation metric, PV status
Network partition	Circuit breaker, retry with backoff, failover	L4	Error rate metric, mesh health
Certificate expiry	Auto-rotation (cert-manager)	L4	Cert expiry metric, TLS handshake success
Config drift	GitOps reconciliation (Flux/ArgoCD)	L4	Drift detection alert, sync status
Resource exhaustion	HPA/VPA scaling	L3–L4	Resource utilisation, scaling events
Cascading failure	Circuit breaker + rate limiting + load shedding	L3	Error budget burn rate, human review
Split-brain / data inconsistency	Human gate – safe-halt, alert	L1	Requires manual diagnosis

2.4 What Cannot Be Automated

Novel failure modes not covered by existing remediation playbooks
Cascading failures crossing service boundaries in unpredictable ways
Split-brain scenarios requiring data reconciliation decisions
Infrastructure architecture changes (adding regions, changing topology)

2.5 Implementation Pattern

┌─────────────────────────────────────────────────────┐
│                   Git Repository                     │
│  (Infrastructure state, Kubernetes manifests,        │
│   Helm charts, Kustomize overlays)                   │
└──────────────┬──────────────────────────────────────┘
               │ GitOps sync
               ▼
┌──────────────────────────┐    ┌─────────────────────┐
│  Flux / ArgoCD           │◄───│  Drift Detection    │
│  (Continuous reconcile)  │    │  (alert on manual    │
│                          │    │   cluster changes)   │
└──────────┬───────────────┘    └─────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────┐
│  Kubernetes Cluster (Talos / managed K8s)             │
│  ┌──────────┐  ┌──────────┐  ┌───────────────────┐  │
│  │ HPA/VPA  │  │ PDB      │  │ Cluster API /     │  │
│  │ (scaling)│  │ (budget) │  │ Node auto-repair  │  │
│  └──────────┘  └──────────┘  └───────────────────┘  │
│  ┌──────────────────┐  ┌────────────────────────┐   │
│  │ cert-manager     │  │ Service mesh (mTLS,    │   │
│  │ (cert rotation)  │  │  circuit breaker)      │   │
│  └──────────────────┘  └────────────────────────┘   │
└──────────────────────────────────────────────────────┘

2.6 Evidence Capture

GitOps sync status and history (Flux/ArgoCD events)
Node lifecycle events (creation, deletion, repair)
Scaling events (HPA/VPA decisions with reasoning)
Certificate rotation events (cert-manager logs)
Drift detection alerts with before/after state

3. Layer 2 – Continuous Patching Pipeline

3.1 Purpose

Automatically apply security and dependency patches with verification gates, maintaining compliance SLAs for patch windows while minimising human intervention.

3.2 Patching Domains

Domain	Approach	Autonomy target
OS base image	Immutable image rebuild on upstream CVE	L3–L4
Container base images	Automated rebuild pipeline	L3–L4
Application dependencies	Renovate/Dependabot with auto-merge rules	L3 (low-risk), L2 (high-risk)
Kubernetes components	Managed K8s auto-upgrade or staged rollout	L2–L3
Database engines	Staged, human-gated	L1–L2
Kernel / firmware	Human-gated, scheduled maintenance	L1

3.3 Automated Patching Pipeline

CVE Feed / Upstream Release
        │
        ▼
┌───────────────────────┐
│  Vulnerability Scanner │ ◄── Trivy, Grype, Snyk
│  (continuous scan of   │     scanning container
│   images + deps)       │     registry + repos
└──────────┬────────────┘
           │ New CVE or dependency update detected
           ▼
┌───────────────────────┐
│  Renovate / Dependabot │
│  (auto-PR with         │
│   changelog + diff)    │
└──────────┬────────────┘
           │ PR opened
           ▼
┌───────────────────────┐
│  CI Pipeline           │
│  Build, test, SAST,    │
│  DAST, SCA, container  │
│  scan, SBOM, signing   │
└───────────┬───────────┘
            │ All checks pass
            ▼
┌─────────────────────────────────────────────┐
│  Auto-merge Policy Engine                    │
│  IF severity < CRITICAL                      │
│  AND test coverage >= threshold              │
│  AND no breaking API changes                 │
│  AND dependency is in approved-list           │
│  AND SBOM diff is within policy              │
│  THEN auto-merge to staging branch           │
│  ELSE require human review                   │
└──────────────┬──────────────────────────────┘
               │
               ▼
┌───────────────────────┐
│  Canary Deployment     │
│  (Argo Rollouts /      │
│   Flagger)             │
│  Monitor: error rate,  │
│  latency, success rate,│
│  resource usage        │
│  Auto-promote if SLO   │
│  met; rollback if not  │
└──────────┬────────────┘
           │
           ▼
┌───────────────────────┐
│  Progressive Rollout   │
│  5% → 25% → 50% → 100%│
│  with SLO gates at     │
│  each stage            │
└───────────────────────┘

3.4 Auto-Merge Rules (Policy-as-Code)

These rules determine which patches can proceed without human review. They should be conservative and tightened over time based on incident data.

# Example: Renovate auto-merge policy (conceptual)
auto_merge_criteria:
  # Patch version bumps of well-known, low-risk deps
  - match:
      update_type: "patch"
      dependency_type: "production"
      severity: ["low", "medium"]
    requires:
      ci_pass: true
      test_coverage_delta: ">= 0"  # no coverage regression
      breaking_changes: false
      sbom_policy_check: pass
    action: auto_merge

  # Security patches – critical CVEs get fast-tracked
  # but still require canary verification
  - match:
      update_type: "any"
      cve_severity: "critical"
      cisa_kev: true  # Known Exploited Vulnerability
    requires:
      ci_pass: true
    action: auto_merge_to_canary
    escalation: page_oncall_if_canary_fails

  # Everything else: human review
  - match:
      update_type: "major"
    action: require_human_review

  - match:
      dependency_type: "database_engine"
    action: require_human_review

  - match:
      update_type: "minor"
      breaking_changes: true
    action: require_human_review

3.5 OS-Level Patching (Immutable Image Rebuild)

For immutable OS (Talos, Flatcar, Bottlerocket):

Upstream publishes new image with security fixes
CI pipeline builds new machine image incorporating the update
Image is scanned (Trivy/vulnerability assessment)
Staged node replacement: drain → replace → verify, one node at a time
PodDisruptionBudgets ensure workload availability during rollout
Rollback: revert to previous image if node health checks fail

For traditional OS (management nodes):

Unattended-upgrades for security patches (automatic)
Ansible playbooks for coordinated upgrades
Snapshot before, apply, verify, rollback if failed
Kernel updates: scheduled maintenance window, human approval

3.6 Compliance Patch SLAs

Regulatory frameworks set expectations for patch timelines. These should be encoded as policy:

Severity	SLA target	Auto-action
Critical (CVSS ≥ 9.0, CISA KEV)	24–72 hours	Auto-canary, page on-call
High (CVSS 7.0–8.9)	7 days	Auto-PR, auto-merge if policy met
Medium (CVSS 4.0–6.9)	30 days	Auto-PR, batch with next release
Low (CVSS < 4.0)	90 days	Auto-PR, low priority queue

3.7 Human Gates (Non-Automatable)

Database engine major version upgrades (schema compatibility)
Kubernetes control plane upgrades (API deprecation review)
Kernel updates on bare-metal with custom drivers
Any patch that changes API contracts or data formats
First-time patching of a new dependency (no historical data)

3.8 Evidence Capture

SBOM for every deployed image (CycloneDX or SPDX)
Vulnerability scan results at build time and runtime
Signed image digests (cosign / Sigstore)
Canary metrics during bake period
Rollback events with reason codes
Patch compliance dashboard (time-to-patch by severity)

4. Layer 3 – Security Automation and Autonomous Defense

4.1 Purpose

Detect, contain, and respond to security threats with minimal human latency for known attack patterns, while maintaining human oversight for novel threats and availability-impacting responses.

4.2 Defense-in-Depth Stack

Layer 6: Compliance-as-Code (OPA, Kyverno, Cedar)
         Continuous policy enforcement, admission control

Layer 5: SOAR (Tines, Shuffle, XSOAR)
         Playbook-driven automated response

Layer 4: SIEM / Correlation (Wazuh, Elastic SIEM)
         Event correlation, alert enrichment, threat intelligence

Layer 3: Runtime Security (Falco, Tetragon, Sysdig)
         Syscall monitoring, behavioural detection, eBPF

Layer 2: Network Security (Cilium, Calico, NP-as-code)
         Network policy, DNS filtering, egress control

Layer 1: Supply Chain (Trivy, cosign, SLSA, admission)
         Image signing, SBOM, vulnerability gates

Layer 0: Identity (Keycloak/OIDC, RBAC, SPIFFE/SPIRE)
         Zero-trust identity, workload identity, least privilege

4.3 Automated Response Playbooks

Threat pattern	Automated response	Autonomy level	Constraint
Known malware hash in container	Kill pod, quarantine image, alert	L4	Pre-approved action
Brute force authentication	Progressive rate limit, temp block IP, alert	L4	Threshold-based
Anomalous egress traffic	Block egress to unknown destination, alert	L3	May impact availability
Privilege escalation attempt	Kill process, alert, capture forensics	L4	Pre-approved action
CVE in running container	Schedule replacement with patched image	L3	Follows patching pipeline
Certificate about to expire	Auto-rotate	L4	cert-manager handles
Config drift from policy	Auto-remediate to desired state	L3–L4	Policy-as-code
Unusual API call patterns	Increase logging, alert, reduce rate limit	L3	May impact legitimate traffic
Novel attack pattern	Human gate – alert, capture, do not auto-remediate	L1	Unknown blast radius
Insider threat indicators	Human gate – alert security team, capture evidence	L1	Legal/HR implications

4.4 Policy-as-Code (Admission Control)

Prevent bad state from entering the cluster rather than detecting it after the fact.

Kubernetes admission:

Kyverno or OPA/Gatekeeper for pod security standards
Image signature verification (cosign + admission webhook)
No privileged containers, no host networking (except explicit allowlist)
Resource limits required on all pods
Network policies required for all namespaces

Cloud-level:

AWS SCP / Azure Policy / GCP Organization Policy for guardrails
Terraform / OpenTofu with plan validation (OPA on plan output)
No direct console changes – all changes through IaC pipeline

Runtime:

Falco rules for syscall-level behavioural detection
Tetragon / eBPF for kernel-level enforcement
Wazuh for host-level integrity monitoring (FIM) and log analysis

4.5 Vulnerability Management Pipeline

Continuous Scanning
    ├── Registry scan (Trivy) — on push and scheduled
    ├── Runtime scan (Trivy operator) — running containers
    ├── Host scan (Wazuh) — OS and installed packages
    ├── IaC scan (Checkov, tfsec) — in CI pipeline
    └── Dependency scan (SCA) — in CI pipeline
            │
            ▼
    Prioritisation Engine
    ├── CVSS score
    ├── EPSS (Exploit Prediction Scoring)
    ├── CISA KEV (Known Exploited)
    ├── Reachability analysis (is the vuln actually reachable?)
    ├── Asset criticality (what does this run on?)
    └── Exposure context (internet-facing? internal?)
            │
            ▼
    Action routing
    ├── Critical + Exploited + Reachable → Emergency patch
    ├── High + Reachable → Fast-track to patching pipeline
    ├── Medium/Low or Not Reachable → Standard patching SLA
    └── Accepted risk → Document in risk register, review quarterly

4.6 Human Gates (Non-Automatable)

Novel attack patterns requiring investigation
Actions that would impact production availability (killing services, blocking IP ranges)
Incident response decisions with legal/regulatory implications
Threat intelligence assessment (is this a false positive or a real campaign?)
Risk acceptance decisions (accepting a vulnerability that can’t be patched)
Security architecture changes

4.7 Evidence Capture

SIEM event logs with correlation IDs
Automated response execution logs (SOAR audit trail)
Forensic captures (container snapshots, memory dumps) for incidents
Policy enforcement logs (admission webhook decisions)
Vulnerability scan history and remediation timelines
Image signatures and SBOM for deployed artifacts

5. Layer 4 – Compliance Automation and Continuous Assurance

5.1 Purpose

Maintain continuous compliance posture with automated evidence generation, policy enforcement, and drift detection, while preserving human accountability for risk decisions and regulatory attestation.

5.2 Compliance Automation Model

Compliance Control Plane

  Policy Engine      Evidence Store     Audit Dashboard
  (OPA/Kyverno/      (immutable,        (continuous
   Cedar)             signed,            posture)
                      timestamped)

            Control Mapping Layer

  ISO 27001 <-> NIS2 <-> SOC 2 <-> DORA <-> CIS <-> NIST

  Maps technical controls to regulatory requirements
  One control can satisfy multiple frameworks

5.3 Continuous Control Monitoring

Control category	Automated check	Frequency	Evidence type
Access control	RBAC audit, stale accounts, over-privileged roles	Continuous	RBAC dump, access review report
Encryption	TLS version check, cert validity, at-rest encryption	Continuous	Scan results, cert inventory
Patch compliance	Vulnerability age vs. SLA	Continuous	Patch timeline report
Network segmentation	Network policy coverage, egress audit	Continuous	Policy dump, connectivity test
Logging and monitoring	Log pipeline health, retention compliance	Continuous	Pipeline metrics, retention proof
Backup integrity	Restore test results, RPO compliance	Daily/weekly	Restore test logs, hash verification
Change management	All changes via GitOps (no manual cluster changes)	Continuous	Git history, drift detection alerts
Incident response	Playbook test results, MTTR metrics	Quarterly exercise	Exercise report, MTTR dashboard
Supply chain	SBOM coverage, image signing rate	Continuous	SBOM inventory, signing audit
Identity lifecycle	Joiner/mover/leaver automation, MFA enforcement	Continuous	IGA audit logs, MFA coverage

5.4 Evidence Generation (OSCAL-Based)

OSCAL (Open Security Controls Assessment Language) provides a machine-readable format for compliance evidence:

System Security Plan (SSP): Generated from IaC and policy-as-code definitions
Assessment Plan: Automated test definitions mapped to controls
Assessment Results: Continuous scan and test results in OSCAL format
Plan of Action and Milestones (POA&M): Auto-generated from failed controls

Evidence pipeline:

Control check runs → Results stored (immutable, signed) →
Mapped to framework requirements → Dashboard updated →
Auditor accesses dashboard + evidence store

5.5 Regulatory Framework Requirements

NIS2 (applicable if operating in EU critical sectors):

Risk management measures (Article 21): Automated control enforcement
Incident reporting: 24-hour early warning, 72-hour notification, 1-month final report
Supply chain security: SBOM, vendor assessment automation
Management accountability: Human sign-off required – cannot be automated

DORA (applicable if in EU financial sector):

ICT risk management framework: Continuous control monitoring
Incident classification and reporting: 4-hour classification, then tiered reporting
Threat-Led Penetration Testing (TLPT): Must be conducted – cannot be purely automated
Third-party risk management: Contract-level controls, cannot be fully automated

SOC 2 / ISO 27001:

Continuous control monitoring replaces point-in-time audits for many controls
Management review and risk treatment decisions: Human gate
Internal audit: Can be assisted by automation but requires human judgement
Certification audit: External auditor, fully human process

5.6 GitSecOps: Git as the Source of Compliance Truth

The strongest pattern for demonstrating compliance to auditors:

All infrastructure state in Git – signed commits, PR-based review
All policy in Git – OPA/Kyverno policies version-controlled
All changes traceable – commit → PR → CI → deploy → verify
Immutable evidence – build logs, scan results, deployment events stored with integrity protection
Automated mapping – technical controls mapped to regulatory requirements

This turns audit from “show me your documents” to “here is the commit history, the policy enforcement logs, and the continuous compliance dashboard.”

5.7 Human Gates (Structurally Required by Regulation)

These are not automation gaps. They are requirements imposed by every major compliance framework:

Risk acceptance decisions: A named human must accept residual risk
Management review: ISO 27001 Clause 9.3, NIS2 Article 20 – senior management must review
Incident severity classification: Initial triage can be automated, final classification needs human judgement
Vendor risk assessment: Questionnaires can be automated, risk decisions cannot
Audit response: External auditors interact with humans
Policy approval: Policy-as-code must be approved by a human before enforcement
Exception management: Granting exceptions to policy requires documented human decision

5.8 Evidence Capture

OSCAL-formatted assessment results
Git commit history with signed commits (GPG/SSH)
CI/CD pipeline execution logs
Policy enforcement decision logs (admission webhook audit)
Access review reports
Incident timeline and response evidence
Management review meeting minutes (human-generated)
Risk register with decision trail

6. Layer 5 – Change Management and Autonomous Deployment

6.1 Purpose

Enable safe, fast, automated deployment of standard changes while maintaining rigorous gates for non-standard and emergency changes.

6.2 Change Classification

Type	Definition	Process	Autonomy
Standard change	Pre-approved, bounded blast radius, automated verification	Fully automated pipeline	L4
Normal change	Requires review, moderate risk	PR review + automated deploy + canary	L2–L3
Emergency change	Urgent fix, expedited process	Abbreviated review + automated deploy + immediate verify	L2
Major change	High risk, architecture impact	Full CAB review + staged manual rollout	L1

6.3 Standard Change Automation (L4)

Standard changes are the largest volume and the highest-value automation target. These are pre-approved change types where the blast radius is bounded and verification is automated.

Examples of standard changes:

Application deployment (same architecture, same APIs, code changes only)
Dependency patch (within auto-merge policy)
Scaling adjustments (within defined limits)
Feature flag toggle (within defined scope)
Certificate rotation
Configuration value change (within defined schema)

Pipeline:

Developer pushes code
        │
        ▼
  CI Pipeline (build, test, scan, sign)
        │ All gates pass
        ▼
  PR Auto-merge (if policy met)
        │
        ▼
  GitOps Sync (Flux/ArgoCD) → Canary Deploy (Argo Rollouts)
                                    │
                              SLO Verification
                              (bake time: 15–30 min)
                                    │
                              Pass? Promote
                              Fail? Rollback
        │
        ▼
  Progressive Rollout: 5% → 25% → 100%
  with SLO gates at each stage

6.4 Feature Flags (Decouple Deploy from Release)

Feature flags enable deploying code without activating it, then progressively enabling:

Deploy: Code ships to production, feature is off
Canary release: Feature enabled for 1–5% of traffic
Progressive rollout: Gradual increase based on metrics
Full release: Feature enabled for all traffic
Kill switch: Instant disable without redeployment

Tools: LaunchDarkly, Flipt (self-hosted), Unleash, OpenFeature SDK

Integration with SLOs: Feature flags should be wired to SLO monitoring. If enabling a feature degrades SLOs beyond threshold, auto-disable.

6.5 Rollback Strategy

Every deployment must have a tested rollback path:

Scenario	Rollback method	Time to restore
Canary failure	Argo Rollouts auto-rollback	Seconds–minutes
Post-deploy SLO violation	GitOps revert (revert commit)	Minutes
Feature flag issue	Disable flag	Seconds
Schema migration failure	Forward-fix preferred; backward migration if tested	Minutes–hours
Infrastructure change failure	Terraform/OpenTofu state rollback	Minutes

Critical rule: Never deploy a schema migration that cannot be rolled back, or a migration that requires the new code to function. Deploy migrations and code changes in separate steps (expand-contract pattern).

6.6 SLO-Gated Deployment

Every automated deployment should be gated on SLO health:

# Conceptual: Argo Rollouts analysis template
analysis:
  metrics:
    - name: error-rate
      provider: prometheus
      query: |
        sum(rate(http_requests_total{status=~"5.."}[5m]))
        / sum(rate(http_requests_total[5m]))
      threshold:
        max: 0.01  # 1% error rate
      interval: 60s

    - name: latency-p99
      provider: prometheus
      query: |
        histogram_quantile(0.99,
          sum(rate(http_request_duration_seconds_bucket[5m]))
          by (le))
      threshold:
        max: 0.5  # 500ms p99
      interval: 60s

    - name: success-rate
      provider: prometheus
      query: |
        sum(rate(http_requests_total{status=~"2.."}[5m]))
        / sum(rate(http_requests_total[5m]))
      threshold:
        min: 0.995  # 99.5% success rate
      interval: 60s

  # Rollback if ANY metric fails
  # Promote only if ALL metrics pass for full bake time

6.7 Human Gates

Non-standard changes (new service, new dependency, architecture change)
Emergency changes (abbreviated review, but still human-approved)
Changes crossing compliance boundaries
First deployment of a new service (no baseline metrics)
Changes to the deployment pipeline itself

6.8 Evidence Capture

Git PR history with reviews and approvals
CI pipeline logs (build, test, scan results)
Canary analysis results (metrics during bake period)
Rollback events with trigger reason
Feature flag change history
Deployment timeline (start, promote, complete, or rollback)

7. Observability – The Nervous System

7.1 Purpose

Observability is the foundation that enables all other layers. Without high-quality telemetry, self-healing is guesswork, SLO gating is impossible, and compliance evidence is incomplete.

7.2 Observability Stack

Dashboards & Alerts: Grafana

  Metrics            Logs               Traces
  Prometheus /       Loki / OpenSearch   Tempo / Jaeger
  Mimir / Thanos     / Elastic

              OpenTelemetry Collector
              (unified pipeline)

  Applications      Infrastructure      Security
  (OTel SDK)        (node exporter,     (Falco,
                     kube-state)         Wazuh)

7.3 SLI/SLO Framework

Define SLIs and SLOs for every user-facing service:

SLI	Measurement	Typical SLO
Availability	Successful requests / total requests	99.9% (30d)
Latency	p99 response time	< 500ms
Error rate	5xx responses / total responses	< 0.1%
Throughput	Requests per second sustained	Within capacity plan
Correctness	Business logic validation pass rate	99.99%

Error budget: The difference between 100% and the SLO target is the error budget. Automation decisions consume error budget. If error budget is exhausted, freeze automated deployments until budget recovers.

7.4 Alert Design (Anti-Alert-Fatigue)

Alert on SLO burn rate, not individual metrics
Multi-window, multi-burn-rate alerts (fast burn = page, slow burn = ticket)
Actionable alerts only: Every alert must have a runbook link and a defined action
Silence expected noise: Planned maintenance, known conditions
Review alert quality monthly: Track alert-to-incident ratio, false positive rate

7.5 Evidence Capture

SLO compliance reports (monthly, quarterly)
Error budget consumption history
Alert history and response times
Dashboard snapshots at time of incidents
Telemetry retention proof (meeting regulatory requirements)

8. Agentic AI in Operations

AI agents are entering infrastructure operations: triaging alerts, suggesting remediations, and beginning to execute bounded actions autonomously. The engineering case is strong. The governance case requires deliberate architecture.

8.1 Current Maturity

Capability	Maturity	Autonomy level
Alert triage and enrichment	Production-ready	L2–L3
Root cause suggestion	Production-ready	L2–L3
Log analysis and summarisation	Production-ready	L2–L3
Autonomous remediation (known patterns)	Emerging	L1–L2
Natural-language infrastructure changes	Experimental	L1
Autonomous architecture decisions	Not ready	–

8.2 Guardrails

AI agents in this architecture must operate within the same enforcement model as any other automated component:

Scoped authority: Read-only by default. Write actions require policy-as-code approval, not prompt instructions.
Structured evidence: Every action produces a signed record: trigger, context, decision, execution, outcome.
Human gates at consequence boundaries: Actions that cross security, availability, or compliance thresholds require human approval.
Circuit breakers: If the agent’s error rate or SLO impact exceeds thresholds, it halts automatically.

8.3 Adoption Path

Start at L1 (read-only advisory), build trust evidence over 3–6 months, then graduate to L2 (human-approved actions). Advance to L3 only for well-understood action classes with documented success rates. Do not skip phases – each builds the evidence needed to justify the next.

For a detailed treatment of guardrail architecture, evidence requirements, and the accountability gap in EU-regulated environments, see Agentic AI in Regulated Infrastructure.

9. Cross-Cutting Concerns

Concern	Approach	Key constraint
Secrets management	Vault or equivalent. Short-lived credentials via OIDC federation. Runtime injection via CSI driver. Automated rotation. All access logged.	Never in Git. No long-lived API keys in production.
Disaster recovery	RPO/RTO defined per service. Automated daily restore tests. etcd backup verified. DR runbook exercised quarterly.	Multi-zone minimum for critical services. Untested backups are not backups.
Cost governance	Resource requests/limits enforced. VPA right-sizing. Cost anomaly alerts. Chargeback per team/service.	Spot instances for non-critical batch only.
Network architecture	Zero-trust: default-deny, all traffic explicitly allowed via network policy as code. mTLS via service mesh. Filtered external DNS.	No implicit trust between services.

10. Implementation Roadmap

Phase 1: Foundation (Months 1–3)

Goal: Immutable base platform with GitOps and basic observability.

Deploy Kubernetes (managed or Talos) with GitOps (Flux or ArgoCD)
Implement pod security standards (Kyverno/OPA baseline)
Deploy OpenTelemetry Collector + Prometheus + Grafana + Loki
Define SLIs/SLOs for existing services
Implement cert-manager with automated certificate rotation
Establish Git repository structure for infrastructure state
Configure node auto-repair and HPA/VPA

Phase 2: Patching and Supply Chain (Months 3–6)

Goal: Automated patching pipeline with compliance SLAs.

Deploy Renovate/Dependabot with auto-merge policies
Implement CI pipeline with SAST, SCA, container scanning
Set up image signing (cosign/Sigstore) and admission verification
Generate SBOMs for all deployed images
Implement canary deployment (Argo Rollouts / Flagger)
Define and enforce patch SLA policy
Establish vulnerability prioritisation workflow

Phase 3: Security Automation (Months 6–9)

Goal: Automated detection and response for known threat patterns.

Deploy Falco/Tetragon for runtime security
Deploy Wazuh for host-level monitoring
Implement SOAR playbooks for top 10 threat patterns
Configure network policies as code (default-deny)
Implement egress filtering
Deploy SIEM correlation rules
Establish incident response automation (containment playbooks)

Phase 4: Compliance Automation (Months 9–12)

Goal: Continuous compliance posture with automated evidence.

Map technical controls to regulatory frameworks (NIS2/DORA/SOC2/ISO27001)
Implement OSCAL-based evidence generation
Deploy continuous compliance dashboard
Automate access reviews
Establish risk register with automated findings intake
Conduct first automated compliance assessment
Schedule management review (human gate)

Phase 5: Advanced Autonomy (Months 12+)

Goal: Expand automation envelope with AI-assisted operations.

Deploy AI-assisted alert triage (read-only, advisory)
Implement SLO-based error budget policies
Expand SOAR playbooks based on incident data
Evaluate AI agent tools for supervised remediation
Refine auto-merge policies based on 6+ months of data
Conduct autonomy level assessment for each component
Document trust evidence for current autonomy levels

11. Anti-Patterns

Anti-pattern	Why it fails
Automate without observability	You cannot verify what you cannot measure. Automation without SLO gating is guesswork.
Skip canary for speed	The time saved is repaid with interest during the inevitable incident.
AI agents without audit trails	If you cannot reconstruct why the agent acted, you cannot trust it – and neither can an auditor.
Bolt-on compliance	Compliance evidence must be a byproduct of the pipeline, not a separate workstream.
Eliminate human gates	Regulation and engineering both require human accountability at consequence boundaries.
Alert-driven ops without SLOs	Optimising for alert count rather than user impact produces noise, not reliability.
Single-vendor security	One product failure should not compromise your entire security posture.
Immutable OS without rollback testing	Immutability only delivers value if you can revert to the previous image within minutes.
Policy-as-code without negative tests	Untested policies block legitimate workloads in production.

12. Decision Log

Decision	Rationale	Alternatives considered	Reversibility
GitOps as deployment model	Auditability, drift detection, rollback via revert	Push-based CI/CD, manual kubectl	High
Immutable OS for compute	Reduced attack surface, consistent state	Hardened Ubuntu/RHEL	Medium
OPA/Kyverno for policy	Kubernetes-native, declarative, testable	Cedar, Sentinel, custom webhooks	High
OpenTelemetry for instrumentation	Vendor-neutral, standard, broad ecosystem	Vendor-specific agents	High
Canary deployment default	Lowest-risk deployment pattern	Blue/green, rolling update	High
Feature flags for release	Decouples deploy from release, instant rollback	Branch-based releases	High
Human gates for risk-bearing changes	Regulatory requirement, safety requirement	Full automation	N/A (structurally required)

Appendix A: Tool Reference

Category	Recommended	Alternatives
OS (compute)	Talos Linux	Flatcar, Bottlerocket
Orchestration	Kubernetes	Nomad (for specific use cases)
GitOps	Flux CD	ArgoCD
Progressive delivery	Argo Rollouts	Flagger
Feature flags	Flipt (self-hosted)	LaunchDarkly, Unleash
CI/CD	GitLab CI, GitHub Actions	Tekton, Jenkins
IaC	OpenTofu / Terraform	Pulumi, Crossplane
Policy engine	Kyverno	OPA/Gatekeeper, Cedar
Metrics	Prometheus + Mimir/Thanos	Datadog, New Relic
Logs	Loki	OpenSearch, Elastic
Traces	Tempo	Jaeger
Dashboards	Grafana	–
Telemetry collection	OpenTelemetry	– (de facto standard)
Runtime security	Falco + Tetragon	Sysdig
Host security	Wazuh	OSSEC, CrowdSec
SIEM	Wazuh / Elastic SIEM	Splunk, Sentinel
SOAR	Tines	Shuffle, Cortex XSOAR
Vulnerability scanning	Trivy + Grype	Snyk, Prisma Cloud
Image signing	cosign (Sigstore)	Notary v2
SBOM	Syft (CycloneDX)	SPDX tools
Secrets	HashiCorp Vault	AWS SM, Azure KV, SOPS
Certificate management	cert-manager	Vault PKI
Dependency updates	Renovate	Dependabot
Compliance evidence	OSCAL tooling	Manual evidence collection
Identity	Keycloak	Auth0, Okta
Network policy	Cilium	Calico
Service mesh	Cilium, Linkerd	Istio

Appendix B: Regulatory Quick Reference

Framework	Scope	Key automation-relevant requirements	Human gate requirements
NIS2	EU critical infrastructure	Risk management (Art. 21), incident reporting (24h), supply chain security	Management accountability (Art. 20), risk acceptance
DORA	EU financial sector	ICT risk management, incident reporting (4h classify), TLPT, third-party oversight	Management oversight, TLPT execution, vendor risk decisions
SOC 2	US, voluntary	Trust services criteria (security, availability, etc.)	Management assertions, auditor interaction
ISO 27001	Global, voluntary	Annex A controls, ISMS operation	Management review (9.3), internal audit (9.2), risk treatment (6.1)
CRA	EU, products with digital elements	Vulnerability handling, SBOM, security updates	Conformity assessment, incident reporting