Skip to content

Reference Architecture: Maximum-Autonomy IT Operations

Version: 1.0.0 · Classification: Conceptual – Implementation-oriented · Status: For review and adaptation to specific environment

Abstract: A reference architecture for maximising operational autonomy in IT infrastructure while maintaining compliance, safety, and auditability. The architecture targets 70–85% automation of operational toil with explicit human decision gates at irreducible control points. It covers five operational layers (platform, patching, security, compliance, change management), observability, agentic AI integration, and a phased implementation roadmap.

Executive Summary

Core principle: Automate the predictable; gate the consequential.

Every domain has a hard ceiling – the point beyond which automation requires human judgement, regulatory accountability, or decisions under genuine uncertainty. This architecture is designed around those ceilings, not against them.

DomainAutomatableHard ceiling
Self-healing infrastructure80–90%Novel failures, cascading events, split-brain
Continuous patching60–75%Stateful upgrades, breaking changes, kernel updates
Security response50–65%Novel attacks, availability-impacting actions
Compliance automation40–60%Risk acceptance, management attestation
Change management70–80%Architecture changes, non-standard changes

1. Architecture Principles

1.1 Design Axioms

  1. Immutability over mutation. Prefer replacing infrastructure over patching in place. Immutable images, declarative state, GitOps reconciliation.
  2. Deterministic over probabilistic. Automation actions must be predictable and testable. LLM/AI-driven actions are advisory until validated in a closed-loop with deterministic verification.
  3. Least privilege, least blast radius. Every automated agent operates with scoped permissions and bounded blast radius. Canary before fleet. Feature flag before hard deploy.
  4. Evidence by default. Every automated action produces a signed, timestamped audit trail. If you can’t prove it happened correctly, it didn’t.
  5. Fail safe, not fail silent. Unknown states trigger safe-halt or degraded mode, never silent continuation.
  6. Human gates are structural, not temporary. Certain decisions require human accountability by regulation and by good engineering. These are not automation gaps to be closed – they are load-bearing control points.

1.2 Autonomy Levels (Graduated Model)

Adapted from the operational maturity model emerging in the Agentic SRE space:

LevelNameDescriptionHuman involvement
L0ManualRunbook-driven, human executesFull
L1AssistedSystem recommends, human approves and executesDecision + execution
L2Semi-automatedSystem executes pre-approved actions, human approves scopeDecision only
L3Supervised autonomousSystem executes and verifies, human monitors and can interveneOversight only
L4Autonomous (bounded)System executes within policy, alerts human on boundaryException only

Target state for this architecture: L3–L4 for standard operations, L1–L2 for risk-bearing changes.

No component should operate at a level beyond what its verification evidence supports.


2. Layer 1 – Immutable, Self-Healing Base Platform

2.1 Purpose

Provide a compute substrate that recovers from common failure modes without human intervention. This is the most mature automation domain and should be built first.

2.2 Reference Stack

Operating system: Immutable, minimal-attack-surface OS.

Selection criteria: No general-purpose OS in the compute plane. General-purpose Linux (Ubuntu, RHEL) only for management/bastion nodes with CIS hardening.

Container orchestration: Kubernetes (managed or self-hosted Talos).

Service mesh: Optional but recommended for mTLS, circuit breaking, retry budgets.

2.3 Self-Healing Capabilities

Failure modeAutomated responseAutonomy levelVerification
Pod crashRestart (kubelet)L4Restart count metric, SLO check
Node failureReschedule pods, replace node (Cluster API / ASG)L4Node ready count, workload distribution
Disk pressureEvict pods, alert, trigger volume expansionL3–L4Disk utilisation metric, PV status
Network partitionCircuit breaker, retry with backoff, failoverL4Error rate metric, mesh health
Certificate expiryAuto-rotation (cert-manager)L4Cert expiry metric, TLS handshake success
Config driftGitOps reconciliation (Flux/ArgoCD)L4Drift detection alert, sync status
Resource exhaustionHPA/VPA scalingL3–L4Resource utilisation, scaling events
Cascading failureCircuit breaker + rate limiting + load sheddingL3Error budget burn rate, human review
Split-brain / data inconsistencyHuman gate – safe-halt, alertL1Requires manual diagnosis

2.4 What Cannot Be Automated

2.5 Implementation Pattern

┌─────────────────────────────────────────────────────┐
│                   Git Repository                     │
│  (Infrastructure state, Kubernetes manifests,        │
│   Helm charts, Kustomize overlays)                   │
└──────────────┬──────────────────────────────────────┘
               │ GitOps sync
               ▼
┌──────────────────────────┐    ┌─────────────────────┐
│  Flux / ArgoCD           │◄───│  Drift Detection    │
│  (Continuous reconcile)  │    │  (alert on manual    │
│                          │    │   cluster changes)   │
└──────────┬───────────────┘    └─────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────┐
│  Kubernetes Cluster (Talos / managed K8s)             │
│  ┌──────────┐  ┌──────────┐  ┌───────────────────┐  │
│  │ HPA/VPA  │  │ PDB      │  │ Cluster API /     │  │
│  │ (scaling)│  │ (budget) │  │ Node auto-repair  │  │
│  └──────────┘  └──────────┘  └───────────────────┘  │
│  ┌──────────────────┐  ┌────────────────────────┐   │
│  │ cert-manager     │  │ Service mesh (mTLS,    │   │
│  │ (cert rotation)  │  │  circuit breaker)      │   │
│  └──────────────────┘  └────────────────────────┘   │
└──────────────────────────────────────────────────────┘

2.6 Evidence Capture


3. Layer 2 – Continuous Patching Pipeline

3.1 Purpose

Automatically apply security and dependency patches with verification gates, maintaining compliance SLAs for patch windows while minimising human intervention.

3.2 Patching Domains

DomainApproachAutonomy target
OS base imageImmutable image rebuild on upstream CVEL3–L4
Container base imagesAutomated rebuild pipelineL3–L4
Application dependenciesRenovate/Dependabot with auto-merge rulesL3 (low-risk), L2 (high-risk)
Kubernetes componentsManaged K8s auto-upgrade or staged rolloutL2–L3
Database enginesStaged, human-gatedL1–L2
Kernel / firmwareHuman-gated, scheduled maintenanceL1

3.3 Automated Patching Pipeline

CVE Feed / Upstream Release
        │
        ▼
┌───────────────────────┐
│  Vulnerability Scanner │ ◄── Trivy, Grype, Snyk
│  (continuous scan of   │     scanning container
│   images + deps)       │     registry + repos
└──────────┬────────────┘
           │ New CVE or dependency update detected
           ▼
┌───────────────────────┐
│  Renovate / Dependabot │
│  (auto-PR with         │
│   changelog + diff)    │
└──────────┬────────────┘
           │ PR opened
           ▼
┌───────────────────────┐
│  CI Pipeline           │
│  Build, test, SAST,    │
│  DAST, SCA, container  │
│  scan, SBOM, signing   │
└───────────┬───────────┘
            │ All checks pass
            ▼
┌─────────────────────────────────────────────┐
│  Auto-merge Policy Engine                    │
│  IF severity < CRITICAL                      │
│  AND test coverage >= threshold              │
│  AND no breaking API changes                 │
│  AND dependency is in approved-list           │
│  AND SBOM diff is within policy              │
│  THEN auto-merge to staging branch           │
│  ELSE require human review                   │
└──────────────┬──────────────────────────────┘
               │
               ▼
┌───────────────────────┐
│  Canary Deployment     │
│  (Argo Rollouts /      │
│   Flagger)             │
│  Monitor: error rate,  │
│  latency, success rate,│
│  resource usage        │
│  Auto-promote if SLO   │
│  met; rollback if not  │
└──────────┬────────────┘
           │
           ▼
┌───────────────────────┐
│  Progressive Rollout   │
│  5% → 25% → 50% → 100%│
│  with SLO gates at     │
│  each stage            │
└───────────────────────┘

3.4 Auto-Merge Rules (Policy-as-Code)

These rules determine which patches can proceed without human review. They should be conservative and tightened over time based on incident data.

# Example: Renovate auto-merge policy (conceptual)
auto_merge_criteria:
  # Patch version bumps of well-known, low-risk deps
  - match:
      update_type: "patch"
      dependency_type: "production"
      severity: ["low", "medium"]
    requires:
      ci_pass: true
      test_coverage_delta: ">= 0"  # no coverage regression
      breaking_changes: false
      sbom_policy_check: pass
    action: auto_merge

  # Security patches – critical CVEs get fast-tracked
  # but still require canary verification
  - match:
      update_type: "any"
      cve_severity: "critical"
      cisa_kev: true  # Known Exploited Vulnerability
    requires:
      ci_pass: true
    action: auto_merge_to_canary
    escalation: page_oncall_if_canary_fails

  # Everything else: human review
  - match:
      update_type: "major"
    action: require_human_review

  - match:
      dependency_type: "database_engine"
    action: require_human_review

  - match:
      update_type: "minor"
      breaking_changes: true
    action: require_human_review

3.5 OS-Level Patching (Immutable Image Rebuild)

For immutable OS (Talos, Flatcar, Bottlerocket):

  1. Upstream publishes new image with security fixes
  2. CI pipeline builds new machine image incorporating the update
  3. Image is scanned (Trivy/vulnerability assessment)
  4. Staged node replacement: drain → replace → verify, one node at a time
  5. PodDisruptionBudgets ensure workload availability during rollout
  6. Rollback: revert to previous image if node health checks fail

For traditional OS (management nodes):

  1. Unattended-upgrades for security patches (automatic)
  2. Ansible playbooks for coordinated upgrades
  3. Snapshot before, apply, verify, rollback if failed
  4. Kernel updates: scheduled maintenance window, human approval

3.6 Compliance Patch SLAs

Regulatory frameworks set expectations for patch timelines. These should be encoded as policy:

SeveritySLA targetAuto-action
Critical (CVSS ≥ 9.0, CISA KEV)24–72 hoursAuto-canary, page on-call
High (CVSS 7.0–8.9)7 daysAuto-PR, auto-merge if policy met
Medium (CVSS 4.0–6.9)30 daysAuto-PR, batch with next release
Low (CVSS < 4.0)90 daysAuto-PR, low priority queue

3.7 Human Gates (Non-Automatable)

3.8 Evidence Capture


4. Layer 3 – Security Automation and Autonomous Defense

4.1 Purpose

Detect, contain, and respond to security threats with minimal human latency for known attack patterns, while maintaining human oversight for novel threats and availability-impacting responses.

4.2 Defense-in-Depth Stack

Layer 6: Compliance-as-Code (OPA, Kyverno, Cedar)
         Continuous policy enforcement, admission control

Layer 5: SOAR (Tines, Shuffle, XSOAR)
         Playbook-driven automated response

Layer 4: SIEM / Correlation (Wazuh, Elastic SIEM)
         Event correlation, alert enrichment, threat intelligence

Layer 3: Runtime Security (Falco, Tetragon, Sysdig)
         Syscall monitoring, behavioural detection, eBPF

Layer 2: Network Security (Cilium, Calico, NP-as-code)
         Network policy, DNS filtering, egress control

Layer 1: Supply Chain (Trivy, cosign, SLSA, admission)
         Image signing, SBOM, vulnerability gates

Layer 0: Identity (Keycloak/OIDC, RBAC, SPIFFE/SPIRE)
         Zero-trust identity, workload identity, least privilege

4.3 Automated Response Playbooks

Threat patternAutomated responseAutonomy levelConstraint
Known malware hash in containerKill pod, quarantine image, alertL4Pre-approved action
Brute force authenticationProgressive rate limit, temp block IP, alertL4Threshold-based
Anomalous egress trafficBlock egress to unknown destination, alertL3May impact availability
Privilege escalation attemptKill process, alert, capture forensicsL4Pre-approved action
CVE in running containerSchedule replacement with patched imageL3Follows patching pipeline
Certificate about to expireAuto-rotateL4cert-manager handles
Config drift from policyAuto-remediate to desired stateL3–L4Policy-as-code
Unusual API call patternsIncrease logging, alert, reduce rate limitL3May impact legitimate traffic
Novel attack patternHuman gate – alert, capture, do not auto-remediateL1Unknown blast radius
Insider threat indicatorsHuman gate – alert security team, capture evidenceL1Legal/HR implications

4.4 Policy-as-Code (Admission Control)

Prevent bad state from entering the cluster rather than detecting it after the fact.

Kubernetes admission:

Cloud-level:

Runtime:

4.5 Vulnerability Management Pipeline

Continuous Scanning
    ├── Registry scan (Trivy) — on push and scheduled
    ├── Runtime scan (Trivy operator) — running containers
    ├── Host scan (Wazuh) — OS and installed packages
    ├── IaC scan (Checkov, tfsec) — in CI pipeline
    └── Dependency scan (SCA) — in CI pipeline
            │
            ▼
    Prioritisation Engine
    ├── CVSS score
    ├── EPSS (Exploit Prediction Scoring)
    ├── CISA KEV (Known Exploited)
    ├── Reachability analysis (is the vuln actually reachable?)
    ├── Asset criticality (what does this run on?)
    └── Exposure context (internet-facing? internal?)
            │
            ▼
    Action routing
    ├── Critical + Exploited + Reachable → Emergency patch
    ├── High + Reachable → Fast-track to patching pipeline
    ├── Medium/Low or Not Reachable → Standard patching SLA
    └── Accepted risk → Document in risk register, review quarterly

4.6 Human Gates (Non-Automatable)

4.7 Evidence Capture


5. Layer 4 – Compliance Automation and Continuous Assurance

5.1 Purpose

Maintain continuous compliance posture with automated evidence generation, policy enforcement, and drift detection, while preserving human accountability for risk decisions and regulatory attestation.

5.2 Compliance Automation Model

Compliance Control Plane

  Policy Engine      Evidence Store     Audit Dashboard
  (OPA/Kyverno/      (immutable,        (continuous
   Cedar)             signed,            posture)
                      timestamped)

            Control Mapping Layer

  ISO 27001 <-> NIS2 <-> SOC 2 <-> DORA <-> CIS <-> NIST

  Maps technical controls to regulatory requirements
  One control can satisfy multiple frameworks

5.3 Continuous Control Monitoring

Control categoryAutomated checkFrequencyEvidence type
Access controlRBAC audit, stale accounts, over-privileged rolesContinuousRBAC dump, access review report
EncryptionTLS version check, cert validity, at-rest encryptionContinuousScan results, cert inventory
Patch complianceVulnerability age vs. SLAContinuousPatch timeline report
Network segmentationNetwork policy coverage, egress auditContinuousPolicy dump, connectivity test
Logging and monitoringLog pipeline health, retention complianceContinuousPipeline metrics, retention proof
Backup integrityRestore test results, RPO complianceDaily/weeklyRestore test logs, hash verification
Change managementAll changes via GitOps (no manual cluster changes)ContinuousGit history, drift detection alerts
Incident responsePlaybook test results, MTTR metricsQuarterly exerciseExercise report, MTTR dashboard
Supply chainSBOM coverage, image signing rateContinuousSBOM inventory, signing audit
Identity lifecycleJoiner/mover/leaver automation, MFA enforcementContinuousIGA audit logs, MFA coverage

5.4 Evidence Generation (OSCAL-Based)

OSCAL (Open Security Controls Assessment Language) provides a machine-readable format for compliance evidence:

  1. System Security Plan (SSP): Generated from IaC and policy-as-code definitions
  2. Assessment Plan: Automated test definitions mapped to controls
  3. Assessment Results: Continuous scan and test results in OSCAL format
  4. Plan of Action and Milestones (POA&M): Auto-generated from failed controls

Evidence pipeline:

Control check runs → Results stored (immutable, signed) →
Mapped to framework requirements → Dashboard updated →
Auditor accesses dashboard + evidence store

5.5 Regulatory Framework Requirements

NIS2 (applicable if operating in EU critical sectors):

DORA (applicable if in EU financial sector):

SOC 2 / ISO 27001:

5.6 GitSecOps: Git as the Source of Compliance Truth

The strongest pattern for demonstrating compliance to auditors:

This turns audit from “show me your documents” to “here is the commit history, the policy enforcement logs, and the continuous compliance dashboard.”

5.7 Human Gates (Structurally Required by Regulation)

These are not automation gaps. They are requirements imposed by every major compliance framework:

5.8 Evidence Capture


6. Layer 5 – Change Management and Autonomous Deployment

6.1 Purpose

Enable safe, fast, automated deployment of standard changes while maintaining rigorous gates for non-standard and emergency changes.

6.2 Change Classification

TypeDefinitionProcessAutonomy
Standard changePre-approved, bounded blast radius, automated verificationFully automated pipelineL4
Normal changeRequires review, moderate riskPR review + automated deploy + canaryL2–L3
Emergency changeUrgent fix, expedited processAbbreviated review + automated deploy + immediate verifyL2
Major changeHigh risk, architecture impactFull CAB review + staged manual rolloutL1

6.3 Standard Change Automation (L4)

Standard changes are the largest volume and the highest-value automation target. These are pre-approved change types where the blast radius is bounded and verification is automated.

Examples of standard changes:

Pipeline:

Developer pushes code
        │
        ▼
  CI Pipeline (build, test, scan, sign)
        │ All gates pass
        ▼
  PR Auto-merge (if policy met)
        │
        ▼
  GitOps Sync (Flux/ArgoCD) → Canary Deploy (Argo Rollouts)
                                    │
                              SLO Verification
                              (bake time: 15–30 min)
                                    │
                              Pass? Promote
                              Fail? Rollback
        │
        ▼
  Progressive Rollout: 5% → 25% → 100%
  with SLO gates at each stage

6.4 Feature Flags (Decouple Deploy from Release)

Feature flags enable deploying code without activating it, then progressively enabling:

Tools: LaunchDarkly, Flipt (self-hosted), Unleash, OpenFeature SDK

Integration with SLOs: Feature flags should be wired to SLO monitoring. If enabling a feature degrades SLOs beyond threshold, auto-disable.

6.5 Rollback Strategy

Every deployment must have a tested rollback path:

ScenarioRollback methodTime to restore
Canary failureArgo Rollouts auto-rollbackSeconds–minutes
Post-deploy SLO violationGitOps revert (revert commit)Minutes
Feature flag issueDisable flagSeconds
Schema migration failureForward-fix preferred; backward migration if testedMinutes–hours
Infrastructure change failureTerraform/OpenTofu state rollbackMinutes

Critical rule: Never deploy a schema migration that cannot be rolled back, or a migration that requires the new code to function. Deploy migrations and code changes in separate steps (expand-contract pattern).

6.6 SLO-Gated Deployment

Every automated deployment should be gated on SLO health:

# Conceptual: Argo Rollouts analysis template
analysis:
  metrics:
    - name: error-rate
      provider: prometheus
      query: |
        sum(rate(http_requests_total{status=~"5.."}[5m]))
        / sum(rate(http_requests_total[5m]))
      threshold:
        max: 0.01  # 1% error rate
      interval: 60s

    - name: latency-p99
      provider: prometheus
      query: |
        histogram_quantile(0.99,
          sum(rate(http_request_duration_seconds_bucket[5m]))
          by (le))
      threshold:
        max: 0.5  # 500ms p99
      interval: 60s

    - name: success-rate
      provider: prometheus
      query: |
        sum(rate(http_requests_total{status=~"2.."}[5m]))
        / sum(rate(http_requests_total[5m]))
      threshold:
        min: 0.995  # 99.5% success rate
      interval: 60s

  # Rollback if ANY metric fails
  # Promote only if ALL metrics pass for full bake time

6.7 Human Gates

6.8 Evidence Capture


7. Observability – The Nervous System

7.1 Purpose

Observability is the foundation that enables all other layers. Without high-quality telemetry, self-healing is guesswork, SLO gating is impossible, and compliance evidence is incomplete.

7.2 Observability Stack

Dashboards & Alerts: Grafana

  Metrics            Logs               Traces
  Prometheus /       Loki / OpenSearch   Tempo / Jaeger
  Mimir / Thanos     / Elastic

              OpenTelemetry Collector
              (unified pipeline)

  Applications      Infrastructure      Security
  (OTel SDK)        (node exporter,     (Falco,
                     kube-state)         Wazuh)

7.3 SLI/SLO Framework

Define SLIs and SLOs for every user-facing service:

SLIMeasurementTypical SLO
AvailabilitySuccessful requests / total requests99.9% (30d)
Latencyp99 response time< 500ms
Error rate5xx responses / total responses< 0.1%
ThroughputRequests per second sustainedWithin capacity plan
CorrectnessBusiness logic validation pass rate99.99%

Error budget: The difference between 100% and the SLO target is the error budget. Automation decisions consume error budget. If error budget is exhausted, freeze automated deployments until budget recovers.

7.4 Alert Design (Anti-Alert-Fatigue)

7.5 Evidence Capture


8. Agentic AI in Operations

AI agents are entering infrastructure operations: triaging alerts, suggesting remediations, and beginning to execute bounded actions autonomously. The engineering case is strong. The governance case requires deliberate architecture.

8.1 Current Maturity

CapabilityMaturityAutonomy level
Alert triage and enrichmentProduction-readyL2–L3
Root cause suggestionProduction-readyL2–L3
Log analysis and summarisationProduction-readyL2–L3
Autonomous remediation (known patterns)EmergingL1–L2
Natural-language infrastructure changesExperimentalL1
Autonomous architecture decisionsNot ready

8.2 Guardrails

AI agents in this architecture must operate within the same enforcement model as any other automated component:

8.3 Adoption Path

Start at L1 (read-only advisory), build trust evidence over 3–6 months, then graduate to L2 (human-approved actions). Advance to L3 only for well-understood action classes with documented success rates. Do not skip phases – each builds the evidence needed to justify the next.

For a detailed treatment of guardrail architecture, evidence requirements, and the accountability gap in EU-regulated environments, see Agentic AI in Regulated Infrastructure.


9. Cross-Cutting Concerns

ConcernApproachKey constraint
Secrets managementVault or equivalent. Short-lived credentials via OIDC federation. Runtime injection via CSI driver. Automated rotation. All access logged.Never in Git. No long-lived API keys in production.
Disaster recoveryRPO/RTO defined per service. Automated daily restore tests. etcd backup verified. DR runbook exercised quarterly.Multi-zone minimum for critical services. Untested backups are not backups.
Cost governanceResource requests/limits enforced. VPA right-sizing. Cost anomaly alerts. Chargeback per team/service.Spot instances for non-critical batch only.
Network architectureZero-trust: default-deny, all traffic explicitly allowed via network policy as code. mTLS via service mesh. Filtered external DNS.No implicit trust between services.

10. Implementation Roadmap

Phase 1: Foundation (Months 1–3)

Goal: Immutable base platform with GitOps and basic observability.

Phase 2: Patching and Supply Chain (Months 3–6)

Goal: Automated patching pipeline with compliance SLAs.

Phase 3: Security Automation (Months 6–9)

Goal: Automated detection and response for known threat patterns.

Phase 4: Compliance Automation (Months 9–12)

Goal: Continuous compliance posture with automated evidence.

Phase 5: Advanced Autonomy (Months 12+)

Goal: Expand automation envelope with AI-assisted operations.


11. Anti-Patterns

Anti-patternWhy it fails
Automate without observabilityYou cannot verify what you cannot measure. Automation without SLO gating is guesswork.
Skip canary for speedThe time saved is repaid with interest during the inevitable incident.
AI agents without audit trailsIf you cannot reconstruct why the agent acted, you cannot trust it – and neither can an auditor.
Bolt-on complianceCompliance evidence must be a byproduct of the pipeline, not a separate workstream.
Eliminate human gatesRegulation and engineering both require human accountability at consequence boundaries.
Alert-driven ops without SLOsOptimising for alert count rather than user impact produces noise, not reliability.
Single-vendor securityOne product failure should not compromise your entire security posture.
Immutable OS without rollback testingImmutability only delivers value if you can revert to the previous image within minutes.
Policy-as-code without negative testsUntested policies block legitimate workloads in production.

12. Decision Log

DecisionRationaleAlternatives consideredReversibility
GitOps as deployment modelAuditability, drift detection, rollback via revertPush-based CI/CD, manual kubectlHigh
Immutable OS for computeReduced attack surface, consistent stateHardened Ubuntu/RHELMedium
OPA/Kyverno for policyKubernetes-native, declarative, testableCedar, Sentinel, custom webhooksHigh
OpenTelemetry for instrumentationVendor-neutral, standard, broad ecosystemVendor-specific agentsHigh
Canary deployment defaultLowest-risk deployment patternBlue/green, rolling updateHigh
Feature flags for releaseDecouples deploy from release, instant rollbackBranch-based releasesHigh
Human gates for risk-bearing changesRegulatory requirement, safety requirementFull automationN/A (structurally required)

Appendix A: Tool Reference

CategoryRecommendedAlternatives
OS (compute)Talos LinuxFlatcar, Bottlerocket
OrchestrationKubernetesNomad (for specific use cases)
GitOpsFlux CDArgoCD
Progressive deliveryArgo RolloutsFlagger
Feature flagsFlipt (self-hosted)LaunchDarkly, Unleash
CI/CDGitLab CI, GitHub ActionsTekton, Jenkins
IaCOpenTofu / TerraformPulumi, Crossplane
Policy engineKyvernoOPA/Gatekeeper, Cedar
MetricsPrometheus + Mimir/ThanosDatadog, New Relic
LogsLokiOpenSearch, Elastic
TracesTempoJaeger
DashboardsGrafana
Telemetry collectionOpenTelemetry – (de facto standard)
Runtime securityFalco + TetragonSysdig
Host securityWazuhOSSEC, CrowdSec
SIEMWazuh / Elastic SIEMSplunk, Sentinel
SOARTinesShuffle, Cortex XSOAR
Vulnerability scanningTrivy + GrypeSnyk, Prisma Cloud
Image signingcosign (Sigstore)Notary v2
SBOMSyft (CycloneDX)SPDX tools
SecretsHashiCorp VaultAWS SM, Azure KV, SOPS
Certificate managementcert-managerVault PKI
Dependency updatesRenovateDependabot
Compliance evidenceOSCAL toolingManual evidence collection
IdentityKeycloakAuth0, Okta
Network policyCiliumCalico
Service meshCilium, LinkerdIstio

Appendix B: Regulatory Quick Reference

FrameworkScopeKey automation-relevant requirementsHuman gate requirements
NIS2EU critical infrastructureRisk management (Art. 21), incident reporting (24h), supply chain securityManagement accountability (Art. 20), risk acceptance
DORAEU financial sectorICT risk management, incident reporting (4h classify), TLPT, third-party oversightManagement oversight, TLPT execution, vendor risk decisions
SOC 2US, voluntaryTrust services criteria (security, availability, etc.)Management assertions, auditor interaction
ISO 27001Global, voluntaryAnnex A controls, ISMS operationManagement review (9.3), internal audit (9.2), risk treatment (6.1)
CRAEU, products with digital elementsVulnerability handling, SBOM, security updatesConformity assessment, incident reporting
IT Operations Infrastructure Automation Kubernetes GitOps Self-Healing SRE Security Automation Compliance Automation NIS2 DORA ISO 27001 Policy-as-Code OpenTelemetry SBOM Agentic AI