OpenClaw Reliability Stack Architecture OpenClaw reliability platform architecture showing layers for observing, protecting, and recovering AI agent systems. Layers include Operator Console (Triage entry point, SphinxGate access control), Observe Layer (RadCheck, Observe), Resilience Layer — Detection (Watchdog, Sentinel, InfraWatch) wired via Resilience Event Bus (REB) to Readiness (Lazarus) and Recovery (Agent911, Recall, ORP), Memory Integrity Layer (Elixir), and Agent Runtime Environment (Claude, OpenAI, local, and specialized agents). ACME AGENT SUPPLY CO. · ARCHITECTURE REFERENCE OpenClaw Reliability Stack How OpenClaw observes, protects, and recovers AI agent systems. SERIAL 77B · v6 RELIABILITY LOOP Operator Console The entry point to the OpenClaw system. Triage captures a deterministic proof bundle before any recovery action begins. LAYER 0 Operator Console Entry point. Start here. Triage Triage is the first-response triage terminal. It captures a deterministic, read-only proof bundle before any recovery action. Run Triage first, every time. Triage ENTRY First-response triage terminal. triage -watch SphinxGate SphinxGate is the access control layer for model routing. It governs which models run on which workloads, enforces token discipline per lane, and produces a routing audit trail. Layer 0 — operates before the resilience layer. SphinxGate ACCESS CTRL Model routing policy. ● lane enforcement STATUS HEALTHY RELIABILITY 87 / 100 Observe Layer The Observe Layer aggregates signals about agent health and system behavior. Tools include RadCheck for reliability scoring from 0 to 100, and Observe for signal aggregation. Agent presence and liveness monitoring is available via recall status --watch. LAYER 1 Observe Layer Visibility. Signals before stalls. RadCheck RadCheck measures system reliability scores from 0 to 100 by analyzing agent health signals and telemetry across five domains: gateway, sessions, disk, memory drift, and agent churn. Free and read-only. RadCheck Reliability scoring. 0–100. ● free tier Observe The Observe module provides unified aggregation of runtime signals from all connected agents. Its output feeds into RadCheck scoring and Sentinel anomaly detection. Observe Signal aggregation layer. ● operator action Detection Layer The Detection Layer continuously monitors runtime behavior, infrastructure configuration, and process liveness. All three detection products emit events to the Resilience Event Bus (REB). Tools include Watchdog for heartbeat and liveness probes, Sentinel for silent failure detection, and InfraWatch for infrastructure config drift detection. LAYER 2 Detection Layer Continuous detection. Emits to REB. Watchdog Watchdog provides heartbeat supervision with liveness probes, cron-safe cadence checks, and lock collision alerts. Port availability does not confirm process health. Watchdog Heartbeat supervision. ● monitoring Sentinel Sentinel provides continuous detection of silent failures, output deviation, and stuck runs. It watches for failure modes that logs do not surface. Sentinel Silent failure detection. ● always on InfraWatch InfraWatch detects configuration drift in your agent stack's infrastructure — ingest chains, daemon configs, and routing. Emits events to the Resilience Event Bus. Included in Operator Bundle. InfraWatch Infra config drift. Emits to REB. ● bundle only Transmission Transmission routes each task to the right model at the right cost. Circuit breaker prevents rate limit failures before they happen. Token efficiency report shows real savings. Transmission Task-aware routing. Economics layer. ● coming soon Recovery Layer The Recovery Layer enables deterministic recovery operations. Tools include ORP (OpenClaw Recovery Protocol) for evidence-first recovery doctrine, Agent911 for read-only recovery cockpit diagnostics, Recall for manual operator intervention, and Lazarus for backup readiness verification. LAYER 3 Recovery Layer Evidence before recovery. Always. ORP — OpenClaw Recovery Protocol ORP is the recovery doctrine layer. It sequences recovery in the correct order: evidence capture, then diagnosis, then safe recovery, then verification. The order is mandatory. ORP Recovery Protocol. Agent911 Agent911 provides recovery cockpit diagnostics and read-only triage capabilities during system recovery operations. It aggregates protection proofs and guides operators through playbooks. Agent911 Recovery cockpit. Recall Recall is the manual intervention surface for the OpenClaw control plane. Operators use Recall to stall, freeze, quarantine, lockdown, and recover agents when automated recovery is insufficient. Recall Manual intervention. Lazarus Lazarus verifies backup readiness before recovery is needed. Recovery readiness degrades silently. Lazarus confirms you can return to a clean state before an incident forces you to try. Included free with Agent911. Lazarus Backup readiness. Memory Integrity Layer The Memory Integrity Layer ensures long-term agent coherence. Elixir provides deterministic agent rehydration via BOOT, DIGEST, ANCHORS, ORIENTATION sequence. Infrastructure config drift is handled by InfraWatch in the Detection Layer. LAYER 4 Memory Integrity Long-horizon agent coherence. Elixir Elixir provides deterministic agent rehydration following the BOOT, DIGEST, ANCHORS, ORIENTATION sequence. Agents that use Elixir return to coherent operational state reliably after restarts or context loss. Elixir Deterministic rehydration. ● BOOT → ORIENT Agent Runtime Environment The Agent Runtime Environment represents the external agent systems that OpenClaw observes, protects, and recovers. This includes Anthropic Claude agents, OpenAI agents, self-hosted local model agents, and specialized or multi-modal agent systems. LAYER 5 Agent Runtime Environment The systems being protected. Claude Agents Anthropic Claude agent runtime environment. OpenClaw provides observability, protection, and recovery for agents running on Anthropic models. Claude Agents Anthropic runtime. OpenAI Agents OpenAI agent runtime environment. OpenClaw provides observability, protection, and recovery for agents running on OpenAI models. OpenAI Agents OpenAI runtime. Local Agents Self-hosted local model agent runtime environment. OpenClaw supports agents running on locally deployed language models and custom infrastructure. Local Agents Self-hosted runtime. Specialized Agents Custom and multi-modal agent runtime environments including specialized agent architectures, domain-specific models, and multi-agent orchestration systems. Specialized Agents Custom / multi-modal. SIGNALS ALERTS PROOFS ACME · FIELD SUPPLY DIVISION · SERIAL 77B · v6 Observe → Protect → Recover. In that order. Run Triage →

OpenClaw Reliability Stack — architecture for observing, protecting, and recovering AI agent systems.