An Operator's Tale

How this reliability stack was built - and why it stays operator-safe.

Most reliability failures don't start with alarms.

They start with doubt: something feels off, but nothing is clearly broken.

In agent systems, "port up" and "process alive" often mean very little. Throughput can collapse while everything still looks green. State can drift without producing obvious errors. Compaction pressure can rise silently until the system feels haunted.

The first job of an operator isn't recovery. It's evidence.

Field Log · Entry 001 - The Problem

Small systems fail quietly. Before dashboards. Before alerts. Before anyone is sure what they're looking at.

When that happens, the first question isn't "how do I fix it?" It's "what is actually true right now?"

Operators need a calm first move that is safe under stress and useful even when the control plane is degraded.

Field Log · Entry 002 - The Early Failure Patterns

This stack grew out of real, persistent agent workloads - not demos, not sandboxes - systems that had to remain coherent across days and weeks.

The failures were rarely dramatic. That was the danger.

Agents would stall mid-run. Gateways would report healthy while the system quietly froze underneath. Watchdog loops would confirm liveness while actual work had stopped.

The logs looked normal. The port was open. The process was "fine."

The system was lying.

Field Log · Entry 003 - First Response Doctrine

Out of those failures came a doctrine we now treat as non-negotiable:

Never recover first. Capture evidence first. Recovery without proof is just hoping the problem doesn't come back.

OpenClaw Triage Unit exists to make that doctrine easy. It produces a deterministic, timestamped proof bundle before anyone starts changing things. No screenshot archaeology. No ad-hoc log grepping. A structured artifact that survives the postmortem.

Evidence first. Recovery second.

Field Log · Entry 004 - How the Stack Emerged

Each tool was built the same way: a failure pattern became visible, got named, and then got a purpose-built response.

There was no roadmap. There was forensics.

RadCheck

Surfaces early instability signals before the system stalls.

OCTriageUnit

Creates a read-only proof bundle for first-response triage.

Sentinel

Continuous detection for silent failures and drift signals that don't make it to the operator surface.

SphinxGate

Token discipline and lane enforcement so background work can't quietly consume foreground budget.

Drift Guard

Baseline comparison and drift analysis when predictability starts eroding over time.

Lazarus

Recovery readiness verification so "we can restore" isn't a guess.

Watchdog

Heartbeat supervision when liveness signals exist but the system isn't truly alive.

The sequence matters. Detect → observe → control → prove → recover. That order is not marketing. It is what operators actually need at 2am.

Field Log · Entry 005 - Why ACME Exists

These tools were built to survive our own systems first - to make failures explainable and recoveries repeatable.

If you've operated agent systems long enough, you already know the feeling this stack is for.

If you haven't yet - this is the set of tools we wished existed before the first long night.

Field Notes

Capture evidence before recovery - proof beats memory.
Deterministic bundles beat screenshots - structure survives postmortems.
Systems drift before they fail - drift is the early signal.
Operators need clarity, not dashboards - fewer panels, more answers.
Heartbeats lie - port up does not mean healthy.
Recovery readiness degrades silently - verify it before you need it.

Next: Start with OpenClaw Triage Unit to capture evidence before recovery.