RadCheck
Surfaces early instability signals before the system stalls.
How this reliability stack was built - and why it stays operator-safe.
Most reliability failures don't start with alarms.
They start with doubt: something feels off, but nothing is clearly broken.
In agent systems, "port up" and "process alive" often mean very little. Throughput can collapse while everything still looks green. State can drift without producing obvious errors. Compaction pressure can rise silently until the system feels haunted.
The first job of an operator isn't recovery. It's evidence.
Small systems fail quietly. Before dashboards. Before alerts. Before anyone is sure what they're looking at.
When that happens, the first question isn't "how do I fix it?" It's "what is actually true right now?"
Operators need a calm first move that is safe under stress and useful even when the control plane is degraded.
This stack grew out of real, persistent agent workloads - not demos, not sandboxes - systems that had to remain coherent across days and weeks.
The failures were rarely dramatic. That was the danger.
Agents would stall mid-run. Gateways would report healthy while the system quietly froze underneath. Watchdog loops would confirm liveness while actual work had stopped.
The logs looked normal. The port was open. The process was "fine."
The system was lying.
Out of those failures came a doctrine we now treat as non-negotiable:
Never recover first. Capture evidence first. Recovery without proof is just hoping the problem doesn't come back.
OpenClaw Triage Unit exists to make that doctrine easy. It produces a deterministic, timestamped proof bundle before anyone starts changing things. No screenshot archaeology. No ad-hoc log grepping. A structured artifact that survives the postmortem.
Evidence first. Recovery second.
Each tool was built the same way: a failure pattern became visible, got named, and then got a purpose-built response.
There was no roadmap. There was forensics.
Surfaces early instability signals before the system stalls.
Creates a read-only proof bundle for first-response triage.
Continuous detection for silent failures and drift signals that don't make it to the operator surface.
Token discipline and lane enforcement so background work can't quietly consume foreground budget.
Baseline comparison and drift analysis when predictability starts eroding over time.
Recovery readiness verification so "we can restore" isn't a guess.
Heartbeat supervision when liveness signals exist but the system isn't truly alive.
The sequence matters. Detect → observe → control → prove → recover. That order is not marketing. It is what operators actually need at 2am.
These tools were built to survive our own systems first - to make failures explainable and recoveries repeatable.
If you've operated agent systems long enough, you already know the feeling this stack is for.
If you haven't yet - this is the set of tools we wished existed before the first long night.
Next: Start with OpenClaw Triage Unit to capture evidence before recovery.