engineeringApr 03, 20265 min read

The 3 AM Alert That Taught Me About Monitoring

It was 3 AM when the Telegram notification finally came through. Not from my OpenClaw agent, from a cron job whose only purpose is checking whether it's still alive. The agent had gone silent nine hours earlier. Nine hours of missed checkpoints, unprocessed tasks, and a growing queue nobody was watching.

What Happened

The agent runs on a heartbeat: every 30 minutes it wakes up, checks for due tasks, processes them, and checkpoints its state. The heartbeat stopped at 6 PM the previous evening. The process itself was still running. It had deadlocked on a file lock and just stopped. No crash, no error log, no stack trace. It sat there doing nothing while looking, to every check I had, like it was fine.

Why I Didn't Notice

I had monitoring for crashes. I had monitoring for high CPU and memory. What I didn't have was monitoring for the absence of activity, so a process that is alive but doing nothing sailed straight through every check I'd built. By every metric I was tracking, the system was healthy. It just wasn't doing anything.

The Fix: A Dead Man's Switch

Every heartbeat now writes a timestamp to a state file. A separate cron, running on a different machine so it can't fail the same way at the same time, checks that file every 15 minutes. If the timestamp is older than 45 minutes, it escalates:

First alert: a Telegram message to me.
Second alert, 15 minutes later: force-restart the agent process.

The restart step loops on its own instead of firing once and hoping. If the process comes back up, I get a success notification and that's the end of it. If it doesn't, the attempt gets logged and the switch tries again. Every 5 failed attempts, it pings me directly instead of waiting for me to notice. That cycle, 5 attempts then a ping, repeats up to 3 times. After the third round, 15 attempts in, it stops trying and stops paging me. At that point the problem needs me at the keyboard directly.

What Else I Added

Queue depth monitoring. Alerts if tasks are piling up faster than they're being processed.
Checkpoint freshness. Alerts if the last checkpoint file is stale, independent of the heartbeat check.
Process liveness. Confirms the process is making syscalls, since a hung process still holds a valid PID.

The Lesson

The real lesson was that green metrics don't mean the job is getting done, they just mean nothing has crashed yet. A crash pages you. A quiet deadlock doesn't, unless you've built something to catch it. Nine hours of silence with every dashboard reading fine is exactly the kind of failure that slips through. Now something is actually watching for that.