The 3 AM Alert That Taught Me About Monitoring
It was 3 AM when the Slack notification finally came through. Not from Atreus — from a cron job that checks if Atreus is alive. The agent had gone silent 9 hours earlier. Nine hours of missed checkpoints, unprocessed tasks, and growing queue depth.
What Happened
Atreus runs on a heartbeat system — every 30 minutes, it wakes up, checks for due tasks, processes them, and checkpoints state. The heartbeat stopped at 6 PM the previous evening. The process was still running, but it had deadlocked on a file lock. No crash, no error log, just... silence.
Why I Didn't Notice
I had monitoring for crashes and errors. I had monitoring for high CPU and memory. What I didn't have was monitoring for the absence of activity. The system was "healthy" by every metric I was tracking — it just wasn't doing anything.
The Fix: Dead Man's Switch
I built a dead man's switch. Every heartbeat writes a timestamp to a state file. A separate cron (on a different machine) checks that file every 15 minutes. If the timestamp is older than 45 minutes, it escalates:
- First alert: Telegram message to me
- Second alert (15 min later): force-restart the agent process
- Third alert (30 min later): page me with a phone call
What Else I Added
- Queue depth monitoring: alert if tasks are accumulating faster than they're being processed
- Checkpoint freshness: alert if the last checkpoint file is stale
- Process liveness: not just "is the PID running" but "is it making syscalls"
The Lesson
Monitoring isn't about watching metrics. It's about detecting the gap between expected behavior and actual behavior. The most dangerous failures aren't the loud ones — they're the silent ones. Build your systems to notice when nothing is happening.