Top 10 Alerts to Configure in RedEyes Host MonitorMonitoring hosts effectively means getting the right alerts to the right people at the right time — no noise, no missed signals. Below are the top 10 alerts you should configure in RedEyes Host Monitor, why each matters, recommended thresholds and actions, and tips to reduce false positives.
1) Host Down / Unreachable
Why it matters:
- Indicates full service outage or network partition — immediate impact on users and dependent systems.
Recommended configuration:
- Frequency: check every 30–60 seconds for critical hosts; 2–5 minutes for less critical.
- Threshold: 3 consecutive failed pings or connection attempts before alerting.
- Actions: page on-call, trigger automated failover if available, run a network traceroute and collect interface stats.
Reduce false positives:
- Combine ICMP/ping with TCP port checks (e.g., SSH, HTTP) and service probes.
- Stagger checks across multiple monitoring nodes to avoid network blips causing false outages.
2) High CPU Usage
Why it matters:
- Sustained high CPU leads to slow responses, timeouts, and cascading failures.
Recommended configuration:
- Metric: percentage CPU utilization (load averages for Unix).
- Thresholds: warn at 70–80% for 5–10 minutes; critical at 90%+ sustained for 5 minutes.
- Actions: notify ops, capture top processes, trigger autoscaling or migration.
Reduce false positives:
- Use sustained-duration thresholds and compare to baseline seasonal patterns.
- Alert on abnormal process spikes (e.g., a single process consuming >50% CPU).
3) Memory Pressure / Low Available Memory
Why it matters:
- Low available memory can cause swapping, degradation, or OOM kills.
Recommended configuration:
- Metric: free memory, available memory, swap usage.
- Thresholds: warn when available memory <20% or swap >30%; critical when swap >70% or free memory %.
- Actions: collect process list, memory maps; restart leaky processes or scale out.
Reduce false positives:
- Monitor available memory rather than cached/buffered values on Linux.
- Correlate with recent deployments or batch jobs.
4) Disk Space Low / Inode Exhaustion
Why it matters:
- Full disks break logging, databases, and application writes — often catastrophic.
Recommended configuration:
- Metric: percent used and inode usage per mount.
- Thresholds: warn at 75–85%; critical at 90–95% used, or inodes >85%.
- Actions: rotate logs, free temp files, grow volumes or mount additional storage.
Reduce false positives:
- Exclude transient filesystems (e.g., CI build tmpdirs) or set different thresholds per partition.
- Alert on sustained growth trends, not single spikes.
5) High Network Latency / Packet Loss
Why it matters:
- Network issues cause slow user experiences and can break distributed systems.
Recommended configuration:
- Metrics: round-trip time, jitter, packet loss between monitoring nodes and host.
- Thresholds: warn when latency >100–200 ms depending on service; critical when packet loss >1–5% sustained.
- Actions: run traceroute, notify network team, failover traffic if available.
Reduce false positives:
- Use multiple probes from different collectors to rule out collector-network issues.
- Correlate with interface errors and router/switch alerts.
6) High Disk I/O or I/O Wait
Why it matters:
- Heavy I/O can make applications unresponsive even with low CPU usage.
Recommended configuration:
- Metrics: IOPS, throughput (MB/s), %iowait.
- Thresholds: warn at application-specific baselines (e.g., iowait >20%); critical when iowait >50% or IOPS saturate device limits.
- Actions: identify top I/O processes, move heavy tasks to other disks, optimize queries or add caching.
Reduce false positives:
- Monitor per-disk and per-process metrics; compare against expected workload patterns.
7) Service-Specific Health Checks (HTTP 5xx, Database Connections)
Why it matters:
- Application-layer failures often precede or accompany infrastructure problems.
Recommended configuration:
- Metrics: HTTP response codes, response times, DB connection pool saturation, failed queries.
- Thresholds: warn on increased 5xx rates (e.g., >1% of requests) or rising response time percentiles; critical on sustained 5xx spikes or DB connections >90% of pool.
- Actions: collect application logs, restart services, roll back recent deploys.
Reduce false positives:
- Use rolling windows and rate-based alerts, not single-request failures.
- Tie alerts to deployments or config changes.
8) Certificate Expiry
Why it matters:
- Expired TLS certificates break secure connections and can cause user trust and availability issues.
Recommended configuration:
- Metric: days until certificate expiry.
- Thresholds: warn at 30 days; critical at 7 days or less.
- Actions: notify devops, trigger automated renewal pipelines, or failover to backup certs.
Reduce false positives:
- Monitor the certificate chain and the actual certificate presented by the service, not just stored copies.
9) Security/Intrusion Indicators (Unusual Auth Failures, Port Scans)
Why it matters:
- Early detection of compromise or brute-force attacks prevents bigger incidents.
Recommended configuration:
- Metrics: failed login attempts, new listening ports, unusual outbound connections, root login.
- Thresholds: warn at small anomalies; critical on rapid spikes (e.g., >50 failed attempts in 5 minutes) or new persistent unauthorized users.
- Actions: block offending IPs, isolate host, initiate incident response.
Reduce false positives:
- Whitelist expected automation IPs, correlate with maintenance windows and configuration changes.
10) Backup/Job Failures & Application-Specific Cron Jobs
Why it matters:
- Failed backups or scheduled jobs can cause data loss and missed business SLAs.
Recommended configuration:
- Metrics: job completion status, runtime, output error codes.
- Thresholds: any failure should be critical for backups; warn if runtime exceeds expected thresholds.
- Actions: notify owners, retry jobs, investigate job logs.
Reduce false positives:
- Use job heartbeats and success markers rather than inferring from log absence; correlate with upstream systems.
Alerting Best Practices for RedEyes Host Monitor
- Use multi-stage alerts: warning → critical → escalations.
- Attach contextual runbooks and automated diagnostics (logs, top, netstat, traces) to alerts.
- Route alerts by severity and service ownership; avoid global paging for non-critical issues.
- Implement maintenance windows and alert suppression for deployment periods.
- Tune thresholds based on historical baselines and adjust after post-incident reviews.
- Use dependent/compound alerts to suppress floods (e.g., suppress service alerts when host is down).
Example Alert Rule Template (variables to adapt)
- Name: [Service] — [Metric] — [Severity]
- Scope: host group or tag (e.g., prod:web-servers)
- Condition: metric X > Y for Z minutes OR consecutive failed checks N
- Notification: on-call rotation via Pager/SMS/Slack; escalation policy after M minutes
- Automated actions: run diagnostics script, collect heap/dump, trigger autoscale or failover
- Runbook link: URL to playbook with triage steps
Final notes
Start with these ten alerts and iterate: monitor what produces noisy alerts, adjust thresholds, and expand visibility (application metrics, distributed tracing). The aim is timely, actionable alerts that reduce mean time to detect and resolve real problems.