As a network becomes larger and more complex, the amount of alerts generated increases and in turn causes network managers to be bombarded with multiple notifications about all the activities and issues in their network. Having a robust alerting mechanism helps network engineers to troubleshoot key issues faster.
Intelligent alerting avoids unnecessary notifications so you can focus on those that are most important. In addition, intelligent alerts can be set in various ways. For example, to notify different people on different days, different times of the day, different people for different events, or any combination of times, events, and people.
Automated alerts are essential to monitoring. They allow you to spot problems anywhere in your infrastructure, so that you can rapidly identify their causes and minimize service degradation and disruption.
Not all alerts carry the same degree of urgency. Some require immediate human intervention, some require eventual human intervention, and some point to areas where attention may be needed in the future. All alerts should, at a minimum, be logged to a central location for easy correlation with other metrics and events.
Alerts as records (low severity)
Many alerts will not be associated with a service problem, so a human may never even need to be aware of them. For instance, when a data store that supports a user-facing service starts serving queries much slower than usual, but not slow enough to make an appreciable difference in the overall service’s response time, that should generate a low-urgency alert that is recorded in your monitoring system for future reference or investigation but does not interrupt anyone’s work. After all, transient issues that could be to blame, such as network congestion, often go away on their own. But should the service start returning a large number of timeouts, that alert-based data will provide invaluable context for your investigation.
Alerts as notifications (moderate severity)
The next tier of alerting urgency is for issues that do require intervention, but not right away. Perhaps the data store is running low on disk space and should be scaled out in the next several days. Sending an email and/or posting a notification in the service owner’s chat room is a perfect way to deliver these alerts—both message types are highly visible, but they won’t wake anyone in the middle of the night or disrupt an engineer’s flow.
Alerts as pages (high severity)
The most urgent alerts should receive special treatment and be escalated to a page (as in “pager”) to urgently request human attention. Response times for your web application, for instance, should have an internal SLA that is at least as aggressive as your strictest customer-facing SLA. Any instance of response times exceeding your internal SLA would warrant immediate attention, whatever the hour.