Saturday, October 20, 2012

Monitoring and Alerting

One cannot appreciate the ability to collect vital data and receive alerts when something malfunctions until you have hundreds or thousands of devices spread across the world.  Collecting and trending data on devices provides a simple method of having insight to when an anomaly occurs.  If one is not collecting and trending the data, then there is no known data to compare for expected behavior.

Recently, we quickly identified a network anomaly that was introduced by a customer premise device setting which increased messaging from devices by 12 times.  Luckily, we caught this with only 15% of the devices changed and quickly addressed the situation.  Failure to recognize this could have caused huge outages for our customers.  How did we notice the anomaly?  Simply reading and plotting the CPU% and memory% of our Session Border Controllers gave us the first clue that something was happening.  This drew our attention to look at the network traffic to discover a huge increase in the number of SIP Notify messages.

Another practical example occurs when one needs to purchase licenses for equipment.  From my experience, trending the number of licenses in use or devices in use is the best way to identify the number of licenses you'll need.  Otherwise, you're at the mercy of some marketing or sales group that typically do not have a clue or basis in actual data.  So if nothing else, it provides a good data point to validate information from other sources before it's too late.

Collecting this data can be very simple with some Perl scripts and SNMP.  Once there's a simple implementation, you can invest the time and effort as needed to enhance the data collection and analysis.

No comments: