SCOM: Heartbeat Failure Alert Tuning

I recently deployed SCOM in a highly distributed network. Most of the edge locations had slow WAN links. These edge locations would often go offline. With the combination of the slow WAN links and them going offline SCOM would flood with alerts/emails on Health Service Heartbeat Failure and Computer Not Reachable monitors.

This had to be tuned out because these alerts were overwhelming for the team. Also as soon as these edge locations would go offline the team would be notified through other network location monitoring tools and from the staff at these edge locations.

These edge locations would often go offline for reasons of power outages or ISP’s going down. These edge locations could also be down for long periods from 2-3 days at a time. Fixing the issues were often out of the control of the team. Receiving alerts during these outages from the edge locations was not helpful. The team still needed alerts right away if servers at the corporate locations went offline. There are several ways to tune alerts for these monitors.

One way to tune Health Service Heartbeat Failure and Computer Not Reachable monitors is to adjust the heartbeat interval (default is 60 seconds) and the amount of missed heartbeats SCOM will tolerate. Note this would be a global change in SCOM across all monitored servers. To access these settings do the following:

In the SCOM console go to Administration>>Settings  in the right hand pane under Type: Agent you will see Heartbeat. Right click on Heartbeat and open the properties.  In the same pane under Type: Server you will see another Heartbeat. Right click on Heartbeat and open the properties. You can see this in the following screenshot:

clip_image001

Another way to tune the alerts on these monitors would be to go adjust the heartbeat interval on an individual server level. This would only be useful if you have a small amount of servers generating these alerts and know what servers they are. To access these settings in the SCOM console go to Administration>>Settings>>Agent Managed. Find your server/s. Right click on the server and select properties. Under the Heartbeat tab select the checkbox next to Override global agent settings and then adjust the Heartbeat interval.

clip_image002

For more information about both of those visit:

Heartbeat and Heartbeat Failure Settings in Operations Manager 2007

http://technet.microsoft.com/en-us/library/cc540380.aspx

Neither of those helped in my situation because we needed these alerts right away from one group of servers but not from another. Here is what I did to tune these monitors so that the team would not become overwhelmed by the alerts.

In this particular environment there were some things I need to point out before I go into the solution.

  • The team did not want to monitor heartbeat or ping basically connectivity to the edge servers at all. They were more interested in gathering performance data, status of the applications on those servers and more.
  • The servers that live in the edge had different sequence in the computer name vs. the servers that lived in the corporate locations. The naming schema was structured like this:
    • Corporate location # 1 server names: PROD100-xxV or PROD100-xxP.
    • Corporate location # 2 server names: PROD200-xxV or PROD200-xxP.
    • Edge server names: PROD404-xxV or PROD404-xxP (404 would actually match the number of that edge location. This would vary from edge to edge.).

The name schema was a big helping in breaking things out. So I basically created an edge server group in SCOM dynamically excluding all corporate locations. Here is what it looked like to build this:

clip_image003

Building the logic:

clip_image004

What it looks like in the group:

clip_image005

By doing that the members would consist of all servers from all edge locations without including any servers from corporate locations.  This member list was built dynamically so that the team did not ever have to worry about adding edge servers to the membership.

Read more