Sunday, March 11, 2007

Understanding "Heartbeats"

MOM 2005 agents routinely report their presence to their assigned management server by sending a heartbeat. Understanding agent heartbeating is helpful as adjustment to the default values may be beneficial in some environments. Let's go through how this works.
Heartbeating is divided into two parts - the agent and the management server. The agent heartbeat settings are adjusted through global settings on the management server(s) as shown in fugure 1 of the attachment..

The one configurable setting for agents is the 'heartbeat interval'. By default, the agent is configured to send a heartbeat via UDP port 1270 every 10 seconds. You will note that this screen also shows the management server 'heartbeat scan interval' - this value defines how often the management server will look for a heartbeat from a particular agent. More on that in a moment but for now just note that the heartbeat scan interval needs to be longer, by default three times longer, than the heartbeat interval.

On the management server side we have several more configuration options as shown in figure 2 of the attachment.

The first block of settings is to configure 'Heartbeat Scan'. There are two options here. The first option, 'Interval to Scan for Agent Heartbeats', defines how often the management server will look to see if it has received a heartbeat from an agent. The default setting is 30 seconds. As you will recall, the agent will, by default, send in a heartbeat every 10 seconds. With the default settings, then, the agent will have up to 3 opportunities to send up a heartbeat before the management server looks to see if one has been received. Since heartbeats are send UDP it's possible one may not arrive. Using these settings MOM accounts for that fact and avoid flagging a problem simply because of a potential and transient communications failure.

Also in the first block is the setting 'Scan agentless computers every specified number of times Management Server performs agent scan'. The default setting is 3. This setting is specific to machines that are agentless monitored - not a common scenario - and by default indicates that the management server should scan agentless machines every 90 seconds (3 times 30 seconds as defined for agent managed machines).

The second block of settings is to configure 'Heartbeat Ping' behavoir. During hearbeat checking, as we will see in a minute, each time the management server looks for a heartbeat and fails to find one MOM will initiate a ping to determine if the agent machine is actually online. Just because a machines fails to send a heartbeat doesn't mean that the machine is down - MOM heartbeat checking looks for machines that are offline vs. those that simply haven't sent a heartbeat by doing ping checks.

The 'Number of Ping attempts' setting defines how many pings will be done to determine if the target machine responds. The 'Time between pings' setting defines how long to wait between each ping attempt. The 'Ping time out" defines how long to wait without hearing a response before the ping attempt is considered a failure. The "Number of scans before generating service unavailablility' defines how many scan attempts will be done prior to flagging the MOM agent service as unavailable.

Lets pull all of this together to discuss how this mechanism works. Assume all settings are default and a MOM agent is heartbeating every 10 seconds and suddently stops - due to a system problem, server reboot, etc. The MOM management server is somewhere in it's 30 second detection period when this happens. Assuming MOM has received a valid heartbeat within the current 30 second window the management server will wait for another 30 second period and then check again for a heartbeat. This time no heartbeat will be seen. In response to that, the management server will initiate a series of pings. Assuming the ping attempt fails MOM will immediately generate an event/alert indicating the ping failed and the target machine may be down. In the instance of a machine actually being down the notification happens as close to real time as possible. Assuming the ping attempt succeeds, MOM will wait another 30 second window to see if a heartbeat arrives - assuming no heartbeat arrives at the end of the second 30 second window MOM will again initiate the ping test to verify the system is online. Assuming that succeeds MOM will wait a third 30 second window and if there is still no heartbeat will initiate a third series of pings Assuming that comes back OK, MOM will generate an event/alert indicating it failed to hear from an agent with current heartbeats but did verify the agent machine was online. Remember, the 3 scan attempts is driven by the setting on the management server and is configurable.

Based on the above description you may see an event/alert combination after approx. 30 seconds when MOM realizes a machines is totally offline or, if the machine is acually OK but the MOM agent is the one having problems, there will be a delay of appox. 2 minutes before receiving the heartbeat failure event/alert.
These default settings can be adjusted to fit the needs of each operating environment - but it is crucial to understand how all of these settings interact to predict the end behavior of MOM. If, for example, the default number of scans was adjusted from 3 to 10, MOM would delay notification on missing heartbeats for approx. 6-7 minutes. This time period may be even more drastically affected by adjusting combinations of settings.

One further comment on this. MOM heartbeat data is stored in the database but this information is NOT what is used to determine the last heartbeat received from an agent. Instead, each management server maintains an in memory list of each of it's managed agents and their last heartbeat time. This is what is used for heartbeat checking.

I may blog more on this in future sumbissions as there is even more 'behind the scenes' details as to how this works both in terms of the mechanics and the rules that detect these potential failures.

- exerp from Steve Rachui's Manageability blog (http://blogs.msdn.com/steverac/archive/2006/02/11/530292.aspx)

No comments: