Friday, March 23, 2007

Rules vs. Monitors

Back in the days of MOM2005, all monitoring was done under a terminology called 'Rules'. We would create a rule to track events and alerts

In Ops Manager 2007, a new component of Management Packs is introduced... 'Monitors'!..

So this is how I understand it...

MONITORS:
Used to assess various conditions that can occur on monitored objects

  • performance counter
  • the existence of an event
  • the occurrence of data in a log file
  • the status of a Windows Service
  • the occurrence of a SNMP trap

The result of this assessment determines the health state of a target and the alerts that are generated

RULES:

Used to collect data, such as events, generated by managed objects.
Can be used instead of monitors to generate alerts when data collected from managed objects DOES NOT indicate the health state of the managed objects.

A useful experience

I delivered a workshop on Operations Manager 2007 this week and it was fantastic! The amount of learning and discoveries was way beyong expectations. Here were some of the questions asked during the workshop and the answers:

1. When is MOM2005's End of life?
[ANS] Mainstream Support will end Jan 2010 and Extended Support will end Jan 2015

2. Do you need a OML to monitor an SNMP Printer?
[ANS] Yes. At the moment, you DO NOT need an OML only for devices operating at OSI Layer 3 and below (i.e., routers, switches, hubs)

3. Does Audit Collection Service need a seperate license?
[ANS] No. ACS licensing is part of the OML license

4. Can Audit Collection Service Database be installed on the same server as the ACS Server?
[ANS] Yes. but it is not recommended. The amount data that ACS collects is huge!

5 Can Ops Manger 2007 be used to monitor Linux servers?
[ANS] Yes with the use of third party providers such as eXc Software, Engyro, Quest, Jalasoft, etc... Essentially how it works is, these providers will communicate with the nonWindows environment and feed information back to Ops Manager. For example, products like eXc software supports a wide range of connectors from OS to network device, printers, UPSes, storage, etc...

There was more... but i need to digest them first and find the answers. Will update on this sooooonn..

Sunday, March 18, 2007

Ops Mgr 2007 RC2 Installation Order

Dealing with BETA code is always a challenge. Here was my final installation order which reported no errors:

  1. Install Windows server 2003 SP1
  2. Upgrade to Windows server 2003 R2
  3. Install SQL 2005
  4. Install KB918222 (http://support.microsoft.com/default.aspx/kb/918222/en-us)
  5. SQL 2005 SP2 (yes... KB918222 is supposed to be in SP2 but somehow if you install SP2 directly, the pre-req checker will report that it is missing)
  6. Install dotNet Framework 3.0
  7. Install Windows Powershell (If you are planning to experiment on Command Shell)
  8. Run SetupOM.exe from the SCOM Source directory to install OPSMGR 07
Once done, open up the Consolidated Operator Console and use Discovery Wizard to discover servers and clients on the network.

Then, import management packs from the Management Pack folder on the SCOM course files.

A couple of days ago, Microsoft released Windows Server 2003 Service Pack 2. I have not tested installation of this and how it would affect Ops Mgr. Will post an update after trying it out

Sunday, March 11, 2007

Understanding "Heartbeats"

MOM 2005 agents routinely report their presence to their assigned management server by sending a heartbeat. Understanding agent heartbeating is helpful as adjustment to the default values may be beneficial in some environments. Let's go through how this works.
Heartbeating is divided into two parts - the agent and the management server. The agent heartbeat settings are adjusted through global settings on the management server(s) as shown in fugure 1 of the attachment..

The one configurable setting for agents is the 'heartbeat interval'. By default, the agent is configured to send a heartbeat via UDP port 1270 every 10 seconds. You will note that this screen also shows the management server 'heartbeat scan interval' - this value defines how often the management server will look for a heartbeat from a particular agent. More on that in a moment but for now just note that the heartbeat scan interval needs to be longer, by default three times longer, than the heartbeat interval.

On the management server side we have several more configuration options as shown in figure 2 of the attachment.

The first block of settings is to configure 'Heartbeat Scan'. There are two options here. The first option, 'Interval to Scan for Agent Heartbeats', defines how often the management server will look to see if it has received a heartbeat from an agent. The default setting is 30 seconds. As you will recall, the agent will, by default, send in a heartbeat every 10 seconds. With the default settings, then, the agent will have up to 3 opportunities to send up a heartbeat before the management server looks to see if one has been received. Since heartbeats are send UDP it's possible one may not arrive. Using these settings MOM accounts for that fact and avoid flagging a problem simply because of a potential and transient communications failure.

Also in the first block is the setting 'Scan agentless computers every specified number of times Management Server performs agent scan'. The default setting is 3. This setting is specific to machines that are agentless monitored - not a common scenario - and by default indicates that the management server should scan agentless machines every 90 seconds (3 times 30 seconds as defined for agent managed machines).

The second block of settings is to configure 'Heartbeat Ping' behavoir. During hearbeat checking, as we will see in a minute, each time the management server looks for a heartbeat and fails to find one MOM will initiate a ping to determine if the agent machine is actually online. Just because a machines fails to send a heartbeat doesn't mean that the machine is down - MOM heartbeat checking looks for machines that are offline vs. those that simply haven't sent a heartbeat by doing ping checks.

The 'Number of Ping attempts' setting defines how many pings will be done to determine if the target machine responds. The 'Time between pings' setting defines how long to wait between each ping attempt. The 'Ping time out" defines how long to wait without hearing a response before the ping attempt is considered a failure. The "Number of scans before generating service unavailablility' defines how many scan attempts will be done prior to flagging the MOM agent service as unavailable.

Lets pull all of this together to discuss how this mechanism works. Assume all settings are default and a MOM agent is heartbeating every 10 seconds and suddently stops - due to a system problem, server reboot, etc. The MOM management server is somewhere in it's 30 second detection period when this happens. Assuming MOM has received a valid heartbeat within the current 30 second window the management server will wait for another 30 second period and then check again for a heartbeat. This time no heartbeat will be seen. In response to that, the management server will initiate a series of pings. Assuming the ping attempt fails MOM will immediately generate an event/alert indicating the ping failed and the target machine may be down. In the instance of a machine actually being down the notification happens as close to real time as possible. Assuming the ping attempt succeeds, MOM will wait another 30 second window to see if a heartbeat arrives - assuming no heartbeat arrives at the end of the second 30 second window MOM will again initiate the ping test to verify the system is online. Assuming that succeeds MOM will wait a third 30 second window and if there is still no heartbeat will initiate a third series of pings Assuming that comes back OK, MOM will generate an event/alert indicating it failed to hear from an agent with current heartbeats but did verify the agent machine was online. Remember, the 3 scan attempts is driven by the setting on the management server and is configurable.

Based on the above description you may see an event/alert combination after approx. 30 seconds when MOM realizes a machines is totally offline or, if the machine is acually OK but the MOM agent is the one having problems, there will be a delay of appox. 2 minutes before receiving the heartbeat failure event/alert.
These default settings can be adjusted to fit the needs of each operating environment - but it is crucial to understand how all of these settings interact to predict the end behavior of MOM. If, for example, the default number of scans was adjusted from 3 to 10, MOM would delay notification on missing heartbeats for approx. 6-7 minutes. This time period may be even more drastically affected by adjusting combinations of settings.

One further comment on this. MOM heartbeat data is stored in the database but this information is NOT what is used to determine the last heartbeat received from an agent. Instead, each management server maintains an in memory list of each of it's managed agents and their last heartbeat time. This is what is used for heartbeat checking.

I may blog more on this in future sumbissions as there is even more 'behind the scenes' details as to how this works both in terms of the mechanics and the rules that detect these potential failures.

- exerp from Steve Rachui's Manageability blog (http://blogs.msdn.com/steverac/archive/2006/02/11/530292.aspx)