IBM i (and its predecessors) has had the capability to automatically identify and report software problems to IBM for many releases. This was first introduced as the “Software error logging” (QSFWERRLOG) system value.
The capability to automatically identify problems when they occur is something we call First Failure Data Collection (FFDC). The intent is the first time a problem occurs, the data necessary for problem determination is automatically collected. That data, for operating system problems, can then be sent to IBM for diagnostic purposes. The lofty goal is to identify and resolve problems the first time they happen without ever having to recreate the problem.
This is an admirable goal, but in reality is very difficult to achieve. The challenge is that defects occur in unknown and unexpected places –– simply because defects are unexpected failures and programmers don't intend for unexpected failures to occur. As such, the capability to programmatically identify what data should be collected for a potential software bug is inherently flawed as it requires the ability to predict knowledge about what the problem may be in order to program the data collection routines.
In the 5.4 release, we decided to take a new approach to our FFDC support; key to that new approach is the recognition that one can never predict where a problem may occur. As such, there needed to be a dynamic way to modify the parameters around the identification of potential problems and the data that is collected for those problems –– this new support for software problem reporting was called Service Monitor. With the introduction of Service Monitor in the 5.4 release, we moved to a design that is dependent upon a “policy” that identifies potential problems along with the data that should be collected for those problems; this policy can be updated dynamically. Service Monitor is automatically started and supported with the *LOG option of the QSFWERRLOG system value.
While the Service Monitor has existed for a few releases now, it seems that it is a relatively unknown feature. The biggest reason is probably due to the fact that there’s no high-level overview of the Service Monitor function within the Information Center. Bits and pieces of information can be found by reviewing APIs, but there’s no summarization of the function and its capabilities.
Service Monitor, then, is a policy-based software function that is used to automatically identify problems that occur and to take the defined actions defined by the policy when the problem occurs. It’s primarily used for problems within the operating system and licensed internal code.
In the 5.4 release, customers primarily noticed the presence of the Service Monitor function by the many QSRVMONxxxx jobs that run in the QUSRWRK subsystem; a job started for each potential problem that could be reported to IBM. A common question I've heard is “What are all those jobs for?” In the 6.1 release, the design was changed to use prestart jobs rather than individual jobs, so the number of jobs has been reduced, but the Service Monitor function remains.
The policy file that is used by Service Monitor is maintained by the IBM Support Center in Rochester. As experience is gained with known problems, or when new problems are encountered, the policy file can be updated. The latest version of the policy file is downloaded to your system when you connect to IBM using Electronic Service Agent.
The policy file identifies the symptom of the problem, which can be a message, a licensed internal code (LIC) log (also known as a Vlog), or a product activity log (PAL) entry. The policy also identifies the action that is taken when that symptom occurs –– that action could be to collect diagnostic data to send to IBM, download a PTF that corrects the issue or some other action.
Several months ago I wrote a blog about Watches. Service Monitor uses watches as the underlying mechanism to implement the automated notification mechanism for problems that can be detected. Service Monitor policies identify the conditions to watch for –– which can be messages, LIC logs and problem activity logs (PAL entries) –– and thus Service Monitor sets up a watch for each item in the policy file. When the watch condition is matched, Service Monitor has an exit program that gets invoked and uses the policy definition as the way to identify the actions that should be taken when the watch occurs.
If you use the WRKWCH command to look at the *SRVMON watches, you will find many active watches (unless you have changed the QSFWERRLOG system value to *NOLOG). By displaying individual watch entries, you can see the kinds of things that service monitor watches for –– messages, LIC logs or PALs. For example, on a 6.1 system, if you look at the SRVMON0003 watch, you can see that it’s monitoring for CPF1101 sent to the QSYSOPR message queue. CPF1101 is “Subsystem &1 had a function check.” You can imagine that IBM probably wants to know if a subsystem takes a function check and has some basic diagnostic information that should be reviewed if this situation occurs. When this message occurs, the policy indicates that the job log should be collected and the problem reported to IBM.
Most of the policies used by Service Monitor are for LIC logs. Since LIC logs are commonly used to log diagnostic information for problems detected by the Licensed Internal Code, LIC logs are something easy to monitor for and then to automatically send the log data to IBM for review.
Comments