Skip to main content

Symptoms deep dive, Part 2: Cool things you can do with symptoms

Use common scenarios and patterns for increased autonomic computing

Marcelo Perazolo (mperazol@us.ibm.com), Autonomic Computing Architecture, IBM, Software Group
Marcelo Perazolo is a member of the IBM Autonomic Computing Architecture team, where he serves as an architect for symptoms and other knowledge formats and defines Management Integration Taxonomies related to autonomic computing. He has worked for IBM since 1990, with various assignments in network and systems management. Marcelo received an M.S. degree in Electrical Engineering in 1994. His interests include problem determination and prediction, process optimization techniques, security, correlation technologies, and knowledge representation.

Summary:  Get introduced to some interesting IT situations and learn how to use autonomic computing canonical symptoms to handle situations. This article covers how the symptom itself is described, how the symptom is recognized, and recommended actions to resolve the situation associated with the symptom.

View more content in this series

Date:  13 Dec 2005
Level:  Intermediate
Comments:  

Introduction

This second installment of the Symptoms deep dive series builds on Part 1, which discussed the new autonomic computing symptoms format. For more on symptoms, their format, and content, see Resources.

This article discusses some common IT scenarios that show how both users and IT personnel would benefit from a symptoms-based autonomic computing architecture. You'll also see how generic situations can be associated with relevant canonical symptoms to enable a higher degree of automation in IT processes. The scenarios include situations involving:


Symptoms and self-CHOP

In the autonomic computing architecture, classes of situations that can be handled in an autonomic way are often called self-*, meaning that different and varied IT functions can be automated and managed by the component itself. The fundamental features, sometimes called self-CHOP, are:

  • Self-configuring
  • Self-healing
  • Self-optimizing
  • Self-protecting

Figure 1 shows a classical view of the self-CHOP autonomic computing features, represented as parts of the whole self-management solution.


Figure 1. The self-CHOP vision
The Self-CHOP vision

A different, and also valid, approach is the automation of more formal IT processes, following the ones defined by the IT Infrastructure Library (ITIL). In this approach, different processes in ITIL are composed of services, activities, and tasks that can be selectively automated; they adopt autonomic computing elements such as autonomic managers, knowledge sources, touchpoints, and so on. To achieve autonomic computing functions, each automated activity has to support different types of knowledge, such as Common Base Events, symptoms, policies, change requests, and change plans. See Resources for more on ITIL.

This article won't try to classify symptoms based on self-management disciplines, or on processes and areas of automation. Instead, you'll see some IT scenarios that would benefit from automation, where symptoms could be part of the autonomic computing solution. Symptoms extracted from the scenarios are canonical, and can be applied in variations of autonomic functions associated with the Self-CHOP vision, or associated with the automation of traditional IT processes.

It's a common misconception that symptoms are all about the self-healing discipline of autonomic computing. Though "symptoms" are typically associated with "healing" in other arenas, such as medicine, in the autonomic computing context symptoms are generic indications of situations that need to be automated (not necessarily healed). Very useful combinations of symptoms can also provide enablement of self-configuring, self-optimizing, and self-protecting solutions.

In the scenarios and canonical symptoms in this article, I assume the following philosophy and conventions:

  • Canonical symptoms are named and described generically, so they can easily be associated with different environments and reused in different conditions.
  • Canonical symptoms can be, and often are, extended. Specific solutions use the base knowledge defined in a canonical symptom, but extend them to provide additional knowledge associated with the specific environment or solution.
  • Canonical symptom definitions are also described in a generic way by the adoption of conceptual patterns similar to those implemented by the Advanced Correlation Technology (ACT), a Tivoli component (see Resources for more on ACT patterns).
  • In general, symptom definitions and recommendations (in Table 1 through Table 5) are typical and observed values in real world implementations, and not necessarily the best way to identify a given canonical symptom or to promote automated reaction to the same symptom. For a full autonomic computing-based solution, definitions and recommendations should be revisited for each application of a canonical symptom.

Security symptoms

Security management is very common in many IT products and solutions. Within the security management discipline, common scenarios include: authentication of users (or principals), authorization of access to resources being secured, and deployment of security prevention updates.

Authentication failure
When a user logs on to an application, you need to make sure he or she is really who they say they are. This is commonly governed by the assignment of user IDs and passwords that must be authenticated at login. The symptom authentication failure can be designed to point out when authentication fails for a given application.
Authorization failure
When a user tries to access a particular function inside an application, and this function is deemed controlled (can only be performed by certain kinds of users that have the skill or authority), the function demands authorization. This is commonly governed by assignment of access control lists that correlate the users and their roles with the resources they're trying to access, and how they can access them. The access control lists must be consulted in the authorization process. You can associate the authorization failure symptom to flag authorization denials for a given controlled resource.
Prevention deployment failure
Some applications or systems depend on periodic security updates to perform their security prevention functions. This often happens when special security monitoring and management applications actively manage a given application or system. A common scenario involves periodic updates associated with virus signatures that are needed by virus scanning services to function properly.

New viruses are always being created and spread on the Internet, so the applications need a live connection with a support desk that feeds them the periodic updates of new virus signatures. The same principle may also be applied to other security functions, such as intrusion detection. You could use the symptom prevention deployment failure to indicate that a needed security prevention update could not be performed, and to establish a reason for the failure.

Table 1 lists examples of the canonical symptoms you can use to detect and react to security situations.


Table 1. Security symptoms
Symptom nameSymptom descriptionSymptom definitionSymptom recommendations
Authentication failureAttempt to access resources associated with this symptom was made, but there was an authentication failureCollection pattern:
event(wrong_password)
n=3
timeout=24h
Log for auditing purposes
Authorization failureUnauthorized attempt to access resources associated with this symptom was made, and access was deniedFilter pattern:
event(access_denied)
Log for auditing purposes
Prevention deployment failureFailure occurred while deploying security prevention resources (virus update table, security patch, and so on)Filter pattern:
event(security_install_failed)
Analyze security prevention failure and alert security administrator

Service support symptoms

Almost all applications depend on some sort of supporting services to function normally. Configuring and deploying distributed applications that can share data among them is exponentially more complicated, and often leads to unexpected situations. Several types of situations can arise due to:

  • Misconfiguration
  • Lack of configuration information
  • Missing dependencies
  • Mismatch of dependent releases of system components

In some scenarios, part of the configuration data is only needed in selected activity flows, and the missing or misconfigured data isn't always noticed before it's too late. The same happens for complex distributed solutions that frequently depend on specific releases of other components to function properly. Symptoms that indicate these situations are important so autonomic computing solutions know when they cannot perform their automated functions properly, and know what is missing. Sometimes automated recovery, assumption of default configuration, or rollback to previous releases is possible, and sometimes not. A good autonomic computing solution should plan for these situations and react accordingly.

Some of the best known scenarios involving configuration and change management data are:

Configuration unavailable
Restrictions can be caused by unavailability of configuration data. Sometimes missing configuration data may be inferred by the system or it is not critical, and the system can continue operation as impaired or degraded. Sometimes configuration data is critical, the system cannot continue, and a failure is detected. Symptoms may be used to signal situations when configuration data is needed but not found. You can use the canonical symptom configuration unavailable to provide this function.
Configuration invalid
Restrictions can be caused by misconfiguration. Because configuration data is generally governed by a human component, mistakes are commonplace. The configuration data exists, but it isn't always possible to determine if the intended reaction of the system is what was originally expected or not. When possible outcome is deterministic, you can use symptoms to signal that an expected result was not achieved, and to associate the correspondent configuration data that led to that result. You can use the canonical symptom configuration invalid to provide this function.
Dependency unavailable
Restrictions can be caused by the unavailability of dependent components, products, or solutions. Missing dependencies generally cause failure when the system tries to perform specific services, activities, or tasks supported only by the missing dependency. You can use the canonical symptom dependency unavailable to signal this type of unavailability.
Dependency mismatch
Restrictions can be caused by the mismatch of dependent components, products, or solutions. Mismatch of dependencies might cause failure, or impaired or degraded support, to specific services, activities, or tasks supported in full only by the correct releases of the dependent elements. For example, in the Java™ world these types of functions are commonly called deprecated, indicating that a new, more complete or precise function is available. You can use the canonical symptom dependency mismatch to provide this function.

Table 2 shows examples of the canonical symptoms you can use to detect and react to the service support scenarios.


Table 2. Service support symptoms
Symptom nameSymptom descriptionSymptom definitionSymptom recommendations
Configuration unavailableSome configuration information for the resources associated with this symptom was not foundFilter pattern:
event(configuration_not_found)
Alert administrator and flag service provided by resource as "marginal"
Configuration invalidConfiguration information for the resources associated with this symptom was processed and determined to be invalidSequence pattern:
event(configuration_found)
event(configuration_invalid)
Alert administrator and flag service provided by resource as "marginal"
Dependency unavailableOne or more dependencies (resources) are non-existent and needed by other resourcesSequence pattern:
event(dependency_request, resource)
event(inventory, resource not within [inventory_list])
Install missing resource
Dependency mismatchRelease level of one or more resources associated with this symptom are not what was expectedFilter pattern: event(wrong_release)Update resource to required release


Service availability symptoms

Service availability is by far the most common situation that's already managed, to a certain degree, (but not usually automated) in IT systems. The following scenarios deal with analyzing situations that might lead to unavailability of a service, an activity, or a simple task in a system.

Systems usually depend on and monitor resources, whether it's software, hardware, communication resources, or others. The system needs these resources to properly perform their intended service, activity, or task. The whole system may also be a resource by itself. When these resources are present but can't perform their required services, activities, or tasks, it's useful to indicate the reason why. Either the resource itself may indicate the cause, or a monitoring system might do so on behalf of the resource.

Some of the best known scenarios involving availability of resources (or lack of) are:

Resource capacity met
System resources are usually shared, and might be at their maximum capacity when clients request their services. In such a case, the resource should be able to signal its condition as a warning and let the clients know it cannot accept more load or it will become unavailable to them. You can use the canonical symptom resource capacity met to provide this function.
Resource unavailable
System resources might be in a situation where they cannot perform the services requested, or they may be present but not operative. In this case, you can use the special symptom resource unavailable to provide this indication.
Resource degraded
System resources might be in a situation where they cannot perform the full extent of the services requested, but they can perform partially. You can use a special symptom called resource degraded to signal the temporary (or not) situations, and signal the client if it wants to request a partial service level from the resource or not.
Resource unreachable
System resources are sometimes distributed, and might be remote from the clients requesting their services. Sometimes communication failures can prevent the clients of remote resources from reaching them when they request services. The system should be able to detect this condition and signal a warning to the clients of the unreachable resource, preventing unnecessary communication loops and retrials while the resource is unreachable. You can use the canonical symptom resource unreachable to provide this function.
Repeated availability problem
With many applications, a common scenario is to have classes of resources that are considered not perfectly reliable. For these resources, it is often OK for availability symptoms to occur from time to time. Sometimes the solution deployer is only interested in tracking if a certain number of availability-related symptoms happen for a certain resource in a given time period. In such cases, the rapid succession of the same types of availability problems for the same resource might point to a more chronic problem, or might predict imminent failure (total unavailability). For these situations, there is the symptom repeated availability problem.

Table 3 has examples of the canonical symptoms you can use to detect and react to the service availability scenarios above.


Table 3. Service availability symptoms
Symptom nameSymptom descriptionSymptom definitionSymptom recommendations
Resource capacity metA given resource or set of resources is fully loaded and reached their maximum capacityFilter pattern:
event(metric, utilization=capacity)
Perform load balancing on resources of the same type
Resource unavailableA given resource or set of resources is installed but not availableFilter pattern:
event(resource_not_available)
or
Sequence pattern:
event(resource_not_available_events)
Heal resource or Alert administrator and flag service provided by resource as "down"
Resource degradedA given resource or set of resources had its service level degradedFilter pattern:
event(resource_degraded)or Sequence pattern:
event(resource_degraded_events)
Heal resource or Alert administrator and flag service provided by resource as "marginal"
Resource unreachableA given resource or set of resources cannot be reachedSequence pattern:
event(resource_communication_error)
symptom(resource_not_available, router)
(resource belongs to router subnet)
Reconfigure routing tables and Alert network Administrator
Repeated availability problemA given resource or set of resources fails multiple times within a specific time periodCollection pattern:
event(resource_availability_problem)
source[i]=source[i-1]
n=3
timeout=24h
Alert administrator or Analyze why problem keeps happening, and heal

Service continuity symptoms

The distinction between service availability and service continuity is often a fine line. Sometimes the distinction is solely based on the resource self-monitoring capabilities, or on the capabilities of external systems monitoring these resources. A resource has continuity problems if it is capable of starting the execution of a requested service, activity, or task, but cannot perform it to completion. It is not considered degraded; it simply cannot perform a request in its entirety. When the situations occur, the continuity of the service being provided by that resource is impacted.

Some of the common scenarios involving continuity of resources are:

Execution failure
When a resource is present, not degraded, and has all dependencies in place to perform a service, activity, or task but fails to do so, it should provide an indication about what part of its execution has failed. Sometimes failure is due to misconfiguration or mismatch of dependencies. But, if a resource is not instrumented to detect these situations, or it's impossible to do so in a deterministic way, you can use the symptom execution failure. This should provide a hint about what actually failed in the execution of the service.
Resource allocation failure
A special class of execution failures may exist when a resource depends on allocation services to function properly, but the allocation services cannot be executed. In such situations, the service, activity, or task to be performed by the resource itself will also fail, of course. You can use the symptom resource allocation failure to indicate special allocation failure situations.
Dependency failure
A special class of execution failures may exist when a resource depends on services provided by another resource to function properly, but the services cannot be executed. The service, activity, or task to be performed by the resource itself will also fail. You can use the symptom dependency failure to indicate execution failure situations associated with dependent resources.
Imminent failure predicted
On rare occasions, a monitoring system or the resources themselves may be able to predict that they might finalize the current service, activity, or task request, but will not be able to perform future requests. In this situation, you can use a symptom called imminent failure predicted as an indication to the resource clients so they know a future request might not be completed successfully. Lack of service continuity does not actually happen, but is predicted to happen in the near future.

Table 4 has examples of canonical symptoms that you can use to detect and react to the service continuity scenarios.


Table 4. Service continuity symptoms
Symptom nameSymptom descriptionSymptom definitionSymptom recommendations
Execution failureA given resource or set of resources is available but failed to execute its required serviceFilter pattern:
event(service_delivery_failure)
Alert administrator
Resource allocation failureA resource or set of resources is available, but could not be allocated to perform a serviceFilter pattern:
event(resource_allocation_failure)
Alert administrator (analyze why resource could not be allocated, and heal)
Dependency failureA dependent resource was available but failed to provide service to a particular resource or set of resourcesFilter pattern:
event(service_dependency_failure)
Alert administrator
Imminent failure predictedA given resource or set of resources will fail shortlyFilter pattern:
event(metric, metric_value>=threshold)
Alert administrator and Replace resource


Business logic symptoms

Business logic symptoms are interesting alarms that might be useful in determining the behavior of a system. Frequently these types of situations can be caused by symptoms that correspond to logical states of the business, or that correspond to integration data exchanged between distributed components in an organized or expected fashion.

A few generic scenarios involving business logic are:

Business logic alarm
When a system undergoes a set of determined states in its business logic, it's good practice to associate business alarms to both the successful execution of these states, and to point out the unsuccessful sequences of states. These special alarms that signal unsuccessful or unexpected business states are represented in the special class called business logic alarm symptoms. For example, a service is expected to update a database table, but after completion the expected change is not detected.
Bad application data
When distributed components exchange business related data, it's good practice to expect that the target component will perform some validation checks in the data received from the source component. This is often a way for distributed components to guarantee that no unexpected behavior will occur. Sometimes undetected business logic failures can manifest themselves by the exchange of unexpected data between distributed components. Such situations may be indicated by the bad application data symptom. For example, the request of a distributed service with a parameter is out of valid bounds.

Table 5 shows examples of canonical symptoms you can use to detect and react to business logic scenarios.


Table 5. Business logic symptoms
Symptom nameSymptom descriptionSymptom definition Symptom recommendations
Business logic alarmA given resource or set of resources raised a specific alarm associated with its business logicFilter pattern:
event(specific_business_logic_event)
Depends on the business logic
Bad application dataA given resource or set of resources has communicated using bad dataFilter pattern:
event(bad_message_content)
Alert application integrator


Conclusion

You can use autonomic computing symptoms as helpful indicators for partial analysis, and for yielding rich data about combinations of events or state data in a system. There are classic scenarios that you can apply to innumerable resources associated with applications and components, and to the services, activities, and tasks they perform. This article categorized these scenarios, and showed the simple generic canonical symptoms for each scenario. You can use the categories and canonical symptoms to enable your unique solutions.

Knowledge and reuse of analysis patterns, such as those provided by canonical symptoms, can play an important role in shortening the time and resources needed for enabling autonomic computing capabilities.


Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

  • developerWorks blogs: Get involved in the developerWorks community.
    Dave Bartlett, IBM VP, blogs each week on his thoughts about the state of autonomic computing in the industry.

About the author

Marcelo Perazolo is a member of the IBM Autonomic Computing Architecture team, where he serves as an architect for symptoms and other knowledge formats and defines Management Integration Taxonomies related to autonomic computing. He has worked for IBM since 1990, with various assignments in network and systems management. Marcelo received an M.S. degree in Electrical Engineering in 1994. His interests include problem determination and prediction, process optimization techniques, security, correlation technologies, and knowledge representation.

Comments



Trademarks

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Tivoli
ArticleID=100372
ArticleTitle=Symptoms deep dive, Part 2: Cool things you can do with symptoms
publish-date=12132005
author1-email=mperazol@us.ibm.com
author1-email-cc=