Consider a typical administrator's user story:
You're an enterprise system administrator deploying a new
application server. Enterprise software means redundancy and reliability, so in
reality you're really deploying a
clustered application server in "sandbox", then "staging",
and maybe one day soon in the not too
distant future - "production". At first everything is wonderful in
the sysadmin garden of Eden. But then
the user issues start. A fix here, a fix there - then 10 fixes - wait ... now
it's a whole fixpack. And somewhere
along the way you find yourself muttering, "I thought we applied a fix for
that". Well you did, but it
was to "staging" and somehow it was not applied to "sandbox",
and you can't seem to reproduce
the user issue.
In reality, that enterprise application server does not run like your Windows OS.
It does not automatically update itself with the latest and greatest fixes
because (again in reality) fixes do
present some form of risk. So you apply only the necessary fixes needed to
solve a particular issue. But that
approach requires: a user to report the problem, someone to analyze the logs,
someone to intelligently pick out the
error that can then be searched on the support website, locating the fix,
downloading, applying, bookkeeping, and
the list goes on.
Now, here's another story, the support engineer's:
identifies a problem in your application server. All products differ but often,
there's a log file with debug or
trace messages, and most importantly - an error signature. For J2EE based
products: WebSphere Application server,
WebSphere Portal, Lotus Quickr (services for Portal), etc, that error signature
is usually a stack trace. Other times, the message
ID is a clear indicator of the type of failure the system has experienced. The
point is, something in the log told
the engineer a problem occurred - a clue. The defect gets fixed, the binary
placed on Fix Central, and an APAR or technote
written to document the error and solution. If the error is reported again, the
engineer can simply search to find the technote.
So what's wrong here? Both scenarios are destined for rediscovery of the
problem. The administrator
inevitably finds a previous fix or configuration needs to be applied to other
servers, and the support engineer
applies the same solution to various administrators. When a team is involved,
the knowledge or change set going into the application is even more
compartmentalized making the problem even more troubling.
Well there just maybe something that can help - autonomic
software. That's a fancy term for simply saying, "when the server
breaks, I want it to fix itself". And, that ability is not too far fetched
with today's tooling.
Tivoli distributes two applications: Log Analyzer and Symptom Editor. Together,
both are a good match for the our user stories. As the support engineer
attributes a fix or configuration to a particular error message, he or she creates a symptom. The symptom is
nothing more than a description of the
error, the conditions that trigger the error, and the recommendation. The
symptom is complementary to the Log
Analyzer. As the administrator reviews log files in Log Analyzer, the messages
can be crossed check with the list
of symptoms. But since you now have a tool to analyze log files and a symptom
database to describe the error
and solution, let's put all of it on auto-pilot.
Server logs error message
Symptom provides context and solution
Tooling downloads fix or supplies
Administrator is notified
The first three stages are possible today. The latter
- maybe, we'll see. Interested?
Where to download
Analyzer is most easily accessible through the IBM Support Assistant workbench (ISA).
Start by downloading ISA
Support Assistant V4.1
How to install
Assistant : Installation demonstration
Demonstrates analyzing a simple test case log file. Typically you'll see many
more records corresponding to log messages.
You can batch analyze entire log file against the symptom database.
I've created a symptom of a real world error message you see with the actual IBM
APAR. Following the link details the problem as well as where to download the
Quickr Symptom Database
I've create a Quickr
symptom database with around 40 errors. Unzip the file to produce the
IBM_Quickr_Portal_Version_8.symptom database file and simply import the database
(really it's just XML) into Log Analyzer using the File menu. Once the database
is added, you can begin analyzing your server logs to find known issues.
Now for some final thoughts. The solution is not perfect. There are some
performance concerns with importing large log files as well as high memory
footprints associated with larger files. The symptom database is subject to the
same concerns for technotes and APARs; if the content is not of quantity and
quality, then the tool lacks utility. But it's built on Eclipse - you know how
I feel about Eclipse - which means that it's an excellent base to begin building.
For example, stages 4 and 5 on the previous list could be future contributed
plugins. And the act of symptom creation already fits within IBM's process for
knowledge review. With all that said, I think the tool is fantastic. To
analyze a 20 MB log file in a matter of minutes producing a list of known
problems makes my job incredibly more productive. So I'm going to continue the Quickr
symptom database within my team and select customers. I'll post status
updates on progress made, and if you're interested in contributing symptoms,
ideas, requirements, etc - please feel free to contact me.