You're an enterprise system administrator deploying a new application server. Enterprise software means redundancy and reliability, so in reality you're really deploying a clustered application server in "sandbox", then "staging", and maybe one day soon in the not too distant future - "production". At first everything is wonderful in the sysadmin garden of Eden. But then the user issues start. A fix here, a fix there - then 10 fixes - wait ... now it's a whole fixpack. And somewhere along the way you find yourself muttering, "I thought we applied a fix for that". Well you did, but it was to "staging" and somehow it was not applied to "sandbox", and you can't seem to reproduce the user issue.
In reality, that enterprise application server does not run like your Windows OS. It does not automatically update itself with the latest and greatest fixes because (again in reality) fixes do present some form of risk. So you apply only the necessary fixes needed to solve a particular issue. But that approach requires: a user to report the problem, someone to analyze the logs, someone to intelligently pick out the error that can then be searched on the support website, locating the fix, downloading, applying, bookkeeping, and the list goes on.
Now, here's another story, the support engineer's:
IBM Support identifies a problem in your application server. All products differ but often, there's a log file with debug or trace messages, and most importantly - an error signature. For J2EE based products: WebSphere Application server, WebSphere Portal, Lotus Quickr (services for Portal), etc, that error signature is usually a stack trace. Other times, the message ID is a clear indicator of the type of failure the system has experienced. The point is, something in the log told the engineer a problem occurred - a clue. The defect gets fixed, the binary placed on Fix Central, and an APAR or technote written to document the error and solution. If the error is reported again, the engineer can simply search to find the technote.
So what's wrong here? Both scenarios are destined for rediscovery of the problem. The administrator inevitably finds a previous fix or configuration needs to be applied to other servers, and the support engineer applies the same solution to various administrators. When a team is involved, the knowledge or change set going into the application is even more compartmentalized making the problem even more troubling.
Well there just maybe something that can help - autonomic software. That's a fancy term for simply saying, "when the server breaks, I want it to fix itself". And, that ability is not too far fetched with today's tooling.
Tivoli distributes two applications: Log Analyzer and Symptom Editor. Together, both are a good match for the our user stories. As the support engineer attributes a fix or configuration to a particular error message, he or she creates a symptom. The symptom is nothing more than a description of the error, the conditions that trigger the error, and the recommendation. The symptom is complementary to the Log Analyzer. As the administrator reviews log files in Log Analyzer, the messages can be crossed check with the list of symptoms. But since you now have a tool to analyze log files and a symptom database to describe the error and solution, let's put all of it on auto-pilot.
- Server logs error message
- Log Analyzer analyzes event
- Symptom provides context and solution
- Tooling downloads fix or supplies recommendation
- Administrator is notified
The first three stages are possible today. The latter - maybe, we'll see. Interested?
Where to download
Log Analyzer is most easily accessible through the IBM Support Assistant workbench (ISA). Start by downloading ISA here: IBM Support Assistant V4.1 Workbench.
Demonstrates analyzing a simple test case log file. Typically you'll see many more records corresponding to log messages.
Quickr Symptom Database
I've create a Quickr symptom database with around 40 errors. Unzip the file to produce the IBM_Quickr_Portal_Version_8.symptom database file and simply import the database (really it's just XML) into Log Analyzer using the File menu. Once the database is added, you can begin analyzing your server logs to find known issues.
Now for some final thoughts. The solution is not perfect. There are some performance concerns with importing large log files as well as high memory footprints associated with larger files. The symptom database is subject to the same concerns for technotes and APARs; if the content is not of quantity and quality, then the tool lacks utility. But it's built on Eclipse - you know how I feel about Eclipse - which means that it's an excellent base to begin building. For example, stages 4 and 5 on the previous list could be future contributed plugins. And the act of symptom creation already fits within IBM's process for knowledge review. With all that said, I think the tool is fantastic. To analyze a 20 MB log file in a matter of minutes producing a list of known problems makes my job incredibly more productive. So I'm going to continue the Quickr symptom database within my team and select customers. I'll post status updates on progress made, and if you're interested in contributing symptoms, ideas, requirements, etc - please feel free to contact me.