The first step in good problem analysis is to describe
the problem completely. Without a problem description, you will not
know where to start investigating the cause of the problem.
This step includes asking yourself such basic questions as:
- What are the symptoms?
- Where is the problem happening?
- When does the problem happen?
- Under which conditions does the problem happen?
- Is the problem reproducible?
Answering these and other questions will lead to a good description
to most problems, and is the best way to start down the path of problem
resolution.
What are the symptoms?
When starting to
describe a problem, the most obvious question is "What is the problem?"
This might seem like a straightforward question; however, it can be
broken down into several other questions to create a more descriptive
picture of the problem. These questions can include:
- Who or what is reporting the problem?
- What are the error codes and error messages?
- How does it fail? For example: loop, hang, stop, performance degradation,
incorrect result.
- What is the affect on business?
Where is the problem happening?
Determining
where the problem originates is not always easy, but it is one of
the most important steps in resolving a problem. Many layers of technology
can exist between the reporting and failing components. Networks,
disks, and drivers are only a few components to be considered when
you are investigating problems.
- Is the problem platform specific, or common to multiple platforms?
- Is the current environment and configuration supported?
- Is the application running locally on the database server or on
a remote server?
- Is there a gateway involved?
- Is the database stored on individual disks, or on a RAID disk
array?
These types of questions will help you isolate the problem
layer, and are necessary to determine the problem source. Remember
that just because one layer is reporting a problem, it does not always
mean the root cause exists there.
Part of identifying where
a problem is occurring is understanding the environment in which it
exists. You should always take some time to completely describe the
problem environment, including the operating system, its version,
all corresponding software and versions, and hardware information.
Confirm you are running within an environment that is a supported
configuration, as many problems can be explained by discovering software
levels that are not meant to run together, or have not been fully
tested together.
When does the problem happen?
Developing
a detailed time line of events leading up to a failure is another
necessary step in problem analysis, especially for those cases that
are one-time occurrences. You can most easily do this by working backwards
--start at the time an error was reported (as exact as possible, even
down to milliseconds), and work backwards through available logs and
information. Usually you only have to look as far as the first suspicious
event that you find in any diagnostic log, however, this is not always
easy to do and will only come with practice. Knowing when to stop
is especially difficult when there are multiple layers of technology
each with its own diagnostic information.
- Does the problem only happen at a certain time of day or night?
- How often does it happen?
- What sequence of events leads up to the time the problem is reported?
- Does the problem happen after an environment change such as upgrading
existing or installing new software or hardware?
Responding to questions like this will help you create
a detailed time line of events, and will provide you with a frame
of reference in which to investigate.
Under which conditions does the problem happen?
Knowing
what else is running at the time of a problem is important for any
complete problem description. If a problem occurs in a certain environment
or under certain conditions, that can be a key indicator of the problem
cause.
- Does the problem always occur when performing the same task?
- Does a certain sequence of events need to occur for the problem
to surface?
- Do other applications fail at the same time?
Answering these types of questions will help you explain
the environment in which the problem occurs, and correlate any dependencies.
Remember that just because multiple problems might have occurred around
the same time, it does not necessarily mean that they are always related.
Is the problem reproducible?
From a problem
description and investigation standpoint, the "ideal" problem is one
that is reproducible. With reproducible problems you almost always
have a larger set of tools or procedures available to use to help
your investigation. Consequently, reproducible problems are usually
easier to debug and solve.
However, reproducible problems can
have a disadvantage: if the problem is of significant business impact,
you don't want it recurring. If possible, recreating the problem
in a test or development environment is often preferable in this case.
- Can the problem be recreated on a test machine?
- Are multiple users or applications encountering the same type
of problem?
- Can
the problem be recreated by running a single command, a set of commands,
or a particular existing application or deliberately crafted test
application?
- Can
the problem be recreated by executing the equivalent command/query
with the DB2® command line processor?
Recreating a single incident problem in a test or development
environment is often preferable, as there is usually much more flexibility
and control when investigating.