Error logging tasks

This section describes the error-logging tasks and information.

Reading an error report

To obtain a report of all errors logged in the 24 hours prior to the failure, type:

errpt -a -s mmddhhmmyy | pg

where mmddhhmmyy represents the month, day, hour, minute, and year 24 hours prior to the failure.

An error-log report contains the following information:

Note: Not all errors generate information for each of the following categories.
LABEL
Predefined name for the event.
ID
Numerical identifier for the event.
Date/Time
Date and time of the event.
Sequence Number
Unique number for the event.
Machine ID
Identification number of your system processor unit.
Node ID
Mnemonic name of your system.
Class
General source of the error. The possible error classes are:
H
Hardware. (When you receive a hardware error, refer to your system operator guide for information about performing diagnostics on the problem device or other piece of equipment. The diagnostics program tests the device and analyzes the error log entries related to it to determine the state of the device.)
S
Software.
O
Informational messages.
U
Undetermined (for example, a network).
Type
Severity of the error that has occurred. The following types of errors are possible:
PEND
The loss of availability of a device or component is imminent.
PERF
The performance of the device or component has degraded to below an acceptable level.
PERM
Condition that could not be recovered from. Error types with this value are usually the most severe errors and are more likely to mean that you have a defective hardware device or software module. Error types other than PERM usually do not indicate a defect, but they are recorded so that they can be analyzed by the diagnostics programs.
TEMP
Condition that was recovered from after a number of unsuccessful attempts. This error type is also used to record informational entries, such as data transfer statistics for DASD devices.
UNKN
It is not possible to determine the severity of the error.
INFO
The error log entry is informational and was not the result of an error.
Resource name
Name of the resource that has detected the error. For software errors. this is the name of a software component or an executable program. For hardware errors, this is the name of a device or system component. It does not indicate that the component is faulty or needs replacement. Instead, it is used to determine the appropriate diagnostic modules to be used to analyze the error.
Resource class
General class of the resource that detected the failure (for example, a device class of disk).
Resource type
Type of the resource that detected the failure (for example, a device type of 355mb).
Location code
Path to the device. There may be up to four fields, which refer to drawer, slot, connector, and port, respectively.
VPD
Vital product data. The contents of this field, if any, vary. Error log entries for devices typically return information concerning the device manufacturer, serial number, Engineering Change levels, and Read Only Storage levels.
Description
Summary of the error.
Probable cause
List of some of the possible sources of the error.
User causes
List of possible reasons for errors due to user mistakes. An improperly inserted disk and external devices (such as modems and printers) that are not turned on are examples of user-caused errors.
Recommended actions
Description of actions for correcting a user-caused error.
Install causes
List of possible reasons for errors due to incorrect installation or configuration procedures. Examples of this type of error include hardware and software mismatches, incorrect installation of cables or cable connections becoming loose, and improperly configured systems.
Recommended actions
Description of actions for correcting an installation-caused error.
Failure causes
List of possible defects in hardware or software.
Note: A failure causes section in a software error log usually indicates a software defect. Logs that list user or installation causes or both, but not failure causes, usually indicate that the problem is not a software defect.

If you suspect a software defect, or are unable to correct user or installation causes, report the problem to your software service department.

Recommended actions
Description of actions for correcting the failure. For hardware errors, PERFORM PROBLEM DETERMINATION PROCEDURES is one of the recommended actions listed. For hardware errors, this will lead to running the diagnostic programs.
Detailed data
  • Failure data that is unique for each error log entry, such as device sense data.
  • Information on the current working directory of the process, such as FILE SYSTEM SERIAL NUMBER and INODE NUMBER when the process dumps the core.
To display a shortened version of the detailed report produced by the -a flag, use the -A flag. The -A flag is not valid with the -a, -g, or -t flags. The items reported when you use -A to produce the shortened version of the report are:
  • Label
  • Date and time
  • Type
  • Resource name
  • Description
  • Detail data
The example output of this flag is in the following format:
LABEL:           STOK_RCVRY_EXIT
Date/Time:       Tue Dec 14 15:25:33
Type:            TEMP Resource Name:   tok0
Description PROBLEM RESOLVED
Detail Data FILE NAME line: 273 file: stok_wdt.c 
SENSE DATA 
0000 0000 0000 0000 0000 0000 DEVICE ADDRESS 0004 AC62 25F1
Reporting can be turned off for some errors. To show which errors have reporting turned off, type:
errpt -t -F report=0 | pg

If reporting is turned off for any errors, enable reporting of all errors using the errupdate command.

Logging may also have been turned off for some errors. To show which errors have logging turned off, type:
errpt -t -F log=0 | pg

If logging is turned off for any errors, enable logging for all errors using the errupdate command. Logging all errors is useful if it becomes necessary to re-create a system error.

Examples of detailed error reports

The following are sample error-report entries that are generated by issuing the errpt -a command.

An error-class value of H and an error-type value of PERM indicate that the system encountered a hardware problem (for example, with a SCSI adapter device driver) and could not recover from it. Diagnostic information might be associated with this type of error. If so, it displays at the end of the error listing, as illustrated in the following example of a problem encountered with a device driver:
LABEL:      SCSI_ERR1
ID:         0502F666

Date/Time:        Jun 19 22:29:51
Sequence Number:  95
Machine ID:       123456789012
Node ID:          host1
Class:            H
Type:             PERM
Resource Name:    scsi0
Resource Class:   adapter
Resource Type:    hscsi
Location:         00-08
VPD:
     Device Driver Level.........00
     Diagnostic Level............00
     Displayable Message.........SCSI
     EC Level....................C25928
     FRU Number..................30F8834
     Manufacturer................IBM97F
     Part Number.................59F4566
     Serial Number...............00002849
     ROS Level and ID............24
     Read/Write Register Ptr.....0120

Description
ADAPTER ERROR

Probable Causes
ADAPTER HARDWARE CABLE
CABLE TERMINATOR DEVICE

Failure Causes
ADAPTER
CABLE LOOSE OR DEFECTIVE

          Recommended Actions
          PERFORM PROBLEM DETERMINATION PROCEDURES
          CHECK CABLE AND ITS CONNECTIONS

Detail Data
SENSE DATA
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 

Diagnostic Log sequence number:  153
Resource Tested:        scsi0
Resource Description:   SCSI I/O Controller
Location:               00-08
SRN:                    889-191
Description:            Error log analysis indicates hardware failure.
Probable FRUs:
    SCSI Bus        FRU: n/a            00-08
                    Fan Assembly
    SCSI2           FRU: 30F8834        00-08
                    SCSI I/O Controller
An error-class value of H and an error-type value of PEND indicate that a piece of hardware (the Token Ring) may become unavailable soon due to numerous errors detected by the system.
LABEL:    TOK_ESERR
ID:       AF1621E8

Date/Time:       Jun 20 11:28:11
Sequence Number: 17262
Machine Id:      123456789012
Node Id:         host1
Class:           H
Type:            PEND
Resource Name:   TokenRing
Resource Class:  tok0
Resource Type:   Adapter
Location:        TokenRing

Description
EXCESSIVE TOKEN-RING ERRORS

Probable Causes
TOKEN-RING FAULT DOMAIN

Failure Causes
TOKEN-RING FAULT DOMAIN

        Recommended Actions
        REVIEW LINK CONFIGURATION DETAIL DATA
        CONTACT TOKEN-RING ADMINISTRATOR RESPONSIBLE FOR THIS LAN

Detail Data
SENSE DATA
0ACA 0032 A440 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 2080 0000 0000 0010 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 78CC 0000 0000 0005 C88F 0304 F4E0 0000 1000 5A4F 5685 
1000 5A4F 5685 3030 3030 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000 0000 0000 0000 0000
An error-class value of S and an error-type value of PERM indicate that the system encountered a problem with software and could not recover from it.
LABEL:    DSI_PROC
ID:       20FAED7F
 
Date/Time:       Jun 28 23:40:14
Sequence Number: 20136
Machine Id:      123456789012
Node Id:         123456789012
Class:           S
Type:            PERM
Resource Name:   SYSVMM

Description
Data Storage Interrupt, Processor

Probable Causes
SOFTWARE PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
Data Storage Interrupt Status Register
4000 0000
Data Storage Interrupt Address Register
0000 9112
Segment Register, SEGREG
D000 1018
EXVAL
0000 0005
An error-class value of S and an error-type value of TEMP indicate that the system encountered a problem with software. After several attempts, the system was able to recover from the problem.
LABEL:          SCSI_ERR6
ID:             52DB7218
 
Date/Time:       Jun 28 23:21:11
Sequence Number: 20114
Machine Id:      123456789012
Node Id:         host1
Class:           S
Type:            INFO
Resource Name:   scsi0

Description
SOFTWARE PROGRAM ERROR

Probable Causes
SOFTWARE PROGRAM

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SENSE DATA
0000 0000 0000 0000 0000 0011 0000 0008 000E 0900 0000 0000 FFFF 
FFFE 4000 1C1F 01A9 09C4 0000 000F 0000 0000 0000 0000 FFFF FFFF 
0325 0018 0040 1500 0000 0000 0000 0000 0000 0000 0000 0000 0800 
0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
0000 0000
An error class value of O indicates that an informational message has been logged.
LABEL:     OPMSG
ID:        AA8AB241
 
Date/Time:       Jul 16 03:02:02
Sequence Number: 26042
Machine Id:      123456789012
Node Id:         host1
Class:           O
Type:            INFO
Resource Name:   OPERATOR

Description
OPERATOR NOTIFICATION

User Causes
errlogger COMMAND

        Recommended Actions
        REVIEW DETAILED DATA

Detail Data
MESSAGE FROM errlogger COMMAND
hdisk1 : Error log analysis indicates a hardware failure.

Example of a summary error report

The following is an example of a summary error report generated using the errpt command. One line of information is returned for each error entry.

ERROR_
IDENTIFIER TIMESTAMP  T CL RESOURCE_NAME ERROR_DESCRIPTION
192AC071   0101000070 I 0  errdemon      Error logging turned off
0E017ED1   0405131090 P H  mem2          Memory failure
9DBCFDEE   0101000070 I 0  errdemon      Error logging turned on
038F2580   0405131090 U H  scdisk0       UNDETERMINED ERROR
AA8AB241   0405130990 I O  OPERATOR      OPERATOR NOTIFICATION

Generating an error report

To create an error report of software or hardware problems do the following:

  1. Determine if error logging is on or off by determining if the error log contains entries:
    errpt -a
    The errpt command generates an error report from entries in the system error log.
    If the error log does not contain entries, error logging has been turned off. Activate the facility by typing:
    /usr/lib/errdemon
    Note: You must have root user access to run this command.

    The errdemon daemon starts error logging and writes error log entries in the system error log. If the daemon is not running, errors are not logged.

  2. Generate an error log report using the errpt command. For example, to see all the errors for the hdisk1 disk drive, type:
    errpt -N hdisk1
  3. Generate an error log report using SMIT. For example, use the smit errpt command:
    smit errpt
    1. Select 1 to send the error report to standard output, or select 2 to send the report to the printer.
    2. Select yes to display or print error log entries as they occur. Otherwise, select no.
    3. Specify the appropriate device name in the Select resource names option (such as hdisk1).
    4. Select Do.

Stopping an error log

This procedure describes how to stop the error-logging facility.

To turn off error logging, use the errstop command. You must have root user authority to use this command.

Ordinarily, you would not want to turn off the error-logging facility. Instead, you should clean the error log of old or unnecessary entries.

Turn off the error-logging facility when you are installing or experimenting with new software or hardware. This way the error logging daemon does not use CPU time to log problems you know you are causing.

Cleaning an error log

Error-log cleaning is normally done for you as part of the daily cron command. If it is not done automatically, clean the error log yourself every couple of days after you have examined the contents to make sure there are no significant errors.

You can also clean up specific errors. For example, if you get a new disk and you do not want the old disk's errors in the log, you can clean just the old disk's errors.

Delete all entries in your error log by doing either of the following:

  • Use the errclear -d command. For example, to delete all software errors, type:
    errclear -d S 0
    The errclear command deletes entries from the error log that are older than a specified number of days. The 0 in the previous example indicates that you want to delete entries for all days.
  • Use the smit errclear command:
    smit errclear

Copying an error log to diskette or tape

Copy an error log by doing one of the following:

  • To copy the error log to diskette, use the ls and backup commands. Insert a formatted diskette into the diskette drive and type:
    ls /var/adm/ras/errlog | backup -ivp
  • To copy the error log to tape, insert a tape in the drive and type:
    ls /var/adm/ras/errlog | backup -ivpf/dev/rmt0
  • To gather system configuration information in a tar file and copy it to diskette, use the snap command. Insert a formatted diskette into the diskette drive and type:
    snap -a -o /dev/rfd0
    Note: To use the snap command, you need root user authority.

    The snap command in this example uses the -a flag to gather all information about your system configuration. The -o flag copies the compressed tar file to the device you name. The /dev/rfd0 names your disk drive.

    To gather all configuration information in a tar file and copy it to tape, type:
    snap -a -o /dev/rmt0

    The /dev/rmt0 names your tape drive.

Using the liberrlog services

The liberrlog services allow you to read entries from an error log, and provide a limited update capability. They are especially useful from an error notification method written in the C programming language, rather than a shell script. Accessing the error log using the liberrlog functions is much more efficient than using the errpt command.