Topic
  • 2 replies
  • Latest Post - ‏2013-04-30T17:05:48Z by matthewpmattson
matthewpmattson
matthewpmattson
2 Posts

Pinned topic Tracing SC_DISK_ERR4

‏2013-04-29T20:42:45Z |

Hi all,

I have some AIX hosts connected via SAN to a 3PAR storage array and while running I/O on the disks I am seeing some TEMP errors in my errpt. See below for full details:

from the errpt -c command

DCB47997   0429144313 T H hdisk1         DISK OPERATION ERROR
DCB47997   0429144313 T H hdisk11        DISK OPERATION ERROR
DCB47997   0429144313 T H hdisk11        DISK OPERATION ERROR
DCB47997   0429144313 T H hdisk20        DISK OPERATION ERROR
DCB47997   0429144313 T H hdisk4         DISK OPERATION ERROR
DCB47997   0429144313 T H hdisk15        DISK OPERATION ERROR
DCB47997   0429144313 T H hdisk15        DISK OPERATION ERROR
DCB47997   0429144313 T H hdisk5         DISK OPERATION ERROR
DCB47997   0429144413 T H hdisk2         DISK OPERATION ERROR
DCB47997   0429144413 T H hdisk2         DISK OPERATION ERROR
DCB47997   0429144413 T H hdisk13        DISK OPERATION ERROR
DCB47997   0429144413 T H hdisk13        DISK OPERATION ERROR
DCB47997   0429144513 T H hdisk17        DISK OPERATION ERROR
DCB47997   0429144513 T H hdisk17        DISK OPERATION ERROR
DCB47997   0429144513 T H hdisk16        DISK OPERATION ERROR
DCB47997   0429144513 T H hdisk5         DISK OPERATION ERROR
DCB47997   0429144513 T H hdisk5         DISK OPERATION ERROR
DCB47997   0429144513 T H hdisk2         DISK OPERATION ERROR
DCB47997   0429144613 T H hdisk1         DISK OPERATION ERROR

detail of the error:

---------------------------------------------------------------------------
LABEL:          SC_DISK_ERR4
IDENTIFIER:     DCB47997

Date/Time:       Sun Apr 28 22:14:51 CDT 2013
Sequence Number: 33081
Machine Id:      00F7A89E4C00
Node Id:         blue7
Class:           H
Type:            TEMP
WPAR:            Global
Resource Name:   hdisk18
Resource Class:  disk
Resource Type:   3PAR_VV_MPIO
Location:        U8231.E1C.06A89ER-V1-C35-T1-W20210002AC0185E0-L12000000000000

VPD:
        Manufacturer................3PARdata
        Machine Type and Model......VV
        Serial Number...............C000009900000000

Description
DISK OPERATION ERROR

Probable Causes
MEDIA
DASD DEVICE

User Causes
MEDIA DEFECTIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
PATH ID
           0
SENSE DATA
0A00 2800 0116 7E20 0000 4004 0000 0000 0000 0000 0000 0000 0200 0300 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 C336 0002 5080 0000 0000 0000 0000 0000 0000 0000 0083 0000
0000 0035 001D

-------------------------------------------------------------------------------------------------------------

Can anyone tell from the Sense data codes some hints as to why these are appearing?

I have looked over the net on the this specific error but it seems there isn't any one cause for this, so I am looking for some ideas on where to start debugging.

Thanks,

Matt

 

  • dukessd
    dukessd
    345 Posts

    Re: Tracing SC_DISK_ERR4

    ‏2013-04-29T23:40:03Z  

    Hi Matt,

    The sense data shows a scsi command timeout.

    The "Fibre Channel Planning and Integration" guide from IBM shows how to understand these errors:

    http://publib.boulder.ibm.com/systems/hardware_docs/pdf/234329.pdf

    Page 89 onwards covers decoding the SC_DISK_ERR events.


    Sense Data Layout

    LL00 CCCC CCCC CCCC CCCC CCCC CCCC CCCC CCCC RRRR RRRR RRRR VVSS AARR DDDD KKDD
    0A00 2800 0116 7E20 0000 4004 0000 0000 0000 0000 0000 0000 0200 0300 0000 0000


    You have vv = 02, AA = 03, which means:


    "Command Timeout. This indicates that the SCSI command did not
    complete within the allowed time. This usually indicates a hardware
    problem related to the SCSI transport layer."

    Check if they are all the same path ID, if so then look at what that path as in common - adapter - switch - storage subsystem port.

    If there are no other errors then it would suggest a problem on the storage subsystem, for some reason that controller / port / lun is not responding in a timely manner.

    As you are using npiv you should also check for adapter and interface errors on the associated VIOS.

    HTH

  • matthewpmattson
    matthewpmattson
    2 Posts

    Re: Tracing SC_DISK_ERR4

    ‏2013-04-30T17:05:48Z  
    • dukessd
    • ‏2013-04-29T23:40:03Z

    Hi Matt,

    The sense data shows a scsi command timeout.

    The "Fibre Channel Planning and Integration" guide from IBM shows how to understand these errors:

    http://publib.boulder.ibm.com/systems/hardware_docs/pdf/234329.pdf

    Page 89 onwards covers decoding the SC_DISK_ERR events.


    Sense Data Layout

    LL00 CCCC CCCC CCCC CCCC CCCC CCCC CCCC CCCC RRRR RRRR RRRR VVSS AARR DDDD KKDD
    0A00 2800 0116 7E20 0000 4004 0000 0000 0000 0000 0000 0000 0200 0300 0000 0000


    You have vv = 02, AA = 03, which means:


    "Command Timeout. This indicates that the SCSI command did not
    complete within the allowed time. This usually indicates a hardware
    problem related to the SCSI transport layer."

    Check if they are all the same path ID, if so then look at what that path as in common - adapter - switch - storage subsystem port.

    If there are no other errors then it would suggest a problem on the storage subsystem, for some reason that controller / port / lun is not responding in a timely manner.

    As you are using npiv you should also check for adapter and interface errors on the associated VIOS.

    HTH

    Hi,

    The path ID alternates between 0 and 1. Which I guess would mean commands are timing out down both paths? These are the only errors I am seeing in AIX errpt. I checked the VIOS errlog as you said and I am seeing a few "Misbehaved Virtual FC Client" errors logged. Not sure if this is related or something different but the description reads:

    ---------------------------------------------------------------------------
    LABEL:          VFC_CLIENT_FAILURE
    IDENTIFIER:     88E96781

    Date/Time:       Tue Apr 30 09:37:11 PDT 2013
    Sequence Number: 79
    Machine Id:      00F7A89E4C00
    Node Id:         blue246
    Class:           S
    Type:            TEMP
    WPAR:            Global
    Resource Name:   vfchost1

    Description
    Misbehaved Virtual FC Client

    Probable Causes
    Bad IU, or Protocol Violation

    Failure Causes
    Bad IU, or Protocol Violation

            Recommended Actions
            Remove Virtual FC Client, then Configure the same instance

    Detail Data
    ADDITIONAL INFORMATION
            module: trans_event     rc: 00000000FFFFFFD8    location: 00000514
            data:  1 1 0 0 0
    ---------------------------------------------------------------------------

     

    Matt