IBM Support

IT35893: ANR9999D FAILURES ARE POSSIBLE WHEN RUNNING BACKUP NODE TO CONTAINER STORAGE POOLS

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • [Problem Description]
    A "BACKUP NODE" process may fail with varying ANR9999D errors if
    running backups to container storage pools. The backup operation
    then fails.
    
    [Customer/L2 Diagnostics]
    Example 1:
    
    02/01/2021 13:35:59      ANR9999D_4237730896
    SdAdjustBuf(sdbuf.c:1735)
                              Thread<371>: The number of CQ slots
    for session
                              000000BD65823CE0 is being reduced to
    ZERO.(SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371> issued message
    9999 from: (SESSION:
                              28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdab864504
    OutDiagToCons()+b4
                              (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdab85db72
    outDiagfExt()+112
                              (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdab5a31c8
    SdAdjustBuf()+4b8
                              (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdab595b9f
    SdStore()+bff
                              (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdab593e67
    sdCreate()+8a7
                              (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdaaeeb0d2
    CreateBitfile()+ba2
                              (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdaaedf152
    bfCreate()+1332
                              (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdaae84969
    bfNASCreate()+b9
                              (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdb856f936
                              moverAcceptConnection()+206
    ndserver.c:1865 (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdb8567295
    ndmpdSelect()+2a5
                              ndmpconn.c:1154 (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdb856f377
                              connectionHandler()+227 ndserver.c:695
    (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdaac1c443
    startThread()+153
                              (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdb9eb4f7f
    beginthreadex()+107
                              (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdb9eb5126
    endthreadex()+192
                              (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdcda513f2
                              BaseThreadInitThunk()+22 (SESSION: 28)
    02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdce6f54f4
                              RtlUserThreadStart()+34 (SESSION: 28)
    
    The problem will only occur when running NDMP backups to
    container storage pools. NDMP stream parsing produces a chunk
    that is too large which causes errors in the circular buffer
    queue.
    
    The problem originates during stream parsing which expects 1K
    read boundaries from the NAS filer. This assumption is violated
    and the read becomes mis-aligned.
    
    Example 2:
    
    01/27/21   15:01:52   ANR9999D_1525641611
    SdWriteNonDedupDataX(sdcreate.c:3755)
                           Thread<1414>: Unexpected large meta data
    chunk size:
                           13046784. (SESSION: 10)
    01/27/21   15:01:52   ANR9999D Thread<1414> issued message 9999
    from: (SESSION:
                           10)
    01/27/21   15:01:52   ANR9999D Thread<1414>  0x0000000100086a30
    StdPutText
                           (SESSION: 10)
    01/27/21   15:01:52   ANR9999D Thread<1414>  0x0000000100087364
    OutDiagToCons
                           (SESSION: 10)
    01/27/21   15:01:52   ANR9999D Thread<1414>  0x00000001000633e4
    outDiagfExt
                           (SESSION: 10)
    01/27/21   15:01:52   ANR9999D Thread<1414>  0x0000000100e2d0c0
                           SdWriteNonDedupDataX  (SESSION: 10)
    01/27/21   15:01:52   ANR9999D Thread<1414>  0x0000000100e34f48
                           SdWriteDedupData  (SESSION: 10)
    01/27/21   15:01:52   ANR9999D Thread<1414>  0x0000000101f67720
    SdCQSinkThread
                           (SESSION: 10)
    01/27/21   15:01:52   ANR9999D Thread<1414>  0x000000010009654c
    StartThread
                           (SESSION: 10)
    
    Similar to the last example, if a large chunk is produced in the
    non-dedup chunk path, then there's an error indicating that the
    metadata chunk is too large to store.
    
    In both cases, running with SPI SPID BF RABIN SD trace is
    helpful for diagnosis as it will show the failing read iteration
    where the server reads data from the filer and creates an
    unexpectedly large chunk that gets sent down to the container
    layer (SD). The trace will look similar to below:
    
    10:36:54.625
    [368][bfdedup.c][14021][NdmpObjectSinkFunc]:dataAmount: 0,
    current: 0, bufLeft: 348, amountToCopy: 348
    10:36:56.153
    [368][bfdedup.c][14021][NdmpObjectSinkFunc]:dataAmount: 0,
    current: 348, bufLeft: 8388608, amountToCopy: 8388608
    10:36:57.804
    [368][bfdedup.c][14021][NdmpObjectSinkFunc]:dataAmount: 0,
    current: 8388956, bufLeft: 8388608, amountToCopy: 8388608
    10:38:14.743 [368][sdbuf.c][1691][SdAdjustBuf]:Number 1 segment:
    length 8388260, bytesRecv 8388260, residual 25165824
    10:38:14.743 [368][sdbuf.c][1691][SdAdjustBuf]:Number 0 segment:
    length 16777564, bytesRecv 16777564, residual 16777564
    10:38:14.743 [368][sdbuf.c][1728][SdAdjustBuf]:Slot
    000000BD6AEC76F0 is too small to hold one complete data chunk.
    Merging it into the next slot
    
    Note that "amountToCopy" in the first three lines adds up to the
    large chunk in one of the buffer slots. The trace then indicates
    that the buffer will try to compensate by merging into the next
    slot which fails and causes the ANR9999D.
    
    [IBM Spectrum Protect Versions Affected]
    IBM Spectrum Protect Server 8.1.10.000 and higher on all
    supported platforms.
    
    [Initial Impact]
    High
    
    [Additional Keywords]
    TSM NAS NDMP backup ANR9999D "BACKUP NODE" "Spectrum Protect"
    container
    

Local fix

  • Redirect NDMP backups temporarily to sequential device class
    storage pools.
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * All IBM Spectrum Protect server users.                       *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See error description.                                       *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Apply fixing level when available. This problem is currently *
    * projected to be fixed in levels 8.1.10.300, 8.1.11.100, and  *
    * 8.1.12. Note that this is subject to change at the           *
    * discretion of IBM.                                           *
    ****************************************************************
    

Problem conclusion

  • This problem was fixed.
    Affected platforms for reported release:  AIX, Linux, and
    Windows.
    Platforms fixed:  AIX, Linux, and Windows.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT35893

  • Reported component name

    TSM SERVER

  • Reported component ID

    5698ISMSV

  • Reported release

    81A

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2021-02-12

  • Closed date

    2021-03-04

  • Last modified date

    2021-03-04

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    TSM SERVER

  • Fixed component ID

    5698ISMSV

Applicable component levels

  • R81A PSY

       UP

  • R81L PSY

       UP

  • R81W PSY

       UP

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"81A","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
18 November 2021