IT35893: ANR9999D FAILURES ARE POSSIBLE WHEN RUNNING BACKUP NODE TO CONTAINER STORAGE POOLS

APAR status

Closed as program error.

Error description

[Problem Description]
A "BACKUP NODE" process may fail with varying ANR9999D errors if
running backups to container storage pools. The backup operation
then fails.

[Customer/L2 Diagnostics]
Example 1:

02/01/2021 13:35:59      ANR9999D_4237730896
SdAdjustBuf(sdbuf.c:1735)
                          Thread<371>: The number of CQ slots
for session
                          000000BD65823CE0 is being reduced to
ZERO.(SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371> issued message
9999 from: (SESSION:
                          28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdab864504
OutDiagToCons()+b4
                          (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdab85db72
outDiagfExt()+112
                          (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdab5a31c8
SdAdjustBuf()+4b8
                          (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdab595b9f
SdStore()+bff
                          (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdab593e67
sdCreate()+8a7
                          (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdaaeeb0d2
CreateBitfile()+ba2
                          (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdaaedf152
bfCreate()+1332
                          (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdaae84969
bfNASCreate()+b9
                          (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdb856f936
                          moverAcceptConnection()+206
ndserver.c:1865 (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdb8567295
ndmpdSelect()+2a5
                          ndmpconn.c:1154 (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdb856f377
                          connectionHandler()+227 ndserver.c:695
(SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdaac1c443
startThread()+153
                          (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdb9eb4f7f
beginthreadex()+107
                          (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdb9eb5126
endthreadex()+192
                          (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdcda513f2
                          BaseThreadInitThunk()+22 (SESSION: 28)
02/01/2021 13:35:59      ANR9999D Thread<371>  7ffdce6f54f4
                          RtlUserThreadStart()+34 (SESSION: 28)

The problem will only occur when running NDMP backups to
container storage pools. NDMP stream parsing produces a chunk
that is too large which causes errors in the circular buffer
queue.

The problem originates during stream parsing which expects 1K
read boundaries from the NAS filer. This assumption is violated
and the read becomes mis-aligned.

Example 2:

01/27/21   15:01:52   ANR9999D_1525641611
SdWriteNonDedupDataX(sdcreate.c:3755)
                       Thread<1414>: Unexpected large meta data
chunk size:
                       13046784. (SESSION: 10)
01/27/21   15:01:52   ANR9999D Thread<1414> issued message 9999
from: (SESSION:
                       10)
01/27/21   15:01:52   ANR9999D Thread<1414>  0x0000000100086a30
StdPutText
                       (SESSION: 10)
01/27/21   15:01:52   ANR9999D Thread<1414>  0x0000000100087364
OutDiagToCons
                       (SESSION: 10)
01/27/21   15:01:52   ANR9999D Thread<1414>  0x00000001000633e4
outDiagfExt
                       (SESSION: 10)
01/27/21   15:01:52   ANR9999D Thread<1414>  0x0000000100e2d0c0
                       SdWriteNonDedupDataX  (SESSION: 10)
01/27/21   15:01:52   ANR9999D Thread<1414>  0x0000000100e34f48
                       SdWriteDedupData  (SESSION: 10)
01/27/21   15:01:52   ANR9999D Thread<1414>  0x0000000101f67720
SdCQSinkThread
                       (SESSION: 10)
01/27/21   15:01:52   ANR9999D Thread<1414>  0x000000010009654c
StartThread
                       (SESSION: 10)

Similar to the last example, if a large chunk is produced in the
non-dedup chunk path, then there's an error indicating that the
metadata chunk is too large to store.

In both cases, running with SPI SPID BF RABIN SD trace is
helpful for diagnosis as it will show the failing read iteration
where the server reads data from the filer and creates an
unexpectedly large chunk that gets sent down to the container
layer (SD). The trace will look similar to below:

10:36:54.625
[368][bfdedup.c][14021][NdmpObjectSinkFunc]:dataAmount: 0,
current: 0, bufLeft: 348, amountToCopy: 348
10:36:56.153
[368][bfdedup.c][14021][NdmpObjectSinkFunc]:dataAmount: 0,
current: 348, bufLeft: 8388608, amountToCopy: 8388608
10:36:57.804
[368][bfdedup.c][14021][NdmpObjectSinkFunc]:dataAmount: 0,
current: 8388956, bufLeft: 8388608, amountToCopy: 8388608
10:38:14.743 [368][sdbuf.c][1691][SdAdjustBuf]:Number 1 segment:
length 8388260, bytesRecv 8388260, residual 25165824
10:38:14.743 [368][sdbuf.c][1691][SdAdjustBuf]:Number 0 segment:
length 16777564, bytesRecv 16777564, residual 16777564
10:38:14.743 [368][sdbuf.c][1728][SdAdjustBuf]:Slot
000000BD6AEC76F0 is too small to hold one complete data chunk.
Merging it into the next slot

Note that "amountToCopy" in the first three lines adds up to the
large chunk in one of the buffer slots. The trace then indicates
that the buffer will try to compensate by merging into the next
slot which fails and causes the ANR9999D.

[IBM Spectrum Protect Versions Affected]
IBM Spectrum Protect Server 8.1.10.000 and higher on all
supported platforms.

[Initial Impact]
High

[Additional Keywords]
TSM NAS NDMP backup ANR9999D "BACKUP NODE" "Spectrum Protect"
container

Local fix

Redirect NDMP backups temporarily to sequential device class
storage pools.

Problem summary

****************************************************************
* USERS AFFECTED:                                              *
* All IBM Spectrum Protect server users.                       *
****************************************************************
* PROBLEM DESCRIPTION:                                         *
* See error description.                                       *
****************************************************************
* RECOMMENDATION:                                              *
* Apply fixing level when available. This problem is currently *
* projected to be fixed in levels 8.1.10.300, 8.1.11.100, and  *
* 8.1.12. Note that this is subject to change at the           *
* discretion of IBM.                                           *
****************************************************************

Problem conclusion

This problem was fixed.
Affected platforms for reported release:  AIX, Linux, and
Windows.
Platforms fixed:  AIX, Linux, and Windows.

Temporary fix

Comments

APAR Information

APAR number
IT35893
Reported component name
TSM SERVER
Reported component ID
5698ISMSV
Reported release
81A
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2021-02-12
Closed date
2021-03-04
Last modified date
2021-03-04

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
TSM SERVER
Fixed component ID
5698ISMSV

Applicable component levels

R81A PSY
UP
R81L PSY
UP
R81W PSY
UP

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"81A","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
18 November 2021

Tips

IT35893: ANR9999D FAILURES ARE POSSIBLE WHEN RUNNING BACKUP NODE TO CONTAINER STORAGE POOLS

Direct links to fixes

Subscribe

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

R81A PSY

R81L PSY

R81W PSY

Document Information

Share your feedback

Need support?