IBM Support

Node error 565 "Node internal disk is failing" due to /dumps full

Troubleshooting


Problem

A canister encountered "Error ID (Event ID) 074565 : Node internal disk is failing" and "Error Code: 1039 : Canister failure, canister replacement required" due to the internal directory "/dumps" exceeds its size. The canister will go in service state and therefore "cleardumps" cannot be used to free up the space. You need to run "satask snap -clean" to clean the /dumps directory and then you have to bring the canister out of service state.

Cause

There is a problem that old files in /dumps won´t be deleted automatically.

Diagnosing The Problem

When you hit this problem, the error log entry will be open:


Error Log Entry
 Node Identifier       : node2
 Object Type           : node
 Object ID             : 2

   :
 Error ID              :
74565 : Node internal disk is failing
 Error Code            : 1039 : Canister failure, canister replacement required
 

The command "sainfo lsservicenodes" will show the affected node in Service state and show that the size of /dumps is causing the problem:


> sainfo lsservicenodes
panel_name cluster_id       cluster_name node_id node_name
relation node_status error_data
01-2       00000XXXXXXXXXXX Flash840_Sec 2       node2     local    Service    
565 Disk full: /dumps
01-1       00000XXXXXXXXXXX Flash840_Sec 3       node1     partner  Active

Resolving The Problem

If you hit above scenario, do not replace the canister, although 1039 is asking to do that.
Clean the /dumps directory and bring the node out of service. In general there is the CLI command "cleardumps -prefix /dumps" available. But this command won't work on a node which is in service state. Therefore proceed as follows:

1) Connect via CLI to the service IP of the affected node.

2) Run "sainfo lsservicenodes" to confirm that you are connected to the correct canister. The relation should be local for that node which is in service state. You can also confirm the "error 565 Disk full: /dumps" . Further check that the partner node is active.

> sainfo lsservicenodes
panel_name cluster_id       cluster_name node_id node_name
relation node_status error_data
01-2       00000XXXXXXXXXXX Flash840_Sec 2       node2    
local    Service     565 Disk full: /dumps
01-1       00000XXXXXXXXXXX Flash840_Sec 3       node1     partner  Active

3) Run the command "sainfo lsfiles -prefix /dumps" to have a list about the actual content of the /dumps directory. (You should see a lot of snaps.* )

> sainfo lsfiles -prefix /dumps
:
snap.13XXXXX-2.150512.182453.tgz
snap.13XXXXX-2.150519.094056.tgz
snap.13XXXXX-2.150622.091408.tgz
snap.13XXXXX-2.160211.154117.tgz
snap.13XXXXX-2.160407.123056.tgz
snap.13XXXXX-2.160420.114250.tgz
snap.13XXXXX-2.160502.112306.tgz
snap.13XXXXX-2.160601.121321.tgz
snap.13XXXXX-2.160801.140249.tgz
snap.13XXXXX-2.160816.095037.tgz
snap.13XXXXX-2.160831.122534.tgz
snap.13XXXXX-2.170124.171429.tgz
snap.13XXXXX-2.170124.172908.tgz
snap.13XXXXX-2.170125.152054.tgz
snap.13XXXXX-2.170201.161036.tgz
snap.13XXXXX-2.170215.160706.tgz
snap.13XXXXX-2.170216.090743
snap.13XXXXX-2.170302.094907.tgz
snap.13XXXXX-2.170307.160534.log
snap.cmds.log
snap.ietd.13XXXXX-2.170307.160534.tar
snap.single.13XXXXX-2.170214.141531.tgz
svc.config.backup.bak_13XXXXX-2
svc.config.backup.log_13XXXXX-2
svc.config.backup.sh_13XXXXX-2
svc.config.backup.xml_13XXXXX6-2
svc.config.cron.bak_13XXXXX-1
svc.config.cron.bak_13XXXXX-2
svc.config.cron.log_13XXXXX6-2
svc.config.cron.sh_13XXXXX-2
svc.config.cron.xml_13XXXXX-2
:

4) Run the command " satask snap -clean " to delete old snaps. (NOTE: this is an undocumented command.)

> satask snap -clean
>

5) Verify with the command " sainfo lsfiles -prefix /dumps " that the old snaps were deleted.

> sainfo lsfiles -prefix /dumps
:
svc.config.backup.bak_13XXXXX-2
svc.config.backup.log_13XXXXX-2
svc.config.backup.sh_13XXXXX-2
svc.config.backup.xml_13XXXXX-2
svc.config.cron.bak_13XXXXX-1
svc.config.cron.bak_13XXXXX6-2
svc.config.cron.log_13XXXXX-2
svc.config.cron.sh_13XXXXX-2
svc.config.cron.xml_13XXXXX-2
:

6) Reboot the local node with " satask stopnode -reboot "

> satask stopnode -reboot
>

Your CLI session will stop.

7) Wait about 10 minutes and reconnect the CLI session.

8) Verify with " sainfo lsservicenodes " that both canisters are active.

sainfo lsservicenodes
panel_name cluster_id       cluster_name node_id node_name relation node_status error_data
01-1       00000XXXXXXXXXXX Flash840_Sec 3       node1     local    Active
01-2       00000XXXXXXXXXXX Flash840_Sec 2       node2     partner  Active


Also the 1039 event should be fixed automatically and no longer visible.

9) After the node is back online, the GUI or the lsdumps command can be used to see if there are any more files in /dumps and /home/admin/upgrade that can be erased to free up more space.

To look for data, run:
>lsdumps -prefix /dumps node1
>lsdumps -prefix /dumps node2

>lsdumps -prefix /home/admin/upgrade node1
>lsdumps -prefix /home/admin/upgrade node2

To clear it, run:
cleardumps -prefix /dumps node1
cleardumps -prefix /dumps node2

cleardumps -prefix /home/admin/upgrade node1
cleardumps -prefix /home/admin/upgrade node2

-- End of procedure --

[{"Product":{"code":"STKMQB","label":"IBM FlashSystem 900"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Canister","Platform":[{"code":"","label":"N\/A"}],"Version":"Version Independent","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Product":{"code":"STKMQB","label":"IBM FlashSystem 900"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Product":{"code":"STKMQB","label":"IBM FlashSystem 900"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Product":{"code":"ST2HTZ","label":"IBM FlashSystem Software"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Product":{"code":"STKMQV","label":"IBM FlashSystem V9000"},"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
17 February 2023

UID

ssg1S1010087