Node error 565 "Node internal disk is failing" due to /dumps full

Troubleshooting

Problem

A canister encountered "Error ID (Event ID) 074565 : Node internal disk is failing" and "Error Code: 1039 : Canister failure, canister replacement required" due to the internal directory "/dumps" exceeds its size. The canister will go in service state and therefore "cleardumps" cannot be used to free up the space. You need to run "satask snap -clean" to clean the /dumps directory and then you have to bring the canister out of service state.

Cause

There is a problem that old files in /dumps won´t be deleted automatically.

Diagnosing The Problem

When you hit this problem, the error log entry will be open:

Error Log Entry Node Identifier : node2 Object Type : node Object ID : 2
: Error ID :74565 : Node internal disk is failing
Error Code : 1039 : Canister failure, canister replacement required

The command "sainfo lsservicenodes" will show the affected node in Service state and show that the size of /dumps is causing the problem:

> sainfo lsservicenodes panel_name cluster_id cluster_name node_id node_namerelation node_status error_data 01-2 00000XXXXXXXXXXX Flash840_Sec 2 node2 local Service 565 Disk full: /dumps
01-1 00000XXXXXXXXXXX Flash840_Sec 3 node1 partner Active

Resolving The Problem

If you hit above scenario, do not replace the canister, although 1039 is asking to do that.
Clean the /dumps directory and bring the node out of service. In general there is the CLI command "cleardumps -prefix /dumps" available. But this command won't work on a node which is in service state. Therefore proceed as follows:

1) Connect via CLI to the service IP of the affected node.

2) Run "sainfo lsservicenodes" to confirm that you are connected to the correct canister. The relation should be local for that node which is in service state. You can also confirm the "error 565 Disk full: /dumps" . Further check that the partner node is active.

> sainfo lsservicenodes panel_name cluster_id cluster_name node_id node_namerelation node_status error_data 01-2 00000XXXXXXXXXXX Flash840_Sec 2 node2 local Service 565 Disk full: /dumps
01-1 00000XXXXXXXXXXX Flash840_Sec 3 node1 partner Active

3) Run the command "sainfo lsfiles -prefix /dumps" to have a list about the actual content of the /dumps directory. (You should see a lot of snaps.* )

> sainfo lsfiles -prefix /dumps
:
snap.13XXXXX-2.150512.182453.tgz
snap.13XXXXX-2.150519.094056.tgz
snap.13XXXXX-2.150622.091408.tgz
snap.13XXXXX-2.160211.154117.tgz
snap.13XXXXX-2.160407.123056.tgz
snap.13XXXXX-2.160420.114250.tgz
snap.13XXXXX-2.160502.112306.tgz
snap.13XXXXX-2.160601.121321.tgz
snap.13XXXXX-2.160801.140249.tgz
snap.13XXXXX-2.160816.095037.tgz
snap.13XXXXX-2.160831.122534.tgz
snap.13XXXXX-2.170124.171429.tgz
snap.13XXXXX-2.170124.172908.tgz
snap.13XXXXX-2.170125.152054.tgz
snap.13XXXXX-2.170201.161036.tgz
snap.13XXXXX-2.170215.160706.tgz
snap.13XXXXX-2.170216.090743
snap.13XXXXX-2.170302.094907.tgz
snap.13XXXXX-2.170307.160534.log
snap.cmds.log
snap.ietd.13XXXXX-2.170307.160534.tar
snap.single.13XXXXX-2.170214.141531.tgz
svc.config.backup.bak_13XXXXX-2
svc.config.backup.log_13XXXXX-2
svc.config.backup.sh_13XXXXX-2
svc.config.backup.xml_13XXXXX6-2
svc.config.cron.bak_13XXXXX-1
svc.config.cron.bak_13XXXXX-2
svc.config.cron.log_13XXXXX6-2
svc.config.cron.sh_13XXXXX-2
svc.config.cron.xml_13XXXXX-2
:

4) Run the command " satask snap -clean " to delete old snaps. (NOTE: this is an undocumented command.)

> satask snap -clean
>

5) Verify with the command " sainfo lsfiles -prefix /dumps " that the old snaps were deleted.

> sainfo lsfiles -prefix /dumps
:
svc.config.backup.bak_13XXXXX-2
svc.config.backup.log_13XXXXX-2
svc.config.backup.sh_13XXXXX-2
svc.config.backup.xml_13XXXXX-2
svc.config.cron.bak_13XXXXX-1
svc.config.cron.bak_13XXXXX6-2
svc.config.cron.log_13XXXXX-2
svc.config.cron.sh_13XXXXX-2
svc.config.cron.xml_13XXXXX-2
:

6) Reboot the local node with " satask stopnode -reboot "

> satask stopnode -reboot
>

Your CLI session will stop.

7) Wait about 10 minutes and reconnect the CLI session.

8) Verify with " sainfo lsservicenodes " that both canisters are active.

sainfo lsservicenodes panel_name cluster_id cluster_name node_id node_name relation node_status error_data 01-1 00000XXXXXXXXXXX Flash840_Sec 3 node1 local Active 01-2 00000XXXXXXXXXXX Flash840_Sec 2 node2 partner Active

Also the 1039 event should be fixed automatically and no longer visible.

9) After the node is back online, the GUI or the lsdumps command can be used to see if there are any more files in /dumps and /home/admin/upgrade that can be erased to free up more space.

To look for data, run:
>lsdumps -prefix /dumps node1
>lsdumps -prefix /dumps node2

>lsdumps -prefix /home/admin/upgrade node1
>lsdumps -prefix /home/admin/upgrade node2

To clear it, run:
cleardumps -prefix /dumps node1
cleardumps -prefix /dumps node2

cleardumps -prefix /home/admin/upgrade node1
cleardumps -prefix /home/admin/upgrade node2

-- End of procedure --

[{"Product":{"code":"STKMQB","label":"IBM FlashSystem 900"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Canister","Platform":[{"code":"","label":"N\/A"}],"Version":"Version Independent","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Product":{"code":"STKMQB","label":"IBM FlashSystem 900"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Product":{"code":"STKMQB","label":"IBM FlashSystem 900"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Product":{"code":"ST2HTZ","label":"IBM FlashSystem Software"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Product":{"code":"STKMQV","label":"IBM FlashSystem V9000"},"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Tips

Node error 565 "Node internal disk is failing" due to /dumps full

Troubleshooting

Problem

Cause

Diagnosing The Problem

Resolving The Problem

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?