Troubleshooting
Problem
A canister encountered "Error ID (Event ID) 074565 : Node internal disk is failing" and "Error Code: 1039 : Canister failure, canister replacement required" due to the internal directory "/dumps" exceeds its size. The canister will go in service state and therefore "cleardumps" cannot be used to free up the space. You need to run "satask snap -clean" to clean the /dumps directory and then you have to bring the canister out of service state.
Cause
There is a problem that old files in /dumps won´t be deleted automatically.
Diagnosing The Problem
When you hit this problem, the error log entry will be open:
Error Log Entry
Node Identifier : node2
Object Type : node
Object ID : 2
:
Error ID : 74565 : Node internal disk is failing
Error Code : 1039 : Canister failure, canister replacement required
The command "sainfo lsservicenodes" will show the affected node in Service state and show that the size of /dumps is causing the problem:
> sainfo lsservicenodes
panel_name cluster_id cluster_name node_id node_name relation node_status error_data
01-2 00000XXXXXXXXXXX Flash840_Sec 2 node2 local Service 565 Disk full: /dumps
01-1 00000XXXXXXXXXXX Flash840_Sec 3 node1 partner Active
Resolving The Problem
If you hit above scenario, do not replace the canister, although 1039 is asking to do that.
Clean the /dumps directory and bring the node out of service. In general there is the CLI command "cleardumps -prefix /dumps" available. But this command won't work on a node which is in service state. Therefore proceed as follows:
1) Connect via CLI to the service IP of the affected node.
2) Run "sainfo lsservicenodes" to confirm that you are connected to the correct canister. The relation should be local for that node which is in service state. You can also confirm the "error 565 Disk full: /dumps" . Further check that the partner node is active.
> sainfo lsservicenodes
panel_name cluster_id cluster_name node_id node_name relation node_status error_data
01-2 00000XXXXXXXXXXX Flash840_Sec 2 node2 local Service 565 Disk full: /dumps
01-1 00000XXXXXXXXXXX Flash840_Sec 3 node1 partner Active
3) Run the command "sainfo lsfiles -prefix /dumps" to have a list about the actual content of the /dumps directory. (You should see a lot of snaps.* )
> sainfo lsfiles -prefix /dumps
:
snap.13XXXXX-2.150512.182453.tgz
snap.13XXXXX-2.150519.094056.tgz
snap.13XXXXX-2.150622.091408.tgz
snap.13XXXXX-2.160211.154117.tgz
snap.13XXXXX-2.160407.123056.tgz
snap.13XXXXX-2.160420.114250.tgz
snap.13XXXXX-2.160502.112306.tgz
snap.13XXXXX-2.160601.121321.tgz
snap.13XXXXX-2.160801.140249.tgz
snap.13XXXXX-2.160816.095037.tgz
snap.13XXXXX-2.160831.122534.tgz
snap.13XXXXX-2.170124.171429.tgz
snap.13XXXXX-2.170124.172908.tgz
snap.13XXXXX-2.170125.152054.tgz
snap.13XXXXX-2.170201.161036.tgz
snap.13XXXXX-2.170215.160706.tgz
snap.13XXXXX-2.170216.090743
snap.13XXXXX-2.170302.094907.tgz
snap.13XXXXX-2.170307.160534.log
snap.cmds.log
snap.ietd.13XXXXX-2.170307.160534.tar
snap.single.13XXXXX-2.170214.141531.tgz
svc.config.backup.bak_13XXXXX-2
svc.config.backup.log_13XXXXX-2
svc.config.backup.sh_13XXXXX-2
svc.config.backup.xml_13XXXXX6-2
svc.config.cron.bak_13XXXXX-1
svc.config.cron.bak_13XXXXX-2
svc.config.cron.log_13XXXXX6-2
svc.config.cron.sh_13XXXXX-2
svc.config.cron.xml_13XXXXX-2
:
4) Run the command " satask snap -clean " to delete old snaps. (NOTE: this is an undocumented command.)
> satask snap -clean
>
5) Verify with the command " sainfo lsfiles -prefix /dumps " that the old snaps were deleted.
> sainfo lsfiles -prefix /dumps
:
svc.config.backup.bak_13XXXXX-2
svc.config.backup.log_13XXXXX-2
svc.config.backup.sh_13XXXXX-2
svc.config.backup.xml_13XXXXX-2
svc.config.cron.bak_13XXXXX-1
svc.config.cron.bak_13XXXXX6-2
svc.config.cron.log_13XXXXX-2
svc.config.cron.sh_13XXXXX-2
svc.config.cron.xml_13XXXXX-2
:
6) Reboot the local node with " satask stopnode -reboot "
> satask stopnode -reboot
>
Your CLI session will stop.
7) Wait about 10 minutes and reconnect the CLI session.
8) Verify with " sainfo lsservicenodes " that both canisters are active.
sainfo lsservicenodes
panel_name cluster_id cluster_name node_id node_name relation node_status error_data
01-1 00000XXXXXXXXXXX Flash840_Sec 3 node1 local Active
01-2 00000XXXXXXXXXXX Flash840_Sec 2 node2 partner Active
Also the 1039 event should be fixed automatically and no longer visible.
9) After the node is back online, the GUI or the lsdumps command can be used to see if there are any more files in /dumps and /home/admin/upgrade that can be erased to free up more space.
To look for data, run:
>lsdumps -prefix /dumps node1
>lsdumps -prefix /dumps node2
>lsdumps -prefix /home/admin/upgrade node1
>lsdumps -prefix /home/admin/upgrade node2
To clear it, run:
cleardumps -prefix /dumps node1
cleardumps -prefix /dumps node2
cleardumps -prefix /home/admin/upgrade node1
cleardumps -prefix /home/admin/upgrade node2
-- End of procedure --
Was this topic helpful?
Document Information
Modified date:
17 February 2023
UID
ssg1S1010087