IBM Support

IT29264: ZFS COMMANDS HANGING AND TIMING OUT CAUSING BACKUPS TO FAIL

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • The ZFS can hang without having any hardware storage problems
    with the following symptoms :
    
    in the vsnap.log the following type of messages will be seen :
    
    [<timestamp>] ERROR pid-2468 vsnap.linux.system Timed out (300
    seconds) waiting for command to complete: <b>sudo -n zfs list
    -Hp vpool1</b>
    [<timestamp>]  INFO pid-2468 vsnap.linux.system Executing
    command: <b>sudo -n kill -s 15 6731 6737</b>
    [<timestamp>] ERROR pid-2468 vsnap.api Traceback (most recent
    call last):
        File "/src/workspace/vsnap/zfs.py", line 548, in get_pools
        File "/src/workspace/vsnap/linux/system.py", line 282, in
    run_shell_command
       vsnap.common.errors.CommandTimeoutError: Command timed out:
    zfs list -Hp vpool1
    
       During handling of the above exception, another exception
    occurred :
    
       Traceback (most recent call last):
        File
    "/opt/vsnap/venv/lib64/python3.4/site-packages/flask/app.py",
    line 1813, in full_dispatch_request
        rv = self.dispatch_request()
        File
    "/opt/vsnap/venv/lib64/python3.4/site-packages/flask/app.py",
    line 1799, in dispatch_request
        return self.view_functions[rule.endpoint](**req.view_args)
        File "/src/workspace/vsnap/api/rest.py", line 82, in
    decorated
        File "/src/workspace/vsnap/api/rest.py", line 323, in
    api_pool_get_all
        File "/src/workspace/vsnap/core.py", line 792, in pool_get
        File "/src/workspace/vsnap/zfs.py", line 559, in get_pools
       vsnap.common.errors.PoolInfoError: Failed to collect pool
    information
    
    And the vSnap host system messages file shows a kernel hung
    stack :
    
    <timestamp> drcscivsnap01 kernel: INFO: task zfs:6737 blocked
    for more than 1800 seconds.
    <timestamp> drcscivsnap01 kernel: "echo 0 >
    /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    <timestamp> drcscivsnap01 kernel: <b>zfs D ffff97d8a68daf70 0
    6737 6731 0x00000084</b>
    <timestamp> drcscivsnap01 kernel: Call Trace:
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf1c05fa>] ?
    __do_fault.isra.58+0x8a/0x100
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf713e69>]
    schedule_preempt_disabled+0x29/0x70
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf711c27>]
    __mutex_lock_slowpath+0xc7/0x1d0
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf71100f>]
    mutex_lock+0x1f/0x2f
    <timestamp> drcscivsnap01 kernel: [<ffffffffc06d5cc1>]
    spa_open_common+0x61/0x4d0 [zfs]
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf1f59a0>] ?
    __kmalloc_node+0x280/0x2b0
    <timestamp> drcscivsnap01 kernel: [<ffffffffc05be0fd>] ?
    entity_namecheck+0xfd/0x260 [zcommon]
    <timestamp> drcscivsnap01 kernel: [<ffffffffc06d6153>]
    spa_open+0x13/0x20 [zfs]
    <timestamp> drcscivsnap01 kernel: [<ffffffffc0717032>]
    pool_status_check.part.37+0x32/0xe0 [zfs]
    <timestamp> drcscivsnap01 kernel: [<ffffffffc07174b0>]
    zfsdev_ioctl+0x3d0/0x660 [zfs]
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf22fb90>]
    do_vfs_ioctl+0x350/0x560
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf2d82bf>] ?
    file_has_perm+0x9f/0xb0
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf22fe41>]
    SyS_ioctl+0xa1/0xc0
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf71f7d5>]
    system_call_fastpath+0x1c/0x21
    <timestamp> drcscivsnap01 kernel: INFO: task zpool:7028 blocked
    for more than 1800 seconds.
    <timestamp> drcscivsnap01 kernel: "echo 0 >
    /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    <timestamp> drcscivsnap01 kernel: zpool D ffff97d5f64f5ee0 0
    7028 7026 0x00000084
    <timestamp> drcscivsnap01 kernel: Call Trace:
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf713e69>]
    schedule_preempt_disabled+0x29/0x70
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf711c27>]
    __mutex_lock_slowpath+0xc7/0x1d0
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf71100f>]
    mutex_lock+0x1f/0x2f
    <timestamp> drcscivsnap01 kernel: [<ffffffffc06da980>]
    spa_all_configs+0x40/0x120 [zfs]
    <timestamp> drcscivsnap01 kernel: [<ffffffffc0716d9b>]
    zfs_ioc_pool_configs+0x1b/0x70 [zfs]
    <timestamp> drcscivsnap01 kernel: [<ffffffffc07176fb>]
    zfsdev_ioctl+0x61b/0x660 [zfs]
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf22fb90>]
    do_vfs_ioctl+0x350/0x560
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf2d82bf>] ?
    file_has_perm+0x9f/0xb0
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf22fe41>]
    SyS_ioctl+0xa1/0xc0
    <timestamp> drcscivsnap01 kernel: [<ffffffffaf71f7d5>]
    system_call_fastpath+0x1c/0x21
    
    The cause is with a defect in the ZFS Events Daemon (ZED) which
    is a service that monitors pool events.
    When that is encountered, the syslog will be flooded with
    messages like :
    
    <timestamp> drcscivsnap01 zed: eid=13774249 class=config_sync
    pool_guid=0x31D3D1D0EABC1E74
    <timestamp> drcscivsnap01 zed: eid=13774250 class=config_sync
    pool_guid=0x31D3D1D0EABC1E74
    <timestamp> drcscivsnap01 zed: eid=13774251 class=config_sync
    pool_guid=0x31D3D1D0EABC1E74
    <timestamp> drcscivsnap01 zed: eid=13774252 class=config_sync
    pool_guid=0x31D3D1D0EABC1E74
    
    Due to the above, the ZED can crash and produce core files
    and/or ZFS can hang.
    
    IBM Spectrum Protect Versions Affected:
    IBM Spectrum Protect Plus 10.1.x
    
    
    Initial Impact: High
    
    Additional Keywords: SPP, SPPLUS, TS002249496, crash, zfs, hang
    

Local fix

  • While waiting to apply the fixing version, the ZED, which is not
     used by vSnap, can be disabled to avoid the problem as follows
    :
    
    1. Check if the ZED service is enabled and running :
    
       [serveradmin@<SPPHost> ~]$ systemctl is-enabled zfs-zed
          enabled
    
       [serveradmin@<SPPHost> ~]$ systemctl status zfs-zed
          ? zfs-zed.service - ZFS Event Daemon (zed)
             Loaded: loaded
    (/usr/lib/systemd/system/zfs-zed.service; enabled; vendor
    preset: enabled)
             Active: active (running) since <timestamp>; 1h 14min
    ago
    
    2. Stop and disable it :
    
       [serveradmin@<SPPHost> ~]$ sudo systemctl stop zfs-zed
    
       [serveradmin@<SPPHost> ~]$ sudo systemctl disable zfs-zed
          Removed symlink
    /etc/systemd/system/zfs.target.wants/zfs-zed.service.
          Removed symlink /etc/systemd/system/zed.service.
    
       [serveradmin@<SPPHost> ~]$ systemctl is-enabled zfs-zed
          disabled
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * IBM Spectrum Protect Plus version 10.1.3                     *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See Error Description                                        *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Apply fixing level when available. This problem is currently *
    * projected to be fixed in IBM Spectrum Protect Plus level     *
    * 10.1.3 patch 1 and 10.1.4. Note that this is subject to      *
    * change at the discretion of IBM.                             *
    ****************************************************************
    

Problem conclusion

  • The zfs-zed service that is packaged within vSnap can trigger
    hangs and crashes due to internal bugs. vSnap does not rely on
    this service but it was still enabled and running by default
    which affected system stability. The problem has been resolved
    by stopping and disabling the zfs-zed service to ensure it does
    not interfere with vSnap functionality.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT29264

  • Reported component name

    SP PLUS

  • Reported component ID

    5737SPLUS

  • Reported release

    A13

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2019-05-24

  • Closed date

    2019-06-11

  • Last modified date

    2021-05-10

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SP PLUS

  • Fixed component ID

    5737SPLUS

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSNQFQ","label":"IBM Spectrum Protect Plus"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"A13","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
30 January 2024