IBM Support

IT31841: VSNAP SHOWS 'READY' OR 'OFFLINE' STATUS AND BACKUP JOBS HANG

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • The vSnap server can experience I/O hangs due to deadlocks in
    the kernel and filesystem modules.
    
    The problem can be seen initially with one or more of the
    following symptoms:
    
    - Under the Disk screen of the SPP UI, the vSnap shows 'Ready'
    or 'Offline' status
    - VM backup operations appear to hang/stall for hours
    - VMware backups fail while trying to create or update NFS
    shares with errors indicating timeout of share commands on the
    vSnap server
    - SQL backups fail while trying to map LUNs to the application
    server with errors indicating timeout of 'vsnap_targetcli'
    commands
    - SQL backups fail with errors seen in the job log: "The system
    cannot find the file specified, The device is not ready."
    
    The root cause of all these symptoms is that I/O operations on
    the vSnap server are hanging.
    
    The problem can be confirmed by further investigation of the
    vSnap server. Running "ps aux | grep D" on the vSnap server
    shows many zfs processes that are in "D" state (i.e. hung or
    waiting for I/O). Further investigation of the system log
    (/var/log/messages) on the vSnap server shows errors indicating
    hung processes with call traces similar to the following:
    
    kernel: Call Trace:
    kernel: ? __schedule+0x2ab/0x880
    kernel: schedule+0x32/0x80
    kernel: schedule_preempt_disabled+0xa/0x10
    kernel: __mutex_lock.isra.11+0x21b/0x4e0
    kernel: ? cityhash4+0x78/0xa0 [zfs]
    kernel: dbuf_find+0xb8/0x190 [zfs]
    kernel: dbuf_hold_impl+0x62/0x590 [zfs]
    kernel: dbuf_hold_level+0x33/0x60 [zfs]
    kernel: dmu_tx_check_ioerr+0x32/0xc0 [zfs]
    kernel: dmu_tx_count_write+0xdd/0x190 [zfs]
    kernel: dmu_tx_hold_write_by_dnode+0x35/0x50 [zfs]
    kernel: zfs_write+0x516/0xcd0 [zfs]
    kernel: zpl_write_common_iovec+0xa9/0x120 [zfs]
    kernel: zpl_iter_write_common+0x98/0xc0 [zfs]
    kernel: zpl_iter_write+0x3f/0x70 [zfs]
    
    OR
    
    kernel: z_wr_iss        D    0 18751      2 0x80000080
    kernel: Call Trace:
    kernel: ? __schedule+0x2ab/0x880
    kernel: schedule+0x32/0x80
    kernel: schedule_preempt_disabled+0xa/0x10
    kernel: __mutex_lock.isra.11+0x21b/0x4e0
    kernel: ? cityhash4+0x78/0xa0 [zfs]
    kernel: dbuf_find+0x5a/0x190 [zfs]
    kernel: dbuf_hold_impl+0x62/0x590 [zfs]
    kernel: dbuf_hold_level+0x33/0x60 [zfs]
    kernel: dmu_buf_hold_noread+0x7c/0x100 [zfs]
    kernel: dmu_buf_hold+0x37/0x80 [zfs]
    kernel: zap_lockdir+0x4e/0xc0 [zfs]
    kernel: ? _cond_resched+0x15/0x30
    kernel: ? __kmalloc_node+0x209/0x270
    kernel: zap_length_uint64+0x51/0x100 [zfs]
    kernel: ddt_zap_lookup+0x62/0xe0 [zfs]
    kernel: ? spl_kmem_cache_alloc+0x91/0x110 [spl]
    kernel: ddt_lookup+0xce/0x1a0 [zfs]
    kernel: ? abd_checksum_SHA256+0x5e/0xb0 [zfs]
    kernel: ? zio_checksum_compute+0x24c/0x3b0 [zfs]
    kernel: zio_ddt_write+0x7a/0x530 [zfs]
    
    IBM Spectrum Protect Plus Versions Affected:
    IBM Spectrum Protect Plus 10.1.x
    
    Initial Impact: Medium
    
    Additional Keywords: SPP, SPPlus, TS003424236
    

Local fix

  • Reboot the vSnap to clear the hangs and run the backups again.
    To avoid triggering the hangs, try to reduce the amount of
    concurrent I/O on the vSnap server. This can be achieved using
    one of more of the following techniques:
    
    - Modify schedules of overlapping backup/replication jobs to
    prevent too many jobs from running at the same time
    - Modify schedule of the Maintenance job to make sure it runs
    during off-hours when other backup/replication jobs do not
    usually run
    - Modify VADP Proxy configuration by reducing the number of
    proxies or reducing the proxy softcap limit setting to prevent
    too many concurrent backup streams from running at the same time
    - Modify the Concurrent Backup setting on the Advanced Options
    page under the Disk management UI. Change the setting from
    "Unlimited" to "Limit" and set a limit value of 15 to throttle
    the amount of concurrent backup streams written to the vSnap
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * IBM Spectrum Protect Plus levels 10.1.5 and 10.1.6.          *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See Error Description                                        *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Apply fixing level when available. This problem is currently *
    * projected to be fixed in IBM Spectrum Protect Plus level     *
    * 10.1.6.ifix3 and 10.1.7. Note that this is subject to change *
    * at the discretion of IBM.                                    *
    ****************************************************************
    

Problem conclusion

  • Multiple fixes were made in the filesystem modules to address
    hangs/deadlocks:
    
    - Fixed a race condition between destroying snapshots and
    regular I/O which caused hangs when maintenance operations
    overlapped with backups.
    - Fixed an issue with open NFS file descriptors not getting
    unlinked correctly.
    - Fixed a bug with filesystem commands deadlocking due to bad
    ordering of read/write locks.
    - Addressed a minor fix with filesystem memory cache contention
    causing the shrinking algorithm to block I/O for long periods of
    time. The shrinker algorithm is called by the operating system
    to drop cached inforation from RAM to free up space. In some
    cases the algorithm could block I/O on the entire system for
    long periods of time.
    
    It was also found that the hangs are more likely to occur when
    the virtual vSnap server is under CPU or memory pressure at the
    hypervisor level. To alleviate this problem, it is recommended
    that the virtual machine configuration of the vSnap be
    configured to ensure sufficient CPU and memory are reserved for
    the vSnap VM.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT31841

  • Reported component name

    SP PLUS

  • Reported component ID

    5737SPLUS

  • Reported release

    A10

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2020-03-05

  • Closed date

    2020-08-31

  • Last modified date

    2020-08-31

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SP PLUS

  • Fixed component ID

    5737SPLUS

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSNQFQ","label":"IBM Spectrum Protect Plus"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"A10","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
30 January 2024