IBM Support

IJ37068: HOSTS CRASHING RANDOMLY AFTER UPGRADE FROM 5.1.1-3 TO 5.1.2.1

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • Node randomly crashed at the following place.
    
    282129.573157] BUG: unable to handle kernel NULL pointer
    dereference at 00000000000000c4
    [282129.581099] IP: [<ffffffffc3430ac8>]
    _Z9gpfsFsyncP13gpfsVfsData_tP9MMFSVInfoP9cxiNode_tiP10ext
    _cred_t+0x2f8/0x370 [mmfs26]
    [282129.592252] PGD 800000210d04e067 PUD 2cf5d99067 PMD 0
    [282129.597536] Oops: 0000 [#1] SMP
    [282129.600890] Modules linked in: stap_netlog(OE)
    nfs_layout_nfsv41_files cts rpcsec_gss_krb5 nfsv4
    dns_resolver tcp_diag udp_diag
    inet_diag unix_diag af_packet_diag netlink_diag nfsv3
    nfs_acl nfs lockd grace fscache isofs loop mmfs26(OE)
    mmfslinux(OE) tracedev(OE)
    8021q garp mrp bridge stp llc proclog_dd7de3(OE)
    rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE)
    ib_umad(OE)
    dell_rbu nvidia_drm(POE) nvidia_modeset(POE)
    nvidia_uvm(OE) nvidia(POE) dcdbas skx_edac
    intel_powerclamp coretemp intel_rapl
    iosf_mbi kvm_intel kvm irqbypass crc32_pclmul
    ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper
    ablk_helper cryptd mei_me
    pcspkr sg i2c_i801 mei lpc_ich ipmi_si ipmi_devintf
    ipmi_msghandler acpi_power_meter acpi_pad sch_fq_codel
    binfmt_misc auth_rpcgss
    sunrpc ip_tables xfs dm_thin_pool dm_persistent_data
    [282129.673879]  dm_bio_prison dm_bufio libcrc32c
    mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sd_mod crc_t10dif
    crct10dif_generic mgag200
    i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
    sysimgblt fb_sys_fops mlx5_core(OE) ttm mlxfw(OE) ahci
    psample ptp pps_core drm
    libahci crct10dif_pclmul auxiliary(OE) nvme devlink
    crct10dif_common crc32c_intel libata megaraid_sas(OE)
    nvme_core mlx_compat(OE)
    drm_panel_orientation_quirks nfit libnvdimm dm_mirror
    dm_region_hash dm_log dm_mod fuse [last unloaded:
    stap_netlog]
    [282129.717874] CPU: 20 PID: 27804 Comm: python Kdump:
    loaded Tainted: P           OE  ------------
    3.10.0-1160.49.1.el7.x86_64 #1
    [282129.729501] Hardware name: Dell Inc. PowerEdge
    R640/0W23H8, BIOS 1.4.9 06/29/2018
    [282129.737057] task: ffff9fcddbbd5280 ti:
    ffff9fd2550e4000 task.ti: ffff9fd2550e4000
    [282129.744605] RIP: 0010:[<ffffffffc3430ac8>]
    [<ffffffffc3430ac8>]
    _Z9gpfsFsyncP13gpfsVfsData_tP9MMFSVInfoP9cxiNode_tiP10ext
    _cred_t+0x2f8/0x370 [mmfs26]
    [282129.758172] RSP: 0018:ffff9fd2550e7cf0  EFLAGS:
    00010286
    [282129.763558] RAX: 0000000000000000 RBX:
    0000000000000000 RCX: 0000000000000005
    [282129.770758] RDX: ffff9fcddbbd58f8 RSI:
    0000000000006c9c RDI: ffffffffc334bea8
    [282129.777961] RBP: ffff9fd2550e7db0 R08:
    0000000000000000 R09: 0000000000000005
    [282129.785164] R10: 0000000000000001 R11:
    0000000000000208 R12: ffffffffffffffff
    [282129.792367] R13: ffff9fd2550e7d28 R14:
    ffff9ffeec67af48 R15: 0000000000800000
    [282129.799567] FS:  00007fdbeba37740(0000)
    GS:ffff9fda7f480000(0000) knlGS:0000000000000000
    [282129.807721] CS:  0010 DS: 0000 ES: 0000 CR0:
    0000000080050033
    [282129.813541] CR2: 00000000000000c4 CR3:
    0000002022544000 CR4: 00000000007607e0
    [282129.820742] DR0: 0000000000000000 DR1:
    0000000000000000 DR2: 0000000000000000
    [282129.827943] DR3: 0000000000000000 DR6:
    00000000fffe0ff0 DR7: 0000000000000400
    [282129.835138] PKRU: 55555554
    [282129.837931] Call Trace:
    [282129.840475]  [<ffffffffc3337321>]
    fsyncInternal.constprop.120+0x101/0x210 [mmfslinux]
    [282129.848368]  [<ffffffffbaac6f4b>] ?
    wake_up_atomic_t+0x2b/0x30
    [282129.854279]  [<ffffffffc0c0897c>] ?
    nfs_file_fsync+0x9c/0x1b0 [nfs]
    [282129.860617]  [<ffffffffc333754b>]
    gpfs_f_flush+0xab/0xc0 [mmfslinux]
    [282129.867044]  [<ffffffffbac4ba77>]
    filp_close+0x37/0x90
    [282129.872257]  [<ffffffffbac6fa2c>]
    __close_fd+0x8c/0xb0
    [282129.877472]  [<ffffffffbac4d5a3>] SyS_close+0x23/0x50
    [282129.882600]  [<ffffffffbb195f92>]
    system_call_fastpath+0x25/0x2a
    [282129.909001] RIP  [<ffffffffc3430ac8>]
    _Z9gpfsFsyncP13gpfsVfsData_tP9MMFSVInfoP9cxiNode_tiP10ext
    _cred_t+0x2f8/0x370 [mmfs26]
    [282129.920223]  RSP <ffff9fd2550e7cf0>
    

Local fix

Problem summary

  • The codepath for flushing file data to disk did
    not properly check for a stale file system,
    resulting in a crash.
    

Problem conclusion

  • This problem is fixed in 5.1.2 PTF 4
    To see all Spectrum Scale APARs and
    their respective fix solutions refer to page
    https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
    apars.html
    
    Benefits of the solution:
    Node does not crash in this scenario
    
    Work Around: N/A
    
    Problem trigger:
    With file descriptor open and kept open, have file
    system go stale (e.g. restart daemon). Then issue
    a request to flush the data to a file
    (or implicit flushOnClose).
    
    Symptom: Abend/Crash
    Platforms affected: ALL Linux OS environments
    
    Functional Area affected: All Scale Users
    
    Customer Impact: High Importance
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ37068

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    512

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2022-01-10

  • Closed date

    2022-03-22

  • Last modified date

    2022-03-22

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"512","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
23 March 2022