IBM Support

IBM Spectrum 10.1.0 silent data corruption with MPI_Ireduce

Flashes (Alerts)


Abstract

Customers will experience silent data corruption using the
MPI_Ireduce collective with Spectrum MPI when the following
conditions are met:

1. MPI_lreduce collective
2. Using the MPI_IN_PLACE directive
3. Rank count in the collective is less than or equal to 4
4. The message size is 64kb or larger
5. The collective is selected from the libnbc collective library

This issue is fixed in IBM Spectrum MPI 10.1.0 PTF 3.

Content

ERROR DESCRIPTION:
Customers will experience silent data corruption using the MPI_Ireduce collective with Spectrum MPI when the following conditions are met:

1. MPI_lreduce collective
2. Using the MPI_IN_PLACE directive
3. Rank count in the collective is less than or equal to 4
4. The message size is 64kb or larger
5. The collective is selected from the libnbc collective library


USERS AFFECTED:
The IBM libcoll collective library does not have this issue. IBM libcoll can only be used with the IBM PAMI interconnect protocol. The IBM libcoll library does not support user-defined operations (MPI_Op_create). Applications that use user-defined operations with MPI_Ireduce, and that meet the other criteria, may have silent data corruption.

For Power customers, only the PAMI interconnect protocol is supported. The IBM libcoll collective library is used by default. If the application provides a user-defined operation to MPI_Ireduce, silent data corruption may occur.

For x86 customers, using the PAMI interconnect protocol will use IBM libcoll by default. If the application provides a user-defined operation to MPI_Ireduce, silent data corruption may occur. There are supported options that can alter that default behavior.

For x86 customers, all non-PAMI interconnects can reach the libnbc collective library. There is no way for a customer to know when the MPI_Ireduce API is called if the libnbc implementation will be selected and used.

There is no workaround for this issue.

IBM Spectrum MPI 10.1.0 PTF 3 contains the following additional fixes:

  • Fix --display-diffable-map without -n option produces SegFault / coredump.
  • Fix MPI hang and failure to clean up following task failure.
  • Fix PGI compiler wrapper issues with mpipgicc and mpipgic++
  • Fix MPI_T_pvar_get_index
  • usNIC Updates (x86 only).
  • Fix extremely large TCP messages issue (x86 only)
  • Fix powerpc atomics (Issue 2610) (Power only)
  • Fix lower bound and extent in Datatypes (Issue 2560)
  • Fix libnbc - coll/libnbc: fix MPI_IN_PLACE handling in i{gather,scatter}[v]
  • Fix libnbc - fix race condition with multi threaded apps (Issue 2427)
  • Limitation:  The IBM Spectrum MPI collectives component (libcollectives) does not support user defined.  This limitation applies to all IBM Spectrum MPI 10.1.0 products (GA, PTF1, PTF2, and PTF3).

Update to IBM Spectrum MPI 10.1.0 PTF 3 as soon as practical.

[{"Product":{"code":"SSZTET","label":"IBM Spectrum MPI"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"Not Applicable","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"10.1","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"SSZUDP","label":"IBM Spectrum LSF Suite for HPC"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"SSZU9Q","label":"IBM Spectrum LSF Suite for Workgroups"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"STFH7W","label":"IBM Power HPC Stack"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
25 September 2022

UID

isg3T1025073