Intermittent DB2 Checksum Corruption (SQLB_CSUM) on VMware ESXi 8.0 Platforms affecting Engineering Lifecycle Management

Troubleshooting

Problem

IBM Engineering Lifecycle Management (ELM) applications fail to start, report "Not Migrated" errors (CRRRS5439E), or crash during operation. These failures are often traced to underlying DB2 database connection errors and page corruption, even when the physical storage is reported as healthy by IT infrastructure teams.

Symptom

ELM Application Layer:

The RM application fails to initialize, showing CRRRS5439E in the setup wizard or logs.

Users encounter "Internal Server Error" or CRRRS1007E when accessing specific artifacts or links.
Database Layer (db2diag.log):

FUNCTION: DB2 UDB, buffer pool services, sqlbReadPage

MESSAGE: ZRC=0x86020019=-2046689255=SQLB_CSUM "Bad Page, Checksum Error"

Maintenance Symptoms:

Scheduled DB2 backups fail for the RM database while JTS, CCM, or QM databases may initially appear unaffected.

Cause

This issue is caused by a silent data corruption bug stemming from a conflict between Intel AVX-512 instructions and certain VMware ESXi 8.0 hypervisor versions.

In an ELM environment, the RM application is particularly vulnerable because it frequently performs complex read/write operations involving deep linking and UUID lookups. When DB2 uses AVX-512 to verify these pages, the hypervisor's mismanagement of the instruction leads to a "false" checksum failure, causing the database to mark ELM data as corrupt.

Environment

Application: IBM Engineering Lifecycle Management (ELM).

Database: IBM DB2 (all versions optimized for modern CPUs).

Virtualization: VMware ESXi 8.0, 8.0U1, or 8.0U2 (Builds prior to 23825572).

Hardware: Intel CPUs supporting AVX-512 (e.g., Ice Lake or Sapphire Rapids).

Diagnosing The Problem

Log Correlation: Match ELM application startup timestamps with SQLB_CSUM entries in the db2diag.log.

Verify Hypervisor Build: Confirm if the ESXi build number is lower than 23825572.

Run DB2DART: Execute db2dart <DB_NAME> /DB and check the .RPT file for BPS Tail incorrect CBITS value.

Identify ELM Impact: Check if corruption is localized in ELM-specific tables such as REPOSITORY.ITEM_STATES or RESOURCE.RESOURCE.

Resolving The Problem

Step 1: Immediate Workaround (Instance Level)

Disable AVX-512 usage within the DB2 instance to prevent further damage to ELM data:

Run: db2set DB2_CPU_FEATURE_DISABLE=AVX512
Restart the DB2 instance (db2stop / db2start).

Step 2: Permanent Infrastructure Fix

Update the VMware environment to resolve the underlying instruction set defect:

Target Version: VMware ESXi 8.0 Update 3 (Build 24674464) or later.

Step 3: ELM Data Recovery

Because ELM applications rely on strict link integrity, choose a recovery path carefully:

Restore (Recommended): Revert to a healthy backup created before the corruption occurred. This is the only way to ensure 100% link integrity between RM, JTS, and other ELM containers.
Salvage (High Risk): Use db2dart /DDEL to extract data from corrupted tables. Warning: This will result in "data holes" and broken links within the RM application, potentially leading to ghost artifacts.
Re-indexing: After any database repair or restore, perform a full RM re-index:
```
repotools-rm.bat -reindex all
```

Step 4: Post-Fix Verification

Perform a final validation to ensure the ELM environment is stable:

Run db2 inspect check database RM.
Verify RM application accessibility and artifact consistency in the web UI.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB77","label":"Automation Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSPRJQ","label":"IBM Engineering Lifecycle Management Base"},"ARM Category":[{"code":"a8m0z000000CbQgAAK","label":"Jazz Team Server-\u003EDatabase-\u003EDB2"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"7.0.0;7.0.1;7.0.2;7.0.3;7.1.0;7.2.0"}]

Tips