IBM Support

Voltaire Subnet Manager (SM) hang due to memory usage - IBM System Cluster 1350 (Type 4669)

Troubleshooting


Problem

When an ISR9024S-M (Type 4669-017) or ISR9024D-M (Type 4669- 018), running 3.4.5_b463, is connected to Voltaire ISR9024S/ ISR9024S-M switches built from an early aged switch ASICs (chip rev 0xA0 or 0xA1), the GridVision performance monitor will fail to recognize the ASICs device ID and will print a repetitive "invalid revision id" message to the nvigor.log. Over the course of few weeks, depending on the number of old rev ASICs, these repetitive prints into the file will increase the file size up to thepoint it will consume the entire available FLASH RAM memory leading to Subnet Manager (SM) hang and total system malfunction.

Resolving The Problem

Source

RETAIN tip: H175659

Symptom

When an ISR9024S-M (Type 4669-017) or ISR9024D-M (Type 4669- 018), running 3.4.5_b463, is connected to Voltaire ISR9024S/ ISR9024S-M switches built from an early aged switch ASICs (chip rev 0xA0 or 0xA1), the GridVision performance monitor will fail to recognize the ASICs device ID and will print a repetitive "invalid revision id" message to the nvigor.log.

Over the course of few weeks, depending on the number of old rev ASICs, these repetitive prints into the file will increase the file size up to the point it will consume the entire available FLASH RAM memory leading to Subnet Manager (SM) hang and total system malfunction.

Affected configurations

The system may be any of the following IBM servers:

  • IBM System Cluster 1350, type 4669, any model

This tip is not option specific.

The 3.4.5_b463 firmware for the Voltaire ISR9024 (IBM M/T 4669) is affected.

This tip is not Operating System specific.

Solution

Voltaire has fixed this issue and is introducing new ISR9024 GV software version 3.4.5, build 467, with a fix eliminating the faulty prints and monitoring the size of the these logs.

In addition, oversized logs are deleted once recognized.

Voltaire is recommending all users upgrade to this new GridVision software immediately to avoid and potential switch hangs. Version 3.4.5, build 467, is currently available for download at the following URL:

http://www.voltaire.com/ftp/support-products/source/9024/pImage.ibswlrm
Workaround

A return to factory default command will erase the FLASH and recover the switch.

To set the ISR9024 back to factory defaults you need to login to the switch CLI --> enable --> config --> factory-default.

Additional information

Voltaire managed switches are running GridVision fabric and device manager software on an embedded Linux CPU board (management CPU). The runtime code is running in RAM however some key elements such as database configuration files are kept on a 64 MB Flash RAM for data persistency. If this FLASH RAM capacity is maxed out, the switch will hang.

One of the Flash RAM elements is a System Log called nvigor.log. This log captures all system power up messages every time system is rebooted or powered up. Typical file size containing normal system bring up messages is 40 KB.

The system can keep up to four nvigor.log files. This nvigor.log, unlike all other system logs, was not constrained or restricted for file size potentially allowing log size to increase and consume the entire available FlashRAM memory space.

Document Location

Worldwide

Operating System

System x:Operating system independent / None

[{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW20W","label":"eServer Cluster 1350"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
29 January 2019

UID

ibm1MIGR-5073289