Switch health check patch release notes

After the patch is applied, platform manager reduces status polling interval for 1 GbE management switches. The high rate of health checks was causing eUSB flash storage failures in both management switch and Fibre Channel switch. The new check interval is 90 min. It is highly recommended to apply this patch to avoid switch failures.

Before you begin

  • The patch is applicable to any Integrated Analytics System version lower than 1.0.27.0
  • The patch must be re-applied after any upgrade lower than 1.0.27.0.
  • The patch is executed by running an interactive update_switches_delay.py script that guides you through the process.
  • The estimated run time is 5-10 minutes, depending on the system size. Platform Manager is stopped, but the applications remain online. Database remains online.
  • Before you run the script, cat the appropriate json file before running the script so you can compare with the results.
    Note: Depending on the Integrated Analytics System version, the file to check may be in different locations:
    • /usr/lib/python2.7/site-packages/magneto/cfg/sf_rack_leader.json
    • /usr/lib/python2.7/site-packages/magneto/cfg/3452_rack_leader.json
    • /usr/lib/python2.7/site-packages/magneto/cfg/3452_hub.json
Run the script as root. Either log in as root directly, or use the command su -. The su root command does not work and causes the process to fail.

Procedure

  1. Download the 1.0.0.0.switch_monitoring_policy-IM-IIAS-fpXXX package, where XXX stands for the latest package number, from Fix Central.
  2. After the tar file is downloaded, untar the switch_monitoring_policy-1.0_release.tar.gz by using tar -xvf command. update_switches_delay directory is created. Example:
    tar -xvf switch_monitoring_policy-1.0_release.tar.gz 
    update_switches_delay/
    update_switches_delay/update_switches_delay.py
  3. Run:
    cd update_switches_delay
  4. Run the following command without any parameters.
    python update_switches_delay.py
  5. Wait for the nodes check to complete, and confirm to update the configuration files:
    [root@e1-n1 update_switches_delay]# ./update_switches_delay.py 
    ###############################################################################
    Started config update script
    Checking nodes list...
    node0101-fab, node0102-fab, node0103-fab, node0104-fab, node0105-fab, node0106-fab, node0107-fab
    Checking nodes reachability...
    Checking reachability of node0101-fab... ok
    Checking reachability of node0102-fab... ok
    Checking reachability of node0103-fab... ok
    Checking reachability of node0104-fab... ok
    Checking reachability of node0105-fab... ok
    Checking reachability of node0106-fab... ok
    Checking reachability of node0107-fab... ok
    All the nodes are reachable, proceeding
    Checking system state... Ready
    Platform Manager is currently running, do you want to stop it and update configuration files?
    Continue? y/[n]: y
    Stopping Platform Management... 
    Successfully deactivated platform
    Updating /usr/lib/python2.7/site-packages/magneto/cfg/3452_rack_leader.json file on node0101-fab
    Updating /usr/lib/python2.7/site-packages/magneto/cfg/3452_rack_leader.json file on node0102-fab
    Updating /usr/lib/python2.7/site-packages/magneto/cfg/3452_rack_leader.json file on node0103-fab
    Updating /usr/lib/python2.7/site-packages/magneto/cfg/3452_rack_leader.json file on node0104-fab
    Updating /usr/lib/python2.7/site-packages/magneto/cfg/3452_rack_leader.json file on node0105-fab
    Updating /usr/lib/python2.7/site-packages/magneto/cfg/3452_rack_leader.json file on node0106-fab
    Updating /usr/lib/python2.7/site-packages/magneto/cfg/3452_rack_leader.json file on node0107-fab
    Updated, running apstart -p... 
    Successfully activated platform
    Script is done, exiting
    ###############################################################################
  6. For verification:
    • cat the appropriate json file and compare with the previous results
      Note: Depending on the Integrated Analytics System version, the file to check may be in different locations:
      • /usr/lib/python2.7/site-packages/magneto/cfg/sf_rack_leader.json
      • /usr/lib/python2.7/site-packages/magneto/cfg/3452_rack_leader.json
      • /usr/lib/python2.7/site-packages/magneto/cfg/3452_hub.json
      The following sections should be changed:
      {"type": "mgtsw", "target": "hw://#bom:mgtsw/hadomain#", "timeout": 120, "delay": 5400},
      {"type": "fcsw", "target": "hw://#bom:fcsw/hadomain#", "timeout": 120, "delay": 5400},
    • Look at the text files for mgtsw and fcsw located in /var/log/appliance/platform/management/resmgr_out/. After the patch has been applied, the check is run approximately every 90 minutes:
      • mgtsw
        2022-03-10 12:09:19|24.76s|STATUS|mgtsw@hw://hadomain1.mgtswa
        2022-03-10 13:40:23|24.36s|STATUS|mgtsw@hw://hadomain1.mgtswa
        2022-03-10 15:03:04|24.07s|STATUS|mgtsw@hw://hadomain1.mgtswa
        
        2022-03-10 12:09:20|25.54s|STATUS|mgtsw@hw://hadomain1.mgtswb
        2022-03-10 13:46:39|23.85s|STATUS|mgtsw@hw://hadomain1.mgtswb
        2022-03-10 15:19:16|23.72s|STATUS|mgtsw@hw://hadomain1.mgtswb
        
      • fcsw:
        2022-03-10 12:09:30|35.91s|STATUS|fcsw@hw://hadomain1.fcswa
        2022-03-10 13:44:44|34.61s|STATUS|fcsw@hw://hadomain1.fcswa
        2022-03-10 15:17:12|34.98s|STATUS|fcsw@hw://hadomain1.fcswa
        
        2022-03-10 12:09:30|35.56s|STATUS|fcsw@hw://hadomain1.fcswb
        2022-03-10 13:35:09|33.79s|STATUS|fcsw@hw://hadomain1.fcswb
        2022-03-10 15:11:41|33.75s|STATUS|fcsw@hw://hadomain1.fcswb