Networking issues

A list of all troubleshooting and known issues that exist in the networking of IBM Fusion HCI.

Configuring the F5 load balancer

Problem statement
You can use an external load balancer in place of the default load balancer to access the Red Hat® OpenShift® cluster after the installation of IBM Fusion HCI.
Resolution
You can use node placement and toleration to place router-default pods to the infrastructure nodes on your environment, then you can configure external load balancer to route traffic to the infrastructure nodes.
  1. Run the following command to label worker nodes as infrastructure nodes. Use the correct compute nodes for your environment.
    oc label nodes compute-1-ru5.rackd.mydomain.com node-role.kubernetes.io/infra=''
    oc label nodes compute-1-ru6.rackd.mydomain.com node-role.kubernetes.io/infra=''
  2. Create a default.yaml file to place router-default pods to the infrastructure nodes.
    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      name: default
      namespace: openshift-ingress-operator
    spec:
      nodePlacement:
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/infra: ""
        tolerations:
        - effect: NoSchedule
          operator: Exists
      replicas: 2
    status: {}
  3. Run the following command to apply the change.
    oc apply -f default.yaml
  4. Run the following command to verify that the router-default pods are running on the infrastructure nodes.
    oc -n openshift-ingress get pod -o wide
    Example output:
    NAME                            READY   STATUS    RESTARTS   AGE   IP              NODE                               NOMINATED NODE   READINESS GATES
    router-default-6c8f978c-9bd9j   1/1     Running   0          23m   172.20.240.29   compute-1-ru6.rackd.mydomain.com   <none>           <none>
    router-default-6c8f978c-zglxz   1/1     Running   0          23m   172.20.240.28   compute-1-ru5.rackd.mydomain.com   <none>           <none>
  5. Configure your external load balancer. For more information, see Configuring an external load balancer.

Missing network events that are raised during port down or up on a management switch

Problem statement
If a cumulusLinkDown event is observed on the IBM Fusion HCI user interface for any high-speed or management switches, then follow the steps that are mentioned in the following resolution section.
Resolution
Do the following workaround steps:
  1. Go to Infrastructure > Network > Switches .
  2. In the ellipsis menu of the switch, click Run Command.
  3. Select net show interface all from the list and click Run.
  4. Check whether the state of the port is UP, DN, or ADMDN.
    1. If the state of the port is UP, then the port is recovered and is up and running. In this scenario, the event is not a new entry and can be ignored.
    2. If the state of the port is DN or ADMDN, then contact IBM support to check the switch.

Management switch unavailability

Problem statement
A management switch is not available.
Resolution
Do the following steps to log in to the management switche and resolve the issue.
  1. Run the following oc command to get the corresponding switch IPv6 or IPv4 address.
     oc get switches <switchName> -o yaml | grep switchIpadress

    For example:

    If the switch name is mgmt1-rackc, then run the following command:
     oc get switches mgmt1-rackc -o yaml | grep switchIpadress
  2. Run the following oc command to get the switch login credentials from the switch secret.
    oc get secrets <switchName>-secret -oyaml | grep defaultUserName
    oc get secrets <switchName>-secret -oyaml | grep defaultUserPasswrd
    For example:
     oc get secrets mgmt1-rackc-secret -oyaml | grep defaultUserName
    oc get secrets mgmt1-rackc-secret -oyaml | grep defaultUserPasswrd
  3. Run the following command to decode both the username and password, obtained from the previous step.
    echo <username/password> | base64 -d
  4. Use the username and password that is decoded from the previous step to login to the switch from one of the compute nodes. From the Red Hat OpenShift user interface, go to one of the compute node terminals.
    ssh `<username>@<switchIpadress>`
  5. SSH to log in to the switch and go to the root user.
    sudo su
  6. Check the values of ClientAliveInterval and ClientAliveCountMax in sshd_config.
    root@mgmt1-rackc:mgmt:~$ cat  /etc/ssh/sshd_config | egrep "ClientAliveInterval|ClientAliveCountMax"
              #ClientAliveInterval 0
              #ClientAliveCountMax 3
  7. Update /etc/ssh/sshd_config file with ClientAliveInterval as 600 and ClientAliveCountMax as 0, and save the file.
    vi /etc/ssh/sshd_config
              "Remove "#" on both lines and change the values as given below and Save the file(Press Esc to enter Command mode, and then type :wq to write and quit the file)"
              ClientAliveInterval 600
              ClientAliveCountMax 0
  8. Check that the values of ClientAliveInterval and ClientAliveCountMax are correctly set.
    root@mgmt1-rackc:mgmt:~$ cat  /etc/ssh/sshd_config | egrep "ClientAliveInterval|ClientAliveCountMax"
              ClientAliveInterval 600
              ClientAliveCountMax 0
              root@mgmt1-rackc:mgmt:~$
  9. Restart the sshd service.
    root@mgmt1-rackc:mgmt:~$ sudo systemctl restart sshd
  10. Run the following command to clear the SSH sessions that belong to ISFUSER (most of the SSH session belongs to ISFUSER).
    for i in `ps -aef | grep ssh | grep ISFUSER | awk {'print $2'}`; do echo $i; kill -9 $i; done

Switches moved to critical state after you power off TOR

Problem statement
Switches are in critical state because the high availability cluster connectivity breaks between the racks.
Cause
The switches from one rack do not share loopback Anycast IP to the spine and other switches.
Resolution
Important: This issue and its workaround steps are applicable only to a high availability cluster.
  1. Log in to all high-speed switches
  2. Run the following oc command to get the corresponding switch IPv6 or IPv4 address.
    oc get switches <switchName> -o yaml | grep switchIpadress
    For example, if the switch name is hspeed1-rackc, then run the following command:
    oc get switches hspeed1-rackc -o yaml | grep switchIpadress
  3. Run the following oc command to get the switch login credentials from the switch secret.
    oc get secrets <switchName>-secret -oyaml | grep defaultUserName
    oc get secrets <switchName>-secret -oyaml | grep defaultUserPasswrd
    For example, if the switch name is hspeed1-rackc, then run the following command:
    oc get secrets hspeed1-rackc-secret -oyaml | grep defaultUserName
    oc get secrets hspeed1-rackc-secret -oyaml | grep defaultUserPasswrd
  4. Run the following command to decode both the username and password, obtained from the previous step.
    echo <username/password> | base64 -d
  5. Use the decoded username and password from the previous step to login to the switch from one of the compute nodes. From the Red Hat OpenShift user interface, go to one of the compute node terminals.
    ssh `<username>@<switchIpadress>`
  6. Run the following command to restart frr service on the switch.
    sudo systemctl restart frr

Adding a VLAN fails on a expansion rack setup

Problem statement
Adding a VLAN fails on a expansion rack setup with the combination of gen1 and gen2 racks.
Resolution
If you face any issues during adding a VLAN on a expansion rack setup, then contact IBM support.

Network adapters validation fails

Problem statement
The network adapter validation fails during replacing a network adapter.
Resolution
As a resolution, ensure you update the kickstart file whenever you replaces a network adapter. Also, update the kickstart with the MAC address of the replaced network adapter.

Admin network in a critical state

Problem statement
The admin network is in a critical state, showing an error on the IBM Fusion HCI user interface because pods are unable to communicate between sites.
Cause
The submariner gateway pods are unable to establish a connection after Site1 is recovered, preventing the submariner from connecting successfully.
Diagnosis
Follow the steps to diagnose through user interface:
  1. Log in to the site2 OpenShift Container Platform user interface.
  2. Go to Administration > CustomResourceDefinitions.
  3. Search for MetroDR and select MetroDR from the list.
  4. Go to Instances tab and select metrodrsite.
  5. Go to YAML tab and check submarinerMonitoringCommandOutput under metroDRSiteStatus.
  6. Check whether you see the following error for the node.
    control-1-ru4.rackae1.mydomain   active 0 connections out of 1 are established
  7. Note down the node name.
Follow the steps to diagnose through CLI:
  1. Log in to the site2 OpenShift Container Platform user interface.
  2. Run the following command.
    oc get mdr -o yaml
  3. Check submarinerMonitoringCommandOutput under metroDRSiteStatus in the output.
  4. Check whether you see the following error for the node.
    control-1-ru4.rackae1.mydomain   active 0 connections out of 1 are established
  5. Note down the node name.
Resolution
Follow the steps to resolve the issue through user interface:
  1. Log in to the OpenShift Container Platform user interface of the site2.
  2. Go to Workloads > Pods
  3. Select submariner-operator from the Project drop-down.
  4. Click Manage columns icon which is present beside the search bar.
  5. Select Node under the Additional columns.

    The pods details appears.

  6. Restart or delete the submariner-gateway pod which is on the node name that you have noted from the diagnose steps.
  7. The error resolves automatically after you applying the resolution steps. If the issue persists, contact IBM support.
Follow the steps to resolve the issue through CLI:
  1. Log in to the OpenShift Container Platform user interface of the site 2.
  2. Run the following command to get the list of submariner gateway pods.
    oc get pods -o wide -n submariner-operator | grep submariner-gateway
    Example output:
    submariner-gateway-jc6pt                        1/1     Running   1             16h   172.20.102.26   control-1-ru3.rackae2.mydomain.ibm.com   <none>           <none>
    submariner-gateway-kpqpc                        1/1     Running   1             16h   172.20.102.25   control-1-ru2.rackae2.mydomain.ibm.com   <none>           <none>
    submariner-gateway-mp7jt                        1/1     Running   2 (16h ago)   16h   172.20.102.27   control-1-ru4.rackae2.mydomain.ibm.com   <none>           <none>
  3. Run the following command to delete the gateway pod which is on the node name that you noted from the diagnose steps.
    oc delete pod <podname> -n submariner-operator
    Example output:
    oc delete pod submariner-gateway-mp7jt -n submariner-operator 
    pod "submariner-gateway-mp7jt" deleted
  4. The error resolves automatically after you applying the resolution steps. If the issue persists, contact IBM support.