Networking issues
A list of all troubleshooting and known issues that exist in the networking of IBM Fusion HCI.
Configuring the F5 load balancer
- Problem statement
- You can use an external load balancer in place of the default load balancer to access the Red Hat® OpenShift® cluster after the installation of IBM Fusion HCI.
- Resolution
- You can use node placement and toleration to place
router-defaultpods to the infrastructure nodes on your environment, then you can configure external load balancer to route traffic to the infrastructure nodes.- Run the following command to label worker nodes as infrastructure nodes. Use the correct compute
nodes for your
environment.
oc label nodes compute-1-ru5.rackd.mydomain.com node-role.kubernetes.io/infra='' oc label nodes compute-1-ru6.rackd.mydomain.com node-role.kubernetes.io/infra='' - Create a
default.yamlfile to placerouter-defaultpods to the infrastructure nodes.apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: default namespace: openshift-ingress-operator spec: nodePlacement: nodeSelector: matchLabels: node-role.kubernetes.io/infra: "" tolerations: - effect: NoSchedule operator: Exists replicas: 2 status: {} - Run the following command to apply the
change.
oc apply -f default.yaml - Run the following command to verify that the
router-defaultpods are running on the infrastructure nodes.oc -n openshift-ingress get pod -o wideExample output:NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-6c8f978c-9bd9j 1/1 Running 0 23m 172.20.240.29 compute-1-ru6.rackd.mydomain.com <none> <none> router-default-6c8f978c-zglxz 1/1 Running 0 23m 172.20.240.28 compute-1-ru5.rackd.mydomain.com <none> <none>
- Configure your external load balancer. For more information, see Configuring an external load balancer.
- Run the following command to label worker nodes as infrastructure nodes. Use the correct compute
nodes for your
environment.
Missing network events that are raised during port down or up on a management switch
- Problem statement
- If a
cumulusLinkDownevent is observed on the IBM Fusion HCI user interface for any high-speed or management switches, then follow the steps that are mentioned in the following resolution section.
- Resolution
- Do the following workaround steps:
- Go to .
- In the ellipsis menu of the switch, click Run Command.
- Select net show interface all from the list and click Run.
- Check whether the state of the port is
UP,DN, orADMDN.- If the state of the port is
UP, then the port is recovered and is up and running. In this scenario, the event is not a new entry and can be ignored. - If the state of the port is
DNorADMDN, then contact IBM support to check the switch.
- If the state of the port is
Management switch unavailability
- Problem statement
- A management switch is not available.
- Resolution
- Do the following steps to log in to the management switche and resolve the issue.
- Run the following
occommand to get the corresponding switch IPv6 or IPv4 address.oc get switches <switchName> -o yaml | grep switchIpadressFor example:
If the switch name is mgmt1-rackc, then run the following command:oc get switches mgmt1-rackc -o yaml | grep switchIpadress - Run the following
occommand to get the switch login credentials from the switch secret.oc get secrets <switchName>-secret -oyaml | grep defaultUserName oc get secrets <switchName>-secret -oyaml | grep defaultUserPasswrdFor example:oc get secrets mgmt1-rackc-secret -oyaml | grep defaultUserName oc get secrets mgmt1-rackc-secret -oyaml | grep defaultUserPasswrd - Run the following command to decode both the username and password, obtained from the previous
step.
echo <username/password> | base64 -d - Use the username and password that is decoded from the previous step to login to the switch from
one of the compute nodes. From the Red Hat
OpenShift user interface, go to one of the
compute node terminals.
ssh `<username>@<switchIpadress>` - SSH to log in to the switch and go to the root user.
sudo su - Check the values of
ClientAliveIntervalandClientAliveCountMaxinsshd_config.root@mgmt1-rackc:mgmt:~$ cat /etc/ssh/sshd_config | egrep "ClientAliveInterval|ClientAliveCountMax" #ClientAliveInterval 0 #ClientAliveCountMax 3 - Update /etc/ssh/sshd_config file with
ClientAliveIntervalas 600 andClientAliveCountMaxas 0, and save the file.vi /etc/ssh/sshd_config "Remove "#" on both lines and change the values as given below and Save the file(Press Esc to enter Command mode, and then type :wq to write and quit the file)" ClientAliveInterval 600 ClientAliveCountMax 0 - Check that the values of
ClientAliveIntervalandClientAliveCountMaxare correctly set.root@mgmt1-rackc:mgmt:~$ cat /etc/ssh/sshd_config | egrep "ClientAliveInterval|ClientAliveCountMax" ClientAliveInterval 600 ClientAliveCountMax 0 root@mgmt1-rackc:mgmt:~$ - Restart the
sshdservice.root@mgmt1-rackc:mgmt:~$ sudo systemctl restart sshd - Run the following command to clear the SSH sessions that belong to ISFUSER (most of the SSH
session belongs to
ISFUSER).
for i in `ps -aef | grep ssh | grep ISFUSER | awk {'print $2'}`; do echo $i; kill -9 $i; done
- Run the following
Switches moved to critical state after you power off TOR
- Problem statement
- Switches are in critical state because the high availability cluster connectivity breaks between the racks.
- Cause
- The switches from one rack do not share loopback Anycast IP to the spine and other switches.
- Resolution
-
Important: This issue and its workaround steps are applicable only to a high availability cluster.
- Log in to all high-speed switches
- Run the following
occommand to get the corresponding switch IPv6 or IPv4 address.oc get switches <switchName> -o yaml | grep switchIpadressFor example, if the switch name is hspeed1-rackc, then run the following command:oc get switches hspeed1-rackc -o yaml | grep switchIpadress - Run the following
occommand to get the switch login credentials from the switch secret.oc get secrets <switchName>-secret -oyaml | grep defaultUserName oc get secrets <switchName>-secret -oyaml | grep defaultUserPasswrdFor example, if the switch name is hspeed1-rackc, then run the following command:oc get secrets hspeed1-rackc-secret -oyaml | grep defaultUserName oc get secrets hspeed1-rackc-secret -oyaml | grep defaultUserPasswrd - Run the following command to decode both the username and password, obtained from the previous
step.
echo <username/password> | base64 -d - Use the decoded username and password from the previous step to login to the switch from one of
the compute nodes. From the Red Hat
OpenShift user interface, go to one of the
compute node terminals.
ssh `<username>@<switchIpadress>` - Run the following command to restart
frrservice on the switch.sudo systemctl restart frr
Adding a VLAN fails on a expansion rack setup
- Problem statement
- Adding a VLAN fails on a expansion rack setup with the combination of gen1 and gen2 racks.
- Resolution
- If you face any issues during adding a VLAN on a expansion rack setup, then contact IBM support.
Network adapters validation fails
- Problem statement
- The network adapter validation fails during replacing a network adapter.
- Resolution
- As a resolution, ensure you update the kickstart file whenever you replaces a network adapter. Also, update the kickstart with the MAC address of the replaced network adapter.
Admin network in a critical state
- Problem statement
- The admin network is in a critical state, showing an error on the IBM Fusion HCI user interface because pods are unable to communicate between sites.
- Cause
- The submariner gateway pods are unable to establish a connection after Site1 is recovered, preventing the submariner from connecting successfully.
- Diagnosis
- Follow the steps to diagnose through user interface:
- Log in to the site2 OpenShift Container Platform user interface.
- Go to .
- Search for MetroDR and select MetroDR from the list.
- Go to Instances tab and select metrodrsite.
- Go to YAML tab and check
submarinerMonitoringCommandOutputundermetroDRSiteStatus. - Check whether you see the following error for the
node.
control-1-ru4.rackae1.mydomain active 0 connections out of 1 are established
- Note down the node name.
- Resolution
- Follow the steps to resolve the issue through user interface:
- Log in to the OpenShift Container Platform user interface of the site2.
- Go to
- Select submariner-operator from the Project drop-down.
- Click Manage columns icon which is present beside the search bar.
- Select Node under the Additional columns.
The pods details appears.
- Restart or delete the submariner-gateway pod which is on the node name that you have noted from the diagnose steps.
- The error resolves automatically after you applying the resolution steps. If the issue persists, contact IBM support.