Troubleshooting

Edit online

The following are some troubleshooting scenarios.

General Issues

Edit online

Early Users of 'unrhel'

Edit online

The following message is returned when the environment is migrated over from RHEL using unrhel@v5.1.1 or below.

FAILED => Missing sudo password

To fix this, execute the following commands before performing an upgrade.

$ ssh sevone@<'agent' IP address>
 
$ sudo -i
 
$ echo "sevone ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

ACCESS DENIED in GraphQL Logs

Edit online

If SOA apikeys are outdated or expired, you are likely to get this error. To fix this, update the datasource keys.

Execute the following command to ensure that GraphQL pod is Running and not in an Errored or CrashLookBackOff state.

Example

$ kubectl get pods | grep graphql
di-graphql-7d88c8c7b5-fbwgc                  1/1     Running     0              22h

If the third column reads Errored or CrashLookBackOff, using a text editor of your choice, edit /opt/SevOne/chartconfs/di_custom.yaml to include the following environment variable and then, save it.
Example
```
graphql:
  env:
    SKIP_REPORT_MIGRATION_DRY_RUN: true
```
Apply the change made to /opt/SevOne/chartconfs/di_custom.yaml file.
```
$ sevone-cli playbook up --tags apps
```

Once the GraphQL pod state is running / healthy, generate new SOA API keys for each affected datasource. Execute the following command.

$ sevone-cli exec graphql -- npm run reconfig-datasource

Note: You will be prompted several times. Keep all the default values except enter y when prompted to Login instead of providing an API key.

username must be admin

password must be the admin Graphical User Interface password for the datasource.

Important: Repeat this step for each datasource.

Example

$ sevone-cli exec graphql -- npm run reconfig-datasource

> insight-server@6.7.0 reconfig-datasource /insight-server
> NODE_PATH=./dist/libs node dist/scripts/database-init/reconfigure-datasource.js

Datasource config:
  Name:         Data Insight API
  Address:      https://staging.soa.sevone.doc
  API key:      eyJ1dWlkIjoiYzMxNTQzZWUtMTgxZC00NWMyLTlkNjctNTUwZWRhODQ2MGFkIiwiYXBwbGljYXRpb24iOiJEYXRhIEluc2lnaHQgKGNrcmwxdmhqZzAwMDA1M3MwM3Z5bmRlZXMpIiwiZW50cm9weSI6IjZVaWxuQStzVDk2ZUFKeG92WW1Nak1odS9nZ29JSWhLNVBDZ05yZHBBT1lrSE11ZlM0eU9CbCs4YWxEUXd3a1MifQ==
  Dstype:       METRICS/FLOW

Datasource name [Data Insight API]:

[1] METRICS/FLOW
[2] splunk-datasource
[3] elastic-datasource
[0] Keep: METRICS/FLOW

Datasource dstype [1, 2, 3, 0]: 0
Datasource address [https://staging.soa.sevone.doc]:
Login instead of providing an API key? [y/n]: y
Username: admin
Password: ******
info: [Data Insight API@reconfigure-datasource] SOA request (SOA-1) post https://staging.soa.sevone.doc/api/v3/users/signin
(node:347) Warning: Setting the NODE_TLS_REJECT_UNAUTHORIZED environment variable to '0' makes TLS connections and HTTPS requests insecure by disabling certificate verification.
(Use `node --trace-warnings ...` to show where the warning was created)
info: [Data Insight API@reconfigure-datasource] SOA response (SOA-1) elapsed 1755ms.
info: [Data Insight API@reconfigure-datasource] SOA request (SOA-2) post https://staging.soa.sevone.doc/api/v3/users/apikey
info: [Data Insight API@reconfigure-datasource] SOA response (SOA-2) elapsed 277ms.

New datasource config:
  Name:         Data Insight API
  Address:      https://staging.soa.sevone.doc
  API key:      eyJ1dWlkIjoiMzgyMDdhMjItNzE2Mi00OWRlLTk5NTYtYmI3OTVkYjc5NzZkIiwiYXBwbGljYXRpb24iOiJEYXRhIEluc2lnaHQgKGNrczUyYnd0YzAwMDA5bnMxNnQ3aWcxYnQpIiwiZW50cm9weSI6IkNoYS9tbDFWbVVyYThQcHVsLzIzY05JZk94QXcxWFQrVnEyM0hPSzYzSTdPNGNMbkJTVjQyWUVRSW1FeGtDaEoifQ==
  Dstype:       METRICS/FLOW

Is this config correct? [y/n]: y
info: [Data Insight API@create-datasource] SOA request (SOA-3) get https://staging.soa.sevone.doc/api/v3/users/self
info: [Data Insight API@create-datasource] SOA response (SOA-3) elapsed 275ms.

Datasource config updated!

Datasource reconfiguration complete.

Once all the datasources have been updated, restart the GraphQL pod.
```
$ kubectl delete pods -l app.kubernetes.io/component=graphql
```

Edit online

This error is when the wdkserver is not serving the widgets to the user interface. In most cases, it is due to an invalid cookie value set in your browser that causes the error. You may inspect the network activity based on the browser's Developer Tools and look for requests to /wdkserver. If you are unable to inspect the network activity, please contact IBM SevOne Support.

If you observe the following error message coming back from /wdkserver, remove the offending cookie or disable strict headers in wdkserver. If you need assistance with this, please contact IBM SevOne Support.

{"statusCode":400,"error":"Bad Request","message":"Invalid cookie value"}

Using a text editor of your choice, edit /opt/SevOne/chartconfs/di_custom.yaml to include the following environment variable and then, save it.
```
wdkserver:
  env:
    DISABLE_STRICT_HEADER: true    
```
Apply the change made to /opt/SevOne/chartconfs/di_custom.yaml file.
```
$ sevone-cli playbook up --tags apps
```

Unable to connect to the server: x509: certificate has expired

Edit online

If you are seeing error message, x509: certificate has expired, when running kubectl commands, your certificates have expired and need to be rotated manually. Please refer to SevOne Data Insight Administration Guide > section Rotate Kubernetes Certificates for details.

[ WARN ] No upgrade available

Edit online

The No upgrade available warning usually occurs when attempting to retry a failed upgrade or if the upgrade .tgz file is placed in an incorrect directory.

Ensure the .tgz file is in the correct directory as outlined in SevOne Data Insight Upgrade Process Guide > section Confirm SevOne Data Insight Version.

Using ssh , log into SevOne Data Insight as sevone .

$ ssh sevone@<SevOne Data Insight 'control plane' node IP address or hostname>

Revert your SevOne Data Insight major / minor version in /SevOne.info to a prior / lower version using a text editor of your choice.
```
$ vi /SevOne.info
```
Example# 1
Assume the current SevOne Data Insight version is 6.7.0. The version prior to SevOne Data Insight 6.7.0 is SevOne Data Insight is 6.6. In this case, if you want to go to the prior version, you must change the major and minor version to go to the prior / lower version.
```
major = 6           # e.g.: if this is `6` then leave it as-is
minor = 7           # e.g.: if this is `7` then set it to `6`
patch = 0
build = 160         # e.g.: enter the build number for the prior version i.e. 160
```
The prior version is,
```
major = 6
minor = 6 
patch = 0
build = 139
```
Example# 2
Assume the current SevOne Data Insight version is 6.5.0. The version prior to SevOne Data Insight 6.5.0 is SevOne Data Insight is 3.14. In this case, if you want to go to the prior version, you must change the major and minor version to go to the prior / lower version.
```
major = 6      # e.g.: if this is `6` then set it to `3`
minor = 5      # e.g.: if this is `5` then set it to `14` or `13` or lower version
patch = 0
build = 67     # e.g.: enter the build number for the prior version i.e. 162
```
The prior version is,
```
major = 3
minor = 14 
patch = 0
build = 162
```
Example# 3
Assume the current SevOne Data Insight version is 3.14. The version prior to SevOne Data Insight 3.14 is SevOne Data Insight is 3.13. In this case, if you want to go to the prior version, you must change the major and minor version to go to the prior / lower version.
```
major = 3           
minor = 14           # e.g.: if this is `14` then set it to `13`
patch = 0
build = 162          # e.g.: enter the build number for the prior version i.e. 54         
```
The prior version is,
```
major = 3
minor = 13 
patch = 0
build = 54
```

Domain Name Resolution (DNS) not working

Edit online

Important: A working DNS configuration is a requirement for any SevOne Data Insight deployment.

The DNS server must be able to resolve SevOne Data Insight's hostname on both the control plane and the agent nodes otherwise, SevOne Data Insight will not work. This can be done by adding your DNS servers via nmtui or by editing /etc/resolv.conf file directly as shown in the steps below.

In the example below, let's use the following SevOne Data Insight IP addresses.

Hostname	IP Address	Role
sdi-node01	10.123.45.67	control plane
sdi-node02	10.123.45.68	agent

Also, in this example, the following DNS configuration is used and DNS search records, sevone.com and nwk.sevone.com are used.

Nameserver	IP Address
nameserver	10.168.16.50
nameserver	10.205.8.50

Using ssh, log into the designated SevOne Data Insight control plane node and agent node as sevone from two different terminal windows.
SSH to 'control plane' node from terminal window 1
```
$ ssh sevone@10.123.45.67
```
SSH to 'agent' node from terminal window 2
```
$ ssh sevone@10.123.45.68
```

Obtain a list of DNS entries in /etc/resolv.conf file for both control plane and agent nodes in this example.

From terminal window 1

$ cat /etc/resolv.conf
# Generated by NetworkManager
search sevone.com nwk.sevone.com
nameserver 10.168.16.50
nameserver 10.205.8.50

From terminal window 2

$ cat /etc/resolv.conf
# Generated by NetworkManager
search sevone.com nwk.sevone.com
nameserver 10.168.16.50
nameserver 10.205.8.50

Ensure that DNS server can resolve SevOne Data Insight's hostname / IP address on both the control plane and the agent nodes along with the DNS entries in /etc/resolv.conf file (see the search line and nameserver(s)).
From terminal window 1 The following output shows that the DNS server can resolve hostname / IP address on both the control plane and the agent nodes.

Check if 'nslookup' resolves the 'control plane' IP address
```
$ nslookup 10.123.45.67
67.45.123.10.in-addr.arpa   name = sdi-node01.sevone.com.
```
Check if 'nslookup' resolves the 'control plane' hostname
```
$ nslookup sdi-node01.sevone.com
Server:     10.168.16.50
Address:    10.168.16.50#53

Name:   sdi-node01.sevone.com
Address: 10.123.45.67
```
Check if 'nslookup' resolves the 'agent' IP address
```
$ nslookup 10.123.45.68
68.45.123.10.in-addr.arpa   name = sdi-node02.sevone.com.
```
Check if 'nslookup' resolves the 'agent' hostname
```
$ nslookup sdi-node02.sevone.com
Server:     10.168.16.50
Address:    10.168.16.50#53

Name:   sdi-node02.sevone.com
Address: 10.123.45.68
```
nslookup name 'sevone.com' in search line in /etc/resolve.conf
```
$ nslookup sevone.com
Server:     10.168.16.50
Address:    10.168.16.50#53

Name:   sevone.com
Address: 23.185.0.4
```
nslookup name 'nwk.sevone.com' in search line in /etc/resolve.conf
```
$ nslookup nwk.sevone.com
Server:     10.168.16.50
Address:    10.168.16.50#53

Name:   nwk.sevone.com
Address: 25.185.0.4
```
nslookup nameserver '10.168.16.50' in /etc/resolve.conf
```
$ nslookup 10.168.16.50
50.16.168.10.in-addr.arpa   name = infoblox.nwk.sevone.com.
```
nslookup nameserver '10.205.8.50' in /etc/resolve.conf
```
$ nslookup 10.205.8.50
50.8.205.10.in-addr.arpa    name = infoblox.colo2.sevone.com.
```
From terminal window 2 The following output shows that the DNS server can resolve hostname / IP address on both the control plane and the agent nodes.

Check if 'nslookup' resolves the 'agent' IP address
```
$ nslookup 10.123.45.68
68.45.123.10.in-addr.arpa   name = sdi-node02.sevone.com.
```
Check if 'nslookup' resolves the 'agent' hostname
```
$ nslookup sdi-node02.sevone.com
Server:     10.168.16.50
Address:    10.168.16.50#53

Name:   sdi-node02.sevone.com
Address: 10.123.45.68
```
Check if 'nslookup' resolves the 'control plane' IP address
```
$ nslookup 10.123.45.67
67.45.123.10.in-addr.arpa   name = sdi-node01.sevone.com.
```
Check if 'nslookup' resolves the 'control plane' hostname
```
$ nslookup sdi-node01.sevone.com
Server:     10.168.16.50
Address:    10.168.16.50#53

Name:   sdi-node01.sevone.com
Address: 10.123.45.67
```
nslookup name 'sevone.com' in search line in /etc/resolve.conf
```
$ nslookup sevone.com
Server:     10.168.16.50
Address:    10.168.16.50#53

Name:   sevone.com
Address: 23.185.0.4
```
nslookup name 'nwk.sevone.com' in search line in /etc/resolve.conf
```
$ nslookup nwk.sevone.com
Server:     10.168.16.50
Address:    10.168.16.50#53

Name:   nwk.sevone.com
Address: 25.185.0.4
```
nslookup nameserver '10.168.16.50' in /etc/resolve.conf
```
$ nslookup 10.168.16.50
50.16.168.10.in-addr.arpa   name = infoblox.nwk.sevone.com.
```
nslookup nameserver '10.205.8.50' in /etc/resolve.conf
```
$ nslookup 10.205.8.50
50.8.205.10.in-addr.arpa    name = infoblox.colo2.sevone.com.
```
Note: If any of the nslookup commands in terminal window 1 or terminal window 2 above fail or return one or more of the following, you must first resolve the name resolution issue otherwise, SevOne Data Insight will not work.
Examples
```
** server can't find 67.45.123.10.in-addr.arpa.: NXDOMAIN

or

** server can't find 68.45.123.10.in-addr.arpa.: NXDOMAIN

or

*** Can't find nwk.sevone.com: No answer

etc.
```
If the name resolution fails due to any reason after the deployment of SevOne Data Insight, then this could also lead to the failure of normal operations in SevOne Data Insight. Hence, it is recommended to ensure that the DNS configuration is always working.

ERROR: Failed to open ID file '/home/sevone/.pub': No such file or directory

Edit online

As a security measure, fresh installations do not ship with pre-generated SSH keys.

Using ssh , log into SevOne Data Insight as sevone .

$ ssh sevone@<SevOne Data Insight 'control plane' node IP address or hostname>

Execute the following command to generate unique SSH keys for your cluster.
```
$ sevone-cli cluster setup-keys
```

TimeShift between SevOne Data Insight & SevOne NMS

Edit online

If the time difference between SevOne Data Insight and SevOne NMS appliances is more than 5 minutes, the following steps must be performed.

Check the time on the SevOne Data Insight appliance.
```
$ date
```
Check the time on the SevOne NMS appliance.
```
$ date
```
If the time difference between SevOne Data Insight and SevOne NMS appliances is more than 5 minutes, then check the NTP configuration on both the appliances. Both the appliances must be time-sync'd to the NTP.
Note: If the NTP server is unavailable, manually set the same time on both the appliances as shown in the example below.
Example
```
 $ date --set="6 OCT 2023 18:00:00" 
```

Pre-check Failures

Edit online

TASK [ Confirm free space ]

Edit online

If this task fails, you can try to clean up old installer files that may be found in various parts of the file system. For example,
- /root
- /home/sevone
- /opt/SevOne/upgrade
- /var/lib/rancher/k3s/agent/images
Clear scheduled report caching. Execute the following command to delete files older than one-week (604800 seconds).
Note: SevOne Data Insight maintains a cache of the printed PDF's for scheduled reports. Depending on your usage of report scheduling, it is recommended to occasionally clean up the cache to free up disk space.
```
$ sevone-cli exec graphql -- "npm run asset-sweeper -- --prefix=scheduledReports --age=604800"
```
You can run the following command to delete files older than one week (604800 seconds).
Running the command below helps track down what files in the system are taking up the most space.
```
$ du -sh /*
```
In your investigation, if you find that the following directory is filling up your HDD (Hard Disk Drive), then it is a container or a pod that is the culprit.
```
$ /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots
```
You must continue running du -sh to further pinpoint the exact container or pod. In some cases, it may be the printer container taking up the space due to node.js core dump files. Execute the following command to identify those files.
```
$ find /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots \
-name "core\.*"
```

TASK [ FN000## ]

Edit online

If the pre-check fails to validate any of the Internal Field Notices (IFNs), apply the IFNs and reboot the appliance(s).
Note: If FN00068 and/or FN00070 needs to be applied, please contact IBM SevOne Support for the IFN's patch instructions / workaround / solution.
Rerun the pre-check playbook to verify if the IFNs have been applied or you may manually verify it.
```
$ ansible-playbook /opt/SevOne/upgrade/ansible/playbooks/precheck.yaml
```
Validate that Internal Field Notices (IFNs) FN00068 and FN00070 have been applied on both the control plane and agent nodes.
Check if FN00068 is applied

This issue is due to a bug with the CentOS kernel reporting incorrect memory usage. Due to this, Kubernetes does not schedule or restart any pods on the affected node because it thinks there is no memory remaining. To check if the IFN needs to be applied, execute the following command.
```
$ cat /proc/cmdline | grep -qi 'cgroup.memory=nokmem' || \
echo ">> IFN 68 NOT APPLIED"
```
Check if FN00070 is applied

This issue only affects users who have been migrated over from RHEL (using the unrhel migration tool). To check if the IFN is applied, execute the following command.
```
$ nmcli dev | grep -i ^eth && (cat /proc/cmdline | \
grep -qi 'biosdevname=0 net.ifnames=0' || \
echo ">> IFN 70 NOT APPLIED") || \
echo ">> IFN 70 NOT NEEDED"
```

Install / Upgrade Failures

Edit online

TASK [ k3s : Initialize the cluster ]

Edit online

If this task fails, you can observe the status of k3s service by using the following command.

$ systemctl status k3s

Unable to find suitable network address. No default routes found.

Edit online

Check if there is a default route added to the routing table.

$ ip route | grep default

If this returns empty, you will need to add a default route.

Add default route

$ ip route add default via <default_gateway>

TASK [ Stop k3s-server if upgrading to new version ]

Edit online

If this task does not complete within a minute then, you will have to apply the following workaround before continuing with the upgrade.

Note: If you are upgrading using the GUI Installer, you must stop the GUI Installer API and Client processes.

Stop API and Client processes

$ sudo systemctl status sevone-guii-@api
$ sudo systemctl status sevone-guii-@client
$ sudo systemctl start sevone-guii-@api
$ sudo systemctl start sevone-guii-@client
$ sudo systemctl stop sevone-guii-@api
$ sudo systemctl stop sevone-guii-@client

for SevOne Data Insight <= 3.9

$ sed -i 's/.*k3s-killall.sh.*/    echo noop/' \
/opt/SevOne/upgrade/ansible/playbooks/roles/k3s/tasks/02_setup.yaml

$ ansible-playbook /opt/SevOne/upgrade/ansible/playbooks/up.yaml \
--tags kube,apps,kernel

for SevOne Data Insight >= 3.10

$ sed -i 's/.*k3s-killall.sh.*/    echo noop/' \
/opt/SevOne/upgrade/ansible/playbooks/roles/k3s/tasks/02_setup.yaml

$ sevone-cli playbook up --tags kube,apps,kernel

This is due to an upstream issue with the k3s-killall.sh script hanging when attempting to shut down some running containerd processes.

TASK [ prep : Ensure hostname set ]

Edit online

When attempting to run an upgrade, you may run into the following error.

TASK [prep : Ensure hostname set] ****************************************************************************************************************************************************
fatal: [sevonek8s]: FAILED! => {“changed”: false, \
“msg”: “Command failed rc=1, out=, err=Could not get property: \
Failed to activate service ‘org.freedesktop.hostname1’: timed out\n”}

This happens when hostnamed has likely crashed. Restart hostnamed.

$ sudo systemctl restart systemd-hostnamed

Important: If hostnamed restart does not work, restart the machine.

$ sudo reboot

TASK [ freight : Install centos-update-*.el7.tgz ]

Edit online

When upgrading to SevOne Data Insight >= 3.8 or higher, the culprit is likely that the packages are too up-to-date. This can happen if your machine has internet access and can resolve yum package servers. The fix is to retry the upgrade while skipping the yum packages with broken dependencies.

Remove lingering yum packages or package conflicts.

$ sudo yum clean all

$ sudo rm -rf /var/cache/yum/

Retry the upgrade via the Command Line Interface (CLI).

$ sevone-cli playbook up --extra-vars "freight_install_skip_broken=yes"

TASK [ helm upgrade/install default/<chart_name> ]

Edit online

There are several reasons why this task may fail. Unfortunately, Helm does not report errors or useful debug information. Due to this, further investigation is required for this. Please look for the stderr key in the large JSON body that is returned in the task output.

UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

Edit online

This failure is caused when Helm is unable to manually rollback a failed deployment. It may have occurred before you initiated the upgrade, perhaps when configuring the SevOne Data Insight Helm chart. Please refer to SevOne Data Insight Administration Guide, section Helm Chart for details.

Execute the following command.

$ helm rollback di

Important: If the issue is related to ingress, execute the following command.

$ helm rollback ingress

Upon completion of the command above, you may then resume the upgrade by executing the following command.

$ sevone-cli playbook up --tags apps,kernel

UPGRADE FAILED: to deploy apps

Edit online

If you are upgrading between 3.5.x versions, for example, from 3.5.1 to 3.5.3, upgrade will fail to deploy apps.

To fix this, execute the following commands before performing an upgrade.

$ sevone-cli playbook up --skip-tags apps,kernel

$ sudo systemctl restart k3s

$ ssh sevone@<'agent' IP address>

$ sudo systemctl restart k3s-agent

UPGRADE FAILED: current release manifest contains removed kubernetes api(s) for this kubernetes version

Edit online

This is caused when upgrading from SevOne Data Insight 3.5.x directly to SevOne Data Insight 3.11 and above. Please refer to SevOne Data Insight Upgrade Process Guide > Pre-Upgrade Checklist > section Version Matrix for more information.

Note: Error: UPGRADE FAILED: current release manifest contains removed kubernetes api(s) for this kubernetes version and it is therefore unable to build the kubernetes objects for performing the diff. error from kubernetes: unable to recognize “”:no matches for kind “Ingress” in version “networking.k8s.io/v1beta1”

Execute the following steps.

Go to /home/sevone directory.
```
$ cd /home/sevone
```
Create fix-manifest.sh script file.
```
$ touch fix-manifest.sh
```

Using a text editor of your choice, edit /home/sevone/fix-manifest.sh script file, add the following and then, save it.

$ #!/bin/bash

# set up vars. change these as needed
release=di
namespace=default

# create temp file to output files to
tmp_dir=$(mktemp -d -t fix-manifest-XXXXX)

# grab helm release object and decode it
releaseObject=$(kubectl get secret -l owner=helm,status=deployed,name=$release --namespace $namespace | awk '{print $1}' | grep -v NAME)
kubectl get secret $releaseObject -n $namespace -o yaml > $tmp_dir/$release.release.yaml
cp $tmp_dir/$release.release.yaml $tmp_dir/$release.release.bak
cat $tmp_dir/$release.release.yaml | grep -oP '(?<=release: ).*' | base64 -d | base64 -d | gzip -d > $tmp_dir/$release.release.data.decoded
sed -i -e 's/networking.k8s.io\/v1beta1/networking.k8s.io\/v1/' $tmp_dir/$release.release.data.decoded
cat $tmp_dir/$release.release.data.decoded | gzip | base64 | base64 > $tmp_dir/$release.release.data.encoded

# patch the helm release object
tr -d "\n" < $tmp_dir/$release.release.data.encoded > $tmp_dir/$release.release.data.encoded.final
releaseData=$(cat $tmp_dir/$release.release.data.encoded.final)
sed 's/^\(\s*release\s*:\s*\).*/\1'$releaseData'/' $tmp_dir/$release.release.yaml > $tmp_dir/$release.final.release.yaml
kubectl apply -f $tmp_dir/$release.final.release.yaml -n $namespace

# clean up
rm -rf $tmp_dir

Execute fix-manifest.sh script file.
```
$ bash fix-manifest
```

General Debugging Tips

Edit online

There are several reasons why task helm upgrade/install default/<chart_name> may fail. Helm does not provide useful debug information and further investigation is required to understand the failure.

Execute the following command to retry the upgrade.
```
$ sevone-cli playbook up --tags apps
```
While the above command is in progress, from another terminal window, run k9s.

Monitor the status of each pod and refer to the table below for some basic debugging techniques.

If logs are not shown when observing the logs via k9s, press 0 to enable logs for all time.

Status	Action
CrashLoopBackOff	Check the pod logs by hovering over the pod and pressing 1.
Error	Check the pod logs by hovering over the pod and pressing Check the pod logs by hovering over the pod and pressing 1.
Pending	Check the pod event log by hovering over it and pressing d.

Other Issues

Edit online

Configuration Check

Edit online

SevOne Data Insight requires configuration of several components to operate properly. When troubleshooting issues, it can be cumbersome to check the configuration and health of the system because there are different tools and network requirements, such as exposing certain ports.

The administrator would benefit from a tool used to display the configuration and health of the Data Insight environment so that a misconfiguration or system error can be quickly identified.

CLI method on production Kubernetes

$ ssh sevone@<SevOne Data Insight 'control plane' node IP address or hostname>>

$ sevone-cli exec graphql -- npm run health

Important: From the command above, expect no errors or failures are returned.

GraphQL method

Here are some sample GraphQL queries.

Check Data Insight system health

query health {
  health {
    minio { ...componentHealthDetails }
    mysql { ...componentHealthDetails }
    rabbitMq { ...componentHealthDetails }
    redis { ...componentHealthDetails }
    reportScheduler { ...componentHealthDetails }
    soa { ...componentHealthDetails }
  }
}
fragment componentHealthDetails on ComponentHealthDetails {
  host port error ok
}

Check a single datasource

query ds {
  datasources(ids: [ 1 ]) {
    id
    name
    address
  }
}

Check all datasources on all tenants

query tenants {
  tenants {
    id
    name
    datasources {
      id
      name
      address
    }
  }
}

Check datasources at authentication

mutation auth {
  authentication(tenant: "MyTenant", username: "admin", password: "password") {
    token
    success
    datasources {
      id
      name
      address
    }
  }
}

Error getting NMS IP List

Edit online

SOA version
SOA must be on the latest version on all appliances in SevOne NMS cluster. Command Line Interface (CLI) must be used to upgrade SOA on all peers as the graphical user interface (GUI) only upgrades SOA for the NMS appliance you are connected to.

Add flag --all-peers if you want to install / upgrade SOA on all peers in the cluster.

Error

$ sevone-cli soa upgrade /opt/SevOne/upgrade/utilities/SevOne-soa-*.rpm --all-peers
>> [INFO] ATTEMPTING TO AUTO-DETECT SOA DATASOURCES...
Defaulted container "mysql" out of: mysql, metrics
...
...
<returns an ERROR>

If you get this error, please make sure you are logged into SevOne Data Insight as sevone.

$ ssh sevone@<SevOne Data Insight IP address or hostname>

Now, re-run the command to upgrade SOA.

Add flag --all-peers if you want to install / upgrade SOA on all peers in the cluster.

$ sevone-cli soa upgrade /opt/SevOne/upgrade/utilities/SevOne-soa-*.rpm --all-peers

Incorrect information entered at Bootstrap and/or Provisioning prompts?

Edit online

If you entered incorrect information at bootstrap and/or provisioning prompts, execute the following commands to allow you to override the input. These commands can only be run once your SevOne Data Insight is up and running.

$ ssh sevone@<SevOne Data Insight IP address or hostname>

$ sevone-cli exec graphql -- npm run bootstrap -- -f

$ sevone-cli exec graphql -- npm run provision -- -f

The purpose of these commands is to re-run through the bootstrap and provisioning prompts. They are especially useful if incorrect information was provided the first time around.

Pod Stuck in a terminating State

Edit online

If a pod is ever stuck and you want it to reboot, you can append --grace-period=0 --force to the end of your delete pod command.

Example

$ ssh sevone@<SevOne Data Insight IP address or hostname>

$ kubectl delete pod $(kubectl get pods | grep 'dsm' | awk '{print $1}') --grace-period=0 --force

Review / Collect Logs

Edit online

Logs can be collected at the pod level. The status of pods must be Running.

Note: In the commands below, to obtain the logs, you need to pass <resource-type/pod-name>. For example, deployment.apps/di-printer or deploy/di-printer.

By default, resource-type = pod. For logs where resource-type = pod, you may choose to only pass the pod-name only; resource-type is optional.

Using ssh, log into SevOne Data Insight as sevone.

$ ssh sevone@<SevOne Data Insight IP address or hostname>

Example: Get 'pod' names

$ kubectl get pods
NAME                                                      READY   STATUS      RESTARTS        AGE
di-create-secrets-xllfj                                   0/1     Completed   0               22h
di-upgrade-l2cs8                                          0/1     Completed   0               22h
clienttest-success-89lmt                                  0/1     Completed   0               22h
clienttest-fail-lb8mq                                     0/1     Completed   0               22h
di-report-version-sweeper-28276440-zpcxt                  0/1     Completed   0               20h
ingress-ingress-nginx-controller-54dfdbc9cf-g9wdz         1/1     Running     0               22h
di-prometheus-node-exporter-shnxk                         1/1     Running     0               22h
di-graphql-7d88c8c7b5-fbwgc                               1/1     Running     0               22h
di-ui-5b8fbcfc54-rtwlq                                    1/1     Running     0               22h
di-kube-state-metrics-6f4fbc67cb-tsbbk                    1/1     Running     0               22h
di-migrator-fdb9dd58b-29kl2                               2/2     Running     0               22h
ingress-ingress-nginx-defaultbackend-69f644c9dc-7jvvs     1/1     Running     0               22h
di-printer-7888679b59-cqp9q                               2/2     Running     0               22h
di-scheduler-7845d64d57-bdsm2                             1/1     Running     0               22h
di-registry-68c7bbc47b-45l5v                              1/1     Running     0               22h
di-djinn-api-5b4bbb446b-prsjd                             1/1     Running     1 (22h ago)     22h
di-mysql-0                                                2/2     Running     0               22h
di-prometheus-server-7dc67cb6b5-bjzn5                     2/2     Running     0               22h
di-redis-master-0                                         2/2     Running     0               22h
di-wdkserver-6db95bb9c9-5w2kt                             2/2     Running     0               22h
di-assetserver-5c4769bd8-6f2hw                            1/1     Running     0               22h
di-prometheus-node-exporter-mp5xf                         1/1     Running     0               22h
di-report-tombstone-sweeper-28277040-kj227                1/1     Running     0               10h
datasource-operator-controller-manager-5cf6f7f675-h5lng   2/2     Running     3 (5h37m ago)   22h
di-asset-sweeper-28277645-tq6gb                           0/1     Completed   0               12m
di-user-sync-28277645-dl6ks                               0/1     Completed   0               12m
di-asset-sweeper-28277650-hxwvn                           0/1     Completed   0               7m46s
di-user-sync-28277650-6kxf7                               0/1     Completed   0               7m46s
di-asset-sweeper-28277655-gjtpr                           0/1     Completed   0               2m46s
di-user-sync-28277655-chgxd                               0/1     Completed   0               2m46s

Pod names are the names found under column NAME.

Get resource types
Get 'all' resource types

$ kubectl get all | more

Get resource type for a pod

$ kubectl get all | grep <pod-name>

Example: Get resource type for pod-name containing 'printer'

$ kubectl get all | grep printer
pod/di-printer-68f6bddb6f-hkhdt            1/1     Running   2 (27h ago)   2d3h
deployment.apps/di-printer                 1/1     1         1             2d3h
replicaset.apps/di-printer-68f6bddb6f      1       1         1             2d3h

Example: Get resource type for pod-name containing 'rabbitmq'

$ kubectl get all | grep rabbitmq
pod/di-rabbitmq-0             1/1         Running          2 (27h ago)  2d3h
service/di-rabbitmq-headless  ClusterIP   None             <none>       4369/TCP,5672/TCP,25672/TCP,15672/TCP            2d3h
service/di-rabbitmq           ClusterIP   192.168.108.109  <none>       5672/TCP,4369/TCP,25672/TCP,15672/TCP,9419/TCP   2d3h
statefulset.apps/di-rabbitmq  1/1                                                                                        2d3h

Important: pod, deployment.apps, replicaset.apps, service, statefulset.apps, etc. in the examples above are resource types.

di-printer, di-rabbitmq, etc. in the examples above are pod names.

Get logs

$ kubectl logs <resource-type>/<pod-name>

Example: Get logs for pod-name 'di-printer'

$ kubectl logs deployment.apps/di-printer 

OR

$ kubectl logs deploy/di-printer

Example: Get logs for pod-name 'di-rabbitmq'

$ kubectl logs statefulset.apps/di-rabbitmq

OR

$ kubectl logs sts/di-rabbitmq

Example: Get logs for pod-name 'rabbitmq' with timestamps

$ kubectl logs statefulset.apps/di-rabbitmq --timestamps

OR

$ kubectl logs sts/di-rabbitmq --timestamps

By default, resource-type = pod.

In the example below, to obtain the logs for <resource-type>/<pod-name> = pod/di-mysql-0, <resource-type> pod is optional.

Example: <resource-type> = pod; <resource-type> is optional

$ kubectl logs pod/di-mysql-0

OR

$ kubectl logs di-mysql-0

Important: Each pod can have one or more associated containers.

Collect Logs for a Pod with One Container

Edit online

Using ssh, log into SevOne Data Insight as sevone.

$ ssh sevone@<SevOne Data Insight IP address or hostname>

Obtain the list of containers that belong to a pod.
Example: Pod name 'di-mysql-0' contains one container, 'mysql'
```
$ kubectl get pods di-mysql-0 -o jsonpath='{.spec.containers[*].name}{"\n"}'
mysql metrics
```

Collect logs.

Note: For pods with one container only, -c < container-name > in the command below is optional.

$ kubectl logs <pod-name> -c <container-name>

or

$ kubectl logs <pod-name>

Example

$ kubectl logs di-mysql-0 -c mysql

or

$ kubectl logs di-mysql-0

Collect Logs for a Pod with More Than One Container

Edit online

Using ssh, log into SevOne Data Insight as sevone.

$ ssh sevone@<SevOne Data Insight IP address or hostname>

Obtain the list of containers that belong to a pod.
Example: Pod name 'svclb-ingress-ingress-nginx-controller-6fbfd' contains two containers, 'lb-port-80' and 'lb-port-443'
```
$ kubectl get pods svclb-ingress-ingress-nginx-controller-5pcm7 \
-o jsonpath='{.spec.containers[*].name}{"\n"}'

lb-port-80 lb-port-443
```
Collect logs.
Important: For pods with more than one container, -c < container-name > is required.
```
$ kubectl logs <pod-name> -c <container-name>
```
Example: Get logs for <container-name> = lb-port-80
```
$ kubectl logs svclb-ingress-ingress-nginx-controller-vzcqj -c lb-port-80
```
Example: Get logs for <container-name> = lb-port-443
```
$ kubectl logs svclb-ingress-ingress-nginx-controller-vzcqj -c lb-port-443
```

Collect All Logs

Edit online

To collect all the logs relevant for SevOne Data Insight pods and its containers, create a working directory where all the logs can be collected.
```
$ TMPDIR="/tmp/sdi_logs/$(date +%d%b%y)"
$ mkdir -p $TMPDIR
```

Execute the following command to collect all logs for all SevOne Data Insight containers.

Note: The --timestamps option in the command below allows you to collect the logs with the timestamps.

Example: Command to collect logs from all SevOne Data Insight Pods and containers

$ for POD in $(kubectl get pods --no-headers -n default | \
awk '{print $1}'); do for CONTAINER in $(kubectl get pods \
$POD -o jsonpath='{.spec.containers[*].name}{"\n"}'); \
do echo "Collecting logs for POD: $POD - CONTAINER: \
$CONTAINER in log file $TMPDIR/$POD_$CONTAINER.log.gz" ; \
kubectl logs $POD -c $CONTAINER --timestamps | \
gzip > $TMPDIR/$POD_$CONTAINER.log.gz 2>&1; done ; done

The for command is shown here with indentations for clarity.

for POD in $(kubectl get pods --no-headers -n default | awk '{print $1}') ;
  do
  for CONTAINER in $(kubectl get pods $POD -o jsonpath='{.spec.containers[*].name}{"\n"}') ;
  do
    echo "Collecting logs for POD: $POD - CONTAINER: $CONTAINER in log file $TMPDIR/$POD_$CONTAINER.log.gz" ;
    kubectl logs $POD -c $CONTAINER --timestamps | gzip > $TMPDIR/$POD_$CONTAINER.log.gz 2>&1 ;
  done ;
done

Command to see files contained in $TMPDIR

$ ls -lh $TMPDIR

Once the logs are collected, the contents can be put in a tar file. There is no need to compress again since the logs are already compressed.
```
$ tar -cf /tmp/sdi_logs-$(date +%d%b%y).tar $TMPDIR

$ ls -l /tmp/sdi_logs-$(date +%d%b%y).tar

$ md5sum /tmp/sdi_logs-$(date +%d%b%y).tar
```
Delete the log directory to free-up the space.
```
$ rm -rf $TMPDIR
```
You may upload the tar file in /tmp/sdi_logs-$(date +%d%b%y).tar for further investigation.

'Agent' Nodes in a Not Ready State after Rebooting

Edit online

Perform the following actions if the agent nodes are in a Not Ready state after rebooting.

Ensure Data Insight is 100% deployed

Edit online

Check the status of the deployment by running the following command. Ensure that everything is in Running status.

$ ssh sevone@<SevOne Data Insight IP address or hostname>

$ kubectl get pods
NAME                                                      READY   STATUS      RESTARTS        AGE
di-create-secrets-xllfj                                   0/1     Completed   0               22h
di-upgrade-l2cs8                                          0/1     Completed   0               22h
clienttest-success-89lmt                                  0/1     Completed   0               22h
clienttest-fail-lb8mq                                     0/1     Completed   0               22h
di-report-version-sweeper-28276440-zpcxt                  0/1     Completed   0               20h
ingress-ingress-nginx-controller-54dfdbc9cf-g9wdz         1/1     Running     0               22h
di-prometheus-node-exporter-shnxk                         1/1     Running     0               22h
di-graphql-7d88c8c7b5-fbwgc                               1/1     Running     0               22h
di-ui-5b8fbcfc54-rtwlq                                    1/1     Running     0               22h
di-kube-state-metrics-6f4fbc67cb-tsbbk                    1/1     Running     0               22h
di-migrator-fdb9dd58b-29kl2                               2/2     Running     0               22h
ingress-ingress-nginx-defaultbackend-69f644c9dc-7jvvs     1/1     Running     0               22h
di-printer-7888679b59-cqp9q                               2/2     Running     0               22h
di-scheduler-7845d64d57-bdsm2                             1/1     Running     0               22h
di-registry-68c7bbc47b-45l5v                              1/1     Running     0               22h
di-djinn-api-5b4bbb446b-prsjd                             1/1     Running     1 (22h ago)     22h
di-mysql-0                                                2/2     Running     0               22h
di-prometheus-server-7dc67cb6b5-bjzn5                     2/2     Running     0               22h
di-redis-master-0                                         2/2     Running     0               22h
di-wdkserver-6db95bb9c9-5w2kt                             2/2     Running     0               22h
di-assetserver-5c4769bd8-6f2hw                            1/1     Running     0               22h
di-prometheus-node-exporter-mp5xf                         1/1     Running     0               22h
di-report-tombstone-sweeper-28277040-kj227                1/1     Running     0               10h
datasource-operator-controller-manager-5cf6f7f675-h5lng   2/2     Running     3 (5h37m ago)   22h
di-asset-sweeper-28277645-tq6gb                           0/1     Completed   0               12m
di-user-sync-28277645-dl6ks                               0/1     Completed   0               12m
di-asset-sweeper-28277650-hxwvn                           0/1     Completed   0               7m46s
di-user-sync-28277650-6kxf7                               0/1     Completed   0               7m46s
di-asset-sweeper-28277655-gjtpr                           0/1     Completed   0               2m46s
di-user-sync-28277655-chgxd                               0/1     Completed   0               2m46s

Note: To see additional pod details, you may use kubectl get pods -o wide command.

Restart SOA

Edit online

If SevOne NMS has been upgraded or downgraded, please make sure that the SOA container is restarted after a successful upgrade/downgrade. Execute the following command.

From SevOne NMS appliance,

$ ssh root@<NMS appliance>

$ supervisorctl restart soa

Troubleshooting

General Issues

Early Users of 'unrhel'

ACCESS DENIED in GraphQL Logs

Error Fetching Widget when Loading Report

Unable to connect to the server: x509: certificate has expired

[ WARN ] No upgrade available

Domain Name Resolution (DNS) not working

ERROR: Failed to open ID file '/home/sevone/.pub': No such file or directory

TimeShift between SevOne Data Insight & SevOne NMS

Pre-check Failures

TASK [ Confirm free space ]

TASK [ FN000## ]

Install / Upgrade Failures

TASK [ k3s : Initialize the cluster ]

Unable to find suitable network address. No default routes found.

TASK [ Stop k3s-server if upgrading to new version ]

TASK [ prep : Ensure hostname set ]

TASK [ freight : Install centos-update-*.el7.tgz ]

TASK [ helm upgrade/install default/<chart_name> ]

UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

UPGRADE FAILED: to deploy apps

UPGRADE FAILED: current release manifest contains removed kubernetes api(s) for this kubernetes version

General Debugging Tips

Other Issues

Configuration Check

Error getting NMS IP List

Incorrect information entered at Bootstrap and/or Provisioning prompts?

Pod Stuck in a terminating State

Review / Collect Logs

Collect Logs for a Pod with One Container

Collect Logs for a Pod with More Than One Container

Collect All Logs

'Agent' Nodes in a Not Ready State after Rebooting

Ensure Data Insight is 100% deployed

Restart SOA