Troubleshooting

The following are some troubleshooting scenarios.

General Issues

Early Users of 'unrhel'

The following message is returned when the environment is migrated over from RHEL using unrhel@v5.1.1 or below.

FAILED => Missing sudo password

To fix this, execute the following commands before performing an upgrade.

$ ssh sevone@<'agent' IP address>
 
$ sudo -i
 
$ echo "sevone ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

ACCESS DENIED in GraphQL Logs

If SOA apikeys are outdated or expired, you are likely to get this error. To fix this, update the datasource keys.

  1. Execute the following command to ensure that GraphQL pod is Running and not in an Errored or CrashLookBackOff state.

    Example

    $ kubectl get pods | grep graphql
    di-graphql-7d88c8c7b5-fbwgc                  1/1     Running     0              22h
  2. If the third column reads Errored or CrashLookBackOff, using a text editor of your choice, edit /opt/SevOne/chartconfs/di_custom.yaml to include the following environment variable and then, save it.

    Example

    graphql:
      env:
        SKIP_REPORT_MIGRATION_DRY_RUN: true
  3. Apply the change made to /opt/SevOne/chartconfs/di_custom.yaml file.
    $ sevone-cli playbook up --tags apps
  4. Once the GraphQL pod state is running / healthy, generate new SOA API keys for each affected datasource. Execute the following command.
    $ sevone-cli exec graphql -- npm run reconfig-datasource
    Note: You will be prompted several times. Keep all the default values except enter y when prompted to Login instead of providing an API key.

    username must be admin

    password must be the admin Graphical User Interface password for the datasource.
    Important: Repeat this step for each datasource.

    Example

    $ sevone-cli exec graphql -- npm run reconfig-datasource
    
    > insight-server@6.7.0 reconfig-datasource /insight-server
    > NODE_PATH=./dist/libs node dist/scripts/database-init/reconfigure-datasource.js
    
    Datasource config:
      Name:         Data Insight API
      Address:      https://staging.soa.sevone.doc
      API key:      eyJ1dWlkIjoiYzMxNTQzZWUtMTgxZC00NWMyLTlkNjctNTUwZWRhODQ2MGFkIiwiYXBwbGljYXRpb24iOiJEYXRhIEluc2lnaHQgKGNrcmwxdmhqZzAwMDA1M3MwM3Z5bmRlZXMpIiwiZW50cm9weSI6IjZVaWxuQStzVDk2ZUFKeG92WW1Nak1odS9nZ29JSWhLNVBDZ05yZHBBT1lrSE11ZlM0eU9CbCs4YWxEUXd3a1MifQ==
      Dstype:       METRICS/FLOW
    
    Datasource name [Data Insight API]:
    
    [1] METRICS/FLOW
    [2] splunk-datasource
    [3] elastic-datasource
    [0] Keep: METRICS/FLOW
    
    Datasource dstype [1, 2, 3, 0]: 0
    Datasource address [https://staging.soa.sevone.doc]:
    Login instead of providing an API key? [y/n]: y
    Username: admin
    Password: ******
    info: [Data Insight API@reconfigure-datasource] SOA request (SOA-1) post https://staging.soa.sevone.doc/api/v3/users/signin
    (node:347) Warning: Setting the NODE_TLS_REJECT_UNAUTHORIZED environment variable to '0' makes TLS connections and HTTPS requests insecure by disabling certificate verification.
    (Use `node --trace-warnings ...` to show where the warning was created)
    info: [Data Insight API@reconfigure-datasource] SOA response (SOA-1) elapsed 1755ms.
    info: [Data Insight API@reconfigure-datasource] SOA request (SOA-2) post https://staging.soa.sevone.doc/api/v3/users/apikey
    info: [Data Insight API@reconfigure-datasource] SOA response (SOA-2) elapsed 277ms.
    
    New datasource config:
      Name:         Data Insight API
      Address:      https://staging.soa.sevone.doc
      API key:      eyJ1dWlkIjoiMzgyMDdhMjItNzE2Mi00OWRlLTk5NTYtYmI3OTVkYjc5NzZkIiwiYXBwbGljYXRpb24iOiJEYXRhIEluc2lnaHQgKGNrczUyYnd0YzAwMDA5bnMxNnQ3aWcxYnQpIiwiZW50cm9weSI6IkNoYS9tbDFWbVVyYThQcHVsLzIzY05JZk94QXcxWFQrVnEyM0hPSzYzSTdPNGNMbkJTVjQyWUVRSW1FeGtDaEoifQ==
      Dstype:       METRICS/FLOW
    
    Is this config correct? [y/n]: y
    info: [Data Insight API@create-datasource] SOA request (SOA-3) get https://staging.soa.sevone.doc/api/v3/users/self
    info: [Data Insight API@create-datasource] SOA response (SOA-3) elapsed 275ms.
    
    Datasource config updated!
    
    Datasource reconfiguration complete.
  5. Once all the datasources have been updated, restart the GraphQL pod.
    $ kubectl delete pods -l app.kubernetes.io/component=graphql

Error Fetching Widget when Loading Report

This error is when the wdkserver is not serving the widgets to the user interface. In most cases, it is due to an invalid cookie value set in your browser that causes the error. You may inspect the network activity based on the browser's Developer Tools and look for requests to /wdkserver. If you are unable to inspect the network activity, please contact IBM SevOne Support.

If you observe the following error message coming back from /wdkserver, remove the offending cookie or disable strict headers in wdkserver. If you need assistance with this, please contact IBM SevOne Support.

{"statusCode":400,"error":"Bad Request","message":"Invalid cookie value"}
  1. Using a text editor of your choice, edit /opt/SevOne/chartconfs/di_custom.yaml to include the following environment variable and then, save it.
    wdkserver:
      env:
        DISABLE_STRICT_HEADER: true    
  2. Apply the change made to /opt/SevOne/chartconfs/di_custom.yaml file.
    $ sevone-cli playbook up --tags apps

Unable to connect to the server: x509: certificate has expired

If you are seeing error message, x509: certificate has expired, when running kubectl commands, your certificates have expired and need to be rotated manually. Please refer to SevOne Data Insight Administration Guide > section Rotate Kubernetes Certificates for details.

[ WARN ] No upgrade available

The No upgrade available warning usually occurs when attempting to retry a failed upgrade or if the upgrade .tgz file is placed in an incorrect directory.

  1. Ensure the .tgz file is in the correct directory as outlined in SevOne Data Insight Upgrade Process Guide > section Confirm SevOne Data Insight Version.
  2. Using ssh , log into SevOne Data Insight as sevone .
    $ ssh sevone@<SevOne Data Insight 'control plane' node IP address or hostname>
  3. Revert your SevOne Data Insight major / minor version in /SevOne.info to a prior / lower version using a text editor of your choice.
    $ vi /SevOne.info

    Example# 1
    Assume the current SevOne Data Insight version is 6.7.0. The version prior to SevOne Data Insight 6.7.0 is SevOne Data Insight is 6.6. In this case, if you want to go to the prior version, you must change the major and minor version to go to the prior / lower version.

    major = 6           # e.g.: if this is `6` then leave it as-is
    minor = 7           # e.g.: if this is `7` then set it to `6`
    patch = 0
    build = 160         # e.g.: enter the build number for the prior version i.e. 160

    The prior version is,

    major = 6
    minor = 6 
    patch = 0
    build = 139

    Example# 2
    Assume the current SevOne Data Insight version is 6.5.0. The version prior to SevOne Data Insight 6.5.0 is SevOne Data Insight is 3.14. In this case, if you want to go to the prior version, you must change the major and minor version to go to the prior / lower version.

    major = 6      # e.g.: if this is `6` then set it to `3`
    minor = 5      # e.g.: if this is `5` then set it to `14` or `13` or lower version
    patch = 0
    build = 67     # e.g.: enter the build number for the prior version i.e. 162

    The prior version is,

    major = 3
    minor = 14 
    patch = 0
    build = 162

    Example# 3
    Assume the current SevOne Data Insight version is 3.14. The version prior to SevOne Data Insight 3.14 is SevOne Data Insight is 3.13. In this case, if you want to go to the prior version, you must change the major and minor version to go to the prior / lower version.

    major = 3           
    minor = 14           # e.g.: if this is `14` then set it to `13`
    patch = 0
    build = 162          # e.g.: enter the build number for the prior version i.e. 54         

    The prior version is,

    major = 3
    minor = 13 
    patch = 0
    build = 54

Domain Name Resolution (DNS) not working

Important: A working DNS configuration is a requirement for any SevOne Data Insight deployment.

The DNS server must be able to resolve SevOne Data Insight's hostname on both the control plane and the agent nodes otherwise, SevOne Data Insight will not work. This can be done by adding your DNS servers via nmtui or by editing /etc/resolv.conf file directly as shown in the steps below.

In the example below, let's use the following SevOne Data Insight IP addresses.

Hostname IP Address Role
sdi-node01 10.123.45.67 control plane
sdi-node02 10.123.45.68 agent

Also, in this example, the following DNS configuration is used and DNS search records, sevone.com and nwk.sevone.com are used.

Nameserver IP Address
nameserver 10.168.16.50
nameserver 10.205.8.50
  1. Using ssh, log into the designated SevOne Data Insight control plane node and agent node as sevone from two different terminal windows.

    SSH to 'control plane' node from terminal window 1

    $ ssh sevone@10.123.45.67

    SSH to 'agent' node from terminal window 2

    $ ssh sevone@10.123.45.68
  2. Obtain a list of DNS entries in /etc/resolv.conf file for both control plane and agent nodes in this example.

    From terminal window 1

    $ cat /etc/resolv.conf
    # Generated by NetworkManager
    search sevone.com nwk.sevone.com
    nameserver 10.168.16.50
    nameserver 10.205.8.50

    From terminal window 2

    $ cat /etc/resolv.conf
    # Generated by NetworkManager
    search sevone.com nwk.sevone.com
    nameserver 10.168.16.50
    nameserver 10.205.8.50
  3. Ensure that DNS server can resolve SevOne Data Insight's hostname / IP address on both the control plane and the agent nodes along with the DNS entries in /etc/resolv.conf file (see the search line and nameserver(s)).

    From terminal window 1 The following output shows that the DNS server can resolve hostname / IP address on both the control plane and the agent nodes.

    Check if 'nslookup' resolves the 'control plane' IP address

    $ nslookup 10.123.45.67
    67.45.123.10.in-addr.arpa   name = sdi-node01.sevone.com.

    Check if 'nslookup' resolves the 'control plane' hostname

    $ nslookup sdi-node01.sevone.com
    Server:     10.168.16.50
    Address:    10.168.16.50#53
    
    Name:   sdi-node01.sevone.com
    Address: 10.123.45.67

    Check if 'nslookup' resolves the 'agent' IP address

    $ nslookup 10.123.45.68
    68.45.123.10.in-addr.arpa   name = sdi-node02.sevone.com.

    Check if 'nslookup' resolves the 'agent' hostname

    $ nslookup sdi-node02.sevone.com
    Server:     10.168.16.50
    Address:    10.168.16.50#53
    
    Name:   sdi-node02.sevone.com
    Address: 10.123.45.68

    nslookup name 'sevone.com' in search line in /etc/resolve.conf

    $ nslookup sevone.com
    Server:     10.168.16.50
    Address:    10.168.16.50#53
    
    Name:   sevone.com
    Address: 23.185.0.4

    nslookup name 'nwk.sevone.com' in search line in /etc/resolve.conf

    $ nslookup nwk.sevone.com
    Server:     10.168.16.50
    Address:    10.168.16.50#53
    
    Name:   nwk.sevone.com
    Address: 25.185.0.4

    nslookup nameserver '10.168.16.50' in /etc/resolve.conf

    $ nslookup 10.168.16.50
    50.16.168.10.in-addr.arpa   name = infoblox.nwk.sevone.com.

    nslookup nameserver '10.205.8.50' in /etc/resolve.conf

    $ nslookup 10.205.8.50
    50.8.205.10.in-addr.arpa    name = infoblox.colo2.sevone.com.

    From terminal window 2 The following output shows that the DNS server can resolve hostname / IP address on both the control plane and the agent nodes.

    Check if 'nslookup' resolves the 'agent' IP address

    $ nslookup 10.123.45.68
    68.45.123.10.in-addr.arpa   name = sdi-node02.sevone.com.

    Check if 'nslookup' resolves the 'agent' hostname

    $ nslookup sdi-node02.sevone.com
    Server:     10.168.16.50
    Address:    10.168.16.50#53
    
    Name:   sdi-node02.sevone.com
    Address: 10.123.45.68

    Check if 'nslookup' resolves the 'control plane' IP address

    $ nslookup 10.123.45.67
    67.45.123.10.in-addr.arpa   name = sdi-node01.sevone.com.

    Check if 'nslookup' resolves the 'control plane' hostname

    $ nslookup sdi-node01.sevone.com
    Server:     10.168.16.50
    Address:    10.168.16.50#53
    
    Name:   sdi-node01.sevone.com
    Address: 10.123.45.67

    nslookup name 'sevone.com' in search line in /etc/resolve.conf

    $ nslookup sevone.com
    Server:     10.168.16.50
    Address:    10.168.16.50#53
    
    Name:   sevone.com
    Address: 23.185.0.4

    nslookup name 'nwk.sevone.com' in search line in /etc/resolve.conf

    $ nslookup nwk.sevone.com
    Server:     10.168.16.50
    Address:    10.168.16.50#53
    
    Name:   nwk.sevone.com
    Address: 25.185.0.4

    nslookup nameserver '10.168.16.50' in /etc/resolve.conf

    $ nslookup 10.168.16.50
    50.16.168.10.in-addr.arpa   name = infoblox.nwk.sevone.com.

    nslookup nameserver '10.205.8.50' in /etc/resolve.conf

    $ nslookup 10.205.8.50
    50.8.205.10.in-addr.arpa    name = infoblox.colo2.sevone.com.
    Note: If any of the nslookup commands in terminal window 1 or terminal window 2 above fail or return one or more of the following, you must first resolve the name resolution issue otherwise, SevOne Data Insight will not work.

    Examples

    ** server can't find 67.45.123.10.in-addr.arpa.: NXDOMAIN
    
    or
    
    ** server can't find 68.45.123.10.in-addr.arpa.: NXDOMAIN
    
    or
    
    *** Can't find nwk.sevone.com: No answer
    
    etc.

    If the name resolution fails due to any reason after the deployment of SevOne Data Insight, then this could also lead to the failure of normal operations in SevOne Data Insight. Hence, it is recommended to ensure that the DNS configuration is always working.


ERROR: Failed to open ID file '/home/sevone/.pub': No such file or directory

As a security measure, fresh installations do not ship with pre-generated SSH keys.

  1. Using ssh , log into SevOne Data Insight as sevone .
    $ ssh sevone@<SevOne Data Insight 'control plane' node IP address or hostname>
  2. Execute the following command to generate unique SSH keys for your cluster.
    $ sevone-cli cluster setup-keys

TimeShift between SevOne Data Insight & SevOne NMS

If the time difference between SevOne Data Insight and SevOne NMS appliances is more than 5 minutes, the following steps must be performed.

  1. Check the time on the SevOne Data Insight appliance.
    $ date
  2. Check the time on the SevOne NMS appliance.
    $ date
  3. If the time difference between SevOne Data Insight and SevOne NMS appliances is more than 5 minutes, then check the NTP configuration on both the appliances. Both the appliances must be time-sync'd to the NTP.
    Note: If the NTP server is unavailable, manually set the same time on both the appliances as shown in the example below.
    Example
     $ date --set="6 OCT 2023 18:00:00" 

Pre-check Failures

TASK [ Confirm free space ]

  1. If this task fails, you can try to clean up old installer files that may be found in various parts of the file system. For example,
    • /root
    • /home/sevone
    • /opt/SevOne/upgrade
    • /var/lib/rancher/k3s/agent/images
  2. Clear scheduled report caching. Execute the following command to delete files older than one-week (604800 seconds).
    Note: SevOne Data Insight maintains a cache of the printed PDF's for scheduled reports. Depending on your usage of report scheduling, it is recommended to occasionally clean up the cache to free up disk space.
    $ sevone-cli exec graphql -- "npm run asset-sweeper -- --prefix=scheduledReports --age=604800"

    You can run the following command to delete files older than one week (604800 seconds).

  3. Running the command below helps track down what files in the system are taking up the most space.
    $ du -sh /*
  4. In your investigation, if you find that the following directory is filling up your HDD (Hard Disk Drive), then it is a container or a pod that is the culprit.
    $ /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots

    You must continue running du -sh to further pinpoint the exact container or pod. In some cases, it may be the printer container taking up the space due to node.js core dump files. Execute the following command to identify those files.

    $ find /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots \
    -name "core\.*"

TASK [ FN000## ]

  1. If the pre-check fails to validate any of the Internal Field Notices (IFNs), apply the IFNs and reboot the appliance(s).
    Note: If FN00068 and/or FN00070 needs to be applied, please contact IBM SevOne Support for the IFN's patch instructions / workaround / solution.
  2. Rerun the pre-check playbook to verify if the IFNs have been applied or you may manually verify it.
    $ ansible-playbook /opt/SevOne/upgrade/ansible/playbooks/precheck.yaml
  3. Validate that Internal Field Notices (IFNs) FN00068 and FN00070 have been applied on both the control plane and agent nodes.

    Check if FN00068 is applied

    This issue is due to a bug with the CentOS kernel reporting incorrect memory usage. Due to this, Kubernetes does not schedule or restart any pods on the affected node because it thinks there is no memory remaining. To check if the IFN needs to be applied, execute the following command.

    $ cat /proc/cmdline | grep -qi 'cgroup.memory=nokmem' || \
    echo ">> IFN 68 NOT APPLIED"

    Check if FN00070 is applied

    This issue only affects users who have been migrated over from RHEL (using the unrhel migration tool). To check if the IFN is applied, execute the following command.

    $ nmcli dev | grep -i ^eth && (cat /proc/cmdline | \
    grep -qi 'biosdevname=0 net.ifnames=0' || \
    echo ">> IFN 70 NOT APPLIED") || \
    echo ">> IFN 70 NOT NEEDED"

Install / Upgrade Failures

TASK [ k3s : Initialize the cluster ]

If this task fails, you can observe the status of k3s service by using the following command.

$ systemctl status k3s

Unable to find suitable network address. No default routes found.

Check if there is a default route added to the routing table.

$ ip route | grep default

If this returns empty, you will need to add a default route.

Add default route

$ ip route add default via <default_gateway>

TASK [ Stop k3s-server if upgrading to new version ]

If this task does not complete within a minute then, you will have to apply the following workaround before continuing with the upgrade.

Note: If you are upgrading using the GUI Installer, you must stop the GUI Installer API and Client processes.

Stop API and Client processes

$ sudo systemctl status sevone-guii-@api
$ sudo systemctl status sevone-guii-@client
$ sudo systemctl start sevone-guii-@api
$ sudo systemctl start sevone-guii-@client
$ sudo systemctl stop sevone-guii-@api
$ sudo systemctl stop sevone-guii-@client

for SevOne Data Insight <= 3.9

$ sed -i 's/.*k3s-killall.sh.*/    echo noop/' \
/opt/SevOne/upgrade/ansible/playbooks/roles/k3s/tasks/02_setup.yaml

$ ansible-playbook /opt/SevOne/upgrade/ansible/playbooks/up.yaml \
--tags kube,apps,kernel

for SevOne Data Insight >= 3.10

$ sed -i 's/.*k3s-killall.sh.*/    echo noop/' \
/opt/SevOne/upgrade/ansible/playbooks/roles/k3s/tasks/02_setup.yaml

$ sevone-cli playbook up --tags kube,apps,kernel

This is due to an upstream issue with the k3s-killall.sh script hanging when attempting to shut down some running containerd processes.

TASK [ prep : Ensure hostname set ]

When attempting to run an upgrade, you may run into the following error.

TASK [prep : Ensure hostname set] ****************************************************************************************************************************************************
fatal: [sevonek8s]: FAILED! => {“changed”: false, \
“msg”: “Command failed rc=1, out=, err=Could not get property: \
Failed to activate service ‘org.freedesktop.hostname1’: timed out\n”}

This happens when hostnamed has likely crashed. Restart hostnamed.

$ sudo systemctl restart systemd-hostnamed
Important: If hostnamed restart does not work, restart the machine.
$ sudo reboot

TASK [ freight : Install centos-update-*.el7.tgz ]

When upgrading to SevOne Data Insight >= 3.8 or higher, the culprit is likely that the packages are too up-to-date. This can happen if your machine has internet access and can resolve yum package servers. The fix is to retry the upgrade while skipping the yum packages with broken dependencies.

  1. Remove lingering yum packages or package conflicts.
    $ sudo yum clean all
    
    $ sudo rm -rf /var/cache/yum/
  2. Retry the upgrade via the Command Line Interface (CLI).
    $ sevone-cli playbook up --extra-vars "freight_install_skip_broken=yes"

TASK [ helm upgrade/install default/<chart_name> ]

There are several reasons why this task may fail. Unfortunately, Helm does not report errors or useful debug information. Due to this, further investigation is required for this. Please look for the stderr key in the large JSON body that is returned in the task output.

UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

This failure is caused when Helm is unable to manually rollback a failed deployment. It may have occurred before you initiated the upgrade, perhaps when configuring the SevOne Data Insight Helm chart. Please refer to SevOne Data Insight Administration Guide, section Helm Chart for details.

Execute the following command.

$ helm rollback di
Important: If the issue is related to ingress, execute the following command.
$ helm rollback ingress

Upon completion of the command above, you may then resume the upgrade by executing the following command.

$ sevone-cli playbook up --tags apps,kernel

UPGRADE FAILED: to deploy apps

If you are upgrading between 3.5.x versions, for example, from 3.5.1 to 3.5.3, upgrade will fail to deploy apps.

To fix this, execute the following commands before performing an upgrade.

$ sevone-cli playbook up --skip-tags apps,kernel

$ sudo systemctl restart k3s

$ ssh sevone@<'agent' IP address>

$ sudo systemctl restart k3s-agent

UPGRADE FAILED: current release manifest contains removed kubernetes api(s) for this kubernetes version

This is caused when upgrading from SevOne Data Insight 3.5.x directly to SevOne Data Insight 3.11 and above. Please refer to SevOne Data Insight Upgrade Process Guide > Pre-Upgrade Checklist > section Version Matrix for more information.

Note: Error: UPGRADE FAILED: current release manifest contains removed kubernetes api(s) for this kubernetes version and it is therefore unable to build the kubernetes objects for performing the diff. error from kubernetes: unable to recognize “”:no matches for kind “Ingress” in version “networking.k8s.io/v1beta1”

Execute the following steps.

  1. Go to /home/sevone directory.
    $ cd /home/sevone
  2. Create fix-manifest.sh script file.
    $ touch fix-manifest.sh
  3. Using a text editor of your choice, edit /home/sevone/fix-manifest.sh script file, add the following and then, save it.
    $ #!/bin/bash
    
    # set up vars. change these as needed
    release=di
    namespace=default
    
    # create temp file to output files to
    tmp_dir=$(mktemp -d -t fix-manifest-XXXXX)
    
    # grab helm release object and decode it
    releaseObject=$(kubectl get secret -l owner=helm,status=deployed,name=$release --namespace $namespace | awk '{print $1}' | grep -v NAME)
    kubectl get secret $releaseObject -n $namespace -o yaml > $tmp_dir/$release.release.yaml
    cp $tmp_dir/$release.release.yaml $tmp_dir/$release.release.bak
    cat $tmp_dir/$release.release.yaml | grep -oP '(?<=release: ).*' | base64 -d | base64 -d | gzip -d > $tmp_dir/$release.release.data.decoded
    sed -i -e 's/networking.k8s.io\/v1beta1/networking.k8s.io\/v1/' $tmp_dir/$release.release.data.decoded
    cat $tmp_dir/$release.release.data.decoded | gzip | base64 | base64 > $tmp_dir/$release.release.data.encoded
    
    # patch the helm release object
    tr -d "\n" < $tmp_dir/$release.release.data.encoded > $tmp_dir/$release.release.data.encoded.final
    releaseData=$(cat $tmp_dir/$release.release.data.encoded.final)
    sed 's/^\(\s*release\s*:\s*\).*/\1'$releaseData'/' $tmp_dir/$release.release.yaml > $tmp_dir/$release.final.release.yaml
    kubectl apply -f $tmp_dir/$release.final.release.yaml -n $namespace
    
    # clean up
    rm -rf $tmp_dir
  4. Execute fix-manifest.sh script file.
    $ bash fix-manifest

General Debugging Tips

There are several reasons why task helm upgrade/install default/<chart_name> may fail. Helm does not provide useful debug information and further investigation is required to understand the failure.

  1. Execute the following command to retry the upgrade.
    $ sevone-cli playbook up --tags apps
  2. While the above command is in progress, from another terminal window, run k9s.
  3. Monitor the status of each pod and refer to the table below for some basic debugging techniques.
    If logs are not shown when observing the logs via k9s, press 0 to enable logs for all time.
    Status Action
    CrashLoopBackOff Check the pod logs by hovering over the pod and pressing 1.
    Error Check the pod logs by hovering over the pod and pressing Check the pod logs by hovering over the pod and pressing 1.
    Pending Check the pod event log by hovering over it and pressing d.

Other Issues

Configuration Check

SevOne Data Insight requires configuration of several components to operate properly. When troubleshooting issues, it can be cumbersome to check the configuration and health of the system because there are different tools and network requirements, such as exposing certain ports.

The administrator would benefit from a tool used to display the configuration and health of the Data Insight environment so that a misconfiguration or system error can be quickly identified.

CLI method on production Kubernetes

$ ssh sevone@<SevOne Data Insight 'control plane' node IP address or hostname>>

$ sevone-cli exec graphql -- npm run health
Important: From the command above, expect no errors or failures are returned.

GraphQL method

Here are some sample GraphQL queries.

Check Data Insight system health

query health {
  health {
    minio { ...componentHealthDetails }
    mysql { ...componentHealthDetails }
    rabbitMq { ...componentHealthDetails }
    redis { ...componentHealthDetails }
    reportScheduler { ...componentHealthDetails }
    soa { ...componentHealthDetails }
  }
}
fragment componentHealthDetails on ComponentHealthDetails {
  host port error ok
}

Check a single datasource

query ds {
  datasources(ids: [ 1 ]) {
    id
    name
    address
  }
}

Check all datasources on all tenants

query tenants {
  tenants {
    id
    name
    datasources {
      id
      name
      address
    }
  }
}

Check datasources at authentication

mutation auth {
  authentication(tenant: "MyTenant", username: "admin", password: "password") {
    token
    success
    datasources {
      id
      name
      address
    }
  }
}

Error getting NMS IP List

SOA version
SOA must be on the latest version on all appliances in SevOne NMS cluster. Command Line Interface (CLI) must be used to upgrade SOA on all peers as the graphical user interface (GUI) only upgrades SOA for the NMS appliance you are connected to.
Add flag --all-peers if you want to install / upgrade SOA on all peers in the cluster.

Error

$ sevone-cli soa upgrade /opt/SevOne/upgrade/utilities/SevOne-soa-*.rpm --all-peers
>> [INFO] ATTEMPTING TO AUTO-DETECT SOA DATASOURCES...
Defaulted container "mysql" out of: mysql, metrics
...
...
<returns an ERROR>

If you get this error, please make sure you are logged into SevOne Data Insight as sevone.

$ ssh sevone@<SevOne Data Insight IP address or hostname>

Now, re-run the command to upgrade SOA.

SOA version
SOA must be on the latest version on all appliances in SevOne NMS cluster. Command Line Interface (CLI) must be used to upgrade SOA on all peers as the graphical user interface (GUI) only upgrades SOA for the NMS appliance you are connected to.
Add flag --all-peers if you want to install / upgrade SOA on all peers in the cluster.
$ sevone-cli soa upgrade /opt/SevOne/upgrade/utilities/SevOne-soa-*.rpm --all-peers

Incorrect information entered at Bootstrap and/or Provisioning prompts?

If you entered incorrect information at bootstrap and/or provisioning prompts, execute the following commands to allow you to override the input. These commands can only be run once your SevOne Data Insight is up and running.

$ ssh sevone@<SevOne Data Insight IP address or hostname>

$ sevone-cli exec graphql -- npm run bootstrap -- -f

$ sevone-cli exec graphql -- npm run provision -- -f
The purpose of these commands is to re-run through the bootstrap and provisioning prompts. They are especially useful if incorrect information was provided the first time around.

Pod Stuck in a terminating State

If a pod is ever stuck and you want it to reboot, you can append --grace-period=0 --force to the end of your delete pod command.

Example

$ ssh sevone@<SevOne Data Insight IP address or hostname>

$ kubectl delete pod $(kubectl get pods | grep 'dsm' | awk '{print $1}') --grace-period=0 --force

Review / Collect Logs

Logs can be collected at the pod level. The status of pods must be Running.

Note: In the commands below, to obtain the logs, you need to pass <resource-type/pod-name>. For example, deployment.apps/di-printer or deploy/di-printer.

By default, resource-type = pod. For logs where resource-type = pod, you may choose to only pass the pod-name only; resource-type is optional.

Using ssh, log into SevOne Data Insight as sevone.

$ ssh sevone@<SevOne Data Insight IP address or hostname>

Example: Get 'pod' names

$ kubectl get pods
NAME                                                      READY   STATUS      RESTARTS        AGE
di-create-secrets-xllfj                                   0/1     Completed   0               22h
di-upgrade-l2cs8                                          0/1     Completed   0               22h
clienttest-success-89lmt                                  0/1     Completed   0               22h
clienttest-fail-lb8mq                                     0/1     Completed   0               22h
di-report-version-sweeper-28276440-zpcxt                  0/1     Completed   0               20h
ingress-ingress-nginx-controller-54dfdbc9cf-g9wdz         1/1     Running     0               22h
di-prometheus-node-exporter-shnxk                         1/1     Running     0               22h
di-graphql-7d88c8c7b5-fbwgc                               1/1     Running     0               22h
di-ui-5b8fbcfc54-rtwlq                                    1/1     Running     0               22h
di-kube-state-metrics-6f4fbc67cb-tsbbk                    1/1     Running     0               22h
di-migrator-fdb9dd58b-29kl2                               2/2     Running     0               22h
ingress-ingress-nginx-defaultbackend-69f644c9dc-7jvvs     1/1     Running     0               22h
di-printer-7888679b59-cqp9q                               2/2     Running     0               22h
di-scheduler-7845d64d57-bdsm2                             1/1     Running     0               22h
di-registry-68c7bbc47b-45l5v                              1/1     Running     0               22h
di-djinn-api-5b4bbb446b-prsjd                             1/1     Running     1 (22h ago)     22h
di-mysql-0                                                2/2     Running     0               22h
di-prometheus-server-7dc67cb6b5-bjzn5                     2/2     Running     0               22h
di-redis-master-0                                         2/2     Running     0               22h
di-wdkserver-6db95bb9c9-5w2kt                             2/2     Running     0               22h
di-assetserver-5c4769bd8-6f2hw                            1/1     Running     0               22h
di-prometheus-node-exporter-mp5xf                         1/1     Running     0               22h
di-report-tombstone-sweeper-28277040-kj227                1/1     Running     0               10h
datasource-operator-controller-manager-5cf6f7f675-h5lng   2/2     Running     3 (5h37m ago)   22h
di-asset-sweeper-28277645-tq6gb                           0/1     Completed   0               12m
di-user-sync-28277645-dl6ks                               0/1     Completed   0               12m
di-asset-sweeper-28277650-hxwvn                           0/1     Completed   0               7m46s
di-user-sync-28277650-6kxf7                               0/1     Completed   0               7m46s
di-asset-sweeper-28277655-gjtpr                           0/1     Completed   0               2m46s
di-user-sync-28277655-chgxd                               0/1     Completed   0               2m46s
Pod names are the names found under column NAME.

Get resource types
Get 'all' resource types

$ kubectl get all | more

Get resource type for a pod

$ kubectl get all | grep <pod-name>

Example: Get resource type for pod-name containing 'printer'

$ kubectl get all | grep printer
pod/di-printer-68f6bddb6f-hkhdt            1/1     Running   2 (27h ago)   2d3h
deployment.apps/di-printer                 1/1     1         1             2d3h
replicaset.apps/di-printer-68f6bddb6f      1       1         1             2d3h

Example: Get resource type for pod-name containing 'rabbitmq'

$ kubectl get all | grep rabbitmq
pod/di-rabbitmq-0             1/1         Running          2 (27h ago)  2d3h
service/di-rabbitmq-headless  ClusterIP   None             <none>       4369/TCP,5672/TCP,25672/TCP,15672/TCP            2d3h
service/di-rabbitmq           ClusterIP   192.168.108.109  <none>       5672/TCP,4369/TCP,25672/TCP,15672/TCP,9419/TCP   2d3h
statefulset.apps/di-rabbitmq  1/1                                                                                        2d3h
Important: pod, deployment.apps, replicaset.apps, service, statefulset.apps, etc. in the examples above are resource types.

di-printer, di-rabbitmq, etc. in the examples above are pod names.

Get logs

$ kubectl logs <resource-type>/<pod-name> 

Example: Get logs for pod-name 'di-printer'

$ kubectl logs deployment.apps/di-printer 

OR

$ kubectl logs deploy/di-printer

Example: Get logs for pod-name 'di-rabbitmq'

$ kubectl logs statefulset.apps/di-rabbitmq

OR

$ kubectl logs sts/di-rabbitmq

Example: Get logs for pod-name 'rabbitmq' with timestamps

$ kubectl logs statefulset.apps/di-rabbitmq --timestamps

OR

$ kubectl logs sts/di-rabbitmq --timestamps

By default, resource-type = pod.

In the example below, to obtain the logs for <resource-type>/<pod-name> = pod/di-mysql-0, <resource-type> pod is optional.

Example: <resource-type> = pod; <resource-type> is optional

$ kubectl logs pod/di-mysql-0

OR

$ kubectl logs di-mysql-0
Important: Each pod can have one or more associated containers.

Collect Logs for a Pod with One Container

  1. Using ssh, log into SevOne Data Insight as sevone.
    $ ssh sevone@<SevOne Data Insight IP address or hostname>
  2. Obtain the list of containers that belong to a pod.
    Example: Pod name 'di-mysql-0' contains one container, 'mysql'
    $ kubectl get pods di-mysql-0 -o jsonpath='{.spec.containers[*].name}{"\n"}'
    mysql metrics
  3. Collect logs.
    Note: For pods with one container only, -c < container-name > in the command below is optional.
    $ kubectl logs <pod-name> -c <container-name>
    
    or
    
    $ kubectl logs <pod-name>

    Example

    $ kubectl logs di-mysql-0 -c mysql
    
    or
    
    $ kubectl logs di-mysql-0

Collect Logs for a Pod with More Than One Container

  1. Using ssh, log into SevOne Data Insight as sevone.
    $ ssh sevone@<SevOne Data Insight IP address or hostname>
  2. Obtain the list of containers that belong to a pod.
    Example: Pod name 'svclb-ingress-ingress-nginx-controller-6fbfd' contains two containers, 'lb-port-80' and 'lb-port-443'
    $ kubectl get pods svclb-ingress-ingress-nginx-controller-5pcm7 \
    -o jsonpath='{.spec.containers[*].name}{"\n"}'
    
    lb-port-80 lb-port-443
  3. Collect logs.
    Important: For pods with more than one container, -c < container-name > is required.
    $ kubectl logs <pod-name> -c <container-name>

    Example: Get logs for <container-name> = lb-port-80

    $ kubectl logs svclb-ingress-ingress-nginx-controller-vzcqj -c lb-port-80

    Example: Get logs for <container-name> = lb-port-443

    $ kubectl logs svclb-ingress-ingress-nginx-controller-vzcqj -c lb-port-443

Collect All Logs

  1. To collect all the logs relevant for SevOne Data Insight pods and its containers, create a working directory where all the logs can be collected.
    $ TMPDIR="/tmp/sdi_logs/$(date +%d%b%y)"
    $ mkdir -p $TMPDIR
  2. Execute the following command to collect all logs for all SevOne Data Insight containers.
    Note: The --timestamps option in the command below allows you to collect the logs with the timestamps.

    Example: Command to collect logs from all SevOne Data Insight Pods and containers

    $ for POD in $(kubectl get pods --no-headers -n default | \
    awk '{print $1}'); do for CONTAINER in $(kubectl get pods \
    $POD -o jsonpath='{.spec.containers[*].name}{"\n"}'); \
    do echo "Collecting logs for POD: $POD - CONTAINER: \
    $CONTAINER in log file $TMPDIR/$POD_$CONTAINER.log.gz" ; \
    kubectl logs $POD -c $CONTAINER --timestamps | \
    gzip > $TMPDIR/$POD_$CONTAINER.log.gz 2>&1; done ; done

    The for command is shown here with indentations for clarity.

    for POD in $(kubectl get pods --no-headers -n default | awk '{print $1}') ;
      do
      for CONTAINER in $(kubectl get pods $POD -o jsonpath='{.spec.containers[*].name}{"\n"}') ;
      do
        echo "Collecting logs for POD: $POD - CONTAINER: $CONTAINER in log file $TMPDIR/$POD_$CONTAINER.log.gz" ;
        kubectl logs $POD -c $CONTAINER --timestamps | gzip > $TMPDIR/$POD_$CONTAINER.log.gz 2>&1 ;
      done ;
    done

    Command to see files contained in $TMPDIR

    $ ls -lh $TMPDIR
  3. Once the logs are collected, the contents can be put in a tar file. There is no need to compress again since the logs are already compressed.
    $ tar -cf /tmp/sdi_logs-$(date +%d%b%y).tar $TMPDIR
    
    $ ls -l /tmp/sdi_logs-$(date +%d%b%y).tar
    
    $ md5sum /tmp/sdi_logs-$(date +%d%b%y).tar
  4. Delete the log directory to free-up the space.
    $ rm -rf $TMPDIR
  5. You may upload the tar file in /tmp/sdi_logs-$(date +%d%b%y).tar for further investigation.

'Agent' Nodes in a Not Ready State after Rebooting

Perform the following actions if the agent nodes are in a Not Ready state after rebooting.

Ensure Data Insight is 100% deployed

Check the status of the deployment by running the following command. Ensure that everything is in Running status.

$ ssh sevone@<SevOne Data Insight IP address or hostname>

$ kubectl get pods
NAME                                                      READY   STATUS      RESTARTS        AGE
di-create-secrets-xllfj                                   0/1     Completed   0               22h
di-upgrade-l2cs8                                          0/1     Completed   0               22h
clienttest-success-89lmt                                  0/1     Completed   0               22h
clienttest-fail-lb8mq                                     0/1     Completed   0               22h
di-report-version-sweeper-28276440-zpcxt                  0/1     Completed   0               20h
ingress-ingress-nginx-controller-54dfdbc9cf-g9wdz         1/1     Running     0               22h
di-prometheus-node-exporter-shnxk                         1/1     Running     0               22h
di-graphql-7d88c8c7b5-fbwgc                               1/1     Running     0               22h
di-ui-5b8fbcfc54-rtwlq                                    1/1     Running     0               22h
di-kube-state-metrics-6f4fbc67cb-tsbbk                    1/1     Running     0               22h
di-migrator-fdb9dd58b-29kl2                               2/2     Running     0               22h
ingress-ingress-nginx-defaultbackend-69f644c9dc-7jvvs     1/1     Running     0               22h
di-printer-7888679b59-cqp9q                               2/2     Running     0               22h
di-scheduler-7845d64d57-bdsm2                             1/1     Running     0               22h
di-registry-68c7bbc47b-45l5v                              1/1     Running     0               22h
di-djinn-api-5b4bbb446b-prsjd                             1/1     Running     1 (22h ago)     22h
di-mysql-0                                                2/2     Running     0               22h
di-prometheus-server-7dc67cb6b5-bjzn5                     2/2     Running     0               22h
di-redis-master-0                                         2/2     Running     0               22h
di-wdkserver-6db95bb9c9-5w2kt                             2/2     Running     0               22h
di-assetserver-5c4769bd8-6f2hw                            1/1     Running     0               22h
di-prometheus-node-exporter-mp5xf                         1/1     Running     0               22h
di-report-tombstone-sweeper-28277040-kj227                1/1     Running     0               10h
datasource-operator-controller-manager-5cf6f7f675-h5lng   2/2     Running     3 (5h37m ago)   22h
di-asset-sweeper-28277645-tq6gb                           0/1     Completed   0               12m
di-user-sync-28277645-dl6ks                               0/1     Completed   0               12m
di-asset-sweeper-28277650-hxwvn                           0/1     Completed   0               7m46s
di-user-sync-28277650-6kxf7                               0/1     Completed   0               7m46s
di-asset-sweeper-28277655-gjtpr                           0/1     Completed   0               2m46s
di-user-sync-28277655-chgxd                               0/1     Completed   0               2m46s
Note: To see additional pod details, you may use kubectl get pods -o wide command.

Restart SOA

If SevOne NMS has been upgraded or downgraded, please make sure that the SOA container is restarted after a successful upgrade/downgrade. Execute the following command.

From SevOne NMS appliance,

$ ssh root@<NMS appliance>

$ supervisorctl restart soa