Troubleshooting
The following are some troubleshooting scenarios.
General Issues
Early Users of 'unrhel'
The following message is returned when the environment is migrated over from RHEL using unrhel@v5.1.1 or below.
FAILED => Missing sudo password
To fix this, execute the following commands before performing an upgrade.
ssh sevone@<'agent' IP address>
sudo -i
echo "sevone ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
the connection plugin 'local' was not found
This is caused by an ansible upgrade being triggered in the middle of a playbook run.
sudo rpm -Uvh /opt/SevOne/upgrade/utilities/sevone-cli*$(rpm --eval '%{dist}')*.rpm
sevone-cli playbook install
sevone-cli playbook up
the connection plugin 'ssh' was not found
This is the same issue as the connection plugin 'local' was not found.
Could not find platform independent libraries
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Python path configuration:
PYTHONHOME = (not set)
PYTHONPATH = (not set)
program name = '/usr/bin/python3.12'
isolated = 0
environment = 1
user site = 1
safe_path = 0
import site = 1
is in build tree = 0
stdlib dir = '/root/.pyenv/versions/3.12.1/lib/python3.12' sys._base_executable = '/usr/bin/python3.12'
sys.base_prefix = '/root/.pyenv/versions/3.12.1'
sys.base_exec_prefix = '/root/.pyenv/versions/3.12.1' sys.platlibdir = 'lib'
sys.executable = '/usr/bin/python3.12'
sys.prefix = '/root/.pyenv/versions/3.12.1'
sys.exec_prefix = '/root/.pyenv/versions/3.12.1'
sys.path = [
'/root/.pyenv/versions/3.12.1/lib/python312.zip',
'/root/.pyenv/versions/3.12.1/lib/python3.12',
'/root/.pyenv/versions/3.12.1/lib/python3.12/lib-dynload',
]
Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
ModuleNotFoundError: No module named 'encodings'
Current thread 0x00007f7bdf71e740 (most recent call first):
<no Python frame>
>> [ERROR] An error occurred running the playbook /opt/SevOne/upgrade/ansible/playbooks/up.yaml. Please check the output above.
ModuleNotFoundError: No module named 'encodings'
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Python path configuration:
PYTHONHOME = (not set)
PYTHONPATH = (not set)
program name = '/usr/bin/python3.12'
isolated = 0
environment = 1
user site = 1
safe_path = 0
import site = 1
is in build tree = 0
stdlib dir = '/root/.pyenv/versions/3.12.1/lib/python3.12' sys._base_executable = '/usr/bin/python3.12'
sys.base_prefix = '/root/.pyenv/versions/3.12.1'
sys.base_exec_prefix = '/root/.pyenv/versions/3.12.1' sys.platlibdir = 'lib'
sys.executable = '/usr/bin/python3.12'
sys.prefix = '/root/.pyenv/versions/3.12.1'
sys.exec_prefix = '/root/.pyenv/versions/3.12.1'
sys.path = [
'/root/.pyenv/versions/3.12.1/lib/python312.zip',
'/root/.pyenv/versions/3.12.1/lib/python3.12',
'/root/.pyenv/versions/3.12.1/lib/python3.12/lib-dynload',
]
Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
ModuleNotFoundError: No module named 'encodings'
Current thread 0x00007f7bdf71e740 (most recent call first):
<no Python fram>
>> [ERROR] An error occurred running the playbook /opt/SevOne/upgrade/ansible/playbooks/up.yaml. Please check the output above.
ACCESS DENIED in GraphQL Logs
If SOA apikeys are outdated or expired, you are likely to get this error. To fix this, update the datasource keys.
- Execute the following command to ensure that GraphQL pod is Running and not in an Errored or CrashLookBackOff state.
Example
kubectl get pods | grep graphqldi-graphql-7d88c8c7b5-fbwgc 1/1 Running 0 22h
- If the third column reads Errored or CrashLookBackOff, using a text editor of your choice, edit the config file based on the SevOne Data Insight version as shown in the table in section find config file to include the following environment variable and then, save it.
Example: for SDI 6.8.x and below,
graphql: env: SKIP_REPORT_MIGRATION_DRY_RUN: trueExample: for SDI 7.0.x and above,
graphql = { env = { "SKIP_REPORT_MIGRATION_DRY_RUN" = "true" } } - Apply the change made to the config file.
sevone-cli playbook up --tags apps - Once the GraphQL pod state is running / healthy, generate new SOA API keys for each affected datasource. Execute the following command.
sevone-cli exec graphql -- npm run reconfig-datasourceNote: You will be prompted several times. Keep all the default values except enter y when prompted to Login instead of providing an API key.username must be admin
password must be the admin Graphical User Interface password for the datasource.
Important: Repeat this step for each datasource.Examplesevone-cli exec graphql -- npm run reconfig-datasourceRespond to the following prompts: > insight-server@7.1.0 reconfig-datasource /insight-server > NODE_PATH=./dist/libs node dist/scripts/database-init/reconfigure-datasource.js Datasource config: Name: Data Insight API Address: https://staging.soa.sevone.doc API key: eyJ1dWlkIjoiYzMxNTQzZWUtMTgxZC00NWMyLTlkNjctNTUwZWRhODQ2MGFkIiwiYXBwbGljYXRpb24iOiJEYXRhIEluc2lnaHQgKGNrcmwxdmhqZzAwMDA1M3MwM3Z5bmRlZXMpIiwiZW50cm9weSI6IjZVaWxuQStzVDk2ZUFKeG92WW1Nak1odS9nZ29JSWhLNVBDZ05yZHBBT1lrSE11ZlM0eU9CbCs4YWxEUXd3a1MifQ== Dstype: METRICS/FLOW Datasource name [Data Insight API]: [1] METRICS/FLOW [2] splunk-datasource [3] elastic-datasource [0] Keep: METRICS/FLOW Please respond to the following prompts: Datasource dstype [1, 2, 3, 0]: 0 Datasource address [https://staging.soa.sevone.doc]: Login instead of providing an API key? [y/n]: y Username: admin Password: ****** Output: info: [Data Insight API@reconfigure-datasource] SOA request (SOA-1) post https://staging.soa.sevone.doc/api/v3/users/signin (node:347) Warning: Setting the NODE_TLS_REJECT_UNAUTHORIZED environment variable to '0' makes TLS connections and HTTPS requests insecure by disabling certificate verification. (Use `node --trace-warnings ...` to show where the warning was created) info: [Data Insight API@reconfigure-datasource] SOA response (SOA-1) elapsed 1755ms. info: [Data Insight API@reconfigure-datasource] SOA request (SOA-2) post https://staging.soa.sevone.doc/api/v3/users/apikey info: [Data Insight API@reconfigure-datasource] SOA response (SOA-2) elapsed 277ms. New datasource config: Name: Data Insight API Address: https://staging.soa.sevone.doc API key: eyJ1dWlkIjoiMzgyMDdhMjItNzE2Mi00OWRlLTk5NTYtYmI3OTVkYjc5NzZkIiwiYXBwbGljYXRpb24iOiJEYXRhIEluc2lnaHQgKGNrczUyYnd0YzAwMDA5bnMxNnQ3aWcxYnQpIiwiZW50cm9weSI6IkNoYS9tbDFWbVVyYThQcHVsLzIzY05JZk94QXcxWFQrVnEyM0hPSzYzSTdPNGNMbkJTVjQyWUVRSW1FeGtDaEoifQ== Dstype: METRICS/FLOW Please respond to the following prompts: Is this config correct? [y/n]: y Output: info: [Data Insight API@create-datasource] SOA request (SOA-3) get https://staging.soa.sevone.doc/api/v3/users/self info: [Data Insight API@create-datasource] SOA response (SOA-3) elapsed 275ms. Datasource config updated! Datasource reconfiguration complete. - Once all the datasources have been updated, restart the GraphQL pod.
kubectl delete pods -l app.kubernetes.io/component=graphql
Error Fetching Widget when Loading Report
This error is when the wdkserver is not serving the widgets to the user interface. In most cases, it is due to an invalid cookie value set in your browser that causes the error. You may inspect the network activity based on the browser's Developer Tools and look for requests to /wdkserver. If you are unable to inspect the network activity, please contact IBM SevOne Support.
If you observe the following error message coming back from /wdkserver, remove the offending cookie or disable strict headers in wdkserver. If you need assistance with this, please contact IBM SevOne Support.
{"statusCode":400,"error":"Bad Request","message":"Invalid cookie value"}
- Using a text editor of your choice, edit the config file based on the SevOne Data Insight version as shown in the table in section find config file to include the following environment variable and then, save it.
For SDI 6.8.x and below,
wdkserver: env: DISABLE_STRICT_HEADER: trueFor SDI 7.0.x and above,
wdkserver = { env = { "DISABLE_STRICT_HEADER": "true" } } - Apply the change made to the config file.
sevone-cli playbook up --tags apps
Unable to connect to the server: x509: certificate has expired
If you are seeing error message, x509: certificate has expired, when running kubectl commands, your certificates have expired and need to be rotated manually. Please refer to SevOne Data Insight Administration Guide > section Rotate Kubernetes Certificates for details.
[ WARN ] No upgrade available
The No upgrade available warning usually occurs when attempting to retry a failed upgrade or if the upgrade .tgz file is placed in an incorrect directory.
- Ensure the .tgz file is in the correct directory as outlined in SevOne Data Insight Upgrade Process Guide > section Confirm SevOne Data Insight Version.
- Using ssh , log into SevOne Data Insight as sevone .
ssh sevone@<SevOne Data Insight 'control plane' node IP address or hostname> - Revert your SevOne Data Insight major / minor version in /SevOne.info to a prior / lower version using a text editor of your choice.
vi /SevOne.infoExample# 1
The current SevOne Data Insight version is 7.2.0. The version prior to SevOne Data Insight 7.2.0 is SevOne Data Insight 7.1.0. In this case, if you want to go to the prior version, you must change the minor version to go to the prior / lower version.major = 7 # e.g.: if this is `7` then leave it as-is minor = 2 # e.g.: if this is `2` then set it to `1` patch = 0 build = <###> # e.g.: replace the build number with the one for the prior version i.e., 184The prior version is,major = 7 minor = 1 patch = 0 build = 184Example# 2
The current SevOne Data Insight version is 7.0.0. The version prior to SevOne Data Insight 7.0.0 is SevOne Data Insight 6.8.0. In this case, if you want to go to the prior version, you must change the major and minor version to go to the prior / lower version.
major = 7 # e.g.: if this is `7` then set it to `6` minor = 0 # e.g.: if this is `0` then set it to `8` patch = 0 build = <###> # e.g.: replace the build number with the one for the prior version i.e., 45The prior version is,
major = 6 minor = 8 patch = 0 build = 45
Domain Name Resolution (DNS) not working
The DNS server must be able to resolve SevOne Data Insight's hostname on both the control plane and the agent nodes otherwise, SevOne Data Insight will not work. This can be done by adding your DNS servers via nmtui or by editing /etc/resolv.conf file directly as shown in the steps below.
In the example below, let's use the following SevOne Data Insight IP addresses.
| Hostname | IP Address | Role |
|---|---|---|
| sdi-node01 | 10.123.45.67 | control plane |
| sdi-node02 | 10.123.45.68 | agent |
Also, in this example, the following DNS configuration is used and DNS search records, sevone.com and nwk.sevone.com are used.
| Nameserver | IP Address |
|---|---|
| nameserver | 10.168.16.50 |
| nameserver | 10.205.8.50 |
- Using ssh, log into the designated SevOne Data Insight control plane node and agent node as sevone from two different terminal windows.
SSH to 'control plane' node from terminal window 1
ssh sevone@10.123.45.67SSH to 'agent' node from terminal window 2ssh sevone@10.123.45.68 - Obtain a list of DNS entries in /etc/resolv.conf file for both control plane and agent nodes in this example.
From terminal window 1
cat /etc/resolv.conf# Generated by NetworkManager search sevone.com nwk.sevone.com nameserver 10.168.16.50 nameserver 10.205.8.50
From terminal window 2
cat /etc/resolv.conf# Generated by NetworkManager search sevone.com nwk.sevone.com nameserver 10.168.16.50 nameserver 10.205.8.50
- Ensure that DNS server can resolve SevOne Data Insight's hostname / IP address on both the control plane and the agent nodes along with the DNS entries in /etc/resolv.conf file (see the search line and nameserver(s)).
From terminal window 1
The following output shows that the DNS server can resolve hostname / IP address on both the control plane and the agent nodes.
Check if 'nslookup' resolves the 'control plane' IP addressnslookup 10.123.45.6767.45.123.10.in-addr.arpa name = sdi-node01.sevone.com.Check if 'nslookup' resolves the 'control plane' hostnamenslookup sdi-node01.sevone.comServer: 10.168.16.50 Address: 10.168.16.50#53 Name: sdi-node01.sevone.com Address: 10.123.45.67
Check if 'nslookup' resolves the 'agent' IP addressnslookup 10.123.45.6868.45.123.10.in-addr.arpa name = sdi-node02.sevone.com.
Check if 'nslookup' resolves the 'agent' hostnamenslookup sdi-node02.sevone.comServer: 10.168.16.50 Address: 10.168.16.50#53 Name: sdi-node02.sevone.com Address: 10.123.45.68
nslookup name 'sevone.com' in search line in /etc/resolve.confnslookup sevone.comServer: 10.168.16.50 Address: 10.168.16.50#53 Name: sevone.com Address: 23.185.0.4
nslookup name 'nwk.sevone.com' in search line in /etc/resolve.confnslookup nwk.sevone.comServer: 10.168.16.50 Address: 10.168.16.50#53 Name: nwk.sevone.com Address: 25.185.0.4
nslookup nameserver '10.168.16.50' in /etc/resolve.confnslookup 10.168.16.5050.16.168.10.in-addr.arpa name = infoblox.nwk.sevone.com.
nslookup nameserver '10.205.8.50' in /etc/resolve.confnslookup 10.205.8.5050.8.205.10.in-addr.arpa name = infoblox.colo2.sevone.com.
From terminal window 2
The following output shows that the DNS server can resolve hostname / IP address on both the control plane and the agent nodes.
Check if 'nslookup' resolves the 'agent' IP addressnslookup 10.123.45.6868.45.123.10.in-addr.arpa name = sdi-node02.sevone.com.
Check if 'nslookup' resolves the 'agent' hostnamenslookup sdi-node02.sevone.comServer: 10.168.16.50 Address: 10.168.16.50#53 Name: sdi-node02.sevone.com Address: 10.123.45.68
Check if 'nslookup' resolves the 'control plane' IP addressnslookup 10.123.45.6767.45.123.10.in-addr.arpa name = sdi-node01.sevone.com.
Check if 'nslookup' resolves the 'control plane' hostnamenslookup sdi-node01.sevone.comServer: 10.168.16.50 Address: 10.168.16.50#53 Name: sdi-node01.sevone.com Address: 10.123.45.67
nslookup name 'sevone.com' in search line in /etc/resolve.confnslookup sevone.comServer: 10.168.16.50 Address: 10.168.16.50#53 Name: sevone.com Address: 23.185.0.4
nslookup name 'nwk.sevone.com' in search line in /etc/resolve.confnslookup nwk.sevone.comServer: 10.168.16.50 Address: 10.168.16.50#53 Name: nwk.sevone.com Address: 25.185.0.4
nslookup nameserver '10.168.16.50' in /etc/resolve.confnslookup 10.168.16.5050.16.168.10.in-addr.arpa name = infoblox.nwk.sevone.com.
nslookup nameserver '10.205.8.50' in /etc/resolve.confnslookup 10.205.8.5050.8.205.10.in-addr.arpa name = infoblox.colo2.sevone.com.
Note: If any of the nslookup commands in terminal window 1 or terminal window 2 above fail or return one or more of the following, you must first resolve the name resolution issue otherwise, SevOne Data Insight will not work.Examples** server can't find 67.45.123.10.in-addr.arpa.: NXDOMAIN or ** server can't find 68.45.123.10.in-addr.arpa.: NXDOMAIN or *** Can't find nwk.sevone.com: No answer etc.
If the name resolution fails due to any reason after the deployment of SevOne Data Insight, then this could also lead to the failure of normal operations in SevOne Data Insight. Hence, it is recommended to ensure that the DNS configuration is always working.
ERROR: Failed to open ID file '/home/sevone/.pub': No such file or directory
As a security measure, fresh installations do not ship with pre-generated SSH keys.
- Using ssh , log into SevOne Data Insight as sevone .
ssh sevone@<SevOne Data Insight 'control plane' node IP address or hostname> - Execute the following command to generate unique SSH keys for your cluster.
sevone-cli cluster setup-keys
TimeShift between SevOne Data Insight & SevOne NMS
If the time difference between SevOne Data Insight and SevOne NMS appliances is more than 5 minutes, the following steps must be performed.
- Check the time on the SevOne Data Insight appliance.
date - Check the time on the SevOne NMS appliance.
date - If the time difference between SevOne Data Insight and SevOne NMS appliances is more than 5 minutes, then check the NTP configuration on both the appliances. Both the appliances must be time-sync'd to the NTP.
Note: If the NTP server is unavailable, manually set the same time on both the appliances as shown in the example below.Example
date --set="6 OCT 2023 18:00:00"
Pre-Check Failures
TASK [ Confirm free space ]
- If this task fails, you can try to clean up old installer files that may be found in various parts of the file system. For example,
- /root
- /home/sevone
- /opt/SevOne/upgrade
- /var/lib/rancher/k3s/agent/images
Important: Please exercise caution before proceeding with the cleanup. Ensure that you only delete files that are no longer needed and have been confirmed as safe to remove. For example, look for installer files of the formatsdi-x.y.z-build.<###>.tgzwhere x.y.z is an older version and not the one you are upgrading to. - Clear scheduled report caching. Execute the following command to delete files older than one-week (604800 seconds).
Note: SevOne Data Insight maintains a cache of the printed PDF's for scheduled reports. Depending on your usage of report scheduling, it is recommended to occasionally clean up the cache to free up disk space.
sevone-cli exec graphql -- "ASSET_SERVER_TEMP_PATH='/' npm run asset-sweeper -- --extension=pdf --age=604800"You can run the following command to delete files older than one week (604800 seconds).
- Running the command below helps track down what files in the system are taking up the most space.
du -sh /* - In your investigation, if you find that the following directory is filling up your HDD (Hard Disk Drive), then it is a container or a pod that is the culprit.
/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshotsImportant: Do not delete the folder. Only remove files that are safe and approved for deletion. Please proceed with caution when deleting files. If the folder is accidentally deleted, it can be restored by running the following commands.sudo systemctl restart containerd sudo systemctl restart k3sYou must continue running du -sh to further pinpoint the exact container or pod. In some cases, it may be the printer container taking up the space due to node.js core dump files. Execute the following command to identify those files.
find /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots \ -name "core\.*"
Install / Upgrade Failures
TASK [ k3s : Initialize the cluster ]
If this task fails, you can observe the status of k3s service by using the following command.
systemctl status k3s
Unable to find suitable network address. No default routes found.
Check if there is a default route added to the routing table.
ip route | grep default
If this returns empty, you will need to add a default route.
Add default route
ip route add default via <default_gateway>
TASK [ Stop k3s-server if upgrading to new version ]
If this task does not complete within a minute then, you will have to apply the following workaround before continuing with the upgrade.
Stop API and Client processes
sudo systemctl status sevone-guii-@api
sudo systemctl status sevone-guii-@client
sudo systemctl start sevone-guii-@api
sudo systemctl start sevone-guii-@client
sudo systemctl stop sevone-guii-@api
sudo systemctl stop sevone-guii-@client
sed -i 's/.*k3s-killall.sh.*/ echo noop/' \
/opt/SevOne/upgrade/ansible/playbooks/roles/k3s/tasks/02_setup.yaml
sevone-cli playbook up --tags kube,apps,kernel
This is due to an upstream issue with the k3s-killall.sh script hanging when attempting to shut down some running containerd processes.
TASK [ prep : Ensure hostname set ]
When attempting to run an upgrade, you may run into the following error.
TASK [prep : Ensure hostname set] ****************************************************************************************************************************************************
fatal: [sevonek8s]: FAILED! => {“changed”: false, \
“msg”: “Command failed rc=1, out=, err=Could not get property: \
Failed to activate service ‘org.freedesktop.hostname1’: timed out\n”}
This happens when hostnamed has likely crashed. Restart hostnamed.
sudo systemctl restart systemd-hostnamed
sudo reboot
TASK [ freight : Install rhel8-update-*.el8.tgz ]
When upgrading to SevOne Data Insight >= 3.8 or higher, the culprit is likely that the packages are too up-to-date. This can happen if your machine has internet access and can resolve yum package servers. The fix is to retry the upgrade while skipping the yum packages with broken dependencies.
- Remove lingering yum packages or package conflicts.
sudo yum clean all sudo rm -rf /var/cache/yum/ - Retry the upgrade via the Command Line Interface (CLI).
sevone-cli playbook up --extra-vars "freight_install_skip_broken=yes"
TASK [ helm upgrade/install default/<chart_name> ]
There are several reasons why this task may fail. Unfortunately, Helm does not report errors or useful debug information. Due to this, further investigation is required for this. Please look for the stderr key in the large JSON body that is returned in the task output.
UPGRADE FAILED: to deploy apps
If you are upgrading between 3.5.x versions, for example, from 3.5.1 to 3.5.3, upgrade will fail to deploy apps.
To fix this, execute the following commands before performing an upgrade.
sevone-cli playbook up --skip-tags apps,kernel
sudo systemctl restart k3s
ssh sevone@<'agent' IP address>
sudo systemctl restart k3s-agent
General Debugging Tips
There are several reasons why task helm upgrade/install default/<chart_name> may fail. Helm does not provide useful debug information and further investigation is required to understand the failure.
- Execute the following command to retry the upgrade.
sevone-cli playbook up --tags apps - While the above command is in progress, from another terminal window, run k9s.
- Monitor the status of each pod and refer to the table below for some basic debugging techniques.
Note: If logs are not shown when observing the logs via k9s, press 0 to enable logs for all time.
Status Action CrashLoopBackOff Check the pod logs by hovering over the pod and pressing 1. Error Check the pod logs by hovering over the pod and pressing Check the pod logs by hovering over the pod and pressing 1. Pending Check the pod event log by hovering over it and pressing d.
Other Issues
Error getting NMS IP List
This section only applies if your SevOne NMS is on 6.x and you are getting this error.
- SOA must be on the latest version on all appliances in SevOne NMS cluster. Command Line Interface (CLI) must be used to upgrade SOA on all peers as the graphical user interface (GUI) only upgrades SOA for the NMS appliance you are connected to.
- Add flag --all-peers if you want to install / upgrade SOA on all peers in the cluster.
Error
sevone-cli soa upgrade /opt/SevOne/upgrade/utilities/SevOne-soa-*.rpm --all-peers
>> [INFO] ATTEMPTING TO AUTO-DETECT SOA DATASOURCES...
Defaulted container "mysql" out of: mysql, metrics
...
...
<returns an ERROR>
If you get this error, please make sure you are logged into SevOne Data Insight as sevone.
ssh sevone@<SevOne Data Insight IP address or hostname>
Now, re-run the command to upgrade SOA.
sevone-cli soa upgrade /opt/SevOne/upgrade/utilities/SevOne-soa-*.rpm --all-peers
Incorrect information entered at Bootstrap and/or Provisioning prompts?
If you entered incorrect information at bootstrap and/or provisioning prompts, execute the following commands to allow you to override the input. These commands can only be run once your SevOne Data Insight is up and running.
ssh sevone@<SevOne Data Insight IP address or hostname>
sevone-cli exec graphql -- npm run bootstrap -- -f
sevone-cli exec graphql -- npm run provision -- -f
Pod Stuck in a terminating State
If a pod is ever stuck and you want it to reboot, you can append --grace-period=0 --force to the end of your delete pod command.
Example
ssh sevone@<SevOne Data Insight IP address or hostname>
kubectl delete pod $(kubectl get pods | grep 'dsm' | awk '{print $1}') --grace-period=0 --force
Review / Collect Logs
Logs can be collected at the pod level. The status of pods must be Running.
By default, resource-type = pod. For logs where resource-type = pod, you may choose to only pass the pod-name only; resource-type is optional.
Using ssh, log into SevOne Data Insight as sevone.
ssh sevone@<SevOne Data Insight IP address or hostname>
Example: Get 'pod' names
kubectl get pods
NAME READY STATUS RESTARTS AGE di-create-secrets-xllfj 0/1 Completed 0 22h di-upgrade-l2cs8 0/1 Completed 0 22h clienttest-success-89lmt 0/1 Completed 0 22h clienttest-fail-lb8mq 0/1 Completed 0 22h di-report-version-sweeper-28276440-zpcxt 0/1 Completed 0 20h ingress-ingress-nginx-controller-54dfdbc9cf-g9wdz 1/1 Running 0 22h di-prometheus-node-exporter-shnxk 1/1 Running 0 22h di-graphql-7d88c8c7b5-fbwgc 1/1 Running 0 22h di-ui-5b8fbcfc54-rtwlq 1/1 Running 0 22h di-kube-state-metrics-6f4fbc67cb-tsbbk 1/1 Running 0 22h di-migrator-fdb9dd58b-29kl2 2/2 Running 0 22h ingress-ingress-nginx-defaultbackend-69f644c9dc-7jvvs 1/1 Running 0 22h di-printer-7888679b59-cqp9q 2/2 Running 0 22h di-scheduler-7845d64d57-bdsm2 1/1 Running 0 22h di-registry-68c7bbc47b-45l5v 1/1 Running 0 22h di-djinn-api-5b4bbb446b-prsjd 1/1 Running 1 (22h ago) 22h di-mysql-0 2/2 Running 0 22h di-prometheus-server-7dc67cb6b5-bjzn5 2/2 Running 0 22h di-redis-master-0 2/2 Running 0 22h di-wdkserver-6db95bb9c9-5w2kt 2/2 Running 0 22h di-assetserver-5c4769bd8-6f2hw 1/1 Running 0 22h di-prometheus-node-exporter-mp5xf 1/1 Running 0 22h di-report-tombstone-sweeper-28277040-kj227 1/1 Running 0 10h datasource-operator-controller-manager-5cf6f7f675-h5lng 2/2 Running 3 (5h37m ago) 22h di-asset-sweeper-28277645-tq6gb 0/1 Completed 0 12m di-user-sync-28277645-dl6ks 0/1 Completed 0 12m di-asset-sweeper-28277650-hxwvn 0/1 Completed 0 7m46s di-user-sync-28277650-6kxf7 0/1 Completed 0 7m46s di-asset-sweeper-28277655-gjtpr 0/1 Completed 0 2m46s di-user-sync-28277655-chgxd 0/1 Completed 0 2m46s
Get resource types
kubectl get all | more
kubectl get all | grep <pod-name>
kubectl get all | grep printer
pod/di-printer-68f6bddb6f-hkhdt 1/1 Running 2 (27h ago) 2d3h deployment.apps/di-printer 1/1 1 1 2d3h replicaset.apps/di-printer-68f6bddb6f 1 1 1 2d3h
kubectl get all | grep rabbitmq
pod/di-rabbitmq-0 1/1 Running 2 (27h ago) 2d3h service/di-rabbitmq-headless ClusterIP None <none> 4369/TCP,5672/TCP,25672/TCP,15672/TCP 2d3h service/di-rabbitmq ClusterIP 192.168.108.109 <none> 5672/TCP,4369/TCP,25672/TCP,15672/TCP,9419/TCP 2d3h statefulset.apps/di-rabbitmq 1/1 2d3h
di-printer, di-rabbitmq, etc. in the examples above are pod names.
kubectl logs <resource-type>/<pod-name>
kubectl logs deployment.apps/di-printer
OR
kubectl logs deploy/di-printer
kubectl logs statefulset.apps/di-rabbitmq
OR
kubectl logs sts/di-rabbitmq
kubectl logs statefulset.apps/di-rabbitmq --timestamps
OR
kubectl logs sts/di-rabbitmq --timestamps
By default, resource-type = pod.
In the example below, to obtain the logs for <resource-type>/<pod-name> = pod/di-mysql-0, <resource-type> pod is optional.
kubectl logs pod/di-mysql-0
OR
kubectl logs di-mysql-0
Collect Logs for a Pod with One Container
- Using ssh, log into SevOne Data Insight as sevone.
ssh sevone@<SevOne Data Insight IP address or hostname> - Obtain the list of containers that belong to a pod.
Example: Pod name 'di-mysql-0' contains one container, 'mysql'
kubectl get pods di-mysql-0 -o jsonpath='{.spec.containers[*].name}{"\n"}' mysql metrics - Collect logs.
Note: For pods with one container only, -c < container-name > in the command below is optional.
kubectl logs <pod-name> -c <container-name> Or kubectl logs <pod-name>Examplekubectl logs di-mysql-0 -c mysql Or kubectl logs di-mysql-0
Collect Logs for a Pod with More Than One Container
- Using ssh, log into SevOne Data Insight as sevone.
ssh sevone@<SevOne Data Insight IP address or hostname> - Obtain the list of containers that belong to a pod.
Example: Pod name 'svclb-ingress-ingress-nginx-controller-6fbfd' contains two containers, 'lb-port-80' and 'lb-port-443'
kubectl get pods svclb-ingress-ingress-nginx-controller-5pcm7 \ -o jsonpath='{.spec.containers[*].name}{"\n"}'lb-port-80 lb-port-443
- Collect logs.
Important: For pods with more than one container, -c <container-name> is required.
kubectl logs <pod-name> -c <container-name>Example: Get logs for <container-name> = lb-port-80kubectl logs svclb-ingress-ingress-nginx-controller-vzcqj -c lb-port-80Example: Get logs for <container-name> = lb-port-443kubectl logs svclb-ingress-ingress-nginx-controller-vzcqj -c lb-port-443
Collect All Logs
- To collect all the logs relevant for SevOne Data Insight pods and its containers, create a working directory where all the logs can be collected.
TMPDIR="/tmp/sdi_logs/$(date +%d%b%y)" mkdir -p $TMPDIR - Execute the following command to collect all logs for all SevOne Data Insight containers.
Note: The --timestamps option in the command below allows you to collect the logs with the timestamps.
Example: Command to collect logs from all SevOne Data Insight Pods and containers
for POD in $(kubectl get pods --no-headers -n default | \ awk '{print $1}'); do for CONTAINER in $(kubectl get pods \ $POD -o jsonpath='{.spec.containers[*].name}{"\n"}'); \ do echo "Collecting logs for POD: $POD - CONTAINER: \ $CONTAINER in log file $TMPDIR/$POD_$CONTAINER.log.gz" ; \ kubectl logs $POD -c $CONTAINER --timestamps | \ gzip > $TMPDIR/$POD_$CONTAINER.log.gz 2>&1; done ; doneThe for command is shown here with indentations for clarity.
for POD in $(kubectl get pods --no-headers -n default | awk '{print $1}') ; do for CONTAINER in $(kubectl get pods $POD -o jsonpath='{.spec.containers[*].name}{"\n"}') ; do echo "Collecting logs for POD: $POD - CONTAINER: $CONTAINER in log file $TMPDIR/$POD_$CONTAINER.log.gz" ; kubectl logs $POD -c $CONTAINER --timestamps | gzip > $TMPDIR/$POD_$CONTAINER.log.gz 2>&1 ; done ; doneCommand to see files contained in $TMPDIR
ls -lh $TMPDIR - Once the logs are collected, the contents can be put in a tar file. There is no need to compress again since the logs are already compressed.
tar -cf /tmp/sdi_logs-$(date +%d%b%y).tar $TMPDIR ls -l /tmp/sdi_logs-$(date +%d%b%y).tar md5sum /tmp/sdi_logs-$(date +%d%b%y).tar - Delete the log directory to free-up the space.
rm -rf $TMPDIR - You may upload the tar file
for further investigation./tmp/sdi_logs-$(date +%d%b%y).tar
'Agent' Nodes in a Not Ready State after Rebooting
Perform the following actions if the agent nodes are in a Not Ready state after rebooting.
Ensure Data Insight is 100% deployed
Check the status of the deployment by running the following command. Ensure that everything is in Running status.
ssh sevone@<SevOne Data Insight IP address or hostname>
kubectl get pods
NAME READY STATUS RESTARTS AGE di-create-secrets-xllfj 0/1 Completed 0 22h di-upgrade-l2cs8 0/1 Completed 0 22h clienttest-success-89lmt 0/1 Completed 0 22h clienttest-fail-lb8mq 0/1 Completed 0 22h di-report-version-sweeper-28276440-zpcxt 0/1 Completed 0 20h ingress-ingress-nginx-controller-54dfdbc9cf-g9wdz 1/1 Running 0 22h di-prometheus-node-exporter-shnxk 1/1 Running 0 22h di-graphql-7d88c8c7b5-fbwgc 1/1 Running 0 22h di-ui-5b8fbcfc54-rtwlq 1/1 Running 0 22h di-kube-state-metrics-6f4fbc67cb-tsbbk 1/1 Running 0 22h di-migrator-fdb9dd58b-29kl2 2/2 Running 0 22h ingress-ingress-nginx-defaultbackend-69f644c9dc-7jvvs 1/1 Running 0 22h di-printer-7888679b59-cqp9q 2/2 Running 0 22h di-scheduler-7845d64d57-bdsm2 1/1 Running 0 22h di-registry-68c7bbc47b-45l5v 1/1 Running 0 22h di-djinn-api-5b4bbb446b-prsjd 1/1 Running 1 (22h ago) 22h di-mysql-0 2/2 Running 0 22h di-prometheus-server-7dc67cb6b5-bjzn5 2/2 Running 0 22h di-redis-master-0 2/2 Running 0 22h di-wdkserver-6db95bb9c9-5w2kt 2/2 Running 0 22h di-assetserver-5c4769bd8-6f2hw 1/1 Running 0 22h di-prometheus-node-exporter-mp5xf 1/1 Running 0 22h di-report-tombstone-sweeper-28277040-kj227 1/1 Running 0 10h datasource-operator-controller-manager-5cf6f7f675-h5lng 2/2 Running 3 (5h37m ago) 22h di-asset-sweeper-28277645-tq6gb 0/1 Completed 0 12m di-user-sync-28277645-dl6ks 0/1 Completed 0 12m di-asset-sweeper-28277650-hxwvn 0/1 Completed 0 7m46s di-user-sync-28277650-6kxf7 0/1 Completed 0 7m46s di-asset-sweeper-28277655-gjtpr 0/1 Completed 0 2m46s di-user-sync-28277655-chgxd 0/1 Completed 0 2m46s
Restart SOA
If SevOne NMS has been upgraded or downgraded, please make sure that the SOA container is restarted after a successful upgrade/downgrade. Execute the following command.
From SevOne NMS appliance,
ssh root@<NMS appliance>
supervisorctl restart soa